Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Uncordon the node during failed updates #1572

Closed
wants to merge 1 commit into from

Conversation

yuqi-zhang
Copy link
Contributor

Today we cordon the node before we write updates to the node. This
means that if a file write fails (e.g. failed to create a directory),
we fail the update but the node stays cordoned. This will cause
deadlocks as the node annotation for desired config will no longer
be updated.

With the rollback added, if you delete the erroneous machineconfig
in question, we will be able to auto-recover from failed writes,
like we do for failed reconciliation. The side effect of this is
that the node will flip between Ready and Ready,Unschedulable,
since each time we receive a node event we will attempt to update
again and go through the full process.

Signed-off-by: Yu Qi Zhang [email protected]

Closes: #1443

@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 18, 2020
@yuqi-zhang
Copy link
Contributor Author

Will adapt to #1571

@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: yuqi-zhang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 18, 2020
@yuqi-zhang
Copy link
Contributor Author

/retest e2e-aws

@yuqi-zhang
Copy link
Contributor Author

/retest

Today we cordon the node before we write updates to the node. This
means that if a file write fails (e.g. failed to create a directory),
we fail the update but the node stays cordoned. This will cause
deadlocks as the node annotation for desired config will no longer
be updated.

With the rollback added, if you delete the erroneous machineconfig
in question, we will be able to auto-recover from failed writes,
like we do for failed reconciliation. The side effect of this is
that the node will flip between Ready and Ready,Unschedulable,
since each time we receive a node event we will attempt to update
again and go through the full process.

Signed-off-by: Yu Qi Zhang <[email protected]>
@yuqi-zhang
Copy link
Contributor Author

Rebased onto Antonio's PR to bump drain to using upstream libraries.

So after prodding a bit more I've updated my assessment to resolving #1443 as follows:

  1. Do nothing (close this fix)

Pros: we seem to rarely be in a situation where the MCD deadlocks as it stands. In the case of applying a machineconfig causing a failure to update files/units as described in #1443 , we could fix by:

1. manually deleting the bad machineconfig
2. oc edit node/node-in-scheduling-disabled
3. edit desiredconfig back to currentconfig
4. oc adm uncordon node/node-in-scheduling-disabled

Cons: I'd consider this a bug since for other issues (e.g. a bad ignition section) we are able to unstick the node by deleting a bad machineconfig and have it auto recover. The manual workaround, although easy to apply, is not immediately intuitive to users. And not fixing a bug feels bad.

  1. Merge this fix as is (I would say this is more of a patch than a long term solution, see 4.)

Pros: Since we uncordon upon failure, all new MC failures pre-reboots would be consistent (the node would not stay schedulingdisabled). This allows the node annotations to still be written and if the bad MC is deleted, we would "auto recover".

Cons: we will now flip between Ready and Ready,SchedulingDisabled, which means we will drain the node, mark it unschedulable, fail, mark it schedulable again, every ~30 seconds or so on a default cluster. This seems operation heavy and could cause a lot of short lived pods

  1. Rework controller logic to allow writing desired config to unavailable nodes, instead of nothing at all: https://github.com/openshift/machine-config-operator/blob/master/pkg/controller/node/node_controller.go#L796

Pros: would achieve the same result as 2, without the flipping back and forth between cordoned and uncordoned.

Cons: This could potentially open other problems. One example: let's say the node fails after the reboot and not during file/unit updates before the reboot, as highlighted above. We should retain what config the node attempted to update to. If something else gets applied in the meantime, the node controller would update the node annotation, but the daemon running on the node would never see this request and thereby cause a skew. Thus it will not attempt to fix itself, but also potentially confuse a user to what happened and which config failed.

  1. Fundamentally rework the update process, either by making it transactional (update the files in an rpm-ostree prepared root instead of the current running one) or otherwise reworking how we progress updates.

Pros: cleaner operation in general

Cons: probably going to take awhile to implement

@ericavonb
Copy link
Contributor

  1. Do nothing (close this fix)

Pros: we seem to rarely be in a situation where the MCD deadlocks as it stands.

Is this the case? What do we define as rare? Could we get some data on this to help us decide?

In the case of applying a machineconfig causing a failure to update files/units as described in #1443 , we could fix by:


1. manually deleting the bad machineconfig

Deleting MCs can cause an issue if any node somewhere has it set as current/pending, including in the nodeannotations on disk for bootstrapping. I’ll need to think through the exact situations. It might be mostly ok because the current/pending MC are saved on disk in 4.2+ and the MCD uses that. There are still some edge cases though where those got corrupted or don’t match. I’ll need to think through the exact situations more.

  1. oc edit node/node-in-scheduling-disabled

  2. edit desiredconfig back to currentconfig

  3. oc adm uncordon node/node-in-scheduling-disabled




Cons: I'd consider this a bug since for other issues (e.g. a bad ignition section) we are able to unstick the node by deleting a bad machineconfig and have it auto recover. The manual workaround, although easy to apply, is not immediately intuitive to users. And not fixing a bug feels bad.



2. Merge this fix as is (I would say this is more of a patch than a long term solution, see 4.)



Pros: Since we uncordon upon failure, all new MC failures pre-reboots would be consistent (the node would not stay schedulingdisabled). This allows the node annotations to still be written and if the bad MC is deleted, we would "auto recover".

I’m worried about kubelet/crio/api-server skew issues from partial updates.

Cons: we will now flip between Ready and Ready,SchedulingDisabled, which means we will drain the node, mark it unschedulable, fail, mark it schedulable again, every ~30 seconds or so on a default cluster. This seems operation heavy and could cause a lot of short lived pods

Yeah I think this wouldn’t fly. Would at the least cause a lot of noise for people debugging their clusters.

  1. Rework controller logic to allow writing desired config to unavailable nodes, instead of nothing at all: https://github.com/openshift/machine-config-operator/blob/master/pkg/controller/node/node_controller.go#L796

This one makes sense to me.

Pros: would achieve the same result as 2, without the flipping back and forth between cordoned and uncordoned.

Cons: This could potentially open other problems. One example: let's say the node fails after the reboot and not during file/unit updates before the reboot, as highlighted above. We should retain what config the node attempted to update to.

What about if we keep the pending annotation for that information?

If something else gets applied in the meantime, the node controller would update the node annotation, but the daemon running on the node would never see this request and thereby cause a skew.

I’m missing something. Why would the daemon never see the request?

Thus it will not attempt to fix itself, but also potentially confuse a user to what happened and which config failed.

  1. Fundamentally rework the update process, either by making it transactional (update the files in an rpm-ostree prepared root instead of the current running one) or otherwise reworking how we progress updates.

Pros: cleaner operation in general

Cons: probably going to take awhile to implement

I’m in favor of this one as much as possible. We’re never going to get around all the edge cases with the current approach. Doing it right will save us time in the long run.

@cgwalters does it seem like the right time to tackle this?

@yuqi-zhang
Copy link
Contributor Author

Is this the case? What do we define as rare? Could we get some data on this to help us decide?

I don't recall any issues like this personally. The only deadlock'ed situation that was similar was reported by Crawford since he did some manual file editing. And that wasn't reproduced so we closed it.

Deleting MCs can cause an issue if any node somewhere has it set as current/pending, including in the nodeannotations on disk for bootstrapping. I’ll need to think through the exact situations. It might be mostly ok because the current/pending MC are saved on disk in 4.2+ and the MCD uses that. There are still some edge cases though where those got corrupted or don’t match. I’ll need to think through the exact situations more.

to clarify, I don't mean deleting a rendered-mc-xxx, just a regular MC, which we support anyways. I don't think I've seen a scenario where we drop into some weird state but I can see it happening if we somehow trigger a race between an annotation write and a node going schedulingdisabled? Anyhow, out of the context of this PR.

I’m worried about kubelet/crio/api-server skew issues from partial updates.

We shouldn't have any unless we fail the rollback in which case we are in deep trouble anyways. This PR doesn't affect that path since its LIFO deferred and runs last.

Yeah I think this wouldn’t fly. Would at the least cause a lot of noise for people debugging their clusters.

Agreed

I’m missing something. Why would the daemon never see the request?

Hm now thinking about it I think it wouldn't really be a huge issue. What I was thinking was something like:

So if we change controller behaviour, we are basically saying, if a node is schedulingdisabled, you can still write the desiredConfig. Lets say you have nodes on rendered-1. If you apply a MC that is correct, but caused a failure during reboot (say, in the initramfs), it generates a new renderedconfig rendered-2, targets maxUnavailable nodes, and applies the changes and reboots. Lets say during this time you realized something went wrong, did an oc delete bad-mc and applied a new mc that doesn't have the problem, which generates rendered-3. The controller sees that there is 1 unavailable node, and targets that one's desiredConfig to rendered-3, since we allow that now, whereas before it would still be on 2

You then realize that since the node died in the initramfs the cluster is still broken, so you open a BZ with a must-gather. Since we lost the node entirely we no longer have the MCD and logs on it, so we can see in the MCP/node annotation the desiredConfig is rendered-3, which is not what borked the cluster. That said we can probably see in the MCP what was the original operation that failed it?

@openshift-ci-robot
Copy link
Contributor

@yuqi-zhang: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-gcp-upgrade 8ee8efc link /test e2e-gcp-upgrade
ci/prow/e2e-aws-scaleup-rhel7 8ee8efc link /test e2e-aws-scaleup-rhel7

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@yuqi-zhang
Copy link
Contributor Author

Closing this for now as it is not a high priority fix. Will revisit later with better approaches

@yuqi-zhang yuqi-zhang closed this Apr 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

The MCD is stuck and unable to recover from file degradations
3 participants