WIP: Uncordon the node during failed updates #1572

yuqi-zhang · 2020-03-18T16:06:20Z

Today we cordon the node before we write updates to the node. This
means that if a file write fails (e.g. failed to create a directory),
we fail the update but the node stays cordoned. This will cause
deadlocks as the node annotation for desired config will no longer
be updated.

With the rollback added, if you delete the erroneous machineconfig
in question, we will be able to auto-recover from failed writes,
like we do for failed reconciliation. The side effect of this is
that the node will flip between Ready and Ready,Unschedulable,
since each time we receive a node event we will attempt to update
again and go through the full process.

Signed-off-by: Yu Qi Zhang [email protected]

Closes: #1443

yuqi-zhang · 2020-03-18T16:06:46Z

Will adapt to #1571

openshift-ci-robot · 2020-03-18T16:07:10Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: yuqi-zhang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [yuqi-zhang]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

yuqi-zhang · 2020-03-18T18:19:57Z

/retest e2e-aws

yuqi-zhang · 2020-03-19T00:09:33Z

/retest

Today we cordon the node before we write updates to the node. This means that if a file write fails (e.g. failed to create a directory), we fail the update but the node stays cordoned. This will cause deadlocks as the node annotation for desired config will no longer be updated. With the rollback added, if you delete the erroneous machineconfig in question, we will be able to auto-recover from failed writes, like we do for failed reconciliation. The side effect of this is that the node will flip between Ready and Ready,Unschedulable, since each time we receive a node event we will attempt to update again and go through the full process. Signed-off-by: Yu Qi Zhang <[email protected]>

yuqi-zhang · 2020-03-19T19:54:04Z

Rebased onto Antonio's PR to bump drain to using upstream libraries.

So after prodding a bit more I've updated my assessment to resolving #1443 as follows:

Do nothing (close this fix)

Pros: we seem to rarely be in a situation where the MCD deadlocks as it stands. In the case of applying a machineconfig causing a failure to update files/units as described in #1443 , we could fix by:

1. manually deleting the bad machineconfig
2. oc edit node/node-in-scheduling-disabled
3. edit desiredconfig back to currentconfig
4. oc adm uncordon node/node-in-scheduling-disabled

Cons: I'd consider this a bug since for other issues (e.g. a bad ignition section) we are able to unstick the node by deleting a bad machineconfig and have it auto recover. The manual workaround, although easy to apply, is not immediately intuitive to users. And not fixing a bug feels bad.

Merge this fix as is (I would say this is more of a patch than a long term solution, see 4.)

Pros: Since we uncordon upon failure, all new MC failures pre-reboots would be consistent (the node would not stay schedulingdisabled). This allows the node annotations to still be written and if the bad MC is deleted, we would "auto recover".

Cons: we will now flip between Ready and Ready,SchedulingDisabled, which means we will drain the node, mark it unschedulable, fail, mark it schedulable again, every ~30 seconds or so on a default cluster. This seems operation heavy and could cause a lot of short lived pods

Rework controller logic to allow writing desired config to unavailable nodes, instead of nothing at all: https://github.com/openshift/machine-config-operator/blob/master/pkg/controller/node/node_controller.go#L796

Pros: would achieve the same result as 2, without the flipping back and forth between cordoned and uncordoned.

Cons: This could potentially open other problems. One example: let's say the node fails after the reboot and not during file/unit updates before the reboot, as highlighted above. We should retain what config the node attempted to update to. If something else gets applied in the meantime, the node controller would update the node annotation, but the daemon running on the node would never see this request and thereby cause a skew. Thus it will not attempt to fix itself, but also potentially confuse a user to what happened and which config failed.

Fundamentally rework the update process, either by making it transactional (update the files in an rpm-ostree prepared root instead of the current running one) or otherwise reworking how we progress updates.

Pros: cleaner operation in general

Cons: probably going to take awhile to implement

ericavonb · 2020-03-19T20:26:40Z

Do nothing (close this fix)

Pros: we seem to rarely be in a situation where the MCD deadlocks as it stands.

Is this the case? What do we define as rare? Could we get some data on this to help us decide?

In the case of applying a machineconfig causing a failure to update files/units as described in #1443 , we could fix by:
1. manually deleting the bad machineconfig

Deleting MCs can cause an issue if any node somewhere has it set as current/pending, including in the nodeannotations on disk for bootstrapping. I’ll need to think through the exact situations. It might be mostly ok because the current/pending MC are saved on disk in 4.2+ and the MCD uses that. There are still some edge cases though where those got corrupted or don’t match. I’ll need to think through the exact situations more.

oc edit node/node-in-scheduling-disabled
edit desiredconfig back to currentconfig
oc adm uncordon node/node-in-scheduling-disabled




Cons: I'd consider this a bug since for other issues (e.g. a bad ignition section) we are able to unstick the node by deleting a bad machineconfig and have it auto recover. The manual workaround, although easy to apply, is not immediately intuitive to users. And not fixing a bug feels bad.



2. Merge this fix as is (I would say this is more of a patch than a long term solution, see 4.)



Pros: Since we uncordon upon failure, all new MC failures pre-reboots would be consistent (the node would not stay schedulingdisabled). This allows the node annotations to still be written and if the bad MC is deleted, we would "auto recover".

I’m worried about kubelet/crio/api-server skew issues from partial updates.

Cons: we will now flip between Ready and Ready,SchedulingDisabled, which means we will drain the node, mark it unschedulable, fail, mark it schedulable again, every ~30 seconds or so on a default cluster. This seems operation heavy and could cause a lot of short lived pods

Yeah I think this wouldn’t fly. Would at the least cause a lot of noise for people debugging their clusters.

Rework controller logic to allow writing desired config to unavailable nodes, instead of nothing at all: https://github.com/openshift/machine-config-operator/blob/master/pkg/controller/node/node_controller.go#L796

This one makes sense to me.

Pros: would achieve the same result as 2, without the flipping back and forth between cordoned and uncordoned.

Cons: This could potentially open other problems. One example: let's say the node fails after the reboot and not during file/unit updates before the reboot, as highlighted above. We should retain what config the node attempted to update to.

What about if we keep the pending annotation for that information?

If something else gets applied in the meantime, the node controller would update the node annotation, but the daemon running on the node would never see this request and thereby cause a skew.

I’m missing something. Why would the daemon never see the request?

Thus it will not attempt to fix itself, but also potentially confuse a user to what happened and which config failed.

Fundamentally rework the update process, either by making it transactional (update the files in an rpm-ostree prepared root instead of the current running one) or otherwise reworking how we progress updates.

Pros: cleaner operation in general

Cons: probably going to take awhile to implement

I’m in favor of this one as much as possible. We’re never going to get around all the edge cases with the current approach. Doing it right will save us time in the long run.

@cgwalters does it seem like the right time to tackle this?

yuqi-zhang · 2020-03-19T20:53:45Z

Is this the case? What do we define as rare? Could we get some data on this to help us decide?

I don't recall any issues like this personally. The only deadlock'ed situation that was similar was reported by Crawford since he did some manual file editing. And that wasn't reproduced so we closed it.

Deleting MCs can cause an issue if any node somewhere has it set as current/pending, including in the nodeannotations on disk for bootstrapping. I’ll need to think through the exact situations. It might be mostly ok because the current/pending MC are saved on disk in 4.2+ and the MCD uses that. There are still some edge cases though where those got corrupted or don’t match. I’ll need to think through the exact situations more.

to clarify, I don't mean deleting a rendered-mc-xxx, just a regular MC, which we support anyways. I don't think I've seen a scenario where we drop into some weird state but I can see it happening if we somehow trigger a race between an annotation write and a node going schedulingdisabled? Anyhow, out of the context of this PR.

I’m worried about kubelet/crio/api-server skew issues from partial updates.

We shouldn't have any unless we fail the rollback in which case we are in deep trouble anyways. This PR doesn't affect that path since its LIFO deferred and runs last.

Yeah I think this wouldn’t fly. Would at the least cause a lot of noise for people debugging their clusters.

Agreed

I’m missing something. Why would the daemon never see the request?

Hm now thinking about it I think it wouldn't really be a huge issue. What I was thinking was something like:

So if we change controller behaviour, we are basically saying, if a node is schedulingdisabled, you can still write the desiredConfig. Lets say you have nodes on rendered-1. If you apply a MC that is correct, but caused a failure during reboot (say, in the initramfs), it generates a new renderedconfig rendered-2, targets maxUnavailable nodes, and applies the changes and reboots. Lets say during this time you realized something went wrong, did an oc delete bad-mc and applied a new mc that doesn't have the problem, which generates rendered-3. The controller sees that there is 1 unavailable node, and targets that one's desiredConfig to rendered-3, since we allow that now, whereas before it would still be on 2

You then realize that since the node died in the initramfs the cluster is still broken, so you open a BZ with a must-gather. Since we lost the node entirely we no longer have the MCD and logs on it, so we can see in the MCP/node annotation the desiredConfig is rendered-3, which is not what borked the cluster. That said we can probably see in the MCP what was the original operation that failed it?

openshift-ci-robot · 2020-03-19T21:00:15Z

@yuqi-zhang: The following tests failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/e2e-gcp-upgrade	`8ee8efc`	link	`/test e2e-gcp-upgrade`
ci/prow/e2e-aws-scaleup-rhel7	`8ee8efc`	link	`/test e2e-aws-scaleup-rhel7`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

yuqi-zhang · 2020-04-03T19:33:28Z

Closing this for now as it is not a high priority fix. Will revisit later with better approaches

openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 18, 2020

openshift-ci-robot requested review from ericavonb and kikisdeliveryservice March 18, 2020 16:06

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 18, 2020

yuqi-zhang force-pushed the rollback-cordon branch from 5d45e74 to 8ee8efc Compare March 19, 2020 19:19

yuqi-zhang closed this Apr 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Uncordon the node during failed updates #1572

WIP: Uncordon the node during failed updates #1572

yuqi-zhang commented Mar 18, 2020

yuqi-zhang commented Mar 18, 2020

openshift-ci-robot commented Mar 18, 2020

yuqi-zhang commented Mar 18, 2020

yuqi-zhang commented Mar 19, 2020

yuqi-zhang commented Mar 19, 2020

ericavonb commented Mar 19, 2020

yuqi-zhang commented Mar 19, 2020

openshift-ci-robot commented Mar 19, 2020

yuqi-zhang commented Apr 3, 2020

WIP: Uncordon the node during failed updates #1572

WIP: Uncordon the node during failed updates #1572

Conversation

yuqi-zhang commented Mar 18, 2020

yuqi-zhang commented Mar 18, 2020

openshift-ci-robot commented Mar 18, 2020

yuqi-zhang commented Mar 18, 2020

yuqi-zhang commented Mar 19, 2020

yuqi-zhang commented Mar 19, 2020

ericavonb commented Mar 19, 2020

yuqi-zhang commented Mar 19, 2020

openshift-ci-robot commented Mar 19, 2020

yuqi-zhang commented Apr 3, 2020