-
Notifications
You must be signed in to change notification settings - Fork 412
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Uncordon the node during failed updates #1572
Conversation
Will adapt to #1571 |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: yuqi-zhang The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest e2e-aws |
/retest |
Today we cordon the node before we write updates to the node. This means that if a file write fails (e.g. failed to create a directory), we fail the update but the node stays cordoned. This will cause deadlocks as the node annotation for desired config will no longer be updated. With the rollback added, if you delete the erroneous machineconfig in question, we will be able to auto-recover from failed writes, like we do for failed reconciliation. The side effect of this is that the node will flip between Ready and Ready,Unschedulable, since each time we receive a node event we will attempt to update again and go through the full process. Signed-off-by: Yu Qi Zhang <[email protected]>
5d45e74
to
8ee8efc
Compare
Rebased onto Antonio's PR to bump drain to using upstream libraries. So after prodding a bit more I've updated my assessment to resolving #1443 as follows:
Pros: we seem to rarely be in a situation where the MCD deadlocks as it stands. In the case of applying a machineconfig causing a failure to update files/units as described in #1443 , we could fix by:
Cons: I'd consider this a bug since for other issues (e.g. a bad ignition section) we are able to unstick the node by deleting a bad machineconfig and have it auto recover. The manual workaround, although easy to apply, is not immediately intuitive to users. And not fixing a bug feels bad.
Pros: Since we uncordon upon failure, all new MC failures pre-reboots would be consistent (the node would not stay schedulingdisabled). This allows the node annotations to still be written and if the bad MC is deleted, we would "auto recover". Cons: we will now flip between
Pros: would achieve the same result as 2, without the flipping back and forth between cordoned and uncordoned. Cons: This could potentially open other problems. One example: let's say the node fails after the reboot and not during file/unit updates before the reboot, as highlighted above. We should retain what config the node attempted to update to. If something else gets applied in the meantime, the node controller would update the node annotation, but the daemon running on the node would never see this request and thereby cause a skew. Thus it will not attempt to fix itself, but also potentially confuse a user to what happened and which config failed.
Pros: cleaner operation in general Cons: probably going to take awhile to implement |
Is this the case? What do we define as rare? Could we get some data on this to help us decide?
Deleting MCs can cause an issue if any node somewhere has it set as current/pending, including in the nodeannotations on disk for bootstrapping. I’ll need to think through the exact situations. It might be mostly ok because the current/pending MC are saved on disk in 4.2+ and the MCD uses that. There are still some edge cases though where those got corrupted or don’t match. I’ll need to think through the exact situations more.
I’m worried about kubelet/crio/api-server skew issues from partial updates.
Yeah I think this wouldn’t fly. Would at the least cause a lot of noise for people debugging their clusters.
This one makes sense to me.
What about if we keep the
I’m missing something. Why would the daemon never see the request?
I’m in favor of this one as much as possible. We’re never going to get around all the edge cases with the current approach. Doing it right will save us time in the long run. @cgwalters does it seem like the right time to tackle this? |
I don't recall any issues like this personally. The only deadlock'ed situation that was similar was reported by Crawford since he did some manual file editing. And that wasn't reproduced so we closed it.
to clarify, I don't mean deleting a rendered-mc-xxx, just a regular MC, which we support anyways. I don't think I've seen a scenario where we drop into some weird state but I can see it happening if we somehow trigger a race between an annotation write and a node going schedulingdisabled? Anyhow, out of the context of this PR.
We shouldn't have any unless we fail the rollback in which case we are in deep trouble anyways. This PR doesn't affect that path since its LIFO deferred and runs last.
Agreed
Hm now thinking about it I think it wouldn't really be a huge issue. What I was thinking was something like: So if we change controller behaviour, we are basically saying, if a node is schedulingdisabled, you can still write the desiredConfig. Lets say you have nodes on You then realize that since the node died in the initramfs the cluster is still broken, so you open a BZ with a must-gather. Since we lost the node entirely we no longer have the MCD and logs on it, so we can see in the MCP/node annotation the desiredConfig is |
@yuqi-zhang: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Closing this for now as it is not a high priority fix. Will revisit later with better approaches |
Today we cordon the node before we write updates to the node. This
means that if a file write fails (e.g. failed to create a directory),
we fail the update but the node stays cordoned. This will cause
deadlocks as the node annotation for desired config will no longer
be updated.
With the rollback added, if you delete the erroneous machineconfig
in question, we will be able to auto-recover from failed writes,
like we do for failed reconciliation. The side effect of this is
that the node will flip between Ready and Ready,Unschedulable,
since each time we receive a node event we will attempt to update
again and go through the full process.
Signed-off-by: Yu Qi Zhang [email protected]
Closes: #1443