-
Notifications
You must be signed in to change notification settings - Fork 412
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
node-controller: Support an annotation to hold updates #2163
Conversation
Today the MCO arbitrarily chooses a node to update from the candidates. We want to allow admins to avoid specific nodes entirely. (Aside: This replaces the defunct etcd-specific ordering code that we didn't end up using) Add an annotation `machineconfiguration.openshift.io/hold` that allows an external controller (and/or human) to avoid specific nodes. Related: openshift#2059
Only compile tested locally. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cgwalters The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
High-level direction looks good to me. Can you split this into multiple commits since this is also pulling out unrelated functionality? This will need an update to the docs. How does an admin know that machines in the cluster are held? What does the MCO's ClusterOperator object look like when a machine is in the held state? We'll need to make it easy for an admin to know that machines have been held, otherwise people are going to forget and it's going to cause problems with upgrades. |
I actually think pausing nodes indefinitely (and allowing them to skip and update) undermines the point of pools entirely (whereas specific ordering does not). If the point of a pool is knowing that all members of the pool have a certain rendered-config, skipping an update makes the pool status and information meaningless. If you can do that what does it mean to be a member of a pool? If a node doesn't want the updates given to a specific pool why is it a member? As a UI, the MCO thinks in pools for its status, how can we reflect a pool as upgraded if all members don't have the same config? In the long-term how do we troubleshoot in a general sense when the state of a node is not longer easily known? If there’s an mco issue, I have to dig thru annotations+rendered configs? What if a node is paused on a required update path? If the hold is indefinite (and not just a hold this node for last or something) I think we really do start to hit some fundamental questions/concerns and some long-term confusion/debugging risks. Note: If I'm misunderstanding what the intention of "hold" is please let me know =) |
Also adding a hold (:drum: ) since the MCO team has some questions/concerns that need to be addressed and would also like more background/use case info. /hold |
This doesn't skip an update - the node isn't updated and the pool wouldn't "skip" the node. The node would still count as not updated. The pool would progress until the only nodes remaining are "held" and at that point I think the status should be the same as if the pool was paused. |
I guess this brings me to the question/possible solution of: If we're holding multiple nodes of a pool, why not designate them into a custom pool to preserve the UI/general sense of the MCO dealing in pools? Is there something preventing that or some problem with custom pools that we could address? I'm starting to compile a list of our questions bc I feel like there's some color we are missing in this. |
|
(Will circle back to this on Monday but) To expand on the logic here, we're just changing the node controller's Nothing else calls that function - in particular when we go to calculate the pool status, notice:
Now we do need to patch this logic to probably add a "held node" count to the pool status, but the important thing to notice here is that held machines don't count as updated, so the pool won't count as updated until they're un-held. Another way to look at this is that holding a machine is much like marking it unscheduable - the node is still there and counts. |
So here's a general concern I have for this: we really only should allow one version to be updated to at a time, and since we don't have the concept of queue'd updates, paused nodes should block the NEXT update because we can't guarantee edges in the future, somewhat invalidating this path. Let me give an example: Today let's say your worker nodes (named A, B, C, D) are on rendered-1, and you add a MC to create rendered-2. Let's say A starts the update first. While A is updating you add a new MC that causes rendered-3. The worker pool will target that immediately, so A, after it finishes the update, will go to rendered 3, BUT because BCD has not yet started they will skip the inbetween step. Basically: At a glance that might seem like behaviour we want, which is fine-ish for our regular operations. But now lets add pausing to the picture, and use more concrete examples with the rendered version. Same nodes, installed on 4.4, and now you're upgrading to 4.5. Lets say you pause D, and ABC finishes the upgrade without a problem. And now you have Now you want to start another upgrade to 4.6. There is no edge from 4.4->4.6. ABC, although on 4.5 and can upgrade, should not. Otherwise we're going to leave D in a limbo where it can never be un-held or risk failing the update. We also don't have a queue to tell D to eventually go from 4.4->4.5->4.6 so held nodes would forever block updates until there are no held nodes from the previous version, which somewhat kills the point of node holds really because you can't drift more than 1 version. Now of course we can add a lot of complex logic here, edge checking, or queued updates, or something, but I think that if we do go down that path I wouldn't recommend a quick and dirty annotation like this, but a more fleshed out design. Maybe my assessment is not entirely correct but just wanted to add this for discussion. |
I guess my comment doesn't really apply to this PR specifically very well. It's more so to the use case that some nodes getting paused will block all updates forever, I feel we may want a better solution in that case. |
This discussion is strongly related to the "worker pool blocking cluster updates across major versions" thread that came up (not sure if it's on GH or just in the internal jira dark matter). Remember today, the worker pool does not block upgrades - we have had in the past (and will continue to) have potential failure scenarios where a worker is stuck on e.g. 4.4 becuase of some failure to drain and the control plane is all the way on 4.6 and we end up trying to jump the worker (if the drain gets unstuck) straight from 4.4 to 4.6. Or the admin could have just straight up paused the worker pool and forgotten about it. IOW this capability is not introducing a new possible failure mode. (I would probably agree it makes it more likely though) And yes I think the MCO should learn about major version updates vs minor. |
I agree, I'm not against this PR, just that "pausing" seems like a paradigm we should not be supporting for extended lengths of time. You can pause a few nodes for a short time while they're running critical workloads, but you should not be holding them for weeks/months to drift the cluster. |
I think it's a bit more nuanced than that. Remember Red Hat hasn't finished the acquisition of CoreOS so it's only drift if the admin starts an |
Catching up a bit... @kikisdeliveryservice I completely agree with your argument and I think @yuqi-zhang does a good job of spelling out just how nasty things can get. The tentative agreement is that the customers who make use of this feature will enjoy the responsibility of ensuring that the myriad versions of RHCOS that make up their cluster and the updates they apply remain a subgraph of the published Cincinnati graph. The overwhelming majority of cluster administrators will not be able to effectively follow these constraints, so in practice, this holding mechanism will only be useful for temporary actions on a machine (e.g. debugging) and really should not be used across updates. Come to think of it, we should include telemetry and an alerting rule so that administrators are reminded when a machine remains paused for too long. |
@cgwalters: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Note that today the worker pool isn't gating, so arbitrary worker node skew can and definitely does happen. IOW there's no relationship today between Cincinnati and the worker pool. There were discussions about fixing that so that the MCO wouldn't allow upgrading the control plane across majors while workers were at N-1 but it doesn't exist today. I'd agree explicitly supporting it this way makes it more likely for a few nodes to skew, but failure to drain can also cause skew (and that's come up in a few bug reports). |
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
i know this is mostly a technical discussion between peers but I hope you allow me to bring in a cluster admin perspective We have exactly the situation that our users start critical jobs that run for several hours. We would like not to interrupt them. On openshift 4 it is not apparent how to achieve that. Having multiple machine pools for upgrading control via pause was an initial idea of ours but felt a bit like misuse so we discarded it. Holding those few nodes and taint them for new pods at the beginning of a cluster upgrade sounds manageable in our case. Since critical workloads are properly labeled, we would automate the hold/unhold actions during cluster upgrades or machineconfig changes. Since the risk for skew was discussed here as well: node drains are not always reliable so looking out for blocking pods and incomplete updates on pools is something required as part of day 2 operations anyways. |
That's fine, our discussions are public partially for this reason!
Broadly speaking, https://kubernetes.io/docs/tasks/run-application/configure-pdb/ |
Stale issues rot after 30d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle rotten |
/remove-lifecycle rotten |
@cgwalters: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
We're having a big debate on a related issue in coreos/zincati#245 I think the MCO should honor Doing this via systemd has a lot of nice properties:
|
@cgwalters: PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
Stale issues rot after 30d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle rotten |
Rotten issues close after 30d of inactivity. Reopen the issue by commenting /close |
@openshift-bot: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Today the MCO arbitrarily chooses a node to update from the candidates.
We want to allow admins to avoid specific nodes entirely.
(Aside: This replaces the defunct etcd-specific ordering code that
we didn't end up using)
Add an annotation
machineconfiguration.openshift.io/hold
that allowsan external controller (and/or human) to avoid specific nodes.
Related: #2059