-
Notifications
You must be signed in to change notification settings - Fork 476
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Non-disruptive upgrades in DPU clusters #1005
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,318 @@ | ||
--- | ||
title: Non-Disruptive DPU Infra Cluster Upgrades | ||
authors: | ||
- "@danwinship" | ||
reviewers: | ||
- "@dhellmann" | ||
- "@sdodson" | ||
- "@wking" | ||
- "@romfreiman" | ||
- "@zshi" | ||
approvers: | ||
- TBD | ||
api-approvers: # in case of new or modified APIs or API extensions (CRDs, aggregated apiservers, webhooks, finalizers) | ||
- TBD | ||
creation-date: 2021-01-11 | ||
last-updated: 2021-01-11 | ||
tracking-link: | ||
- https://issues.redhat.com/browse/SDN-2603 | ||
see-also: | ||
- "/enhancements/network/dpu/overview.md" | ||
--- | ||
|
||
# Non-Disruptive DPU Infra Cluster Upgrades | ||
|
||
## Summary | ||
|
||
Our architecture for [using DPUs (eg, BlueField-2 NICs) in | ||
OCP](../network/dpu/overview.md) involves creating one ordinary | ||
"tenant" OCP cluster consisting of x86 hosts, and a second "infra" OCP | ||
cluster consisting of the ARM systems running on the NICs installed in | ||
the x86 hosts. | ||
|
||
In the current Dev Preview of DPU support, there is no coordination | ||
between the infra and tenant clusters during upgrades. Thus, when | ||
upgrading the infra cluster, each NIC will end up rebooting at some | ||
point, after it has drained _its own_ pods, but without having made | ||
any attempt to drain the pods of its x86 tenant node. When this | ||
happens, the x86 node (and its pods) will lose network connectivity | ||
until the NIC finishes rebooting. | ||
|
||
In order to make upgrades in DPU clusters work smoothly, we need some | ||
way to ensure that tenant nodes do not get rebooted when they are | ||
running user workloads. | ||
|
||
## Glossary | ||
|
||
(Copied from the [DPU Overview | ||
Enhancement](../network/dpu/overview.md).) | ||
|
||
- **DPU** / **Data Processing Unit** - a Smart NIC with a full CPU, | ||
RAM, storage, etc, running a full operating system, able to offload | ||
network processing and potentially other tasks from its host system. | ||
For example, the NVIDIA BlueField-2. | ||
|
||
- **DPU NIC** - refers to the DPU and its CPU, OS, etc (as opposed to | ||
the CPU, OS, etc, of the x86 server that the DPU is installed in). A | ||
**DPU NIC node** or **DPU NIC worker** is an OCP worker node | ||
running on a DPU NIC. | ||
|
||
- **DPU Host** - refers to an x86 server (and its CPU, OS, etc) which | ||
contains a DPU. A **DPU Host node** or **DPU Host worker** is an OCP | ||
worker node running on a DPU Host. | ||
|
||
- **Two-Cluster Design** - any architecture for deploying an OCP | ||
cluster with (some) nodes containing DPUs, where the DPUs are nodes | ||
in a second OCP cluster. (For purposes of this term, the | ||
architecture still counts as "two-cluster" even if HyperShift is | ||
involved and there are actually three clusters.) | ||
|
||
- **Infra Cluster** - in a Two-Cluster Design, the OCP cluster which | ||
contains the DPU NIC workers. (It may also include master nodes | ||
and/or non-DPU workers, depending on the particular details of the | ||
two-cluster design.) | ||
|
||
- **Tenant Cluster** - in a Two-Cluster Design, the OCP cluster | ||
containing the DPU Host workers. (It may also include master nodes | ||
and/or workers without DPUs.) | ||
|
||
## Motivation | ||
|
||
### Goals | ||
|
||
- Make upgrades of DPU Infra Clusters work without interrupting | ||
network connectivity in their corresponding Tenant Clusters. | ||
|
||
### Non-Goals | ||
|
||
- Requiring Infra and Tenant Clusters to upgrade at the same time; | ||
either cluster should be able to do an upgrade without involving the | ||
other. | ||
|
||
- _Allowing_ Infra and Tenant Clusters to upgrade _efficiently_ at the | ||
same time. If an administrator wants to upgrade both clusters at the | ||
same time, then we would ideally do this in a way that would only | ||
drain each Tenant node once. But this would require substantially | ||
more synchronization between the two clusters, and is not handled by | ||
this proposal. | ||
|
||
## Proposal | ||
|
||
### User Stories | ||
|
||
#### User Story 1 | ||
|
||
As the administrator of a cluster using DPUs, I want to be able to do | ||
y-stream and z-stream upgrades without causing unnecessary network outages. | ||
|
||
#### User Story 2 | ||
|
||
As an infrastructure provider of clusters using DPUs, I want to be | ||
able to upgrade the DPU Infra Clusters without having to coordinate | ||
with the administrators of the individual Tenant Clusters. | ||
|
||
### API Extensions | ||
|
||
TBD | ||
|
||
### Implementation Details/Notes/Constraints | ||
|
||
#### Setup | ||
|
||
We can set some things up at install time (eg, creating credentials to | ||
allow certain operators in the two clusters to talk to each other). | ||
|
||
#### Inter-Cluster Version Skew | ||
|
||
There is very little explicit communication between the two clusters: | ||
|
||
1. The dpu-operator in the infra cluster talks to the apiserver in | ||
the tenant cluster. | ||
|
||
2. The ovn-kubernetes components in the two clusters do not | ||
communicate directly, but they do make some assumptions about | ||
each other's behavior (which is necessary to coordinate the | ||
plumbing of pod networking). | ||
|
||
The apiserver communication is very ordinary and boring and not likely | ||
to be subject to any interesting version skew problems (even if we | ||
ended up with, say, a 3-minor-version skew between the clusters). | ||
|
||
Thus, other than ovn-kubernetes, no OCP component needs to be | ||
concerned about version skew between the two clusters, because the | ||
components in the two clusters are completely unaware of each other. | ||
|
||
For ovn-kubernetes, if we ever change the details of the cross-cluster | ||
communication, then we will need to add proper checks to enforce | ||
tolerable cross-cluster skew at that time. | ||
|
||
## Design Details | ||
|
||
If an Infra Node is being cordoned and drained, then we can assume | ||
that the administrator (or an operator) is planning to do something | ||
disruptive to it. Anything that disrupts the Infra Node will disrupt | ||
its Tenant Node as well. | ||
|
||
Therefore, if an Infra Node is being cordoned and drained, we should | ||
respond by cordoning and draining its Tenant Node, and not allowing | ||
the Infra Node to be fully drained until the Tenant has also been | ||
drained. | ||
|
||
Given that behavior, when an Infra Cluster upgrade occurs, if it is | ||
necessary to reboot nodes, then each node's tenant should be drained | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Whenever we reboot a DPU, we also have to reboot the host. So that VFs can be recreated and VF representers can be discovered by the DPU. Without the VF reps, the ovnkube-node pod cannot start on the DPU. |
||
before the infra node reboots, so there should be no pod network | ||
disruption. | ||
|
||
### "Drain Mirroring" Controller | ||
|
||
In order to prevent MCO from rebooting an Infra node until its Tenant | ||
has been drained, we will need to run at least one drainable pod on | ||
each node, so that we can refuse to let that pod be drained until the | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let's clarify here and say "infra node"? In general throughout the text I think we need to be very clear for type of node. Perhaps a shorthand "inode" and "tnode"? Or "infra-node" and "tenant-node"? |
||
infra node's tenant is drained. | ||
|
||
Annoyingly, although we need this pod to run on every worker node, it | ||
can't be deployed by a DaemonSet, because DaemonSet pods are not | ||
subject to draining. Thus, we will have to have the DPU operator | ||
Comment on lines
+173
to
+175
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The daemonset draining is configurable I think; if we wanted to do that we could. See e.g. https://github.com/openshift/machine-config-operator/blob/62fb49f2d14519e215125faa6b817662cdb7ffd0/pkg/daemon/daemon.go#L316 That said, we'd probably need to change the MCO to opt-in specific daemonsets to being drained. This also relates to openshift/machine-config-operator#2978 I think. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (And in practice, we probably do need to care about things other than the MCO doing draining, so relying on a daemonset drain is probably not viable in the near future) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The choice isn't "drain DaemonSets" or "don't drain DaemonSets". It's "bail out with an error if there are DaemonSets since they can't be drained" or "don't bail out with an error if there are DaemonSets". They never get drained. |
||
manage this manually. To minimize privileges and cross-cluster | ||
communication, it would also make sense to have the DPU operator | ||
handle the actual mirroring of infra state to the tenant cluster, | ||
rather than having each infra node manage its own tenant node's state. | ||
|
||
So, the "drain mirroring" controller of the DPU operator will: | ||
|
||
- Deploy a "drain blocker" pod to each infra worker node. This pod | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: linter has some list-indent opinions:
|
||
will do nothing (ie, it just runs the "pause" image, or | ||
equivalent), and will have a | ||
`dpu-operator.openshift.io/drain-mirroring` finalizer. | ||
|
||
- Whenever an infra worker is marked `Unschedulable`, mark its | ||
tenant node `Unschedulable` as well, with a `Condition` indicating | ||
why it was made unschedulable. | ||
|
||
- If a "drain blocker" pod on an unschedulable infra node is marked | ||
for deletion, then drain the corresponding tenant node as well | ||
before removing the finalizer from the pod. | ||
|
||
- Whenever an infra worker is marked not-`Unschedulable`: | ||
|
||
- If we were in the process of draining its tenant, then abort. | ||
|
||
- If the tenant node is also `Unschedulable` and has a | ||
`Condition` indicating that we are the one who made it | ||
`Unschedulable`, then remove the `Condition` and the | ||
`Unschedulable` state from the tenant node. | ||
|
||
- If the infra node does not currently have a "drain blocker" | ||
pod, then deploy a new one to it. | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As I mentioned above, we need to add a step to reboot the tenant node after the DPU reboot. However, I am not quite sure how DPU can do that. One option is to add a reboot node API to MCO and let DPU invoke that. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hm... do we actually need to reboot? It seems to me like there are two possibilities here:
Right? We don't have the code to do 2 now, but since we don't have the code to do 1 now either, we have to write code either way. And it seems like making the tenant more able to recover from unexpected infra problems would be a good thing in general? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok, if we want to do 2. Shall we also describe the logic in this doc? |
||
### Open Questions | ||
|
||
Do finalizers block draining/eviction, or do we need to set | ||
`TerminationGracePeriodSeconds` too? (If the latter, then this gets | ||
more complicated since we need to control when the "drain blocker" | ||
actually exits.) | ||
|
||
### Risks and Mitigations | ||
|
||
TBD | ||
|
||
### Test Plan | ||
|
||
TBD | ||
|
||
### Graduation Criteria | ||
|
||
TBD | ||
|
||
#### Dev Preview -> Tech Preview | ||
|
||
#### Tech Preview -> GA | ||
|
||
#### Removing a deprecated feature | ||
|
||
### Upgrade / Downgrade Strategy | ||
|
||
This is a modification to the upgrade process, not something that can | ||
be upgrade or downgraded on its own. | ||
|
||
TBD, as the details depend on the eventual design. | ||
|
||
### Version Skew Strategy | ||
|
||
TBD, as the details depend on the eventual design. | ||
|
||
### Operational Aspects of API Extensions | ||
|
||
TBD | ||
|
||
#### Failure Modes | ||
|
||
TBD | ||
|
||
#### Support Procedures | ||
|
||
TBD | ||
|
||
## Implementation History | ||
|
||
- Initial proposal: 2021-01-11 | ||
- Updated for initial feedback: 2021-01-24 | ||
- Updated again: 2021-03-02 | ||
|
||
## Drawbacks | ||
|
||
This makes the upgrade process more complicated, which risks rendering | ||
clusters un-upgradeable without manual intervention. | ||
|
||
However, without some form of synchronization, it is impossible to | ||
have non-disruptive tenant cluster upgrades. | ||
|
||
## Alternatives | ||
|
||
### Never Reboot the DPUs | ||
|
||
This implies never upgrading OCP on the DPUs. I don't see how this | ||
could work. | ||
|
||
### Don't Have an Infra Cluster | ||
|
||
If the DPUs were not all part of a single OCP cluster (for example, | ||
they were just "bare" RHCOS hosts, or they were each running | ||
Single-Node OpenShift), then it might be simpler to synchronize the | ||
DPU upgrades with the tenant upgrades, because then each tenant could | ||
coordinate the actions of its own DPU by itself. | ||
|
||
The big problem with this is that, for security reasons, we don't want | ||
the tenants to have any control over their DPUs. (For some future use | ||
cases, the DPUs will be used to enforce security policies on their | ||
tenants.) | ||
|
||
### Synchronized Infra and Tenant Cluster Upgrades | ||
|
||
In [the original proposal], the plan was for infra and tenant cluster | ||
uprades to be synchronized, with the Infra and Tenant MCOs | ||
communicating to ensure that workers were upgraded in the same order | ||
in both clusters, and that reboots of the Infra and Tenant workers | ||
happened at the same time. | ||
|
||
After some discussion, we decided we didn't need that much | ||
synchronization. | ||
|
||
[the original proposal]: https://github.com/openshift/enhancements/commit/8033a923 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A definite downside of the github-markdown thing we have is that a lot of important commentary and information lives only in PR review, and that earlier revisions get lost. As far as I understand things, since that commit is not part of any branch, it will be subject to GC and potentially lost. Maybe we could handle this a bit crudely by having explicit versions, something like sticking that content at Dunno. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In my experience, github doesn't ever GC revisions that were commented on as part of a PR. (Yes, there's that "I can't find those revisions anywhere" error but I think that's more about it trying to stop you from accidentally following out-of-date links.) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Instead of relying on GitHub's history preservation, we can just inline the relevant bits in this alternative section. In this case "we didn't need that" is a bit fuzzy. I think the relevant bits were something like:
|
||
|
||
### Simultaneous but Asynchronous Infra and Tenant Cluster Upgrades | ||
|
||
[The second version of the proposal] took advantage of the fact that | ||
it's presumed safe to reboot infra nodes whenever their tenant nodes | ||
are idle, but still assumed that the infra and tenant clusters would | ||
be upgraded at the same time. In this version, the two clusters | ||
upgraded independently until they reached the MCO upgrade stage. Then | ||
the infra cluster MCO was supposed to queue up RHCOS changes without | ||
actually rebooting nodes, and then the tenant cluster MCO would | ||
proceed with its upgrade, and each infra worker node would watch to | ||
see when its tenant rebooted, and reboot itself at the same time. | ||
|
||
This approach also had various problems, and probably would not have | ||
worked well. | ||
|
||
[The second version of the proposal]: https://github.com/openshift/enhancements/commit/b4650860 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the MCO on the infra node is still going to drain all the other pods on the infra node except for the drain mirror pod - wouldn't that lead to disruption? Or is it that the main workload today on the infra pod is SDN daemonset and hence won't be drained?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DPU NIC nodes do not currently run any non-DaemonSet pods, and that's not expected to change. So in fact, the drain-blocker pod is actually the only pod that will get drained.
(I've been imprecise with terminology here... I talk about "infra nodes" and "infra workers", but there are actually three machinesets in the infra cluster: "master", "worker", and "dpu-worker", and it's specifically the "dpu-worker" nodes we're talking about here, and they don't run any general-purpose pods. They only run pods that are specifically part of providing functionality to the corresponding DPU Host Node.)