diff --git a/enhancements/update/synchronized-upgrades.md b/enhancements/update/synchronized-upgrades.md new file mode 100644 index 00000000000..bff82e8c454 --- /dev/null +++ b/enhancements/update/synchronized-upgrades.md @@ -0,0 +1,223 @@ +--- +title: Synchronized Upgrades Between Clusters +authors: + - "@danwinship" +reviewers: + - TBD +approvers: + - TBD +api-approvers: # in case of new or modified APIs or API extensions (CRDs, aggregated apiservers, webhooks, finalizers) + - TBD +creation-date: 2021-01-11 +last-updated: 2021-01-11 +tracking-link: + - https://issues.redhat.com/browse/SDN-2603 +see-also: + - "/enhancements/network/dpu/overview.md" +--- + +# Synchronized Upgrades Between Clusters + +## Summary + +In a [cluster with DPUs](../network/dpu/overview.md) (eg, BlueField-2 +NICs), the x86 hosts form one OCP cluster, and the DPU ARM systems +form a second OCP cluster. This makes upgrades to new OCP releases +complicated, because there is currently no way to synchronize upgrades +between the two clusters, but rebooting the BF-2 systems as part of +the MCO upgrade will cause a network outage on the x86 systems. In +order for upgrades to work smoothly, we need to synchronize the +reboots between the two clusters, so that the BF-2 systems are only +rebooted when their corresponding x86 hosts have been cordoned and +drained. + +Please refer to the "Glossary" section of the [DPU Overview +Enhancement](../network/dpu/overview.md). + +## Motivation + +### Goals + +- Make upgrades work smoothly in clusters running with DPU support, by + synchronizing the reboots of nodes between the infra cluster and the + tenant cluster. + +### Non-Goals + +- Supporting synchronized upgrades of more than 2 clusters at once. + +## Proposal + +### User Stories + +As the administrator of a cluster using DPUs, I want to be able to do +z-stream upgrades without causing unnecessary network outages. + +### API Extensions + +TBD + +### Implementation Details/Notes/Constraints [optional] + +TBD + +### Risks and Mitigations + +TBD + +## Design Details + +### Open Questions + +Basically everything... + +The general idea is: + +- We can set some things up at install time (eg, creating credentials + to allow certain operators in the two clusters to talk to each + other). + +- As part of the DPU security model, the tenant cluster cannot have + any power over the infra cluster. (In particular, it can't be + possible for an administrator in the tenant cluster to force the + infra cluster to upgrade/downgrade to any particular version.) Thus, + the upgrade must be initiated on the infra cluster side, and the + infra side will tell the tenant cluster to upgrade as well. + (Alternatively, the upgrade must be simultaneous-ish-ly initiated in + both clusters, if we don't want the infra cluster to have to have a + credential that lets it initiate an upgrade in the tenant cluster.) + +- An upgrade should not be able to start unless both clusters are able + to upgrade. + + - In particular: + + - There can be no `Upgradeable: False` operators in either cluster. + + - The version to upgrade to must be available to both clusters + (ie, it must be available for both x86 and ARM). + + - This could be implemented via some sort of "dpu-cluster-upgrade" + operator running in both clusters, where the two operators + communicate with each other and set their `Upgradeable` state to + reflect the state of the other cluster. If the + "dpu-cluster-upgrade" operator was placed before every other + operator in upgrade priority, then it could also block disallowed + upgrades, by failing its own upgrade, eg if an admin tries to + upgrade one cluster without the other, or tries to upgrade the two + clusters to different versions. + + - (Or should it be possible to do z-stream upgrades of the tenant + cluster without bothering to upgrade the infra cluster too?) + +- The two clusters upgrade all of the operators up to MCO in parallel. + +- Whichever cluster reaches the MCO upgrade first needs to wait for + the other cluster to get there before proceeding. The two MCOs then + need to coordinate to complete the upgrade: First, they have to + agree on what order the physical hosts will be upgraded in. Second, + for each physical host, they have to properly synchronize the + upgrades of its infra node and its tenant node. + + - More specifically, for each physical host, in some order: + + - The Infra MCO will cordon and drain that host's infra node, and + the Tenant MCO will cordon and drain that host's tenant node. + (This can happen in parallel.) + + - The Infra MCO will then upgrade the infra node (causing it to + reboot and temporarily break network connectivity to the tenant + node). + + - Once the infra node upgrade completes, the Tenant MCO will + reboot and upgrade the tenant node. + + - (This seems like it will absolutely require MCO changes.) + + - One way to do this would be to have a CRD with an array of hosts, + indicating the ordering, and the current status of each host, and + the two MCOs could update and watch for updates to monitor each + other's progress. + + +### Test Plan + +TBD + +### Graduation Criteria + +TBD + +#### Dev Preview -> Tech Preview + +#### Tech Preview -> GA + +### Upgrade / Downgrade Strategy + +This is a modification to the upgrade process, not something that can +be upgrade or downgraded on its own. + +TBD, as the details depend on the eventual design. + +### Version Skew Strategy + +TBD, as the details depend on the eventual design. + +We will need to deal with skew both within a single cluster, as well +as skew between the infra and tenant clusters. + +### Operational Aspects of API Extensions + +TBD + +The only currently-proposed CRD is for Infra MCO to Tenant MCO +communication, and would not be used by any other components. + +#### Failure Modes + +- The system might get confused and spuriously block upgrades that + should be allowed. + +- Communications failures might lead to upgrades failing without the + tenant cluster being able to figure out why they failed. + +- TBD + +#### Support Procedures + +TBD + +## Implementation History + +- Initial proposal: 2021-01-11 + +## Drawbacks + +This makes the upgrade process more complicated, which risks rendering +clusters un-upgradeable without manual intervention. + +However, without some form of synchronization, it is impossible to +have non-disruptive tenant cluster upgrades. + +## Alternatives + +The fundamental problem is that rebooting the DPU causes a network +outage on the tenant. + +### Never Reboot the DPUs + +This implies never upgrading OCP on the DPUs. I don't see how this +could work. + +### Don't Have an Infra Cluster + +If the DPUs were not all part of a single OCP cluster (for example, +they were just "bare" RHCOS hosts, or they were each running +Single-Node OpenShift), then it might be simpler to synchronize the +DPU upgrades with the tenant upgrades, because then each tenant could +coordinate the actions of its own DPU by itself. + +The big problem with this is that, for security reasons, we don't want +the tenants to have any control over their DPUs. (For some future use +cases, the DPUs will be used to enforce security policies on their +tenants.)