initial wip synchronized-upgrades

openshift · Jan 12, 2022 · 8033a92 · 8033a92
1 parent 88b00b6
commit 8033a92
Showing 1 changed file with 223 additions and 0 deletions.
diff --git a/enhancements/update/synchronized-upgrades.md b/enhancements/update/synchronized-upgrades.md
@@ -0,0 +1,223 @@
+---
+title: Synchronized Upgrades Between Clusters
+authors:
+  - "@danwinship"
+reviewers:
+  - TBD
+approvers:
+  - TBD
+api-approvers: # in case of new or modified APIs or API extensions (CRDs, aggregated apiservers, webhooks, finalizers)
+  - TBD
+creation-date: 2021-01-11
+last-updated: 2021-01-11
+tracking-link:
+  - https://issues.redhat.com/browse/SDN-2603
+see-also:
+  - "/enhancements/network/dpu/overview.md"
+---
+
+# Synchronized Upgrades Between Clusters
+
+## Summary
+
+In a [cluster with DPUs](../network/dpu/overview.md) (eg, BlueField-2
+NICs), the x86 hosts form one OCP cluster, and the DPU ARM systems
+form a second OCP cluster. This makes upgrades to new OCP releases
+complicated, because there is currently no way to synchronize upgrades
+between the two clusters, but rebooting the BF-2 systems as part of
+the MCO upgrade will cause a network outage on the x86 systems. In
+order for upgrades to work smoothly, we need to synchronize the
+reboots between the two clusters, so that the BF-2 systems are only
+rebooted when their corresponding x86 hosts have been cordoned and
+drained.
+
+Please refer to the "Glossary" section of the [DPU Overview
+Enhancement](../network/dpu/overview.md).
+
+## Motivation
+
+### Goals
+
+- Make upgrades work smoothly in clusters running with DPU support, by
+  synchronizing the reboots of nodes between the infra cluster and the
+  tenant cluster.
+
+### Non-Goals
+
+- Supporting synchronized upgrades of more than 2 clusters at once.
+
+## Proposal
+
+### User Stories
+
+As the administrator of a cluster using DPUs, I want to be able to do
+z-stream upgrades without causing unnecessary network outages.
+
+### API Extensions
+
+TBD
+
+### Implementation Details/Notes/Constraints [optional]
+
+TBD
+
+### Risks and Mitigations
+
+TBD
+
+## Design Details
+
+### Open Questions
+
+Basically everything...
+
+The general idea is:
+
+- We can set some things up at install time (eg, creating credentials
+  to allow certain operators in the two clusters to talk to each
+  other).
+
+- As part of the DPU security model, the tenant cluster cannot have
+  any power over the infra cluster. (In particular, it can't be
+  possible for an administrator in the tenant cluster to force the
+  infra cluster to upgrade/downgrade to any particular version.) Thus,
+  the upgrade must be initiated on the infra cluster side, and the
+  infra side will tell the tenant cluster to upgrade as well.
+  (Alternatively, the upgrade must be simultaneous-ish-ly initiated in
+  both clusters, if we don't want the infra cluster to have to have a
+  credential that lets it initiate an upgrade in the tenant cluster.)
+
+- An upgrade should not be able to start unless both clusters are able
+  to upgrade.
+
+  - In particular:
+
+    - There can be no `Upgradeable: False` operators in either cluster.
+
+    - The version to upgrade to must be available to both clusters
+      (ie, it must be available for both x86 and ARM).
+
+  - This could be implemented via some sort of "dpu-cluster-upgrade"
+    operator running in both clusters, where the two operators
+    communicate with each other and set their `Upgradeable` state to
+    reflect the state of the other cluster. If the
+    "dpu-cluster-upgrade" operator was placed before every other
+    operator in upgrade priority, then it could also block disallowed
+    upgrades, by failing its own upgrade, eg if an admin tries to
+    upgrade one cluster without the other, or tries to upgrade the two
+    clusters to different versions.
+
+    - (Or should it be possible to do z-stream upgrades of the tenant
+      cluster without bothering to upgrade the infra cluster too?)
+
+- The two clusters upgrade all of the operators up to MCO in parallel.
+
+- Whichever cluster reaches the MCO upgrade first needs to wait for
+  the other cluster to get there before proceeding. The two MCOs then
+  need to coordinate to complete the upgrade: First, they have to
+  agree on what order the physical hosts will be upgraded in. Second,
+  for each physical host, they have to properly synchronize the
+  upgrades of its infra node and its tenant node.
+
+  - More specifically, for each physical host, in some order:
+
+    - The Infra MCO will cordon and drain that host's infra node, and
+      the Tenant MCO will cordon and drain that host's tenant node.
+      (This can happen in parallel.)
+
+    - The Infra MCO will then upgrade the infra node (causing it to
+      reboot and temporarily break network connectivity to the tenant
+      node).
+
+    - Once the infra node upgrade completes, the Tenant MCO will
+      reboot and upgrade the tenant node.
+
+  - (This seems like it will absolutely require MCO changes.)
+
+  - One way to do this would be to have a CRD with an array of hosts,
+    indicating the ordering, and the current status of each host, and
+    the two MCOs could update and watch for updates to monitor each
+    other's progress.
+
+
+### Test Plan
+
+TBD
+
+### Graduation Criteria
+
+TBD
+
+#### Dev Preview -> Tech Preview
+
+#### Tech Preview -> GA
+
+### Upgrade / Downgrade Strategy
+
+This is a modification to the upgrade process, not something that can
+be upgrade or downgraded on its own.
+
+TBD, as the details depend on the eventual design.
+
+### Version Skew Strategy
+
+TBD, as the details depend on the eventual design.
+
+We will need to deal with skew both within a single cluster, as well
+as skew between the infra and tenant clusters.
+
+### Operational Aspects of API Extensions
+
+TBD
+
+The only currently-proposed CRD is for Infra MCO to Tenant MCO
+communication, and would not be used by any other components.
+
+#### Failure Modes
+
+- The system might get confused and spuriously block upgrades that
+  should be allowed.
+
+- Communications failures might lead to upgrades failing without the
+  tenant cluster being able to figure out why they failed.
+
+- TBD
+
+#### Support Procedures
+
+TBD
+
+## Implementation History
+
+- Initial proposal: 2021-01-11
+
+## Drawbacks
+
+This makes the upgrade process more complicated, which risks rendering
+clusters un-upgradeable without manual intervention.
+
+However, without some form of synchronization, it is impossible to
+have non-disruptive tenant cluster upgrades.
+
+## Alternatives
+
+The fundamental problem is that rebooting the DPU causes a network
+outage on the tenant.
+
+### Never Reboot the DPUs
+
+This implies never upgrading OCP on the DPUs. I don't see how this
+could work.
+
+### Don't Have an Infra Cluster
+
+If the DPUs were not all part of a single OCP cluster (for example,
+they were just "bare" RHCOS hosts, or they were each running
+Single-Node OpenShift), then it might be simpler to synchronize the
+DPU upgrades with the tenant upgrades, because then each tenant could
+coordinate the actions of its own DPU by itself.
+
+The big problem with this is that, for security reasons, we don't want
+the tenants to have any control over their DPUs. (For some future use
+cases, the DPUs will be used to enforce security policies on their
+tenants.)