Skip to content

Commit

Permalink
initial wip synchronized-upgrades
Browse files Browse the repository at this point in the history
  • Loading branch information
danwinship committed Jan 12, 2022
1 parent 88b00b6 commit 8033a92
Showing 1 changed file with 223 additions and 0 deletions.
223 changes: 223 additions & 0 deletions enhancements/update/synchronized-upgrades.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,223 @@
---
title: Synchronized Upgrades Between Clusters
authors:
- "@danwinship"
reviewers:
- TBD
approvers:
- TBD
api-approvers: # in case of new or modified APIs or API extensions (CRDs, aggregated apiservers, webhooks, finalizers)
- TBD
creation-date: 2021-01-11
last-updated: 2021-01-11
tracking-link:
- https://issues.redhat.com/browse/SDN-2603
see-also:
- "/enhancements/network/dpu/overview.md"
---

# Synchronized Upgrades Between Clusters

## Summary

In a [cluster with DPUs](../network/dpu/overview.md) (eg, BlueField-2
NICs), the x86 hosts form one OCP cluster, and the DPU ARM systems
form a second OCP cluster. This makes upgrades to new OCP releases
complicated, because there is currently no way to synchronize upgrades
between the two clusters, but rebooting the BF-2 systems as part of
the MCO upgrade will cause a network outage on the x86 systems. In
order for upgrades to work smoothly, we need to synchronize the
reboots between the two clusters, so that the BF-2 systems are only
rebooted when their corresponding x86 hosts have been cordoned and
drained.

Please refer to the "Glossary" section of the [DPU Overview
Enhancement](../network/dpu/overview.md).

## Motivation

### Goals

- Make upgrades work smoothly in clusters running with DPU support, by
synchronizing the reboots of nodes between the infra cluster and the
tenant cluster.

### Non-Goals

- Supporting synchronized upgrades of more than 2 clusters at once.

## Proposal

### User Stories

As the administrator of a cluster using DPUs, I want to be able to do
z-stream upgrades without causing unnecessary network outages.

### API Extensions

TBD

### Implementation Details/Notes/Constraints [optional]

TBD

### Risks and Mitigations

TBD

## Design Details

### Open Questions

Basically everything...

The general idea is:

- We can set some things up at install time (eg, creating credentials
to allow certain operators in the two clusters to talk to each
other).

- As part of the DPU security model, the tenant cluster cannot have
any power over the infra cluster. (In particular, it can't be
possible for an administrator in the tenant cluster to force the
infra cluster to upgrade/downgrade to any particular version.) Thus,
the upgrade must be initiated on the infra cluster side, and the
infra side will tell the tenant cluster to upgrade as well.
(Alternatively, the upgrade must be simultaneous-ish-ly initiated in
both clusters, if we don't want the infra cluster to have to have a
credential that lets it initiate an upgrade in the tenant cluster.)

- An upgrade should not be able to start unless both clusters are able
to upgrade.

- In particular:

- There can be no `Upgradeable: False` operators in either cluster.

- The version to upgrade to must be available to both clusters
(ie, it must be available for both x86 and ARM).

- This could be implemented via some sort of "dpu-cluster-upgrade"
operator running in both clusters, where the two operators
communicate with each other and set their `Upgradeable` state to
reflect the state of the other cluster. If the
"dpu-cluster-upgrade" operator was placed before every other
operator in upgrade priority, then it could also block disallowed
upgrades, by failing its own upgrade, eg if an admin tries to
upgrade one cluster without the other, or tries to upgrade the two
clusters to different versions.

- (Or should it be possible to do z-stream upgrades of the tenant
cluster without bothering to upgrade the infra cluster too?)

- The two clusters upgrade all of the operators up to MCO in parallel.

- Whichever cluster reaches the MCO upgrade first needs to wait for
the other cluster to get there before proceeding. The two MCOs then
need to coordinate to complete the upgrade: First, they have to
agree on what order the physical hosts will be upgraded in. Second,
for each physical host, they have to properly synchronize the
upgrades of its infra node and its tenant node.

- More specifically, for each physical host, in some order:

- The Infra MCO will cordon and drain that host's infra node, and
the Tenant MCO will cordon and drain that host's tenant node.
(This can happen in parallel.)

- The Infra MCO will then upgrade the infra node (causing it to
reboot and temporarily break network connectivity to the tenant
node).

- Once the infra node upgrade completes, the Tenant MCO will
reboot and upgrade the tenant node.

- (This seems like it will absolutely require MCO changes.)

- One way to do this would be to have a CRD with an array of hosts,
indicating the ordering, and the current status of each host, and
the two MCOs could update and watch for updates to monitor each
other's progress.


### Test Plan

TBD

### Graduation Criteria

TBD

#### Dev Preview -> Tech Preview

#### Tech Preview -> GA

### Upgrade / Downgrade Strategy

This is a modification to the upgrade process, not something that can
be upgrade or downgraded on its own.

TBD, as the details depend on the eventual design.

### Version Skew Strategy

TBD, as the details depend on the eventual design.

We will need to deal with skew both within a single cluster, as well
as skew between the infra and tenant clusters.

### Operational Aspects of API Extensions

TBD

The only currently-proposed CRD is for Infra MCO to Tenant MCO
communication, and would not be used by any other components.

#### Failure Modes

- The system might get confused and spuriously block upgrades that
should be allowed.

- Communications failures might lead to upgrades failing without the
tenant cluster being able to figure out why they failed.

- TBD

#### Support Procedures

TBD

## Implementation History

- Initial proposal: 2021-01-11

## Drawbacks

This makes the upgrade process more complicated, which risks rendering
clusters un-upgradeable without manual intervention.

However, without some form of synchronization, it is impossible to
have non-disruptive tenant cluster upgrades.

## Alternatives

The fundamental problem is that rebooting the DPU causes a network
outage on the tenant.

### Never Reboot the DPUs

This implies never upgrading OCP on the DPUs. I don't see how this
could work.

### Don't Have an Infra Cluster

If the DPUs were not all part of a single OCP cluster (for example,
they were just "bare" RHCOS hosts, or they were each running
Single-Node OpenShift), then it might be simpler to synchronize the
DPU upgrades with the tenant upgrades, because then each tenant could
coordinate the actions of its own DPU by itself.

The big problem with this is that, for security reasons, we don't want
the tenants to have any control over their DPUs. (For some future use
cases, the DPUs will be used to enforce security policies on their
tenants.)

0 comments on commit 8033a92

Please sign in to comment.