Skip to content

Commit

Permalink
Add enhancement proposal for SDN live migration
Browse files Browse the repository at this point in the history
Signed-off-by: Peng Liu <[email protected]>
  • Loading branch information
Peng Liu authored and pliurh committed Mar 18, 2022
1 parent fe109ed commit fc178b3
Showing 1 changed file with 358 additions and 0 deletions.
358 changes: 358 additions & 0 deletions enhancements/network/sdn-live-migration.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,358 @@
---
title: SDN Live Migration
authors:
- "@pliurh"
reviewers:
- "@danwinship"
- "@trozet"
- "@dcbw"
approvers:
-
creation-date: 2022-03-04
last-updated: 2022-03-04
status: implementable
---

# SDN Live Migration

## Release Signoff Checklist

- [ ] Enhancement is `implementable`
- [ ] Design details are appropriately documented from clear requirements
- [ ] Test plan is defined
- [ ] Graduation criteria for dev preview, tech preview, GA
- [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/)


## Summary
Migrating the CNI network provider network of a running cluster from
OpenShift SDN to OVN Kubernetes without service interruption. During the
migration, we will partition the cluster into two sets of nodes controlled by
different network plugins. We will utilize the Hybrid overlay feature of
OVN Kubernetes to connect the networks of the two CNI network plugins. So that
pods on each side can still talk to the pods on the other side.

## Motivation

For some Openshift users, they have very high requirement on service
availability. The current SDN migration solution, that will cause a service
interruption, is not acceptable.

### Goals

- Migrate the cluster network provider from OpenShift SDN to OVN Kubernetes for a
existing cluster.
- This's in-place migration without requiring extra nodes.
- The impact to workload of the migration shall be similar as an OCP upgrade.
- The solution can work in scale, e.g. in a large cluster with hundreds of
nodes.
- The migration operation shall be able to be rolled back if needed.

### Non-Goals

- Support for migration to other network providers
- The necessary GUI change in Openshift Cluster Manager.
- The live migration of the egress IP and firewall configuration.

## Pre-requisites

- This solution relies on the ovnkube hybrid overlay feature. This bug
https://bugzilla.redhat.com/show_bug.cgi?id=2040779 needs to be fixed.

## Proposal

The key problem of do a live SDN migration is that, we need to the connectivity
of the cluster network during the migration, when pods are attached to different
networks. We propose to utilize the OVN Kubernetes hybrid overlay feature to
connect the networks owned by OpenShift SDN and OVN Kubernetes.

- We will run different plugins on different nodes, but both plugins will know
how to reach pods owned by the other plugin, so all pods/services/etc remain
connected.
- During migration, CNO will take origin-plugin nodes one by one and convert them
to destination-plugin nodes, rebooting them in the process.
- The cluster network CIDR will not change. In fact, none of the node host
subnets will change.
- NetworkPolicy will work correctly throughout the migration.

### Limitations

- Multicast is not handled by this proposal.
- Migration from SDN Multitenant mode is not handled by this proposal.

### User Stories

The service delivery (SD) team (managed OpenShift services ARO, OSD, ROSA) has a
unique set of requirements around downtime, node reboots, and high degree of
automation. Specifically, SD need a way to migrate their managed fleet in a way
that is no more impactful to the customer's workloads than an OCP upgrade, and
that can be done at scale, in a safe, automated way that can be made
self-service and not require SD to negotiate maintenance windows with customers.
The current migration solution needs to be revisited to support these
(relatively) more stringent requirements.

### Risks and Mitigations

## Design Details

The existing ovn-kubernetes hybrid overlay feature was developed for hybrid
Windows/Linux clusters. Each ovn-kubernetes node manages an external-to-OVN OVS
bridge, named br-ext, which acts as the VXLAN source and endpoint for packets
moving between pods on the node and their cluster-external destination. The
br-ext SDN switch acts as a transparent gateway and routes traffic towards
windows nodes.

In the SDN live migration use case, we can enhance this feature to connects the
nodes managed by different CNI plugins. To minimize the implementation effort,
and the code maintainability, we will try to reused the whole hybrid overlay and
only make necessary changes to both CNI plugins.

On the OVN Kubernetes side, all the cross CNI traffic shall follow the same path
of current hybrid overlay implementation. For OVN Kubernetes, we need to do
following enhancements:

1. Instead of using a hardcoded VNID, we shall make it configurable. As
OpneShift SDN uses 0 as a privileged VNID which can talk to pods in all the
namespaces. We shall let OVN Kubernetes use VNID 0 to connect to OpenShift
SDN.
2. We need to modify OVN-K to allow overlapping between the cluster network and
the Hybrid overlay CIDR. So that we can reuse the cluster network in the
migration.
4. We need to modify OVN-K to allow modifying the hybrid overlay CIDR on the fly.

On the OpenShift SDN side, when a node is converted to OVN Kubernetes,
it shall be almost transparent to the control-plane. But we need to introduce a
'migration mode' for OpenShift SDN by:

1. Change ingress NetworkPolicy processing to be based entirely on pod IPs
rather than using namespace VNIDs, since packets from ovn nodes will have
the VNID 0 set.
2. To be compatible with Windows node VXLAN implementation, OVN-K hybrid overlay
uses the host interface MAC as VXLAN inner MAC. When packets arrive at the
br0 of the SDN node, they cannot be forwarded to the pod interface, due to
the MAC mismatch. We need to add flows for each pod which swaps the dst MAC
to the pod interface mac.

### The Traffic Path

#### Packets going from OpenShift SDN to OVN Kubernetes

On the SDN side, it doesn't need to know if the peer node is a SDN node or a OVN
node. We reuse the existing VXLAN tunnel rules one the SDN side.
- Egress NetworkPolicy rules and service proxying happen as normal
- When the packet reaches table 90, it will hit a “send via vxlan” rule that was
generated based on a HostSubnet object.

On the OVN side:
- OVN accepts the packet via the VXLAN tunnel, ignore the VNID set by SDN and
then just routes it normally.
- Ingress NetworkPolicy processing will happen when the packet reaches the
destination pod’s switch port, just like normal.
- Our NetworkPolicy rules are all based on IP addresses, not “logical input
port”, etc, so it doesn’t matter that the packets came from outside OVN and
have no useful OVN metadata.

#### Packets going from OVN Kubernetes to OpenShift SDN

On the ovn side:
- The packet just follow the same path of hybrid overlay. The packet just has to
get routed out the VXLAN tunnel with VNID 0.

On the sdn side:
- We have to change ingress NetworkPolicy processing to be based entirely on pod
IPs rather than using namespace VNIDs, since packets from ovn nodes won’t have
the VNID set. There is already code to generate the rules that way though,
because egress NetworkPolicy already works that way.

### The Migration Process

#### Migration Setup

1. The admin kicks off the migration process by editing the CNO configuration

2. CNO will label each existing node with, eg:

```
migration.network.openshift.io/plugin: OpenShiftSDN
k8s.ovn.org/hybrid-overlay-node-subnet: 10.129.0.0/23
```
(CNO knows how to figure out the host subnet by looking at openshift-sdn
HostSubnet objects. When migrating in the reverse direction, it knows to look
at the ovn-kubernetes annotations.)
3. CNO will redeploy the openshift-sdn DaemonSets, enabling “migration mode”,
and adding node affinity to the node DaemonSet:
```
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: migration.network.openshift.io/plugin
operator: In
values: ["OpenShiftSDN"]
```
so that openshift-sdn will now only run on nodes labeled
“migration.network.openshift.io/plugin: OpenShiftSDN” (which is currently all
of them).
4. CNO will then deploy the ovnkube-master and ovnkube-node DaemonSets, also
enabling hybrid overlay mode” in their config, and adding equivalent node
affinity to the ovnkube-node DaemonSet (to only run on nodes labeled
“migration.network.openshift.io/plugin: OVNKubernetes”). Since there are not
currently any nodes labeled that way, this means no actual ovnkube-node pods
will be created at this step.
At this point, neither plugin is willing to run on a node that has no
“migration.network.openshift.io/plugin” label at all. This means that if a new
node gets added to the cluster in the middle of the migration process, neither
plugin will get deployed to it until after the migration of the older nodes is
complete, and the node affinity is removed.
#### Migration
In order to run ovn-kubernetess in shared-gw mode, we need MCO to update the
ovs-configuration.service. To apply the MachineConfig update, a reboot will be
triggered by MCO. To coordinate the draining and rebooting between CNO and MCO,
we need to modify MCO as following:
- add a `machineconfiguration.openshift.io/rebootPausedBy: network-operator`
node annotation to MCO, so that CNO can block MCO from rebooting the node
- add a new value `RebootPending` to the existing
`machineconfiguration.openshift.io/state` node annotation. It will be set by
MCD if the reboot is blocked. It indicts that draining node is done, and the
rebooting is paused.
When we kick off the migration from CNO, CNO will update the
`status.networkType` field of the `network.config` CR. It will trigger MCO to
apply new MachineConfig to each node:
1. MCO will
- rerender the MachineConfig for each node
- try to cordon and drain the node one node
2. CNO finds a node is being drained by MCO by watching the MCO node
annotations, it will:
- pause the node reboot by annotate the node with
`machineconfiguration.openshift.io/rebootPausedBy: network-operator`.
3. MCO will set node annotation `machineconfiguration.openshift.io/state:
RebootPending`, when draining node is completed.
4. CNO finds the node has been drained by MCO, it will:
- annotate the node with ovnkube annotations:
`k8s.ovn.org/node-subnets: '{"default":"10.131.0.0/23"}'`
- remove the "migration.network.openshift.io/plugin" label from the node wait
for the sdn-node DaemonSet to decrement its NumberReady
- label the node "migration.network.openshift.io/plugin: OVNKubernetes"
- remove the `rebootPausedBy` node annotation.
5. MCO will reboot the node
6. After booting up, br-ex will be created and ovnkube-node will run in hybrid
overlay mode on the node.
7. MCO will uncordon the node. Pods will be created on the node using
ovn-kubernetes as the default CNI plugin.
The above process will be repeated for each node, until all the nodes have been
applied to the new MachineConfig and converted to OVN Kubernetes.
#### Migration Cleanup
Once migration is complete, CNO will:
- delete the openshift-sdn DaemonSets
- redeploy ovn-kubernetes in “normal” mode (no migration mode config, no node
affinity).
- remove the migration-related labels from the nodes
### API
To start the migration, users need to update the `network.operator`
CR by adding:
```
{
"spec": {
"migration": {
"networkType": "OVNKubernetes"
"type: "live"
}
}
}
```
On removing of the `spec.migration` field, CNO will start the migration cleanup.
CNO will also report the migration state to under the `status` of
`network.operator` CR.
```
{
"status": {
"migration": {
// The state can be 'Setup', 'Working', 'Done' or 'Error',
"state": "Working"
// the reason needs to be filled when state is 'Error'.
"reason":""
}
}
}
```
### Lifecycle Management
This is a one-time operation for a cluster, therefore no lifecycle management.
### Test Plan
TBD
### Graduation Criteria
Graduation criteria follows:
#### Dev Preview -> Tech Preview
- End user documentation, relative API stability
- Sufficient test coverage
- Gather feedback from users rather than just developers
#### Tech Preview -> GA
- More testing (upgrade, scale)
- Add CI job
- Sufficient time for feedback
#### Removing a deprecated feature
N/A
### Upgrade / Downgrade Strategy
This is a one-time operation for a cluster, therefore no upgrade / downgrade
strategy.
### Version Skew Strategy
N/A
### API Extensions
N/A
### Operational Aspects of API Extensions
N/A
#### Failure Modes
N/A
#### Support Procedures
N/A
## Implementation History
N/A
## Drawbacks
N/A
## Alternatives
Instead of switch the network provider for a existing cluster, we can spin up a
new cluster then move the workload to it.
## Infrastructure Needed

0 comments on commit fc178b3

Please sign in to comment.