Add enhancement proposal for SDN live migration

Signed-off-by: Peng Liu <[email protected]>
openshift · Mar 18, 2022 · fc178b3 · fc178b3
1 parent fe109ed
commit fc178b3
Showing 1 changed file with 358 additions and 0 deletions.
diff --git a/enhancements/network/sdn-live-migration.md b/enhancements/network/sdn-live-migration.md
@@ -0,0 +1,358 @@
+---
+title: SDN Live Migration
+authors:
+  - "@pliurh"
+reviewers:
+  - "@danwinship"
+  - "@trozet"
+  - "@dcbw"
+approvers:
+  - 
+creation-date: 2022-03-04
+last-updated: 2022-03-04
+status: implementable
+---
+
+# SDN Live Migration
+
+## Release Signoff Checklist
+
+- [ ] Enhancement is `implementable`
+- [ ] Design details are appropriately documented from clear requirements
+- [ ] Test plan is defined
+- [ ] Graduation criteria for dev preview, tech preview, GA
+- [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/)
+
+
+## Summary
+Migrating the CNI network provider network of a running cluster from
+OpenShift SDN to OVN Kubernetes without service interruption. During the
+migration, we will partition the cluster into two sets of nodes controlled by
+different network plugins. We will utilize the Hybrid overlay feature of
+OVN Kubernetes to connect the networks of the two CNI network plugins. So that
+pods on each side can still talk to the pods on the other side.
+
+## Motivation
+
+For some Openshift users, they have very high requirement on service
+availability. The current SDN migration solution, that will cause a service
+interruption, is not acceptable.
+
+### Goals
+
+- Migrate the cluster network provider from OpenShift SDN to OVN Kubernetes for a 
+  existing cluster.
+- This's in-place migration without requiring extra nodes.
+- The impact to workload of the migration shall be similar as an OCP upgrade.
+- The solution can work in scale, e.g. in a large cluster with hundreds of
+  nodes.
+- The migration operation shall be able to be rolled back if needed.
+
+### Non-Goals
+
+- Support for migration to other network providers
+- The necessary GUI change in Openshift Cluster Manager.
+- The live migration of the egress IP and firewall configuration.
+
+## Pre-requisites
+
+- This solution relies on the ovnkube hybrid overlay feature. This bug 
+  https://bugzilla.redhat.com/show_bug.cgi?id=2040779 needs to be fixed.
+
+## Proposal
+
+The key problem of do a live SDN migration is that, we need to the connectivity 
+of the cluster network during the migration, when pods are attached to different
+networks. We propose to utilize the OVN Kubernetes hybrid overlay feature to
+connect the networks owned by OpenShift SDN and OVN Kubernetes.
+
+- We will run different plugins on different nodes, but both plugins will know
+  how to reach pods owned by the other plugin, so all pods/services/etc remain
+  connected.
+- During migration, CNO will take origin-plugin nodes one by one and convert them
+  to destination-plugin nodes, rebooting them in the process.
+- The cluster network CIDR will not change. In fact, none of the node host
+  subnets will change.
+- NetworkPolicy will work correctly throughout the migration.
+
+### Limitations
+
+- Multicast is not handled by this proposal.
+- Migration from SDN Multitenant mode is not handled by this proposal.
+
+### User Stories
+
+The service delivery (SD) team (managed OpenShift services ARO, OSD, ROSA) has a
+unique set of requirements around downtime, node reboots, and high degree of
+automation. Specifically, SD need a way to migrate their managed fleet in a way
+that is no more impactful to the customer's workloads than an OCP upgrade, and
+that can be done at scale, in a safe, automated way that can be made
+self-service and not require SD to negotiate maintenance windows with customers.
+The current migration solution needs to be revisited to support these
+(relatively) more stringent requirements.
+
+### Risks and Mitigations
+
+## Design Details
+
+The existing ovn-kubernetes hybrid overlay feature was developed for hybrid
+Windows/Linux clusters. Each ovn-kubernetes node manages an external-to-OVN OVS
+bridge, named br-ext, which acts as the VXLAN source and endpoint for packets
+moving between pods on the node and their cluster-external destination. The
+br-ext SDN switch acts as a transparent gateway and routes traffic towards
+windows nodes.
+
+In the SDN live migration use case, we can enhance this feature to connects the
+nodes managed by different CNI plugins. To minimize the implementation effort,
+and the code maintainability, we will try to reused the whole hybrid overlay and
+only make necessary changes to both CNI plugins.
+
+On the OVN Kubernetes side, all the cross CNI traffic shall follow the same path
+of current hybrid overlay implementation. For OVN Kubernetes, we need to do
+following enhancements:
+
+1. Instead of using a hardcoded VNID, we shall make it configurable. As
+   OpneShift SDN uses 0 as a privileged VNID which can talk to pods in all the
+   namespaces. We shall let OVN Kubernetes use VNID 0 to connect to OpenShift
+   SDN.
+2. We need to modify OVN-K to allow overlapping between the cluster network and
+   the Hybrid overlay CIDR. So that we can reuse the cluster network in the
+   migration.
+4. We need to modify OVN-K to allow modifying the hybrid overlay CIDR on the fly.
+
+On the OpenShift SDN side, when a node is converted to OVN Kubernetes,
+it shall be almost transparent to the control-plane. But we need to introduce a
+'migration mode' for OpenShift SDN by:
+
+1. Change ingress NetworkPolicy processing to be based entirely on pod IPs
+   rather than using namespace VNIDs, since packets from ovn nodes will have
+   the VNID 0 set.
+2. To be compatible with Windows node VXLAN implementation, OVN-K hybrid overlay
+   uses the host interface MAC as VXLAN inner MAC. When packets arrive at the
+   br0 of the SDN node, they cannot be forwarded to the pod interface, due to
+   the MAC mismatch. We need to add flows for each pod which swaps the dst MAC
+   to the pod interface mac.
+
+### The Traffic Path
+
+#### Packets going from OpenShift SDN to OVN Kubernetes
+
+On the SDN side, it doesn't need to know if the peer node is a SDN node or a OVN
+node. We reuse the existing VXLAN tunnel rules one the SDN side.
+- Egress NetworkPolicy rules and service proxying happen as normal
+- When the packet reaches table 90, it will hit a “send via vxlan” rule that was
+  generated based on a HostSubnet object.
+
+On the OVN side:
+- OVN accepts the packet via the VXLAN tunnel, ignore the VNID set by SDN and
+  then just routes it normally.
+- Ingress NetworkPolicy processing will happen when the packet reaches the
+  destination pod’s switch port, just like normal.
+  - Our NetworkPolicy rules are all based on IP addresses, not “logical input
+    port”, etc, so it doesn’t matter that the packets came from outside OVN and
+    have no useful OVN metadata.
+
+#### Packets going from OVN Kubernetes to OpenShift SDN
+
+On the ovn side:
+- The packet just follow the same path of hybrid overlay. The packet just has to
+  get routed out the VXLAN tunnel with VNID 0.
+
+On the sdn side:
+- We have to change ingress NetworkPolicy processing to be based entirely on pod
+  IPs rather than using namespace VNIDs, since packets from ovn nodes won’t have
+  the VNID set. There is already code to generate the rules that way though,
+  because egress NetworkPolicy already works that way.
+
+### The Migration Process
+
+#### Migration Setup
+
+1. The admin kicks off the migration process by editing the CNO configuration
+
+2. CNO will label each existing node with, eg:
+
+    ```
+    migration.network.openshift.io/plugin: OpenShiftSDN
+    k8s.ovn.org/hybrid-overlay-node-subnet: 10.129.0.0/23
+    ```
+
+   (CNO knows how to figure out the host subnet by looking at openshift-sdn
+   HostSubnet objects. When migrating in the reverse direction, it knows to look
+   at the ovn-kubernetes annotations.)
+
+3. CNO will redeploy the openshift-sdn DaemonSets, enabling “migration mode”,
+   and adding node affinity to the node DaemonSet:
+
+    ```
+    affinity:
+      nodeAffinity:
+        requiredDuringSchedulingIgnoredDuringExecution:
+          nodeSelectorTerms:
+            - matchExpressions:
+                - key: migration.network.openshift.io/plugin
+                  operator: In
+                  values: ["OpenShiftSDN"]
+    ```
+
+  so that openshift-sdn will now only run on nodes labeled
+  “migration.network.openshift.io/plugin: OpenShiftSDN” (which is currently all
+  of them).
+
+4. CNO will then deploy the ovnkube-master and ovnkube-node DaemonSets, also
+   enabling hybrid overlay mode” in their config, and adding equivalent node
+   affinity to the ovnkube-node DaemonSet (to only run on nodes labeled
+   “migration.network.openshift.io/plugin: OVNKubernetes”). Since there are not
+   currently any nodes labeled that way, this means no actual ovnkube-node pods
+   will be created at this step.
+
+At this point, neither plugin is willing to run on a node that has no
+“migration.network.openshift.io/plugin” label at all. This means that if a new
+node gets added to the cluster in the middle of the migration process, neither
+plugin will get deployed to it until after the migration of the older nodes is
+complete, and the node affinity is removed.
+
+#### Migration
+
+In order to run ovn-kubernetess in shared-gw mode, we need MCO to update the
+ovs-configuration.service. To apply the MachineConfig update, a reboot will be
+triggered by MCO. To coordinate the draining and rebooting between CNO and MCO,
+we need to modify MCO as following:
+- add a `machineconfiguration.openshift.io/rebootPausedBy: network-operator`
+  node annotation to MCO, so that CNO can block MCO from rebooting the node
+- add a new value `RebootPending` to the existing
+  `machineconfiguration.openshift.io/state` node annotation. It will be set by
+  MCD if the reboot is blocked. It indicts that draining node is done, and the
+  rebooting is paused.
+
+When we kick off the migration from CNO, CNO will update the
+`status.networkType` field of the `network.config` CR. It will trigger MCO to
+apply new MachineConfig to each node:
+1. MCO will 
+   - rerender the MachineConfig for each node
+   - try to cordon and drain the node one node
+2. CNO finds a node is being drained by MCO by watching the MCO node
+   annotations, it will:
+   - pause the node reboot by annotate the node with
+     `machineconfiguration.openshift.io/rebootPausedBy: network-operator`.
+3. MCO will set node annotation `machineconfiguration.openshift.io/state:
+   RebootPending`, when draining node is completed.
+4. CNO finds the node has been drained by MCO, it will:
+   - annotate the node with ovnkube annotations:
+     `k8s.ovn.org/node-subnets: '{"default":"10.131.0.0/23"}'`
+   - remove the "migration.network.openshift.io/plugin" label from the node wait
+     for the sdn-node DaemonSet to decrement its NumberReady
+   - label the node "migration.network.openshift.io/plugin: OVNKubernetes"
+   - remove the `rebootPausedBy` node annotation.
+5. MCO will reboot the node
+6. After booting up, br-ex will be created and ovnkube-node will run in hybrid
+   overlay mode on the node.
+7. MCO will uncordon the node. Pods will be created on the node using
+   ovn-kubernetes as the default CNI plugin.
+
+The above process will be repeated for each node, until all the nodes have been
+applied to the new MachineConfig and converted to OVN Kubernetes.
+
+#### Migration Cleanup
+
+Once migration is complete, CNO will:
+- delete the openshift-sdn DaemonSets
+- redeploy ovn-kubernetes in “normal” mode (no migration mode config, no node
+  affinity).
+- remove the migration-related labels from the nodes
+
+### API
+
+To start the migration, users need to update the `network.operator`
+CR by adding:
+
+```
+{ 
+  "spec": { 
+    "migration": {
+      "networkType": "OVNKubernetes"
+      "type: "live"
+    }
+  } 
+}
+```
+
+On removing of the `spec.migration` field, CNO will start the migration cleanup.
+
+CNO will also report the migration state to under the `status` of
+`network.operator` CR.
+
+```
+{ 
+  "status": { 
+    "migration": {
+      // The state can be 'Setup', 'Working', 'Done' or 'Error', 
+      "state": "Working"
+      // the reason needs to be filled when state is 'Error'.
+      "reason":""
+    }
+  } 
+}
+```
+
+### Lifecycle Management
+
+This is a one-time operation for a cluster, therefore no lifecycle management.
+
+### Test Plan
+
+TBD
+
+### Graduation Criteria
+
+Graduation criteria follows:
+
+#### Dev Preview -> Tech Preview
+
+- End user documentation, relative API stability
+- Sufficient test coverage
+- Gather feedback from users rather than just developers
+
+#### Tech Preview -> GA
+
+- More testing (upgrade, scale)
+- Add CI job
+- Sufficient time for feedback
+
+#### Removing a deprecated feature
+N/A
+
+### Upgrade / Downgrade Strategy
+
+This is a one-time operation for a cluster, therefore no upgrade / downgrade
+strategy.
+
+
+### Version Skew Strategy
+N/A
+
+### API Extensions
+N/A
+
+### Operational Aspects of API Extensions
+N/A
+
+#### Failure Modes
+N/A
+
+#### Support Procedures
+N/A
+
+## Implementation History
+N/A
+
+## Drawbacks
+N/A
+
+## Alternatives
+
+Instead of switch the network provider for a existing cluster, we can spin up a
+new cluster then move the workload to it.
+
+## Infrastructure Needed
+