-
Notifications
You must be signed in to change notification settings - Fork 477
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add enhancement proposal for SDN live migration
Signed-off-by: Peng Liu <[email protected]>
- Loading branch information
Showing
1 changed file
with
358 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,358 @@ | ||
--- | ||
title: SDN Live Migration | ||
authors: | ||
- "@pliurh" | ||
reviewers: | ||
- "@danwinship" | ||
- "@trozet" | ||
- "@dcbw" | ||
approvers: | ||
- | ||
creation-date: 2022-03-04 | ||
last-updated: 2022-03-04 | ||
status: implementable | ||
--- | ||
|
||
# SDN Live Migration | ||
|
||
## Release Signoff Checklist | ||
|
||
- [ ] Enhancement is `implementable` | ||
- [ ] Design details are appropriately documented from clear requirements | ||
- [ ] Test plan is defined | ||
- [ ] Graduation criteria for dev preview, tech preview, GA | ||
- [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/) | ||
|
||
|
||
## Summary | ||
Migrating the CNI network provider network of a running cluster from | ||
OpenShift SDN to OVN Kubernetes without service interruption. During the | ||
migration, we will partition the cluster into two sets of nodes controlled by | ||
different network plugins. We will utilize the Hybrid overlay feature of | ||
OVN Kubernetes to connect the networks of the two CNI network plugins. So that | ||
pods on each side can still talk to the pods on the other side. | ||
|
||
## Motivation | ||
|
||
For some Openshift users, they have very high requirement on service | ||
availability. The current SDN migration solution, that will cause a service | ||
interruption, is not acceptable. | ||
|
||
### Goals | ||
|
||
- Migrate the cluster network provider from OpenShift SDN to OVN Kubernetes for a | ||
existing cluster. | ||
- This's in-place migration without requiring extra nodes. | ||
- The impact to workload of the migration shall be similar as an OCP upgrade. | ||
- The solution can work in scale, e.g. in a large cluster with hundreds of | ||
nodes. | ||
- The migration operation shall be able to be rolled back if needed. | ||
|
||
### Non-Goals | ||
|
||
- Support for migration to other network providers | ||
- The necessary GUI change in Openshift Cluster Manager. | ||
- The live migration of the egress IP and firewall configuration. | ||
|
||
## Pre-requisites | ||
|
||
- This solution relies on the ovnkube hybrid overlay feature. This bug | ||
https://bugzilla.redhat.com/show_bug.cgi?id=2040779 needs to be fixed. | ||
|
||
## Proposal | ||
|
||
The key problem of do a live SDN migration is that, we need to the connectivity | ||
of the cluster network during the migration, when pods are attached to different | ||
networks. We propose to utilize the OVN Kubernetes hybrid overlay feature to | ||
connect the networks owned by OpenShift SDN and OVN Kubernetes. | ||
|
||
- We will run different plugins on different nodes, but both plugins will know | ||
how to reach pods owned by the other plugin, so all pods/services/etc remain | ||
connected. | ||
- During migration, CNO will take origin-plugin nodes one by one and convert them | ||
to destination-plugin nodes, rebooting them in the process. | ||
- The cluster network CIDR will not change. In fact, none of the node host | ||
subnets will change. | ||
- NetworkPolicy will work correctly throughout the migration. | ||
|
||
### Limitations | ||
|
||
- Multicast is not handled by this proposal. | ||
- Migration from SDN Multitenant mode is not handled by this proposal. | ||
|
||
### User Stories | ||
|
||
The service delivery (SD) team (managed OpenShift services ARO, OSD, ROSA) has a | ||
unique set of requirements around downtime, node reboots, and high degree of | ||
automation. Specifically, SD need a way to migrate their managed fleet in a way | ||
that is no more impactful to the customer's workloads than an OCP upgrade, and | ||
that can be done at scale, in a safe, automated way that can be made | ||
self-service and not require SD to negotiate maintenance windows with customers. | ||
The current migration solution needs to be revisited to support these | ||
(relatively) more stringent requirements. | ||
|
||
### Risks and Mitigations | ||
|
||
## Design Details | ||
|
||
The existing ovn-kubernetes hybrid overlay feature was developed for hybrid | ||
Windows/Linux clusters. Each ovn-kubernetes node manages an external-to-OVN OVS | ||
bridge, named br-ext, which acts as the VXLAN source and endpoint for packets | ||
moving between pods on the node and their cluster-external destination. The | ||
br-ext SDN switch acts as a transparent gateway and routes traffic towards | ||
windows nodes. | ||
|
||
In the SDN live migration use case, we can enhance this feature to connects the | ||
nodes managed by different CNI plugins. To minimize the implementation effort, | ||
and the code maintainability, we will try to reused the whole hybrid overlay and | ||
only make necessary changes to both CNI plugins. | ||
|
||
On the OVN Kubernetes side, all the cross CNI traffic shall follow the same path | ||
of current hybrid overlay implementation. For OVN Kubernetes, we need to do | ||
following enhancements: | ||
|
||
1. Instead of using a hardcoded VNID, we shall make it configurable. As | ||
OpneShift SDN uses 0 as a privileged VNID which can talk to pods in all the | ||
namespaces. We shall let OVN Kubernetes use VNID 0 to connect to OpenShift | ||
SDN. | ||
2. We need to modify OVN-K to allow overlapping between the cluster network and | ||
the Hybrid overlay CIDR. So that we can reuse the cluster network in the | ||
migration. | ||
4. We need to modify OVN-K to allow modifying the hybrid overlay CIDR on the fly. | ||
|
||
On the OpenShift SDN side, when a node is converted to OVN Kubernetes, | ||
it shall be almost transparent to the control-plane. But we need to introduce a | ||
'migration mode' for OpenShift SDN by: | ||
|
||
1. Change ingress NetworkPolicy processing to be based entirely on pod IPs | ||
rather than using namespace VNIDs, since packets from ovn nodes will have | ||
the VNID 0 set. | ||
2. To be compatible with Windows node VXLAN implementation, OVN-K hybrid overlay | ||
uses the host interface MAC as VXLAN inner MAC. When packets arrive at the | ||
br0 of the SDN node, they cannot be forwarded to the pod interface, due to | ||
the MAC mismatch. We need to add flows for each pod which swaps the dst MAC | ||
to the pod interface mac. | ||
|
||
### The Traffic Path | ||
|
||
#### Packets going from OpenShift SDN to OVN Kubernetes | ||
|
||
On the SDN side, it doesn't need to know if the peer node is a SDN node or a OVN | ||
node. We reuse the existing VXLAN tunnel rules one the SDN side. | ||
- Egress NetworkPolicy rules and service proxying happen as normal | ||
- When the packet reaches table 90, it will hit a “send via vxlan” rule that was | ||
generated based on a HostSubnet object. | ||
|
||
On the OVN side: | ||
- OVN accepts the packet via the VXLAN tunnel, ignore the VNID set by SDN and | ||
then just routes it normally. | ||
- Ingress NetworkPolicy processing will happen when the packet reaches the | ||
destination pod’s switch port, just like normal. | ||
- Our NetworkPolicy rules are all based on IP addresses, not “logical input | ||
port”, etc, so it doesn’t matter that the packets came from outside OVN and | ||
have no useful OVN metadata. | ||
|
||
#### Packets going from OVN Kubernetes to OpenShift SDN | ||
|
||
On the ovn side: | ||
- The packet just follow the same path of hybrid overlay. The packet just has to | ||
get routed out the VXLAN tunnel with VNID 0. | ||
|
||
On the sdn side: | ||
- We have to change ingress NetworkPolicy processing to be based entirely on pod | ||
IPs rather than using namespace VNIDs, since packets from ovn nodes won’t have | ||
the VNID set. There is already code to generate the rules that way though, | ||
because egress NetworkPolicy already works that way. | ||
|
||
### The Migration Process | ||
|
||
#### Migration Setup | ||
|
||
1. The admin kicks off the migration process by editing the CNO configuration | ||
|
||
2. CNO will label each existing node with, eg: | ||
|
||
``` | ||
migration.network.openshift.io/plugin: OpenShiftSDN | ||
k8s.ovn.org/hybrid-overlay-node-subnet: 10.129.0.0/23 | ||
``` | ||
(CNO knows how to figure out the host subnet by looking at openshift-sdn | ||
HostSubnet objects. When migrating in the reverse direction, it knows to look | ||
at the ovn-kubernetes annotations.) | ||
3. CNO will redeploy the openshift-sdn DaemonSets, enabling “migration mode”, | ||
and adding node affinity to the node DaemonSet: | ||
``` | ||
affinity: | ||
nodeAffinity: | ||
requiredDuringSchedulingIgnoredDuringExecution: | ||
nodeSelectorTerms: | ||
- matchExpressions: | ||
- key: migration.network.openshift.io/plugin | ||
operator: In | ||
values: ["OpenShiftSDN"] | ||
``` | ||
so that openshift-sdn will now only run on nodes labeled | ||
“migration.network.openshift.io/plugin: OpenShiftSDN” (which is currently all | ||
of them). | ||
4. CNO will then deploy the ovnkube-master and ovnkube-node DaemonSets, also | ||
enabling hybrid overlay mode” in their config, and adding equivalent node | ||
affinity to the ovnkube-node DaemonSet (to only run on nodes labeled | ||
“migration.network.openshift.io/plugin: OVNKubernetes”). Since there are not | ||
currently any nodes labeled that way, this means no actual ovnkube-node pods | ||
will be created at this step. | ||
At this point, neither plugin is willing to run on a node that has no | ||
“migration.network.openshift.io/plugin” label at all. This means that if a new | ||
node gets added to the cluster in the middle of the migration process, neither | ||
plugin will get deployed to it until after the migration of the older nodes is | ||
complete, and the node affinity is removed. | ||
#### Migration | ||
In order to run ovn-kubernetess in shared-gw mode, we need MCO to update the | ||
ovs-configuration.service. To apply the MachineConfig update, a reboot will be | ||
triggered by MCO. To coordinate the draining and rebooting between CNO and MCO, | ||
we need to modify MCO as following: | ||
- add a `machineconfiguration.openshift.io/rebootPausedBy: network-operator` | ||
node annotation to MCO, so that CNO can block MCO from rebooting the node | ||
- add a new value `RebootPending` to the existing | ||
`machineconfiguration.openshift.io/state` node annotation. It will be set by | ||
MCD if the reboot is blocked. It indicts that draining node is done, and the | ||
rebooting is paused. | ||
When we kick off the migration from CNO, CNO will update the | ||
`status.networkType` field of the `network.config` CR. It will trigger MCO to | ||
apply new MachineConfig to each node: | ||
1. MCO will | ||
- rerender the MachineConfig for each node | ||
- try to cordon and drain the node one node | ||
2. CNO finds a node is being drained by MCO by watching the MCO node | ||
annotations, it will: | ||
- pause the node reboot by annotate the node with | ||
`machineconfiguration.openshift.io/rebootPausedBy: network-operator`. | ||
3. MCO will set node annotation `machineconfiguration.openshift.io/state: | ||
RebootPending`, when draining node is completed. | ||
4. CNO finds the node has been drained by MCO, it will: | ||
- annotate the node with ovnkube annotations: | ||
`k8s.ovn.org/node-subnets: '{"default":"10.131.0.0/23"}'` | ||
- remove the "migration.network.openshift.io/plugin" label from the node wait | ||
for the sdn-node DaemonSet to decrement its NumberReady | ||
- label the node "migration.network.openshift.io/plugin: OVNKubernetes" | ||
- remove the `rebootPausedBy` node annotation. | ||
5. MCO will reboot the node | ||
6. After booting up, br-ex will be created and ovnkube-node will run in hybrid | ||
overlay mode on the node. | ||
7. MCO will uncordon the node. Pods will be created on the node using | ||
ovn-kubernetes as the default CNI plugin. | ||
The above process will be repeated for each node, until all the nodes have been | ||
applied to the new MachineConfig and converted to OVN Kubernetes. | ||
#### Migration Cleanup | ||
Once migration is complete, CNO will: | ||
- delete the openshift-sdn DaemonSets | ||
- redeploy ovn-kubernetes in “normal” mode (no migration mode config, no node | ||
affinity). | ||
- remove the migration-related labels from the nodes | ||
### API | ||
To start the migration, users need to update the `network.operator` | ||
CR by adding: | ||
``` | ||
{ | ||
"spec": { | ||
"migration": { | ||
"networkType": "OVNKubernetes" | ||
"type: "live" | ||
} | ||
} | ||
} | ||
``` | ||
On removing of the `spec.migration` field, CNO will start the migration cleanup. | ||
CNO will also report the migration state to under the `status` of | ||
`network.operator` CR. | ||
``` | ||
{ | ||
"status": { | ||
"migration": { | ||
// The state can be 'Setup', 'Working', 'Done' or 'Error', | ||
"state": "Working" | ||
// the reason needs to be filled when state is 'Error'. | ||
"reason":"" | ||
} | ||
} | ||
} | ||
``` | ||
### Lifecycle Management | ||
This is a one-time operation for a cluster, therefore no lifecycle management. | ||
### Test Plan | ||
TBD | ||
### Graduation Criteria | ||
Graduation criteria follows: | ||
#### Dev Preview -> Tech Preview | ||
- End user documentation, relative API stability | ||
- Sufficient test coverage | ||
- Gather feedback from users rather than just developers | ||
#### Tech Preview -> GA | ||
- More testing (upgrade, scale) | ||
- Add CI job | ||
- Sufficient time for feedback | ||
#### Removing a deprecated feature | ||
N/A | ||
### Upgrade / Downgrade Strategy | ||
This is a one-time operation for a cluster, therefore no upgrade / downgrade | ||
strategy. | ||
### Version Skew Strategy | ||
N/A | ||
### API Extensions | ||
N/A | ||
### Operational Aspects of API Extensions | ||
N/A | ||
#### Failure Modes | ||
N/A | ||
#### Support Procedures | ||
N/A | ||
## Implementation History | ||
N/A | ||
## Drawbacks | ||
N/A | ||
## Alternatives | ||
Instead of switch the network provider for a existing cluster, we can spin up a | ||
new cluster then move the workload to it. | ||
## Infrastructure Needed | ||