From 91354c0bb0eb3604336e10566ba90a6bf28180f9 Mon Sep 17 00:00:00 2001 From: Ronny Baturov Date: Wed, 1 May 2024 17:42:52 +0300 Subject: [PATCH 01/53] Add support for performance profile status in hypershift Up until now, there was no defined resolution over how PerformanceProfile status conditions with the status of the different components created from the PerformanceProfile (MachineConfig, KubeletConfig, Tuned) are handled and exposed. In this PR, we suggest adding a custom ConfigMap that will be used as a middleware object. It will hold the updated status populated by the performance profile controller and will be watched by the NodePool controller, which will calculate a brief overview of the current status provided and reflect it under NodePool.status.condition. Signed-off-by: Ronny Baturov --- ...erformanceprofilecontroller-node-tuning.md | 55 +++++++++++++++++-- 1 file changed, 49 insertions(+), 6 deletions(-) diff --git a/enhancements/hypershift/performanceprofilecontroller-node-tuning.md b/enhancements/hypershift/performanceprofilecontroller-node-tuning.md index e65db2a96e..155c27047c 100644 --- a/enhancements/hypershift/performanceprofilecontroller-node-tuning.md +++ b/enhancements/hypershift/performanceprofilecontroller-node-tuning.md @@ -3,6 +3,7 @@ title: performanceprofilecontroller-node-tuning authors: - "@jlojosnegros" + - "@rbaturov" reviewers: - "@dagrayvid" @@ -32,7 +33,7 @@ see-also: - "/enhancements/hypershift/node-tuning.md" creation-date: 2022-09-19 -last-updated: 2022-12-14 +last-updated: 2024-05-01 --- # Performance Profile Controller Adaptation to Hypershift @@ -105,7 +106,7 @@ The proposal for this output object is to use the way NTO has already put in pla - Once Performance Profile Controller has created the `tuned` object as usual, it will embeded the `tuned` into a `configmap` in the `hosted-control-plane-namespace`. - This `configmap` will have: - - name: will be function of the `PeformanceProfile` name + - name: function of the `PeformanceProfile` name - label: `hypershift.openshift.io/tuned-config` : `true` - label: `hypershift.openshift.io/nodePool` : `NodePool` API name where the `PeformanceProfile` which generate this `tuned` was referenced. - label: `hypershift.openshift.io/performanceProfileName` : `PerformanceProfile.name`, @@ -119,7 +120,7 @@ The proposal for this output object is to use the way NTO has already put in pla - Once Performance Profile Controller has created the `MachineConfig` object as usual, it will embeded the object into a `configmap` in the `hosted-control-plane-namespace`. - This `configmap` will have: - - name: will be function of the `PeformanceProfile` name + - name: function of the `PeformanceProfile` name - label: `hypershift.openshift.io/nto-generated-machine-config` : `true` - label: `hypershift.openshift.io/nodePool` : `NodePool` API name where the `PeformanceProfile` which generate this `MachineConfig` was referenced. - label: `hypershift.openshift.io/performanceProfileName` : `PerformanceProfile.name`, @@ -133,7 +134,7 @@ Being this object also handled by MachineConfig Operator (MCO) as `MachineConfig - Once Performance Profile Controller has created the `Kubeletconfig` object as usual, it will embeded the object into a `configmap` in the `hosted-control-plane-namespace`. - This `configmap` will have: - - name: will be function of the `PeformanceProfile` name + - name: function of the `PeformanceProfile` name - label: `hypershift.openshift.io/nto-generated-kubelet-config` : `true` - label: `hypershift.openshift.io/nodePool` : `NodePool` API name where the `PeformanceProfile` which generate this `MachineConfig` was referenced. - label: `hypershift.openshift.io/performanceProfileName` : `PerformanceProfile.name`, @@ -145,6 +146,22 @@ Being this object also handled by MachineConfig Operator (MCO) as `MachineConfig The proposal is to handle these objects like NTO handles its `tuned` configurations once they are in the `hosted-control-plane-namespace`, that is synchronize them directly with the ones in the hosted cluster using the proper KubeConfig. +#### Performance Profile Status + +Since we cannot change the `configmap` nor the `PeformanceProfile` embedded in it since it controlled by the NodePool controller, a different resolution is needed to handle and expose the performance profile status. A proper place to inform an overview of it would be `NodePool.status.condition`. +Since `NTO` can't modify the `NodePool` directly, a `configmap` will be used to pass this data as a middle-ware obj, watched by the `NodePool` controller which will reflect this overview under `NodePool.status.condition`. + +- Once Performance Profile Controller has created the performance profiles components listed above, a `configmap` will be created in the `hosted-control-plane-namespace`, in order to inform the current status conditions. +- This `configmap` will have: + - name: function of the `PeformanceProfile` name + - label: `hypershift.openshift.io/nto-generated-performance-profile-status` : `true` + - label: `hypershift.openshift.io/nodePool` : `NodePool` API name where the `PeformanceProfile` which generates this `ConfigMap` was referenced. + - label: `hypershift.openshift.io/performanceProfileName` : `PerformanceProfile.name`, + - annotation: `hypershift.openshift.io/nodePool` : `NodePool` API namespaced name where the `PeformanceProfile` which generates this `ConfigMap` was referenced. + - data: `PerformanceProfile.status` serialized object in the "status" key. +- This will trigger the reconcile operation in NodePool Controller for these objects. + + ### Workflow Diagram ```mermaid @@ -166,6 +183,10 @@ graph LR; end subgraph hosted-control-plane-namespace + NodePoolController -->|8:Reconcile| PP_Status(ConfigMap
name:func_of_PerformanceProfile_name
label:`hypershift.openshift.io/nto-generated-performance-profile-status` : `true`
label: `hypershift.openshift.io/performanceProfileName` : `PerformanceProfile.name`
label: `hypershift.openshift.io/nodePool` : `NodePool` API name
data.status: < serialized PerformanceProfile.status object >) + PPController-->|6:Creates|PP_Status + NodePoolController -->|9:Updates status| NodePool_A + NodePoolController ==>|2:Propagate|PP_ConfigMap(ConfigMap
name:PPConfigMap
tuned: PerformanceProfile) NodePoolController ==>|3:Add label|PP_ConfigMap_01(ConfigMap
name:PPConfigMap
tuned: PerformanceProfile
label:hypershift.openshift.io/performanceprofile-config=true) PP_ConfigMap-. 2:is moved to .->PP_ConfigMap_01 @@ -212,7 +233,7 @@ class PPController,NodePoolController,ClusterServiceCustomer,NTO actor class PP_01,RTC_01,RTC_02,MC_01,KC_01,Tuned_01,NodePool_A object_prim -class PP_ConfigMap,PP_ConfigMap_01,Tuned_CM,KC_CM,MC_CM object_sec +class PP_ConfigMap,PP_ConfigMap_01,Tuned_CM,KC_CM,MC_CM,PP_Status object_sec class StartHere starthere @@ -244,10 +265,12 @@ classDiagram PerformanceProfile --> ConfigMap_T : name & label PerformanceProfile --> ConfigMap_M : name & label PerformanceProfile --> ConfigMap_K : name & label + PerformanceProfile --> ConfigMap_PPS : name & label ConfigMap_PP --> ConfigMap_T : OwnerReference ConfigMap_PP --> ConfigMap_M : OwnerReference ConfigMap_PP --> ConfigMap_K : OwnerReference + ConfigMap_PP --> ConfigMap_PPS : OwnerReference class NodePoolController { @@ -302,6 +325,20 @@ classDiagram "hypershift.openshift.io/nodePool" : NodePoolNamespacedName ] } + class ConfigMap_PPS { + name: PerformanceProfile.name + "-status" + labels : [ + "hypershift.openshift.io/nto-generated-performance-profile-status" : "true", + "hypershift.openshift.io/performanceProfileName" : PerformanceProfile.name, + "hypershift.openshift.io/nodePool" : NodePool.Name + ] + annotations: [ + "hypershift.openshift.io/nodePool" : NodePoolNamespacedName + ] + data: [ + status: PerformanceProfile.status + ] + } ``` ### API Extensions @@ -361,6 +398,11 @@ See openshift/hypershift#1782 To be able to handle `KubeletConfig` propagation properly NodePool controller should be changed to recognize a `KubeletConfig` as a valid content and set its defaults properly, that is mainly to ensure label `machineconfiguration.openshift.io/role=worker` is set. This can be made by changing [`defaultAndValidateConfigManifest`](https://github.com/openshift/hypershift/blob/fa0ca3d09fab02ebff64d45b97cc1abaf4f1c27a/hypershift-operator/controllers/nodepool/nodepool_controller.go#L1439) to handle `KubeletConfig` almost as it is handling `MachineConfig` right now. +To view if the performance profile has been applied successfully, NodePool controller will watch the performance profile status configmap located in the hosted control plane namespace, and updates the PerformanceProfileAppliedSuccessfully condition under NodePool.status.conditions. +This condition signal if the performance profile has been applied successfully. +The condition will be set to false in two scenarios: either the performance profile tuning has not been completed yet, or a failure has occurred. +The reason and message will provide details specific to the scenario. + ### Performance Profile controller 1. Change SetUp to watch over ConfigMaps with Label: `hypershift.openshift.io/performanceprofile-config=true` @@ -381,7 +423,8 @@ In Hypershift Performance Profile controller (PPC) will not be reconciling Perfo - If it already exist - Extract the content and only if there is a difference update it properly and write it again. - Update PerformanceProfile status conditions with the status of the different components created from the PerformanceProfile (MachineConfig, KubeletConfig, Tuned) - - Still not defined if and how this will be done. + - Performance Profile controller will create and update the performance profile status ConfigMap in the hosted control plane namespace, to reflect the updated status of the components. + ## Background ### Hypershift docs From bf8c3514505bf75fd4af74bf511678d01c8f1b2e Mon Sep 17 00:00:00 2001 From: Ronny Baturov Date: Thu, 2 May 2024 12:19:57 +0300 Subject: [PATCH 02/53] Addressing typos Fixing some typos in this doc. Signed-off-by: Ronny Baturov --- .../performanceprofilecontroller-node-tuning.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/enhancements/hypershift/performanceprofilecontroller-node-tuning.md b/enhancements/hypershift/performanceprofilecontroller-node-tuning.md index 155c27047c..5ea355d023 100644 --- a/enhancements/hypershift/performanceprofilecontroller-node-tuning.md +++ b/enhancements/hypershift/performanceprofilecontroller-node-tuning.md @@ -183,9 +183,9 @@ graph LR; end subgraph hosted-control-plane-namespace - NodePoolController -->|8:Reconcile| PP_Status(ConfigMap
name:func_of_PerformanceProfile_name
label:`hypershift.openshift.io/nto-generated-performance-profile-status` : `true`
label: `hypershift.openshift.io/performanceProfileName` : `PerformanceProfile.name`
label: `hypershift.openshift.io/nodePool` : `NodePool` API name
data.status: < serialized PerformanceProfile.status object >) + NodePoolController -->|4:Watch| PP_Status(ConfigMap
name:func_of_PerformanceProfile_name
label:`hypershift.openshift.io/nto-generated-performance-profile-status` : `true`
label: `hypershift.openshift.io/performanceProfileName` : `PerformanceProfile.name`
label: `hypershift.openshift.io/nodePool` : `NodePool` API name
data.status: < serialized PerformanceProfile.status object >) PPController-->|6:Creates|PP_Status - NodePoolController -->|9:Updates status| NodePool_A + NodePoolController -->|8:Reconcile| NodePool_A NodePoolController ==>|2:Propagate|PP_ConfigMap(ConfigMap
name:PPConfigMap
tuned: PerformanceProfile) NodePoolController ==>|3:Add label|PP_ConfigMap_01(ConfigMap
name:PPConfigMap
tuned: PerformanceProfile
label:hypershift.openshift.io/performanceprofile-config=true) @@ -208,7 +208,7 @@ graph LR; PPController-->|7:Creates and Embed|MC_CM(ConfigMap
name:func_of_PerformanceProfile_name
label:`hypershift.openshift.io/nto-generated-machine-config` : `true`
label: `hypershift.openshift.io/performanceProfileName` : `PerformanceProfile.name`
label: `hypershift.openshift.io/nodePool` : `NodePool` API name
data.config: < serialized MachineConfig object >) MC_01-. embedded into .-MC_CM - PPController-->|7:Creates and Embed|KC_CM(ConfigMap
name:func_of_PerformanceProfile_name
label:`hypershift.openshift.io/nto-generated-kubelet-config` : `true` : `true`
label: `hypershift.openshift.io/performanceProfileName` : `PerformanceProfile.name`
label: `hypershift.openshift.io/nodePool` : `NodePool` API name
data.config: < serialized KubeletConfig object >) + PPController-->|7:Creates and Embed|KC_CM(ConfigMap
name:func_of_PerformanceProfile_name
label:`hypershift.openshift.io/nto-generated-kubelet-config` : `true`
label: `hypershift.openshift.io/performanceProfileName` : `PerformanceProfile.name`
label: `hypershift.openshift.io/nodePool` : `NodePool` API name
data.config: < serialized KubeletConfig object >) KC_01-. embedded into .-KC_CM @@ -317,7 +317,7 @@ classDiagram class ConfigMap_K { name: "kc-" + PerformanceProfile.name labels : [ - "hypershift.openshift.io/nto-generated-machine-config" : "true", + "hypershift.openshift.io/nto-generated-kubelet-config" : "true", "hypershift.openshift.io/performanceProfileName" : PerformanceProfile.name, "hypershift.openshift.io/nodePool" : NodePool.Name ] From f2aa85bbcba88d19f03560e69dd7714450bf4b44 Mon Sep 17 00:00:00 2001 From: Prashanth684 Date: Tue, 9 Apr 2024 10:13:15 -0700 Subject: [PATCH 03/53] Dynamically set Imagestream importMode based on cluster payload type Foloowing up on conversations around not altering scripts and oc commands behaviour for importing imagestreams as a single or manifestlist, this enhancement proposes a way to dynamically set the import mode at install/upgrade time or even toggle it manually. https://issues.redhat.com/browse/MULTIARCH-4552 --- .../dynamic-imagestream-importmode-setting.md | 504 ++++++++++++++++++ 1 file changed, 504 insertions(+) create mode 100644 enhancements/multi-arch/dynamic-imagestream-importmode-setting.md diff --git a/enhancements/multi-arch/dynamic-imagestream-importmode-setting.md b/enhancements/multi-arch/dynamic-imagestream-importmode-setting.md new file mode 100644 index 0000000000..8d0537572b --- /dev/null +++ b/enhancements/multi-arch/dynamic-imagestream-importmode-setting.md @@ -0,0 +1,504 @@ + +--- +title: dynamic-imagestream-importmode-setting +authors: + - "@Prashanth684" +reviewers: + - "@deads2k, apiserver" + - "@JoelSpeed, API" + - "@wking, CVO" + - "@dmage, Imagestreams" + - "@soltysh, Workloads, oc" +approvers: + - "@deads2k" +api-approvers: + - "@JoelSpeed" +creation-date: 2024-04-02 +last-updated: 2024-04-02 +tracking-link: + - https://issues.redhat.com/browse/MULTIARCH-4552 +--- + +# Dynamically set Imagestream importMode based on cluster payload type + +## Summary + +This is a proposal to set ImportMode for +imagestreams dynamically based on the payload type +of the cluster. This means that if the OCP release +payload is a singlemanifest payload, the +importMode for all imagestreams by default will be +`Legacy` and if the payload is a multi payload the +default will be `PreserveOriginal`. + +## Motivation + +This change ensures that imagestreams will be +compatible for single and multi arch compute +clusters without needing manual intervention. +Users who install or upgrade to the multi +release payload +(https://mirror.openshift.com/pub/openshift-v4/multi/) +and want to add compute nodes of differing +architectures, need not change the behaviour +of any newly imported imagestreams or any scripts +which import them. For a cluster installed with +single arch payload, imagestreams default import +mode would always be set to `Legacy` because it +is sufficient if the single image manifest is +imported and for a cluster installed with the +multi payload it would be `PreserveOriginal` +which would import the manifestlist in anticipation +that users would add/have already added nodes +of other architectures (which do not match the +control plane's). + +### User Stories + +- As an Openshift user, I do not want to manually + change the importMode of imagestreams I want to + import on the cluster if I add/remove nodes of + a different architecture to it. +- As an Openshift user, I do not want to change my + existing scripts and oc commands to add extra + flags to enable manifestlist imports on + imagestreams. +- As an Openshift user, I want my imagestreams to + work with workloads deployed on any node + irrespective of whether I deploy on a + single or a multi-arch compute cluster. +- As an Openshift user, I want to be able to toggle + importMode globally for all imagestreams through + configuration. + +### Goals + +- Set the imagestream's import mode based on + payload architecture (single vs multi) for newly + created imagestreams. +- Dynamically alter import mode type on upgrade + from a single arch to a multi payload and vice + versa (Note: multi->single arch upgrades are + not officially supported, but can be done + explicitly). +- Allow users to control the import mode setting + manually through the `image.config.openshift.io` + image config CRD. + +### Non-Goals + +- Describe importmode and what it accomplishes + relating to imagestreams. +- Existing imagestreams in the cluster will not be + altered. +- No other imagestream fields are being considered + for such a change. +- Sparse manifest imports is not being considered. + +## Proposal + +The proposal is for CVO to expose the +type of payload installed on the +cluster which will be consumed by imageregistry +operator which sets a field in the image config +status and for apiserver to use that information +to set the importMode for imagestreams globally. + +### Imagestream ImportModes + +The `ImportMode` API was introduced in the 4.12 +timeframe as part of the multi-arch compute cluster +project (then called heterogeneous clusters). The +API essentially allowed users to toggle between: + - importing a single manifest image matching the + architecture of the control plane (`Legacy`). + - importing the entire manifestlist (`PreserveOriginal`). + +### Workflow Description + +#### Installation +1. User triggers installation through +openshift-installer. +2. CVO inspects the payload and updates the status +field of `ClusterVersion` with the payload type +(inferred through +`release.openshift.io/architecture`). +3. cluster-image-registry-operator updates the +`imageStreamImportMode` status field based on +whether the image config CRD spec has the +`ImageStreamImportMode` set. If it does, it gets +that value. If not, it looks at the `ClusterVersion` +status and gets the value of the payload type and +based on that, determines and sets the importmode. +4. openshift-apiserver-operator gets the value of +`ImageStreamImportMode` from the image config status. +5. the operator updates the `OpenshiftAPIServer`'s +observed config ImagePolicyConfig field with the +import mode. +6. openshift-apiserver uses the value of the +import mode in the observed config's +ImagePolicyConfig field to set the default import +mode for any newly created imagestream. + +#### Upgrade +1. User has a cluster installed with a single arch +payload. +2. User triggers an update to multi-arch payload +using `oc adm upgrade --to-multi-arch`. +3. CVO triggers the update, inspects the payload +and updates the status field of `ClusterVersion` +with the payload type. +4. The rest is the same as steps 3-6 in the +install section. + +#### Manual +1. User updates the `ImageStreamImportMode` spec +field in the `image.config.openshift.io` `cluster` +CR. +3. cluster-image-registry-operator reconciles the CR +to see if the `ImageStreamImportMode` value is set +and and updates the image config status with the same +value. +4. openshift-apiserver-operator updates the +`OpenshiftAPIServer`'s observed config ImagePolicyConfig +field with the import mode from the image config status. +5. openshift-apiserver uses the value of the +import mode in the observed config's +ImagePolicyConfig field to set the default import +mode for any newly created imagestream. + +Note that a change to the import mode in the image +config CR would trigger a redeployment of the +openshift-apiserver + +### API Extensions + +- image.config.openshift.io: Add + `ImageStreamImportMode` config flag to CRD spec + and status. + +``` +type ImageSpec struct { +... +... +// imagestreamImportMode controls the import +// mode behaviour of imagestreams. It can be set +// to `Legacy` or `PreserveOriginal`. The +// default behaviour is to default to Legacy. +// If this value is specified, this setting is +// applied to all imagestreams which do not +// have the value set. +// +optional +ImageStreamImportMode imagev1.ImportModeType +`json:"imageStreamImportMode,omitempty" }` +``` +``` +type ImageStatus struct { +... +... +// imageStreamImportMode controls the import +// mode behaviour of imagestreams. It can be set +// to `Legacy` or `PreserveOriginal`. The +// default behaviour is to default to Legacy. +// If this value is specified, this setting is +// applied to all new imagestreams which do not +// have the value set. +// +optional +ImageStreamImportMode imagev1.ImportModeType +`json:"imageStreamImportMode,omitempty" +} +``` +- openshiftcontrolplane/v1/: Add + `ImageStreamImportMode` config flag to + ImagePolicyConfig struct. + +``` +type ImagePolicyConfig { +... +... +// imageStreamImportMode controls the import +mode behaviour of imagestreams. It can be set +to `Legacy` or `PreserveOriginal`. The +// default behaviour is to default to Legacy. +// If this value is specified, this setting is +// applied to all imagestreams which do not +// have the value set. +// +optional +ImageStreamImportMode imagev1.ImportModeType +`json:"imageStreamImportMode,omitempty"` + +} +``` +- An example CR for `image.config.openshift.io` + would look like below if the import mode is set: +``` +apiVersion: config.openshift.io/v1 +kind: Image +metadata: + name: cluster + ... + ... +spec: + imageStreamImportMode: PreserveOriginal +status: + internalRegistryHostname: image-registry.openshift-image-registry.svc:5000 + imageStreamImportMode: PreserveOriginal + ... + ... + +``` +- The ClusterVersion status field CR would look + like below after CVO has inferred the payload + type: +``` +apiVersion: v1 +items: +- apiVersion: config.openshift.io/v1 + kind: ClusterVersion + spec: + channel: candidate-4.15 + desiredUpdate: + force: false + version: 4.15.0 + status: + availableUpdates: + ... + capabilities: + ... + conditions: + ... + desired: + architecture: Multi # optional, unset/empty-string if it's single arch + history: + ... +``` + +### Topology Considerations + +#### Hypershift / Hosted Control Planes + +- On HyperShift, having the hosted-API-side +ClusterVersion driving control-plane side +functionality(like OpenShift API server behavior) +exposes you to some hosted-admin interference +(see this comment[^1], explaining why we jump through +hoops to avoid touching hosted-API stuff when +setting the upstream update service for HyperShift +clusters). For this particular case, this is not +a problem as the registry lives on the hosted +compute nodes. + +#### Standalone Clusters + +N/A + +#### Single-node Deployments or MicroShift + +N/A + +### Implementation Details/Notes/Constraints + +- CVO: The cluster version operator would expose a + field in its status indicating the + type of the payload. This is inferred by CVO from + the `release.openshift.io/architecture` property + in the release payload. If the payload is a multi + payload, the value is `Multi` otherwise it is empty + indicating a single arch payload. +- Cluster-image-registry-operator: the operator + sets the `imageStreamImportMode` field in the + status of the `images.config.openshift.io` + cluster image config based on whether: + - `images.config.openshift.io` image config + has importmode set in its spec. + - if no global importmode is set in the image + config, the operator sets it based on the + payload type (inferred from CVO).Multi would + mean `PreserveOriginal` and any other value + would mean `Legacy`(single-arch). +- Cluster-openshift-apiserver-operator: the + operator sets the import mode value in the + OpenshiftAPIServer CR's observed config based on + the `imageStreamImportMode` in the image config + status. +- Openshift-apiserver: The apiserver would look + at the observed config, check the import mode + value and set the default for the imagestreams + based on that value. + +### Motivations for a new ClusterVersion status property + +There were a few options discussed in lieu of +introducing a new ClusterVersion status field +and the potential risks for doing so. The +alternatives are highlighted with reasoning +given for why they were not pursued: +- Default ImportMode to PreserveOriginal + everywhere: single-arch-release users maybe + concerned about import size and the lack of + metadata like `dockerImageLayers` and + `dockerImageMetadata` for manifestlisted + imagestream tags. +- Clusters with homogeneous nodes + running the multi payload who do not + want to import manifestlists: The clusters + can either migrate to single arch payloads + or manually toggle the importMode through + the image config CRD. +- CVO provides architecural knowledge to + the cluster-image-registry-operator through + a configmap or the image config CRD: To + limit the risk of many external consumers + using CVO's status field to determine that + their cluster is multi-arch ready, the idea + was to expose this information to the specific + controller. This solution is not necessary as + we let other controller implementers decide + if the CVO's new status field is the best fit + for their use case. + +### Risks and Mitigations + +- There is a potential growth of space consumption + in etcd. For each sub manifest present in a manifest + list, there will be an equivalent Image object with + a list of its layers and other basic info. While Image + objects aren't extremely big, this would mean for each + imagestream tag we will have a few image objects + rather than just one. +- If imagestreams in a cluster have the `referencePolicy` + set to `Local`, the images are imported to the cluster's + internal registry. If many imagestreams follow this pattern, + there might be space growth in the internal registry. +- Both of the above problems can be mitigated by manually + inspecting imagestreams and set the importMode to + `PreserveOriginal` only for the necessary imagestreams or + change the `referencePolicy` to `Source`. +- Scripts inspecting image objects and imagestream tags to + access metadata info like `dockerImageLayers` and + `dockerImageMetadata` will break and will need to change to + import arch specific tags to get metadata information. + +### Drawbacks + +- This change would mean users using the multi + payload would have their imagestream imports be + manifestlisted by default. For some users this + might be problematic if they are importing locally + and would have to deal with mirroring 4x the size + of the original single manifest image. In this + case, they might want to change the global to + `Legacy` and pick and choose which imagestreams + which would absolutely need to be manifestlisted. + +## Open Questions + +- How should CVO expose the payload type in the + Status section? is it through: + - string value of `single` and `multi` + - individual architectures, i.e + `amd64`/`arm64`/`ppc64le`/`s390x` and + `multi`? + - array which lists all architectures for + `multi` and just one for single arch? + (in case there is ever a possibility of multiple + payloads with only certain architectures). + +## Test Plan + +1. e2e test in openshift-apiserver-operator for +checking the observed config. +2. e2e test in openshift-apiserver. +3. QE test for CVO payload type status reporting +on installs and upgrades. +4. e2e test in the cluster-image-registry-operator +for `imageStreamImportMode` setting in the cluster +image config's status. +4. QE test for installing a cluster on multi +payload and creating imagestreams to check if the +importMode is `PreserveOriginal` by default. +5. QE test for single -> multi and multi -> single +(not supported but can be explicit) upgrades and +confirm appropriate imagestream modes for new +imagestreams. + +## Graduation Criteria + +- Explicit documentation + (https://issues.redhat.com/browse/MULTIARCH-4560) + around this change is necessary to inform users on + the impact of installing/upgrading to a cluster + with multi payload. +- implementation of e2e and QE test plans + +### Dev Preview -> Tech Preview + +N/A + +### Tech Preview -> GA + +N/A + +### Removing a deprecated feature + +N/A + +## Upgrade / Downgrade Strategy + +- For N->N+1 upgrades (assuming N+1 is the first + release of this feature), the new status field + will be available in the ClusterVersion + +## Version Skew Strategy + +- No issues expected with skews + +## Operational Aspects of API Extensions + +- The `imageStreamImportMode` API setting in the +image config CRD's spec will not be documented. A +cluster-fleet-evaluation[^2] would be done which accesses +Telemetry data to determines the usage. If the field +is not used widely after a set number of releases, deprecate +and remove the field after appropriately informing the +affected customers about the remediation. + +## Support Procedures + +### Cluster fleet evaluation considerations +- The cluster image registry operator will introduce + a new `EvaluationConditionsDetected`[^3] condition + type with a `OperatorEvaluationConditionDetected` + reason. +- This condition will be set if the `ImageStreamImportMode` + field is present in the `image.config.openshift.io` + cluster CR spec. An appropriate message will also be included, + informing user about this condition. +- Telemetry data will be queried[^4] to determine the number + of clusters which have this condition prevalent. +- Monitor the telemetry data for x number (four suggested) + of releases. +- After set number of releases, customers will be alerted[^5] + of this condition and the remediation through: + - A KCS article which would recommend the user to remove the + `ImageStreamImportMode` setting in favor of changing the + import mode of affected imagestreams directly through + manual patching or a script. + - Since we estimate that not many customers might use this + setting, it will be communicated through email to affected + customers informing them of the remediation. + +## Alternatives + +- Manual setting of import mode through editing + imagestreams or patching. +- If oc commands are used to import imagestreams + add the `import-mode` flag to set it + appropriately. + +## References +[^1]:https://github.com/openshift/cluster-version-operator/pull/1035#issue-2133730335 +[^2]:https://github.com/openshift/enhancements/blob/master/dev-guide/cluster-fleet-evaluation.md +[^3]:https://github.com/openshift/api/blob/master/config/v1/types_cluster_operator.go#L211 +[^4]:https://github.com/openshift/enhancements/blob/master/dev-guide/cluster-fleet-evaluation.md#interacting-with-telemetry +[^5]:https://github.com/openshift/enhancements/blob/master/dev-guide/cluster-fleet-evaluation.md#alerting-customers \ No newline at end of file From b5b977997bcde5d36e64944242fa7b6ea87c33d0 Mon Sep 17 00:00:00 2001 From: Jeff Cantrill Date: Tue, 30 Apr 2024 16:14:46 -0400 Subject: [PATCH 04/53] LOG-5428: Add OTLP support to ClusterLogForwarder --- .../cluster-logging/forwarder-to-otlp.md | 203 ++++++++++++++++++ 1 file changed, 203 insertions(+) create mode 100644 enhancements/cluster-logging/forwarder-to-otlp.md diff --git a/enhancements/cluster-logging/forwarder-to-otlp.md b/enhancements/cluster-logging/forwarder-to-otlp.md new file mode 100644 index 0000000000..46484db6eb --- /dev/null +++ b/enhancements/cluster-logging/forwarder-to-otlp.md @@ -0,0 +1,203 @@ +--- +title: forwarder-to-otlp +authors: + - "@jcantrill" +reviewers: + - "@alanconway" + - "@cahartma" + - "@pavolloffay" + - "@periklis" + - "@xperimental" +approvers: + - "@alanconway" +api-approvers: + - "@alanconway" +creation-date: 2024-04-30 +last-updated: 2024-04-30 +tracking-link: + - "https://issues.redhat.com/browse/LOG-4225" +see-also: + - "/enhancements//cluster-logging/cluster-logging-v2-apis.md" +replaces: [] +superseded-by: [] +--- + +# Log Forwarder to OTLP endpoint + +Spec **ClusterLogForwarder.obervability.openshift/io** to forward logs to an **OTLP** endpoint + +## Summary + +The enhancement defines the modifications to the **ClusterLogForwarder** spec +and Red Hat log collector to allow administrators to collect and forward logs +to an OTLP receiver as defined by the [OpenTelemetry Observability framework](https://opentelemetry.io/docs/specs/otlp/) + +## Motivation + +Customers continue to look for greater insight into the operational aspects of their clusters by using +observability tools and open standards to allow them to avoid vedor lock-in. OpenTelemetry + +> ... is an Observability framework and toolkit designed to create and manage telemetry +data such as traces, metrics, and logs. Crucially, OpenTelemetry is vendor- and tool-agnostic, meaning +that it can be used with a broad variety of Observability backends...as well as commercial offerings. + +This framework defines the [OTLP Specification](https://opentelemetry.io/docs/specs/otlp/) for vendors +to implement. This proposal defines the use of that protocol to specifically forward logs collected on +a cluster. Implementation of this proposal is the first step to embracing a community standard +and to possibly deprecate the data model currently supported by Red Hat logging. + +### User Stories + +* As an administrator, I want to forward logs to an OTLP enabled, Red Hat managed LokiStack +so that I can aggregate logs in my existing Red Hat managed log storage solution. +**Note:** The realization of this use-case depends on [LOG-5523](https://issues.redhat.com/browse/LOG-5523) + +* As an administrator, I want to forward logs to an OTLP receiver +so that I can use my observability tools to evaluate all my signals (e.g. logs, traces, metrics). + +### Goals + +* Implement OTLP over HTTP using text(i.e. JSON) in the log collector to forward to any receiver that implements OTLP +**Note:** Stretch goal to implement OTLP over HTTP using binary(i.e. protobuf) + +* Deprecate the [ViaQ Data Model](https://github.com/openshift/cluster-logging-operator/blob/release-5.9/docs/reference/datamodels/viaq/v1.adoc) in +favor of [OpenTelemetry Semantic Conventions](https://opentelemetry.io/docs/specs/semconv/) +* Allow customer to continue to use **ClusterLogForwarder** as-is in cases where they are not ready or not interested in OpenTelementry + +### Non-Goals + +* Replace the existing log collection agent with the [OTEL Collector](https://opentelemetry.io/docs/collector/) + +## Proposal + +This section should explain what the proposal actually is. Enumerate +*all* of the proposed changes at a *high level*, including all of the +components that need to be modified and how they will be +different. Include the reason for each choice in the design and +implementation that is proposed here. + +To keep this section succinct, document the details like API field +changes, new images, and other implementation details in the +**Implementation Details** section and record the reasons for not +choosing alternatives in the **Alternatives** section at the end of +the document. + +This proposal inte + +### Workflow Description + +**cluster administrator** is a human responsible for administering the **cluster-logging-operator** +and **ClusterLogForwarders** + +1. The cluster administrator deployes the cluster-logging-operator if it is already not deployed +1. The cluster administrator edits or creates a **ClusterLogForwarder** and defines an OTLP output +1. The cluster administrator references the OTLP output in a pipeline +1. The cluster-logging-operator reconciles the **ClusterLogForwarder**, generates a new collector configuration, +and updates the collector deployment + + +### API Extensions + +```yaml +apiVersion: "observability.openshift.io/v1" +kind: ClusterLogForwarder +spec: +outputs: +- name: + type: # add otlp to the enum + tls: + secret: #the default resource to search for keys + name: # the name of resource + cacert: + secret: #enum: secret, configmap + name: # the name of resource + key: #the key in the resource + cert: + key: #the key in the resource + key: + key: #the key in the resource + insecureSkipVerify: + securityProfile: + otlp: + url: #must terminate with '/v1/logs' + authorization: + secret: #the secret to search for keys + name: + username: + password: + token: + tuning: + delivery: + maxWrite: # quantity (e.g. 500k) + compression: # gzip,zstd,snappy,zlib,deflate + minRetryDuration: + maxRetryDuration: +``` + +### Topology Considerations + +#### Hypershift / Hosted Control Planes + +There is a separate log collection and forwarding effort to support writing logs directly to the hosted infrastructure (e.g. AWS, Azure, GCP) using +short-lived tokens and native authentication schemes. This proposal does not address this specifically for HCP and assumes any existing +ClusterLogForwarders will continue to use existing functionality to support HCP. + +#### Standalone Clusters + +#### Single-node Deployments or MicroShift + +### Implementation Details/Notes/Constraints + +The [log collector](https://github.com/vectordotdev/vector/issues/13622), at the time of this writing, does not directly support OTLP. This proposal, +intends to provide OTLP over HTTP using a combination of the HTTP sink and transforms. Initial work demonstrates this is feasible by verifying +the log collector can successfully write logs to a configuration of the OTEL collector. The vector components required to make this possible: + +* Remap transform that formats records according to the OTLP specification and OpenTelemetry scemantic conventions +* Reduce transform to batch records by resource type (e.g. container workloads, journal logs, audit logs) +* HTTP sink to forward logs + +### Risks and Mitigations + +This change is relatively low risk since we have verified its viability and we do not need +to wait for an upstream change in order to release a usable solution. + +### Drawbacks + +Many customers see the trend and advantage of using OpenTelemetry +and have asked for its support. By not providing a solution, the Red Hat log collection product +is not following industry trends and risks becoming isolated from the larger observability community. + +## Open Questions [optional] + +## Test Plan + +* Verify forwarding logs to the OTEL collector +* Verify forwarding logs to Red Hat managed LokiStack + +## Graduation Criteria + +### Dev Preview -> Tech Preview + +### Tech Preview -> GA + +### Removing a deprecated feature + +- Announce deprecation and support policy of the existing feature +- Deprecate the feature + +## Upgrade / Downgrade Strategy + +Not Applicable since this adds a new feature + +## Version Skew Strategy + +## Operational Aspects of API Extensions + +## Support Procedures + +## Alternatives + +Wait for the upstream collector project to implement OTLP sink + +## Infrastructure Needed [optional] + From 8412237f733833904f56ae4bee1df2720debd1ef Mon Sep 17 00:00:00 2001 From: Jeff Cantrill Date: Wed, 10 Jan 2024 14:28:05 -0500 Subject: [PATCH 05/53] LOG-4928: Cluster logging v2 APIs --- .../cluster-logging-log-forwarding.md | 3 +- .../logs-observability-openshift-io-apis.md | 600 ++++++++++++++++++ 2 files changed, 602 insertions(+), 1 deletion(-) create mode 100644 enhancements/cluster-logging/logs-observability-openshift-io-apis.md diff --git a/enhancements/cluster-logging/cluster-logging-log-forwarding.md b/enhancements/cluster-logging/cluster-logging-log-forwarding.md index df08b40f64..551bd14a0a 100644 --- a/enhancements/cluster-logging/cluster-logging-log-forwarding.md +++ b/enhancements/cluster-logging/cluster-logging-log-forwarding.md @@ -16,7 +16,8 @@ last-updated: 2020-07-20 status: implementable see-also:[] replaces:[] -superseded-by:[] +superseded-by: + - "/enhancements/cluster-logging-v2-apis.md" --- # cluster-logging-log-forwarding diff --git a/enhancements/cluster-logging/logs-observability-openshift-io-apis.md b/enhancements/cluster-logging/logs-observability-openshift-io-apis.md new file mode 100644 index 0000000000..98d6c60364 --- /dev/null +++ b/enhancements/cluster-logging/logs-observability-openshift-io-apis.md @@ -0,0 +1,600 @@ +--- +title: logs-observability-openshift-io-apis +authors: +- "@jcantril" +reviewers: +- "@alanconway, Red Hat Logging Architect" +- "@xperimental" +- "@syedriko" +- "@cahartma" +approvers: +- "@alanconway" +api-approvers: +- "@alanconway" +creation-date: 2024-04-18 +last-updated: 2023-04-18 +tracking-link: +- https://issues.redhat.com/browse/OBSDA-550 +see-also: + - "/enhancements/cluster-logging-log-forwarding.md" + - "/enhancements/forwarder-input-selectors.md" + - "/enhancements/cluster-logging/multi-cluster-log-forwarder.md" +replaces: [] +superseded-by: [] +--- + + +# observability.openshift.io/v1 API for Log Forwarding and Log Storage +## Summary + +Logging for Red Hat OpenShift has evolved since its initial release in OpenShift 3.x from an on-cluster, highly opinionated offering to a more flexible log forwarding solution that supports multiple internal (e.g LokiStack, Elasticsearch) and externally managed log storage. Given the original components (e.g. Elasticsearch, Fluentd) have been deprecated for various reasons, this enhancement introduces the next version of the APIs in order to formally drop support for those features as well as to generally provide an API to reflect the future direction of log storage and forwarding. + +## Motivation + +### User Stories + +The next version of the APIs should continue to support the primary objectives of the project which are: + +* Collect logs from various sources and services running on a cluster +* Normalize the logs to common format to include workload metadata (i.e. labels, namespace, name) +* Forward logs to storage of an administrator's choosing (e.g. LokiStack) +* Provide a Red Hat managed log storage solution +* Provide an interface to allow users to review logs from a Red Hat managed storage solution + +The following user stories describe deployment scenarios to support these objectives: + +* As an administrator, I want to deploy a complete operator managed logging solution that includes collection, storage, and visualization so I can evaluate log records while on the cluster +* As an administrator, I want to deploy an operator managed log collector only so that I can forward logs to an existing storage solution +* As an administrator, I want to deploy an operator managed instance of LokiStack and visualization + +The administrator role is any user who has permissions to deploy the operator and the cluster-wide resources required to deploy the logging components. + +### Goals + +* Drop support for the **ClusterLogging** custom resource +* Drop support for **ElasticSearch**, **Kibana** custom resources and the **elasticsearch-operator** +* Drop support for Fluentd collector implementation, Red Hat managed Elastic stack (e.g. Elasticsearch, Kibana) +* Drop support in the **cluster-logging-operator** for **logging-view-plugin** management +* Support log forwarder API with minimal or no dependency upon reserved words (e.g. default) +* Support an API to spec a Red Hat managed LokiStack with the logging tenancy model +* Continue to allow deployment of a log forwarder to the output sinks of the administrators choosing +* Automated migration path from *ClusterLogForwarder.logging.openshift.io/v1* to *ClusterLogForwarder.observability.openshift.io/v1* + +### Non-Goals + +* "One click" deployment of a full logging stack as provided by **ClusterLogging** v1 +* Complete backwards compatibility to **ClusterLogForwarder.logging.openshift.io/v1** v1 + + +## Proposal + + +### Workflow Description + +The following workflow describes the first user story which is a superset of the others and allows deployment of a full logging stack to collect and forward logs to a Red Hat managed log store. + +**cluster administrator** is a human responsible for: + +* Managing and deploying day 2 operators +* Managing and deploying an on-cluster LokiStack +* Managing and deploying a cluster-wide log forwarder + +**cluster-observability-operator** is an operator responsible for: + +* managing and deploying observability operands (e.g. LokiStack, ClusterLogForwarder, Tracing) and console plugins (e.g console-logging-plugin) + +**loki-operator** is an operator responsible for managing a loki stack. + +**cluster-logging-operator** is an operator responsible for managing log collection and forwarding. + +The cluster administrator does the following: + +1. Deploys the Red Hat **cluster-observability-operator** +1. Deploys the Red Hat **loki-operator** +1. Deploys an instance of **LokiStack** in the `openshift-logging` namespace +1. Deploys the Red Hat **cluster-logging-operator** +1. Creates a **ClusterLogForwarder** custom resource for the **LokiStack** + +The **cluster-observability-operator**: +1. Deploys the console-logging-plugin for reading logs in the OpenShift console + +The **loki-operator**: +1. Deploys the **LokiStack** for storing logs on-cluster + +The **cluster-logging-operator**: + +1. Deploys the log collector to forward logs to log storage in the `openshift-logging` namespace + +### API Extensions + +This API defines the following opinionated input sources which is a continuation of prior cluster logging versions: + +* **application**: Logs of container workloads running in all namespaces except **default**, **openshift***, and **kube*** +* **infrastructure**: journald logs from OpenShift nodes and container workloads running only in namespaces **default**, **openshift***, and **kube*** +* **audit**: The logs from OpenShift nodes written to the node filesystem by: Kubernetes API server, OpenShift API server, Auditd, and OpenShift Virtual Network (OVN). + +These are **reserved** words that represent input sources that can be referenced by a pipeline without an explicit input specification. + +More explicit specification of **audit** and **infrastructure** logs is allowed by creating a named input of that type and specifiying at least one of the allowed sources. + +This is a namespaced resource that follows the rules and [design](https://github.com/openshift/enhancements/blob/master/enhancements/cluster-logging/multi-cluster-log-forwarder.md) described in the multi-ClusterLogForwarder proposal with the following exceptions: + +* Drops the `legacy` mode described in the proposal. +* Moves collector specification to the **ClusterLogForwarder** + +#### ClusterLogForwarer CustomResourceDefinition: + +Following defines the next version of a ClusterLogForwarder. **Note:** The next version of this resources is part of a new API group to align log collection with +the objectives of Red Hat observability. + +```yaml +apiVersion: "observability.openshift.io/v1" +kind: ClusterLogForwarder +metadata: + name: +spec: + managementState: #enum: Managed, Unmanaged + serviceAccount: + name: + collector: + resources: #corev1.ResourceRequirements + limits: #cpu, memory + requests: + nodeSelector: #map[string]string + tolerations: #corev1.Toleration + inputs: + - name: + type: #enum: application,infrastructure,audit + application: + selector: #labelselector + includes: + - namespace: + container: + excludes: + - namespace: + container: + tuning: + ratelimitPerContainer: #rate limit applied to each container selected by this input + recordsPerSecond: #int (no multiplier, a each container only runs on one node at a time.) + infrastructure: + sources: [] #enum: node,container + audit: + sources: [] #enum: auditd,kubeAPI,openshiftAPI,ovn + receiver: + type: #enum: syslog,http + port: + http: + format: #enum: kubeAPIAudit , format of incoming data + tls: + ca: + key: #the key in the resource + configmap: + name: # the name of resource + secret: + name: # the name of resource + certificate: + key: #the key in the resource + configmap: + name: # the name of resource + secret: + name: # the name of resource + key: + key: #the key in the resource + secret: + name: # the name of resource + keyPassphrase: + key: #the key in the resource + secret: + name: # the name of resource + filters: + - name: + type: #enum: kubeAPIaudit, detectMultilineException, parse, openshiftLabels, drop, prune + kubeAPIAudit: + parse: + pipelines: + - inputRefs: [] + outputRefs: [] + filterRefs: [] + outputs: + - name: + type: #enum: azureMonitor,cloudwatch,elasticsearch,googleCloudLogging,http,kafka,loki,lokiStack,splunk,syslog + tls: + ca: + key: #the key in the resource + configmap: + name: # the name of resource + secret: + name: # the name of resource + certificate: + key: #the key in the resource + configmap: + name: # the name of resource + secret: + name: # the name of resource + key: + key: #the key in the resource + secret: + name: # the name of resource + keyPassphrase: + key: #the key in the resource + insecureSkipVerify: #bool + securityProfile: #openshiftv1.TLSSecurityProfile + rateLimit: + recordsPerSecond: #int - document per-forwarder/per-node multiplier + azureMonitor: + customerId: + logType: + azureResourceId: + host: + authorization: + sharedKey: + key: + secret: + name: # the name of resource + tuning: + delivery: # enum: AtMostOnce, AtLeastOnce + maxWrite: # quantity (e.g. 500k) + minRetryDuration: + maxRetryDuration: + cloudwatch: + region: + groupBy: # enum. should support templating? + groupPrefix: # should support templating? + authorization: # output specific auth keys + tuning: + delivery: # enum: AtMostOnce, AtLeastOnce + maxWrite: # quantity (e.g. 500k) + compression: # enum of supported algos specific to the output + minRetryDuration: + maxRetryDuration: + elasticsearch: + url: + version: + index: # templating? do we need structured key/name or is this good enough + authorization: # output specific auth keys + tuning: + delivery: # enum: AtMostOnce, AtLeastOnce + maxWrite: # quantity (e.g. 500k) + compression: # enum of supported algos specific to the output + minRetryDuration: + maxRetryDuration: + googleCloudLogging: + billingAccountID: + organizationID: + folderID: # templating? + projectID: # templating? + logID: # templating? + authorization: # output specific auth keys + tuning: + delivery: # enum: AtMostOnce, AtLeastOnce + maxWrite: # quantity (e.g. 500k) + compression: # enum of supported algos specific to the output + minRetryDuration: + maxRetryDuration: + http: + url: + headers: + timeout: + method: + authorization: # output specific auth keys + tuning: + delivery: # enum: AtMostOnce, AtLeastOnce + maxWrite: # quantity (e.g. 500k) + compression: # enum of supported algos specific to the output + minRetryDuration: + maxRetryDuration: + kafka: + url: + topic: #templating? + brokers: + authorization: # output specific auth keys + tuning: + delivery: # enum: AtMostOnce, AtLeastOnce + maxWrite: # quantity (e.g. 500k) + compression: # enum of supported algos specific to the output + loki: + url: + tenant: # templating? + labelKeys: + authorization: # output specific auth keys + tuning: + delivery: # enum: AtMostOnce, AtLeastOnce + maxWrite: # quantity (e.g. 500k) + compression: # enum of supported algos specific to the output + minRetryDuration: + maxRetryDuration: + lokiStack: # RH managed loki stack with RH tenant model + target: + name: + namespace: + labelKeys: + authorization: + token: + key: + secret: + name: # the name of resource + serviceAccount: + name: + username: + key: + secret: + name: # the name of resource + password: + key: + secret: + name: # the name of resource + tuning: + delivery: # enum: AtMostOnce, AtLeastOnce + maxWrite: # quantity (e.g. 500k) + compression: # enum of supported algos specific to the output + minRetryDuration: + maxRetryDuration: + splunk: + url: + index: #templating? + authorization: + secret: #the secret to search for keys + name: + # output specific auth keys + tuning: + delivery: # enum: AtMostOnce, AtLeastOnce + maxWrite: # quantity (e.g. 500k) + compression: # enum of supported algos specific to the output + minRetryDuration: + maxRetryDuration: + syslog: #only supports RFC5424? + url: + severity: + facility: + trimPrefix: + tagKey: #templating? + payloadKey: #templating? + addLogSource: + appName: #templating? + procID: #templating? + msgID: #templating? +status: + conditions: # []metav1.conditions + inputs: # []metav1.conditions + outputs: # []metav1.conditions + filters: # []metav1.conditions + pipelines: # []metav1.conditions +``` + +Example: + +```yaml +apiVersion: "observability.openshift.io/v1" +kind: ClusterLogForwarder +metadata: + name: log-collector + namespace: acme-logging +spec: + outputs: + - name: rh-loki + type: lokiStack + service: + namespace: openshift-logging + name: rh-managed-loki + authorization: + resource: + name: audit-collector-sa-token + token: + key: token + inputs: + - name: infra-container + type: infrastructure + infrastructure: + sources: [container] + serviceAccount: + name: audit-collector-sa + pipelines: + - inputRefs: + - infra-container + - audit + outputRefs: + - rh-loki +``` + +This example: + +* Deploys a log collector to the `acme-logging` namespace +* Expects the administrator to have created a service account named `audit-collector-sa` in that namespace +* Expects the administrator to have created a secret named `audit-collector-sa-token` in that namespace with a key named token that is a bearer token +* Expects the administrator to have bound the roles `collect-audit-logs`, `collect-infrastructure-logs` to the service account +* Expects the administrator created a **LokiStack** CR named `rh-managed-loki` in the `openshift-logging` namespace +* Collects all audit log sources and only infrastructure container logs and writes them to the Red Hat managed lokiStack + +### Topology Considerations +#### Hypershift / Hosted Control Planes +#### Standalone Clusters +#### Single-node Deployments or MicroShift + + +### Implementation Details/Notes/Constraints [optional] + +#### Log Storage + +Deployment of log storage is a separate task of the administrator. They deploy a custom resource to be managed by the **loki-operator**. They will additionally specify forwarding logs to this storage by defining an output in the **ClusterLogForwarder**. Deployment of Red Hat managed log storage is optional and not a requirement for log forwarding. + +#### Log Visualization + +The **cluster-observability-operator** will take ownership of the management of the **console-logging-plugin** which replaces the **log-view-plugin**. This requires feature changes to the operator and the OpenShift console before being fully realized. Earlier version of the **cluster-logging-operator** will be updated with logic (TBD) to recognize the **cluster-observability-operator** is able to deploy the plugin and will remove its own deployment in deference to the **cluster-observability-operator**. Deployment of log visualization is optional and not a requirement for log forwarding. + +#### Log Collection and Forwarding + +*observability.openshift.io/v1* of the **ClusterLogForwarder** depends upon a **ServiceAccount** to which roles must be bound that allow elevated permissions (e.g. mounting node filesystem, collecting logs). + +The Red Hat managed logstore is represented by a `lokiStack` output type defined without an URL +with the following assumptions: + +* Named the same as a **LokiStack** CR deployed in the `openshift-logging` namespace +* Follows the logging tenant model + +The **cluster-logging-operator** will: + +* Internally migrate the **ClusterLogForwarder** to craft the URL to the **LokiStack** + +#### Data Model + +**ClusterLogForwarder** API allows for users to spec the format of data that is forwarded to an output. Various models are provided to allow users to embrace industry trends (e.g. OTEL) +while also offering the capability to continue with the current model. This will allow consumers to continue to use existing tooling while offering options for transitioning to other models +when they are ready. + +##### ViaQ + +The ViaQ model is the original data model that has been provided since the inception of OpenShift logging. The model has not been generally publicly documented until relatively recently. It +can be verbose and was subject to subtle change causing issues for users because of the lack of documentation. This enhancement document intends to rectify that. + +###### V1 + +Refer to the following reference documentation for model details: + +* [Container Logs](https://github.com/openshift/cluster-logging-operator/blob/release-5.9/docs/reference/datamodels/viaq/v1.adoc#viaq-data-model-for-containers) +* [Journald Node Logs](https://github.com/openshift/cluster-logging-operator/blob/release-5.9/docs/reference/datamodels/viaq/v1.adoc#viaq-data-model-for-journald) +* [Kubernetes & OpenShift API Events](https://github.com/openshift/cluster-logging-operator/blob/release-5.9/docs/reference/datamodels/viaq/v1.adoc#viaq-data-model-for-kubernetes-api-events) + +###### V2 + +The progression of the ViaQ data model strives to be succinct by removing fields that have been reported by customers as extraneous. + +Container log: +```yaml +model_version: v2.0 +timestamp: +hostname: +severity: +kubernetes: + container_image: + container_name: + pod_name: + namespace_name: + namespace_labels: #map[string]string: underscore, dedotted, deslashed + labels: #map[string]string: underscore, dedotted, deslashed + stream: #enum: stdout,stderr +message: #string: optional. only preset when structured is not +structured: #map[string]: optional. only present when message is not +openshift: + cluster_id: + log_type: #enum: application, infrastructure, audit + log_source: #journal, ovn, etc + sequence: #int: atomically increasing number during the life of the collector process to be used with the timestamp + labels: #map[string]string: additional labels added to the record defined on a pipeline +``` + +Event Log: + +```yaml +model_version: v2.0 +timestamp: +hostname: +event: + uid: + object_ref_api_group: + object_ref_api_version: + object_ref_name: + object_ref_resource: + request_received_timestamp: + response_status_code: + stage: + stage_timestamp: + user_groups: [] + user_name: + user_uid: + user_agent: + verb: +openshift: + cluster_id: + log_type: #audit + log_source: #enum: kube,openshift,ovn,auditd + labels: #map[string]string: additional labels added to the record defined on a pipeline +``` +Journald Log: + +```yaml + model_version: v2.0 + timestamp: + message: + hostname: + systemd: + t: #map + u: #map + openshift: + cluster_id: + log_type: #infrastructure + log_source: #journald + labels: #map[string]string: additional labels added to the record defined on a pipeline +``` + +### Risks and Mitigations + +#### User Experience + +The product is no longer offering a "one-click" experience for deploying a full logging stack from collection to storage. Given we started moving away from this experience when Loki was introduced, this should be low risk. Many customers already have their own log storage solution so they are only making use of log forwarding. Additionally, it is intended for the **cluster-observability-operator** to recognize the existance of the internally managed log storage and automatically deploy the view plugin. This should reduce the burden of administrators. + +#### Security + +The risk of forwarding logs to unauthorized destinations remains as from previous releases. This enhancement embraces the design from +[multi cluster log forwarding](https://github.com/openshift/enhancements/blob/master/enhancements/cluster-logging/multi-cluster-log-forwarder.md) by requiring administrators to provide a +service account with the proper permissions. The permission scheme relies upon RBAC offered by the platform and places the control in the hands of administrators. + +### Drawbacks + +The largest drawback to implementing new APIs is the product continues to identify the +availability of technologies which are deprecated and will soon not be supported. This will +continue to confuse consumers of logging and will require documentation and explanations of our technology decisions. Furthermore, some customers will continue to delay the move to the newer technologies provided by Red Hat. + +## Open Questions [optional] + +## Test Plan + +* Execute all existing tests for log collection, forwarding and storage with the exeception of tests specifically intended to test deprecated features (e.g. Elasticsearch). Functionally, other tests are still applicable +* Execute a test to verify the flow defined for collecting, storing, and visualizing logs from an on-cluster, Red Hat operator managed LokiStack +* Execute a test to verify legacy deployments of logging are no longer managed by the **cluster-logging-operator** after upgrade. + +## Graduation Criteria + +### Dev Preview -> Tech Preview + +### Tech Preview -> GA + +This release: + +* Intends to support the use-cases described within this proposal +* Intends to distibute *ClusterLogForwarder.observability.openshift.io/v1* of the APIs described within this proposal +* Drop support of *ClusterLogging.logging.openshift.io/v1* API +* Deprecate support of *ClusterLogForwarder.logging.openshift.io/v1* API +* Stop any feature development to support the *ClusterLogForwarder.logging.openshift.io/v1* API +* May support multiple data models (e.g OpenTelementry, VIAQ v2) + +### Removing a deprecated feature + +Upon GA release of this enhancement: + +- The internally managed Elastic (e.g. Elasticsearch, Kibana) offering will no longer be available. +- The Fluentd collector implementation will no longer be available +- The *ClusterLogForwarder.logging.openshift.io/v1* is deprecated and intends to be removed after two z-stream releases after GA of this enhancement. +- The *ClusterLogging.logging.openshift.io/v1* will no longer be available + +## Upgrade / Downgrade Strategy + +The **cluster-logging-operator** will internally convert the *ClusterLogForwarder.logging.openshift.io/v1* resources to +*ClusterLogForwarder.observability.openshift.io/v1* and identify the original resource as deprecated. The operator will return an error for any resource +that is unable to be converted, for example, a forwarder that is utilizing the FluendForward output type. Once migrated, the operator will continue to reconcile it. Log forwarders depending upon fluentd collectors will be re-deployed with vector collectors. Fluentd deployments forwarding to fluentforward endpoints will be unsupported. + +**Note:** No new features will be added to *ClusterLogForwarder.logging.openshift.io/v1*. + +**LokiStack** is unaffected by this proposal and not managed by the **cluster-logging-operator** + +## Version Skew Strategy + +## Operational Aspects of API Extensions + +## Support Procedures + +## Alternatives + +Given most of the changes will result in an operator that manages only log collection and forwarding, we could release a new operator for that purpose only that provides only *ClusterLogForwarder.observability.openshift.io/v1* APIs + +## Infrastructure Needed [optional] + From 8ecd1c466f5ce96da55b165e70c9e515190ccfab Mon Sep 17 00:00:00 2001 From: Jeff Cantrill Date: Mon, 3 Jun 2024 11:55:42 -0400 Subject: [PATCH 06/53] LOG-5613: Refactor logging API to make GCP account an enum --- .../logs-observability-openshift-io-apis.md | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/enhancements/cluster-logging/logs-observability-openshift-io-apis.md b/enhancements/cluster-logging/logs-observability-openshift-io-apis.md index 98d6c60364..25a4efa08b 100644 --- a/enhancements/cluster-logging/logs-observability-openshift-io-apis.md +++ b/enhancements/cluster-logging/logs-observability-openshift-io-apis.md @@ -259,10 +259,9 @@ spec: minRetryDuration: maxRetryDuration: googleCloudLogging: - billingAccountID: - organizationID: - folderID: # templating? - projectID: # templating? + ID: + type: #enum: billingAccount,folder,project,organization + value: logID: # templating? authorization: # output specific auth keys tuning: From 44f95123c3adf21fab68683c618d686316199850 Mon Sep 17 00:00:00 2001 From: Prashanth684 Date: Mon, 3 Jun 2024 16:51:52 -0700 Subject: [PATCH 07/53] Update CVO status field details --- .../dynamic-imagestream-importmode-setting.md | 49 ++++++++++++------- 1 file changed, 31 insertions(+), 18 deletions(-) diff --git a/enhancements/multi-arch/dynamic-imagestream-importmode-setting.md b/enhancements/multi-arch/dynamic-imagestream-importmode-setting.md index 8d0537572b..6ab1368b96 100644 --- a/enhancements/multi-arch/dynamic-imagestream-importmode-setting.md +++ b/enhancements/multi-arch/dynamic-imagestream-importmode-setting.md @@ -119,9 +119,9 @@ API essentially allowed users to toggle between: #### Installation 1. User triggers installation through openshift-installer. -2. CVO inspects the payload and updates the status -field of `ClusterVersion` with the payload type -(inferred through +2. CVO inspects the payload and updates the `Desired` +status field of `ClusterVersion` with the payload +type (inferred through `release.openshift.io/architecture`). 3. cluster-image-registry-operator updates the `imageStreamImportMode` status field based on @@ -146,8 +146,8 @@ payload. 2. User triggers an update to multi-arch payload using `oc adm upgrade --to-multi-arch`. 3. CVO triggers the update, inspects the payload -and updates the status field of `ClusterVersion` -with the payload type. +and updates the `Desired` status field of +`ClusterVersion` with the payload type. 4. The rest is the same as steps 3-6 in the install section. @@ -173,6 +173,27 @@ openshift-apiserver ### API Extensions +- CVO would add a new field to the `Desired` + status whose value would either be `multi` or + empty (single-arch) - which indicates the version it + is trying to reconcile to. This would be the same + as the `Architecture` field that can optionally be + set for updates[^6]. + +``` +type Release struct { +... +... +// architecture is an optional field that indicates the +// value of the cluster architecture. In this context cluster +// architecture means either a single architecture or a multi +// architecture. +// Valid values are 'Multi' and empty. +// +// +optional +Architecture ClusterVersionArchitecture `json:"architecture,omitempty"` +} +``` - image.config.openshift.io: Add `ImageStreamImportMode` config flag to CRD spec and status. @@ -299,7 +320,7 @@ N/A ### Implementation Details/Notes/Constraints - CVO: The cluster version operator would expose a - field in its status indicating the + field in its `Desired` status indicating the type of the payload. This is inferred by CVO from the `release.openshift.io/architecture` property in the release payload. If the payload is a multi @@ -393,16 +414,7 @@ given for why they were not pursued: ## Open Questions -- How should CVO expose the payload type in the - Status section? is it through: - - string value of `single` and `multi` - - individual architectures, i.e - `amd64`/`arm64`/`ppc64le`/`s390x` and - `multi`? - - array which lists all architectures for - `multi` and just one for single arch? - (in case there is ever a possibility of multiple - payloads with only certain architectures). +N/A ## Test Plan @@ -499,6 +511,7 @@ affected customers about the remediation. ## References [^1]:https://github.com/openshift/cluster-version-operator/pull/1035#issue-2133730335 [^2]:https://github.com/openshift/enhancements/blob/master/dev-guide/cluster-fleet-evaluation.md -[^3]:https://github.com/openshift/api/blob/master/config/v1/types_cluster_operator.go#L211 +[^3]:https://github.com/openshift/api/blob/b01900f1982a40d2b71a3c742de5755f3f28264f/config/v1/types_cluster_operator.go#L211 [^4]:https://github.com/openshift/enhancements/blob/master/dev-guide/cluster-fleet-evaluation.md#interacting-with-telemetry -[^5]:https://github.com/openshift/enhancements/blob/master/dev-guide/cluster-fleet-evaluation.md#alerting-customers \ No newline at end of file +[^5]:https://github.com/openshift/enhancements/blob/master/dev-guide/cluster-fleet-evaluation.md#alerting-customers +[^6]:https://github.com/openshift/api/blob/b01900f1982a40d2b71a3c742de5755f3f28264f/config/v1/types_cluster_version.go#L654-L664 \ No newline at end of file From cab9bec7f3a2aa58c9729af2fa95b5295501bdfb Mon Sep 17 00:00:00 2001 From: Hongkai Liu Date: Mon, 10 Jun 2024 17:29:29 -0400 Subject: [PATCH 08/53] dev-guide/operators.md: update the CI cluster's domain --- dev-guide/operators.md | 24 +++++++++--------------- 1 file changed, 9 insertions(+), 15 deletions(-) diff --git a/dev-guide/operators.md b/dev-guide/operators.md index 2c926a8477..6ff6cf7f76 100644 --- a/dev-guide/operators.md +++ b/dev-guide/operators.md @@ -55,12 +55,12 @@ the CVO manages all of the COs (in this way ClusterOperators are also operands). ## What is an OpenShift release image? To get a list of the components and their images that comprise an OpenShift release image, grab a -release from the [openshift release page](https://openshift-release.svc.ci.openshift.org/) and run: +release from the [openshift release page](https://amd64.ocp.releases.ci.openshift.org/) and run: ```console -$ oc adm release info registry.svc.ci.openshift.org/ocp/release:version +$ oc adm release info registry.ci.openshift.org/ocp/release:version ``` -If the above command fails, you may need to authenticate against `registry.svc.ci.openshift.org`. +If the above command fails, you may need to authenticate against `registry.ci.openshift.org`. If you are an OpenShift developer, see [authenticating against ci registry](#authenticating-against-ci-registry) You'll notice that currently the release payload is just shy of 100 images. @@ -232,7 +232,7 @@ or remove the overrides section you added in `clusterversion/version`. ### OPTION B - LAUNCH A CLUSTER WITH YOUR CHANGES #### Build a new release image that has your test components built in For this example I'll start with the release image -`registry.svc.ci.openshift.org/ocp/release:4.2` +`registry.ci.openshift.org/ocp/release:4.2` and test a change to the `github.com/openshift/openshift-apiserver` repository. 1. Build the image and push it to a registry (use any containers cli, quay.io, docker.io) @@ -246,15 +246,15 @@ $ buildah push quay.io/yourname/openshift-apiserver:test 2. Assemble a release payload with your test image and push it to a registry Get the name of the image (`openshift-apiserver`) you want to substitute: ```console -$ oc adm release info registry.svc.ci.openshift.org/ocp/release:4.2 +$ oc adm release info registry.ci.openshift.org/ocp/release:4.2 ``` -If the above command fails, you may need to authenticate against `registry.svc.ci.openshift.org`. +If the above command fails, you may need to authenticate against `registry.ci.openshift.org`. If you are an OpenShift developer, see [authenticating against ci registry](#authenticating-against-ci-registry) This command will assemble a release payload incorporating your test image _and_ will push it to the quay.io repository. Be sure to set this repository in quay.io as `public`. ```console -$ oc adm release new --from-release registry.svc.ci.openshift.org/ocp/release:4.2 \ +$ oc adm release new --from-release registry.ci.openshift.org/ocp/release:4.2 \ openshift-apiserver=quay.io/yourname/openshift-apiserver:test \ --to-image quay.io/yourname/release:test ``` @@ -301,20 +301,14 @@ plus operator image build for such operators. (Internal Red Hat registries for developer testing) ### registry.ci.openshift.org -- Login at https://oauth-openshift.apps.ci.l2s4.p1.openshiftapps.com/oauth/token/request with your github account. This may require you to have access to the internal "OpenShift" Github organization so if you have access issues, double-check that you have access to that org and try requesting it if you don't have it. +- Login at https://oauth-openshift.apps.ci.l2s4.p1.openshiftapps.com/oauth/token/request with your Kerberos ID at Red Hat SSO. - Once logged in, an API token will be displayed. Please copy it. - Then login to a `registry.json` file like this ```bash -$ podman login --authfile registry.json -u ${GITHUB_USER} -p ${TOKEN} +$ podman login --authfile registry.json -u ${KERBEROS_ID} -p ${TOKEN} ``` -### registry.svc.ci.openshift.org -Add the necessary credentials to your local `~/.docker/config.json` (or equivalent file) like so: -- visit `https://api.ci.openshift.org`, `upper right corner '?'` dropdown to `Command Line Tools` -- copy the given `oc login https://api.ci.openshift.org --token=`, paste in your terminal -- then run `oc registry login` to add your credentials to your local config file _usually ~/.docker/config.json_ - ## Authenticating against quay registry Add the necessary credentials to your local `~/.docker/config.json` (or equivalent file) like so: - Visit `https://try.openshift.com`, `GET STARTED`, login w/ RedHat account if not already, From 6a5fbafd7d3018b072d7ae9bf192151972a63b01 Mon Sep 17 00:00:00 2001 From: Patryk Matuszak Date: Tue, 11 Jun 2024 09:02:41 +0200 Subject: [PATCH 09/53] low latency workloads on microshift --- .../low-latency-workloads-on-microshift.md | 612 ++++++++++++++++++ 1 file changed, 612 insertions(+) create mode 100644 enhancements/microshift/low-latency-workloads-on-microshift.md diff --git a/enhancements/microshift/low-latency-workloads-on-microshift.md b/enhancements/microshift/low-latency-workloads-on-microshift.md new file mode 100644 index 0000000000..15a113f6f9 --- /dev/null +++ b/enhancements/microshift/low-latency-workloads-on-microshift.md @@ -0,0 +1,612 @@ +--- +title: low-latency-workloads-on-microshift +authors: + - "@pmtk" +reviewers: + - "@sjug, Performance and Scalability expert" + - "@DanielFroehlich, PM" + - "@jogeo, QE lead" + - "@eslutsky, working on MicroShift workload partitioning" + - "@pacevedom, MicroShift lead" +approvers: + - "@jerpeter1" +api-approvers: + - TBD +creation-date: 2024-06-12 +last-updated: 2024-06-12 +tracking-link: + - USHIFT-2981 +--- + +# Low Latency workloads on MicroShift + +## Summary + +This enhancement describes how low latency workloads will be supported on MicroShift hosts. + + +## Motivation + +Some customers want to run latency sensitive workload like software defined PLCs. + +Currently it's possible, but requires substantial amount of knowledge to correctly configure +all involved components whereas in OpenShift everything is abstracted using Node Tuning Operator's +PerformanceProfile CR. +Therefore, this enhancement focuses on making this configuration easy for customers by shipping +ready to use packages to kickstart customers' usage of low latency workloads. + + +### User Stories + +* As a MicroShift administrator, I want to configure MicroShift host, + so that I can run low latency workloads. + + +### Goals + +Provide relatively easy way to configure system for low latency workload running on MicroShift: +- Prepare low latency TuneD profile for MicroShift +- Prepare necessary CRI-O configurations +- Allow configuration of Kubelet via MicroShift config +- Add small systemd daemon to enable TuneD profile and (optionally) reboot the host if the kernel + arguments change to make them effective + + +### Non-Goals + +- Workload partitioning (i.e. pinning MicroShift control plane components) +- Duplicate all capabilities of Node Tuning Operator + + +## Proposal + +To ease configuration of the system for running low latency workloads on MicroShift following +parts need to be put in place: +- TuneD profile +- CRI-O configuration + Kubernetes' RuntimeClass +- Kubelet configuration (CPU, Memory, and Topology Managers) +- Small systemd daemon to activate TuneD profile on boot and reboot the host if the kernel args + are changed. + +New RPM will be created that will contain tuned profile, CRI-O configs, and mentioned systemd daemon. +We'll leverage existing know how of Performance and Scalability team expertise and look at +Node Tuning Operator capabilities. + +To allow customization of supplied TuneD profile for specific system, this new profile will +include instruction to include file with variables, which can be overridden by the user. + +All of this will be accompanied by step by step documentation on how to use this feature, +tweak values for specific system, and what are the possibilities and limitations. + +Optionally, a new subcommand `microshift doctor low-latency` might be added to main +MicroShift binary to provide some verification checks if system configuration is matching +expectations according to our knowledge. It shall not configure system - only report potential problems. + + +### Workflow Description + +Workflow consists of two parts: +1. system and MicroShift configuration +1. preparing Kubernetes manifests for low latency + +#### System and MicroShift configuration + +##### OSTree + +1. User creates an osbuild blueprint: + - (optional) User configures `[customizations.kernel]` in the blueprint if the values are known + beforehand. This could prevent from necessary reboot after applying tuned profile. + - User adds `kernel-rt` package to the blueprint + - User adds `microshift-low-latency` RPM to the blueprint + - User enables `microshift-tuned.service` + - User supplies additional configs using blueprint: + - /etc/tuned/microshift-low-latency-variables.conf + - /etc/microshift/config.yaml to configure Kubelet +1. User builds the blueprint +1. User deploys the commit / installs the system. +1. System boots +1. `microshift-tuned.service` starts (after `tuned.service`, before `microshift.service`): + - Saves current kernel args + - Applies tuned `microshift-low-latency` profile + - Verifies expected kernel args + - ostree: `rpm-ostree kargs` + - rpm: `grubby` + - If the current and expected kernel args are different, reboot the node +1. Host boots again, everything for low latency is in place, + `microshift.service` can continue start up. + +##### RPM + +1. User installs `microshift-low-latency` RPM. +1. User creates following configs: + - /etc/tuned/microshift-low-latency-variables.conf + - /etc/microshift/config.yaml to configure Kubelet +1. User enables `microshift-tuned.service` or uses `tuned-adm` directly to activate the profile + (and reboot the host if needed). +1. If host was not rebooted, CRI-O and MicroShift services need to be restarted to make new settings + active. + +#### Preparing low latency workload + +- Setting `.spec.runtimeClassName: microshift-low-latency` in Pod spec. +- Setting Pod's memory limit and memory request to the same value, and + setting CPU limit and CPU request to the same value to ensure Pod has guaranteed QoS class. +- Use annotations to get desired behavior: + - `cpu-load-balancing.crio.io: "disable"` - disable CPU load balancing for Pod + (only use with CPU Manager `static` policy and for Guaranteed QoS Pods using whole CPUs) + - `cpu-quota.crio.io: "disable"` - disable Completely Fair Scheduler (CFS) + - `irq-load-balancing.crio.io: "disable"` - disable interrupt processing + (only use with CPU Manager `static` policy and for Guaranteed QoS Pods using whole CPUs) + - `cpu-c-states.crio.io: "disable"` - disable C-states + - `cpu-freq-governor.crio.io: ""` - specify governor type for CPU Freq scaling (e.g. `performance`) + +### API Extensions + +Following API extensions are expected: +- A passthrough from MicroShift to Kubelet config. +- Variables file for TuneD profile to allow customization of the profile for specific host. + + +### Topology Considerations + +#### Hypershift / Hosted Control Planes + +N/A + +#### Standalone Clusters + +N/A + +#### Single-node Deployments or MicroShift + +Purely MicroShift enhancement. + +### Implementation Details/Notes/Constraints + +#### TuneD Profile + +New `microshift-low-latency` tuned profile will be created and will include existing `cpu-partitioning` profile. + +`/etc/tuned/microshift-low-latency-variables.conf` will be used by users to provide custom values for settings such as: +- isolated CPU set +- hugepage count (both 2M and 1G) +- additional kernel arguments + +```ini +[main] +summary=Optimize for running low latency workloads on MicroShift +include=cpu-partitioning + +[variables] +include=/etc/tuned/microshift-low-latency-variables.conf + +[bootloader] +cmdline_microshift=+default_hugepagesz=${hugepagesDefaultSize} hugepagesz=2M hugepages=${hugepages2M} hugepagesz=1G hugepages=${hugepages1G} +cmdline_additionalArgs=+${additionalArgs} +``` + +```ini +### cpu-partitioning variables +# +# Core isolation +# +# The 'isolated_cores=' variable below controls which cores should be +# isolated. By default we reserve 1 core per socket for housekeeping +# and isolate the rest. But you can isolate any range as shown in the +# examples below. Just remember to keep only one isolated_cores= line. +# +# Examples: +# isolated_cores=2,4-7 +# isolated_cores=2-23 +# +# Reserve 1 core per socket for housekeeping, isolate the rest. +# Change this for a core list or range as shown above. +isolated_cores=${f:calc_isolated_cores:1} + +# To disable the kernel load balancing in certain isolated CPUs: +# no_balance_cores=5-10 + +### microshift-low-latency variables +# Default hugepages size +hugepagesDefaultSize = 2M + +# Amount of 2M hugepages +hugepages2M = 128 + +# Amount of 1G hugepages +hugepages1G = 0 + +# Additional kernel arguments +additionalArgs = "" +``` + +#### CRI-O configuration + +```ini +[crio.runtime.runtimes.high-performance] +runtime_path = "/bin/crun" +runtime_type = "oci" +runtime_root = "/bin/crun" +allowed_annotations = ["cpu-load-balancing.crio.io", "cpu-quota.crio.io", "irq-load-balancing.crio.io", "cpu-c-states.crio.io", "cpu-freq-governor.crio.io"] +``` + +#### Kubelet configuration + +Because of multitude of option in Kubelet configuration, a simple passthrough (copy paste) +will be implemented, rather than exposing every single little configuration variable. + +```yaml +# /etc/microshift/config.yaml +kubelet: + cpuManagerPolicy: static + cpuManagerPolicyOptions: + full-pcpus-only: "true" + cpuManagerReconcilePeriod: 5s + memoryManagerPolicy: Static + topologyManagerPolicy: single-numa-node + reservedSystemCPUs: 0,28-31 + reservedMemory: + - limits: + memory: 1100Mi + numaNode: 0 + kubeReserved: + memory: 500Mi + systemReserved: + memory: 500Mi + evictionHard: + imagefs.available: 15% + memory.available: 100Mi + nodefs.available: 10% + nodefs.inodesFree: 5% + evictionPressureTransitionPeriod: 0s +``` +will be passed-through to kubelet config as: +```yaml +cpuManagerPolicy: static +cpuManagerPolicyOptions: + full-pcpus-only: "true" +cpuManagerReconcilePeriod: 5s +memoryManagerPolicy: Static +topologyManagerPolicy: single-numa-node +reservedSystemCPUs: 0,28-31 +reservedMemory: +- limits: + memory: 1100Mi + numaNode: 0 +kubeReserved: + memory: 500Mi +systemReserved: + memory: 500Mi +evictionHard: + imagefs.available: 15% + memory.available: 100Mi + nodefs.available: 10% + nodefs.inodesFree: 5% +evictionPressureTransitionPeriod: 0s +``` + +#### Extra manifests + +Connects Pod's `.spec.runtimeClassName` to CRI-O's runtime. +If Pod has `.spec.runtimeClassName: microshift-low-latency`, +it can use annotations specified in CRI-O config with `crio.runtime.runtimes.high-performance`. + +```yaml +apiVersion: node.k8s.io/v1 +handler: high-performance +kind: RuntimeClass +metadata: + name: microshift-low-latency +``` + + +### Risks and Mitigations + +Biggest risk is system misconfiguration. +It is not known to author of the enhancement if there are configurations (like kernel params, sysctl, etc.) +that could brick the device, though it seems rather unlikely. +Even if kernel panic occurs after staging a deployment with new configuration, +thanks to greenboot functionality within the grub itself, the system will eventually rollback to +previous deployment. +Also, it is assumed that users are not pushing new image to production devices without prior verification on reference device. + +It may happen that some users need to use TuneD plugins that are not handled by the profile we'll create. +In such case we may investigate if it's something generic enough to include, or we can instruct them +to create new profile that would include `microshift-low-latency` profile. + +Systemd daemon we'll provide to enable TuneD profile should have a strict requirement before it +reboots the node, so it doesn't put it into a boot loop. +This pattern of reboot after booting affects the number of "effective" greenboot retries, +so customers might need to account for that by increasing the number of retries. + + +### Drawbacks + +Approach described in this enhancement does not provide much of the NTO's functionality +due to the "static" nature of RPMs and packaged files (compared to NTO's dynamic templating), +but it must be noted that NTO is going beyond low latency. + +One of the NTO's strengths is that it can create systemd units for runtime configuration +(such as offlining CPUs, setting hugepages per NUMA node, clearing IRQ balance banned CPUs, +setting RPS masks). Such dynamic actions are beyond capabilities of static files shipped via RPM. +If such features are required by users, we could ship such systemd units and they could be no-op +unless they're turned on in MicroShift's config. However, it is unknown to author of the enhancement +if these are integral part of the low latency. + +## Open Questions [optional] + +- Verify if osbuild blueprint can override a file from RPM + (variables.conf needs to exist for tuned profile, so it's nice to have some fallback)? +- NTO runs tuned in non-daemon one shot mode using systemd unit. + Should we try doing the same or we want the tuned daemon to run continuously? +- NTO's profile includes several other beside cpu-partitioning: + [openshift-node](https://github.com/redhat-performance/tuned/blob/master/profiles/openshift-node/tuned.conf) + and [openshift](https://github.com/redhat-performance/tuned/blob/master/profiles/openshift/tuned.conf) - should we include them or incorporate their settings? +- NTO took an approach to duplicate many of the setting from included profiles - should we do the same? + > Comment: Probably no need to do that. `cpu-partitioning` profile is not changed very often, + > so the risk of breakage is low, but if they change something, we should get that automatically, right? +- Should we also provide NTO's systemd units for offlining CPUs, setting hugepages per NUMA node, clearing IRQ balance, setting RPS masks? + +## Test Plan + +Two aspect of testing: +- Configuration verification - making sure that what we ship configures what we need. + Previously mentioned `microshift doctor low-latency` might be reference point. + +- Runtime verification - making sure that performance is as expected. + This might include using tools such as: hwlatdetect, oslat, cyclictest, rteval, and others. + Some of the mentioned tools are already included in the `openshift-tests`. + This step is highly dependent on the hardware, so we might need to long-term lease some hardware in + Beaker to have consistent environment and results that can be compared between runs. + +## Graduation Criteria + +Feature is meant to be GA on first release. + +### Dev Preview -> Tech Preview + +Not applicable. + +### Tech Preview -> GA + +Not applicable. + +### Removing a deprecated feature + +Not applicable. + +## Upgrade / Downgrade Strategy + +Upgrade / downgrade strategy is not needed because there are almost no runtime components or configs +that would need migration. + +User installs the RPM with TuneD profile and configures MicroShift (either manually, +using blueprint, or using image mode) and that exact configuration is applied on boot +and MicroShift start. + +## Version Skew Strategy + +Potentially breaking changes to TuneD and CRI-O: +- Most likely only relevant when RHEL is updated to next major version. + - To counter this we might want a job that runs performance testing on specified hardware + to find regressions. +- We might introduce some CI job to keep us updated on changes to NTO's functionality related to TuneD. + +Changes to Kubelet configuration: +- Breaking changes to currently used `kubelet.config.k8s.io/v1beta1` are not expected. +- Using new version of the `KubeletConfiguration` will require deliberate changes in MicroShift, + so this aspect of MicroShift Config -> Kubelet Config will not go unnoticed. + + +## Operational Aspects of API Extensions + +Kubelet configuration will be exposed in MicroShift config as a passthrough. + + +## Support Procedures + +To find out any configuration issues: +- Documentation of edge cases and potential pitfalls discovered during implementation. +- `microshift doctor low-latency` command to verify if the pieces involved in tuning host for low latency + are as expected according to developers' knowledge. Mainly comparing values between different + config files, verifying that RT kernel is installed and booted, tuned profile is active, etc. + +To discover any performance issues not related to missing configuration: +- Adapting some parts of OpenShift documentation. +- Referring user to [Red Hat Enterprise Linux for Real Time](https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux_for_real_time/9) documentation. + +## Alternatives + +### Deploying Node Tuning Operator + +Most of the functionality discussed in scope of this enhancement is already handled by Node Tuning +Operator (NTO). However incorporating it in the MicroShift is not the best way for couple reasons: +- NTO depends on Machine Config Operator which also is not supported on MicroShift, +- MicroShift takes different approach to host management than OpenShift, +- MicroShift being intended for edge devices aims to reduce runtime resource consumption and + introducing operator is against this goal. + + +### Reusing NTO code + +Instead of deploying NTO, its code could be partially incorporated in the MicroShift. +However this doesn't improve the operational aspects: MicroShift would transform a CR into TuneD, +CRI-O config, and kubelet configuration, which means it's still a controller, just running in +different binary and that doesn't help with runtime resource consumption. + +Parts that depend on the MCO would need to be rewritten and maintained. + +Other aspect is that NTO is highly generic, supporting many configuration options to mix and match +by the users, but this enhancement focuses solely on Low Latency. + + +### Providing users with upstream documentations on how to use TuneD and configure CRI-O + +This is least UX friendly way of providing the functionality. +Responsibility of dev team is to remove common hurdles from user's path so they make less mistakes +and want to continue using the product. + + +## Infrastructure Needed [optional] + +Nothing. + +## Appendix + +### Mapping NTO's PerformanceProfile + +NTO's PerformanceProfile is transformed into following artifacts (depending on CR's content): +- [tuned profiles](https://github.com/openshift/cluster-node-tuning-operator/tree/master/assets/performanceprofile/tuned) +- [runtime scripts ran using systemd units](https://github.com/openshift/cluster-node-tuning-operator/tree/master/assets/performanceprofile/scripts) +- [static config files, e.g. CRI-O, systemd slices, etc.](https://github.com/openshift/cluster-node-tuning-operator/tree/master/assets/performanceprofile/configs) + + +Following is PerformanceProfileSpec broken into pieces and documented how each value affects +Kubelet, CRI-O, Tuned, Sysctls, or MachineConfig. + +- .CPU + - .Reserved - CPU set not used for any container workloads initiated by kubelet. Used for cluster and OS housekeeping duties. + > Relevant for workload partitioning, out of scope for low latency + - KubeletConfig: .ReservedSystemCPUs, unless .MixedCPUsEnabled=true, then .ReservedSystemCPUs = .Reserved Union .Shared + - CRI-O: + - `assets/performanceprofile/configs/99-workload-pinning.conf` + - `assets/performanceprofile/configs/99-runtimes.conf`: `infra_ctr_cpuset = "{{.ReservedCpus}}"` + - Sysctl: `assets/performanceprofile/configs/99-default-rps-mask.conf`: RPS Mask == .Reserved + - Kubernetes: `assets/performanceprofile/configs/openshift-workload-pinning` + - Tuned: `/sys/devices/system/cpu/cpufreq/policy{{ <> }}/scaling_max_freq={{$.ReservedCpuMaxFreq}}` + + - .Isolated - CPU set for the application container workloads. Should be used for low latency workload. + > Relevant for low latency + - Tuned: `isolated_cores={{.IsolatedCpus}}` + - Tuned: `/sys/devices/system/cpu/cpufreq/policy{{ <> }}/scaling_max_freq={{$.IsolatedCpuMaxFreq}}` + > Impossible to do without dynamic templating (each CPU in .Isolated CPU set needs separate line) + + - .BalanceIsolated - toggles if the .Isolated CPU set is eligible for load balancing work loads. + If `false`, Isolated CPU set is static, meaning workloads have to explicitly assign each thread + to a specific CPU in order to work across multiple CPUs. + - Tuned: true -> cmdline isolcpus=domain,managed_irq,${isolated_cores}, otherwise isolcpus=managed_irq,${isolated_cores} + > Not implemented. Users can use `cpu-load-balancing.crio.io` annotation instead. + + - .Offlined - CPU set be unused and set offline. + > Out of scope + - Systemd: unit running `assets/performanceprofile/scripts/set-cpus-offline.sh` + + - .Shared - CPU set shared among guaranteed workloads needing additional CPUs which are not exclusive. + > User configures in Kubelet config + - KubeletConfig: if .MixedCPUsEnabled=true, then .ReservedSystemCPUs = .Reserved Union .Shared + +- .HardwareTuning + - Tuned: if !.PerPodPowerManagement, then cmdline =+ `intel_pstate=active` + > cpu-partitioning sets `intel_pstate=disable`, if user wants different value they can use + > `additionalArgs` in `microshift-low-latency-variables.conf` - in case of duplicated parameters, + > last one takes precedence + - .IsolatedCpuFreq (int) - defines a minimum frequency to be set across isolated cpus + - Tuned: `/sys/devices/system/cpu/cpufreq/policy{{ <> }}/scaling_max_freq={{$.IsolatedCpuMaxFreq}}` + > Not doable without dynamic templating + - .ReservedCpuFreq (int) - defines a minimum frequency to be set across reserved cpus + - Tuned: `/sys/devices/system/cpu/cpufreq/policy{{ <> }}/scaling_max_freq={{$.ReservedCpuMaxFreq}}` + > Not doable without dynamic templating + +- .HugePages + - .DefaultHugePagesSize (string) + - Tuned: cmdline =+ default_hugepagesz=%s + > Handled + - .Pages (slice) + - .Size + - Tuned: cmdline =+ hugepagesz=%s + > Handled + - .Count + - Tuned: cmdline =+ hugepages=%d + > Handled + - .Node - NUMA node, if not provided, hugepages are set in kernel args + - If provided, systemd unit running `assets/performanceprofile/scripts/hugepages-allocation.sh` - creates hugepages for specific NUMA on boot + > Not supported. + +- .MachineConfigLabel - map[string]string of labels to add to the MachineConfigs created by NTO. +- .MachineConfigPoolSelector - defines the MachineConfigPool label to use in the MachineConfigPoolSelector of resources like KubeletConfigs created by the operator. +- .NodeSelector - NodeSelector defines the Node label to use in the NodeSelectors of resources like Tuned created by the operator. + +- .RealTimeKernel + > RT is implied with low latency, so no explicit setting like this. + - .Enabled - true = RT kernel should be installed + - MachineConfig: .Spec.KernelType = `realtime`, otherwise `default` + +- .AdditionalKernelArgs ([]string) + > Supported + - Tuned: cmdline += .AdditionalKernelArgs + +- .NUMA + > All of these settings are "exposed" as kubelet config for user to set themselves. + - .TopologyPolicy (string), defaults to best-effort + - Kubelet: .TopologyManagerPolicy. + - If it's `restricted` or `single-numa-node` then also: + - kubelet.MemoryManagerPolicy = `static` + - kubelet.ReservedMemory + - Also, if `single-numa-node`: + - kubelet.CPUManagerPolicyOptions["full-pcpus-only"] = `true` + +- .Net + > Doing [net] per device would need templating. + > Doing global [net] is possible, although "Reserved CPU Count" + > suggests it's for control plane (workload partitioning) hence out of scope. + - .UserLevelNetworking (bool, default false) - true -> sets either all or specified net devices queue size to the amount of reserved CPUs + - Tuned: + - if .Device is empty, then: + ``` + [net] + channels=combined << ReserveCPUCount >> + nf_conntrack_hashsize=131072 + ``` + - if .Device not empty, then: each device gets following entry in tuned profile: + ``` + [netN] + type=net + devices_udev_regex=<< UDev Regex >> + channels=combined << ReserveCPUCount >> + nf_conntrack_hashsize=131072 + ``` + - .Device (slice) + - .InterfaceName + - .VendorID + - .DeviceID + +- .GloballyDisableIrqLoadBalancing (bool, default: false) - true: disable IRQ load balancing for the Isolated CPU set. + false: allow the IRQs to be balanced across all CPUs. IRQs LB can be disabled per Pod CPUs by using `irq-load-balancing.crio.io` and `cpu-quota.crio.io` annotations + ``` + [irqbalance] + enabled=false + ``` + > Not supported (though this is not difficult). Users can use `irq-load-balancing.crio.io: "disable"` annotation. + +- .WorkloadHints + - .HighPowerConsumption (bool) + - Tuned: cmdline =+ `processor.max_cstate=1 intel_idle.max_cstage=0` + - .RealTime (bool) + - MachineConfig: if false, don't add "setRPSMask" systemd or RPS sysctls + > Not a requirement. Sysctls can be handled with tuned (hardcoded), but systemd unit is out of scope. + - Tuned: cmdline =+ `nohz_full=${isolated_cores} tsc=reliable nosoftlockup nmi_watchdog=0 mce=off skew_tick=1 rcutree.kthread_prio=11` + > We can adapt some of kargs, but otherwise users can use `additionalArgs` variable. + - .PerPodPowerManagement (bool) + - Tuned: if true: cmdline += `intel_pstate=passive` + > Users can use `additionalArgs` to override default cpu-partitioning's `intel_pstate=disable` + - .MixedCPUs (bool) - enables mixed-cpu-node-plugin + > Seems to be special kind of plugin ([repo](https://github.com/openshift-kni/mixed-cpu-node-plugin)). + > Not present in MicroShift - not supported. + - Used for validation: error if: .MixedCPUs == true && .CPU.Shared == "". + - if true, then .ReservedSystemCPUs = .Reserved Union .Shared + +Default values: +- Kubelet + - .CPUManagerPolicy = `static` (K8s default: none) + - .CPUManagerReconcilePeriod = 5s (K8s default: 10s) + - .TopologyManagerPolicy = `best-effort` (K8s default: none) + - .KubeReserved[`memory`] = 500Mi + - .SystemReserved[`memory`] = 500Mi + - .EvictionHard[`memory.available`] = 100Mi (same as Kubernetes default) + - .EvictionHard[`nodefs.available`] = 10% (same as Kubernetes default) + - .EvictionHard[`imagefs.available`] = 15% (same as Kubernetes default) + - .EvictionHard[`nodefs.inodesFree`] = 5% (same as Kubernetes default) + +- MachineConfig: + - Also runs `assets/performanceprofile/scripts/clear-irqbalance-banned-cpus.sh` + > Unsupported. From 190596209fe583ce77391f1888d08df4f7f2b148 Mon Sep 17 00:00:00 2001 From: Hemant Kumar Date: Thu, 13 Jun 2024 15:52:00 -0400 Subject: [PATCH 10/53] Move steps for migrating code to csi-operator to ocp enhancementsn --- .../storage/csi-driver-operator-merge.md | 64 ++++++++++++++++++- 1 file changed, 63 insertions(+), 1 deletion(-) diff --git a/enhancements/storage/csi-driver-operator-merge.md b/enhancements/storage/csi-driver-operator-merge.md index e607cda448..bfd1f69ec6 100644 --- a/enhancements/storage/csi-driver-operator-merge.md +++ b/enhancements/storage/csi-driver-operator-merge.md @@ -469,6 +469,68 @@ N/A Same as today. +## Process of moving operators to csi-operator monorepo + +We have come with following flow for moving operators from their own repository into csi-operator mono repo. + +### Avoid .gitignore related footguns + +Remove any .gitignore entries in the source repository that would match a directory / file that we need. For example, `azure-disk-csi-driver-operator` in .gitignore matched `cmd/azure-disk-csi-driver-operator` directory that we really need not to be ignored. See https://github.com/openshift/csi-operator/pull/110, where we had to fix after merge to csi-operator. + +### Move existing code into csi-operator repository + +Using git-subtree move your existing operator code to https://github.com/openshift/csi-operator/tree/master/legacy + +``` +git subtree add --prefix legacy/azure-disk-csi-driver-operator https://github.com/openshift/azure-disk-csi-driver-operator.git master --squash +git subtree push --prefix legacy/azure-disk-csi-driver-operator https://github.com/openshift/azure-disk-csi-driver-operator.git master +``` + +### Add Dockerfiles for building images from new location + +Place a `Dockerfile.` and `Dockerfile..test` at top of csi-operator tree and make sure that you are able to build an image of the operator from csi-operator repository. + +### Update openshift/release to build image from new location +Make a PR to openshift/release repository to build the operator from csi-operator. For example - https://github.com/openshift/release/pull/46233. + +1. Update also `storage-conf-csi--commands.sh`, the test manifest will be at a different location. +2. Make sure that rehearse jobs for both older versions of operator and newer versions of operator pass. + +### Change ocp-build-data repository to ship image from new location + +Make a PR to [ocp-build-data](https://github.com/openshift-eng/ocp-build-data) repository to change location of the image etc - https://github.com/openshift-eng/ocp-build-data/pull/4148 + +1. Notice the `cachito` line in the PR - we need to build with the vendor from legacy/ directory. + +2. Ask ART for a scratch build. Make sure you can install a cluster with that build. + +``` +oc adm release new \ +--from-release=registry.ci.openshift.org/ocp/release:4.15.0-0.nightly.XYZ \ +azure-disk-csi-driver-operator= \ +--to-image=quay.io/jsafrane/scratch:release1 \ +--name=4.15.0-0.nightly.jsafrane.1 + +oc adm release extract --command openshift-install quay.io/jsafrane/scratch:release1 +``` + +### Co-ordinating merges in ocp-build-data and release repository + +Both PRs in openshift/release and ocp-build-data must be merged +/- at the same time. There is a robot that syncs some data from ocp-build-data to openshift/release and actually breaks things when these two repos use different source repository to build images. + +### Enjoy the build from csi-operator repository + +After aforementioned changes, your new operator should be able to be built from csi-operator repo and everything should work. + +### Moving operator to new structure in csi-operator + +So in previous section we merely copied existing code from operator’s own repository into `csi-operator` repository. We did not change anything. + +But once your operator has been changed to conform to new code in csi-operator repo, You need to perform following additional steps: + +1. Make sure that `Dockerfile.` at top of the `csi-operator` tree refers to new location of code and not older `legacy/` location.See example of existing Dockerfiles. +2. After your changes to `csi-operator` are merged, you should remove the old location from cachito - https://github.com/openshift-eng/ocp-build-data/pull/4219 + ## Implementation History Major milestones in the life cycle of a proposal should be tracked in `Implementation @@ -548,4 +610,4 @@ Advantages: ## Infrastructure Needed [optional] -N/A (other than the usual CI + QE) \ No newline at end of file +N/A (other than the usual CI + QE) From fb448e2c3cf351ce4083204dde0b99d0cd028f33 Mon Sep 17 00:00:00 2001 From: Justin Pierce Date: Tue, 18 Jun 2024 11:20:20 -0400 Subject: [PATCH 11/53] Improve YAML header parsing The previous implementation required yaml doc separator to ignore everything after the second delimiter. I believe the YAML standard expects subsequent documents to also be valid YAML. This change truncates the markdown prior to trying to parse the YAML header. --- hack/Dockerfile.markdownlint | 2 +- tools/enhancements/metadata.go | 10 +++++++++- 2 files changed, 10 insertions(+), 2 deletions(-) diff --git a/hack/Dockerfile.markdownlint b/hack/Dockerfile.markdownlint index 5ad62848d0..d511fdf66d 100644 --- a/hack/Dockerfile.markdownlint +++ b/hack/Dockerfile.markdownlint @@ -1,4 +1,4 @@ -FROM fedora +FROM fedora:38 WORKDIR /workdir RUN dnf install -y git golang COPY install-markdownlint.sh /tmp diff --git a/tools/enhancements/metadata.go b/tools/enhancements/metadata.go index ee29f98237..8e94fb650e 100644 --- a/tools/enhancements/metadata.go +++ b/tools/enhancements/metadata.go @@ -3,6 +3,7 @@ package enhancements import ( "fmt" "net/url" + "strings" "gopkg.in/yaml.v3" ) @@ -21,7 +22,14 @@ type MetaData struct { func NewMetaData(content []byte) (*MetaData, error) { result := MetaData{} - err := yaml.Unmarshal(content, &result) + strContent := string(content) + parts := strings.Split(strContent, "---") + if len(parts) < 3 { + return nil, fmt.Errorf("could not extract meta data from header: yaml was not delineated by '---' per the template") + } + yamlContent := strings.TrimSpace(parts[1]) + yamlBytes := []byte(yamlContent) + err := yaml.Unmarshal(yamlBytes, &result) if err != nil { return nil, fmt.Errorf("could not extract meta data from header: %w", err) } From 6af0b78288320a68d0bd008325632810a3d11249 Mon Sep 17 00:00:00 2001 From: Patryk Matuszak Date: Thu, 20 Jun 2024 10:58:43 +0200 Subject: [PATCH 12/53] review changes --- .../low-latency-workloads-on-microshift.md | 160 +++++++++++++++--- 1 file changed, 136 insertions(+), 24 deletions(-) diff --git a/enhancements/microshift/low-latency-workloads-on-microshift.md b/enhancements/microshift/low-latency-workloads-on-microshift.md index 15a113f6f9..c8369ff1f0 100644 --- a/enhancements/microshift/low-latency-workloads-on-microshift.md +++ b/enhancements/microshift/low-latency-workloads-on-microshift.md @@ -11,11 +11,11 @@ reviewers: approvers: - "@jerpeter1" api-approvers: - - TBD + - "@jerpeter1" creation-date: 2024-06-12 last-updated: 2024-06-12 tracking-link: - - USHIFT-2981 + - https://issues.redhat.com/browse/USHIFT-2981 --- # Low Latency workloads on MicroShift @@ -38,23 +38,23 @@ ready to use packages to kickstart customers' usage of low latency workloads. ### User Stories -* As a MicroShift administrator, I want to configure MicroShift host, +* As a MicroShift administrator, I want to configure MicroShift host and all involved subsystems so that I can run low latency workloads. ### Goals -Provide relatively easy way to configure system for low latency workload running on MicroShift: +Provide guidance and example artifacts for configuring the system for low latency workload running on MicroShift: - Prepare low latency TuneD profile for MicroShift - Prepare necessary CRI-O configurations - Allow configuration of Kubelet via MicroShift config -- Add small systemd daemon to enable TuneD profile and (optionally) reboot the host if the kernel - arguments change to make them effective - +- Introduce a mechanism to automatically apply a tuned profile upon boot. +- Document how to create a new tuned profile for users wanting more control. + ### Non-Goals -- Workload partitioning (i.e. pinning MicroShift control plane components) +- Workload partitioning (i.e. pinning MicroShift control plane components) (see [OCPSTRAT-1068](https://issues.redhat.com/browse/OCPSTRAT-1068)) - Duplicate all capabilities of Node Tuning Operator @@ -62,11 +62,11 @@ Provide relatively easy way to configure system for low latency workload running To ease configuration of the system for running low latency workloads on MicroShift following parts need to be put in place: -- TuneD profile +- `microshift-low-latency` TuneD profile - CRI-O configuration + Kubernetes' RuntimeClass -- Kubelet configuration (CPU, Memory, and Topology Managers) -- Small systemd daemon to activate TuneD profile on boot and reboot the host if the kernel args - are changed. +- Kubelet configuration (CPU, Memory, and Topology Managers and other) +- `microshift-tuned.service` to activate user selected TuneD profile on boot and reboot the host + if the kernel args are changed. New RPM will be created that will contain tuned profile, CRI-O configs, and mentioned systemd daemon. We'll leverage existing know how of Performance and Scalability team expertise and look at @@ -96,12 +96,13 @@ Workflow consists of two parts: 1. User creates an osbuild blueprint: - (optional) User configures `[customizations.kernel]` in the blueprint if the values are known beforehand. This could prevent from necessary reboot after applying tuned profile. - - User adds `kernel-rt` package to the blueprint - - User adds `microshift-low-latency` RPM to the blueprint + - (optional) User adds `kernel-rt` package to the blueprint + - User adds `microshift-tuned.rpm` to the blueprint - User enables `microshift-tuned.service` - User supplies additional configs using blueprint: - /etc/tuned/microshift-low-latency-variables.conf - /etc/microshift/config.yaml to configure Kubelet + - /etc/microshift/tuned.json to configure `microshift-tuned.service` 1. User builds the blueprint 1. User deploys the commit / installs the system. 1. System boots @@ -109,41 +110,136 @@ Workflow consists of two parts: - Saves current kernel args - Applies tuned `microshift-low-latency` profile - Verifies expected kernel args - - ostree: `rpm-ostree kargs` + - ostree: `rpm-ostree kargs` or checking if new deployment was created[0] - rpm: `grubby` - If the current and expected kernel args are different, reboot the node 1. Host boots again, everything for low latency is in place, `microshift.service` can continue start up. +[0] changing kernel arguments on ostree system results in creating new deployment. + +Example blueprint: + +```toml +name = "microshift-low-latency" +version = "0.0.1" +modules = [] +groups = [] +distro = "rhel-94" + +[[packages]] +name = "microshift" +version = "4.17.*" + +[[packages]] +name = "microshift-tuned" +version = "4.17.*" + +[[customizations.services]] +enabled = ["microshift", "microshift-tuned"] + +[[customizations.kernel]] +append = "some already known kernel args" +name = "KERNEL-rt" + +[[customizations.files]] +path = "/etc/tuned/microshift-low-latency-variables.conf" +data = """ +isolated_cores=1-2 +hugepagesDefaultSize = 2M +hugepages2M = 128 +hugepages1G = 0 +additionalArgs = "" +""" + +[[customizations.files]] +path = "/etc/microshift/config.yaml" +data = """ +kubelet: + cpuManagerPolicy: static + memoryManagerPolicy: Static +""" + +[[customizations.files]] +path = "/etc/microshift/tuned.json" +data = """ +{ + "auto_reboot_enabled": "true", + "profile": "microshift-low-latency" +} +""" +``` + + +##### bootc + +1. User creates Containerfile that: + - (optional) installs `kernel-rt` + - installs `microshift-tuned.rpm` + - enables `microshift-tuned.service` + - adds following configs + - /etc/tuned/microshift-low-latency-variables.conf + - /etc/microshift/config.yaml to configure Kubelet + - /etc/microshift/tuned.json to configure `microshift-tuned.service` +1. User builds the blueprint +1. User deploys the commit / installs the system. +1. System boots - rest is just like in OSTree flow + +Example Containerfile: + +``` +FROM registry.redhat.io/rhel9/rhel-bootc:9.4 + +# ... MicroShift installation ... + +RUN dnf install kernel-rt microshift-tuned +COPY microshift-low-latency-variables.conf /etc/tuned/microshift-low-latency-variables.conf +COPY microshift-config.yaml /etc/microshift/config.yaml +COPY microshift-tuned.json /etc/microshift/tuned.json + +RUN systemctl enable microshift-tuned.service +``` + ##### RPM 1. User installs `microshift-low-latency` RPM. 1. User creates following configs: - /etc/tuned/microshift-low-latency-variables.conf - /etc/microshift/config.yaml to configure Kubelet -1. User enables `microshift-tuned.service` or uses `tuned-adm` directly to activate the profile - (and reboot the host if needed). -1. If host was not rebooted, CRI-O and MicroShift services need to be restarted to make new settings - active. + - /etc/microshift/tuned.json to configure `microshift-tuned.service` +1. user starts/enables `microshift-tuned.service`: + - Saves current kernel args + - Applies tuned `microshift-low-latency` profile + - Verifies expected kernel args + - ostree: `rpm-ostree kargs` + - rpm: `grubby` + - If the current and expected kernel args are different, reboots the node +1. Host boots again, everything for low latency is in place, +1. User starts/enables `microshift.service` + #### Preparing low latency workload - Setting `.spec.runtimeClassName: microshift-low-latency` in Pod spec. - Setting Pod's memory limit and memory request to the same value, and setting CPU limit and CPU request to the same value to ensure Pod has guaranteed QoS class. -- Use annotations to get desired behavior: +- Use annotations to get desired behavior + (unless link to a documentation is present, these annotations only take two values: enabled and disabled): - `cpu-load-balancing.crio.io: "disable"` - disable CPU load balancing for Pod (only use with CPU Manager `static` policy and for Guaranteed QoS Pods using whole CPUs) - `cpu-quota.crio.io: "disable"` - disable Completely Fair Scheduler (CFS) - `irq-load-balancing.crio.io: "disable"` - disable interrupt processing (only use with CPU Manager `static` policy and for Guaranteed QoS Pods using whole CPUs) - `cpu-c-states.crio.io: "disable"` - disable C-states - - `cpu-freq-governor.crio.io: ""` - specify governor type for CPU Freq scaling (e.g. `performance`) + ([see doc for possible values](https://docs.openshift.com/container-platform/4.15/scalability_and_performance/low_latency_tuning/cnf-provisioning-low-latency-workloads.html#cnf-configuring-high-priority-workload-pods_cnf-provisioning-low-latency)) + - `cpu-freq-governor.crio.io: ""` - specify governor type for CPU Freq scaling (e.g. `performance`) + ([see doc for possible values](https://www.kernel.org/doc/Documentation/cpu-freq/governors.txt)) + ### API Extensions Following API extensions are expected: -- A passthrough from MicroShift to Kubelet config. +- A passthrough from MicroShift's config to Kubelet config. - Variables file for TuneD profile to allow customization of the profile for specific host. @@ -220,6 +316,18 @@ hugepages1G = 0 additionalArgs = "" ``` +#### `microshift-tuned.service` configuration + +Config file to specify which profile to re-apply each boot and if host should be rebooted if +the kargs before and after applying profile are mismatched. + +```json +{ + "auto_reboot_enabled": "true", + "profile": "microshift-low-latency" +} +``` + #### CRI-O configuration ```ini @@ -337,8 +445,9 @@ if these are integral part of the low latency. - Verify if osbuild blueprint can override a file from RPM (variables.conf needs to exist for tuned profile, so it's nice to have some fallback)? -- NTO runs tuned in non-daemon one shot mode using systemd unit. - Should we try doing the same or we want the tuned daemon to run continuously? +- ~~NTO runs tuned in non-daemon one shot mode using systemd unit.~~ + ~~Should we try doing the same or we want the tuned daemon to run continuously?~~ + > Let's stick to default RHEL behaviour. MicroShift doesn't own the OS. - NTO's profile includes several other beside cpu-partitioning: [openshift-node](https://github.com/redhat-performance/tuned/blob/master/profiles/openshift-node/tuned.conf) and [openshift](https://github.com/redhat-performance/tuned/blob/master/profiles/openshift/tuned.conf) - should we include them or incorporate their settings? @@ -384,6 +493,9 @@ User installs the RPM with TuneD profile and configures MicroShift (either manua using blueprint, or using image mode) and that exact configuration is applied on boot and MicroShift start. +For the newly added section in MicroShift config, if it's present after downgrading to previous +MicroShift minor version, the section will be simply ignored because it's not represented in the Go structure. + ## Version Skew Strategy Potentially breaking changes to TuneD and CRI-O: From d485f88c318581819b0a2b0654a91249ad790ded Mon Sep 17 00:00:00 2001 From: Patryk Matuszak Date: Mon, 24 Jun 2024 16:46:35 +0200 Subject: [PATCH 13/53] update names --- .../low-latency-workloads-on-microshift.md | 56 +++++++++---------- 1 file changed, 28 insertions(+), 28 deletions(-) diff --git a/enhancements/microshift/low-latency-workloads-on-microshift.md b/enhancements/microshift/low-latency-workloads-on-microshift.md index c8369ff1f0..57d660fad9 100644 --- a/enhancements/microshift/low-latency-workloads-on-microshift.md +++ b/enhancements/microshift/low-latency-workloads-on-microshift.md @@ -62,13 +62,13 @@ Provide guidance and example artifacts for configuring the system for low latenc To ease configuration of the system for running low latency workloads on MicroShift following parts need to be put in place: -- `microshift-low-latency` TuneD profile +- `microshift-baseline` TuneD profile - CRI-O configuration + Kubernetes' RuntimeClass - Kubelet configuration (CPU, Memory, and Topology Managers and other) - `microshift-tuned.service` to activate user selected TuneD profile on boot and reboot the host if the kernel args are changed. -New RPM will be created that will contain tuned profile, CRI-O configs, and mentioned systemd daemon. +New `microshift-low-latency` RPM will be created that will contain tuned profile, CRI-O configs, and mentioned systemd daemon. We'll leverage existing know how of Performance and Scalability team expertise and look at Node Tuning Operator capabilities. @@ -97,18 +97,18 @@ Workflow consists of two parts: - (optional) User configures `[customizations.kernel]` in the blueprint if the values are known beforehand. This could prevent from necessary reboot after applying tuned profile. - (optional) User adds `kernel-rt` package to the blueprint - - User adds `microshift-tuned.rpm` to the blueprint + - User adds `microshift-low-latency.rpm` to the blueprint - User enables `microshift-tuned.service` - User supplies additional configs using blueprint: - - /etc/tuned/microshift-low-latency-variables.conf - - /etc/microshift/config.yaml to configure Kubelet - - /etc/microshift/tuned.json to configure `microshift-tuned.service` + - `/etc/tuned/microshift-baseline-variables.conf` to configure tuned profile + - `/etc/microshift/config.yaml` to configure Kubelet + - `/etc/microshift/microshift-tuned.json` to configure `microshift-tuned.service` 1. User builds the blueprint 1. User deploys the commit / installs the system. 1. System boots 1. `microshift-tuned.service` starts (after `tuned.service`, before `microshift.service`): - Saves current kernel args - - Applies tuned `microshift-low-latency` profile + - Applies tuned `microshift-baseline` profile - Verifies expected kernel args - ostree: `rpm-ostree kargs` or checking if new deployment was created[0] - rpm: `grubby` @@ -132,7 +132,7 @@ name = "microshift" version = "4.17.*" [[packages]] -name = "microshift-tuned" +name = "microshift-low-latency" version = "4.17.*" [[customizations.services]] @@ -143,7 +143,7 @@ append = "some already known kernel args" name = "KERNEL-rt" [[customizations.files]] -path = "/etc/tuned/microshift-low-latency-variables.conf" +path = "/etc/tuned/microshift-baseline-variables.conf" data = """ isolated_cores=1-2 hugepagesDefaultSize = 2M @@ -161,11 +161,11 @@ kubelet: """ [[customizations.files]] -path = "/etc/microshift/tuned.json" +path = "/etc/microshift/microshift-tuned.json" data = """ { "auto_reboot_enabled": "true", - "profile": "microshift-low-latency" + "profile": "microshift-baseline" } """ ``` @@ -175,12 +175,12 @@ data = """ 1. User creates Containerfile that: - (optional) installs `kernel-rt` - - installs `microshift-tuned.rpm` + - installs `microshift-low-latency.rpm` - enables `microshift-tuned.service` - adds following configs - - /etc/tuned/microshift-low-latency-variables.conf - - /etc/microshift/config.yaml to configure Kubelet - - /etc/microshift/tuned.json to configure `microshift-tuned.service` + - `/etc/tuned/microshift-baseline-variables.conf` + - `/etc/microshift/config.yaml` to configure Kubelet + - `/etc/microshift/microshift-tuned.json` to configure `microshift-tuned.service` 1. User builds the blueprint 1. User deploys the commit / installs the system. 1. System boots - rest is just like in OSTree flow @@ -193,9 +193,9 @@ FROM registry.redhat.io/rhel9/rhel-bootc:9.4 # ... MicroShift installation ... RUN dnf install kernel-rt microshift-tuned -COPY microshift-low-latency-variables.conf /etc/tuned/microshift-low-latency-variables.conf -COPY microshift-config.yaml /etc/microshift/config.yaml -COPY microshift-tuned.json /etc/microshift/tuned.json +COPY microshift-baseline-variables.conf /etc/tuned/microshift-low-latency-variables.conf +COPY microshift-config.yaml /etc/microshift/config.yaml +COPY microshift-tuned.json /etc/microshift/microshift-tuned.json RUN systemctl enable microshift-tuned.service ``` @@ -204,9 +204,9 @@ RUN systemctl enable microshift-tuned.service 1. User installs `microshift-low-latency` RPM. 1. User creates following configs: - - /etc/tuned/microshift-low-latency-variables.conf - - /etc/microshift/config.yaml to configure Kubelet - - /etc/microshift/tuned.json to configure `microshift-tuned.service` + - `/etc/tuned/microshift-low-latency-variables.conf` + - `/etc/microshift/config.yaml` to configure Kubelet + - `/etc/microshift/microshift-tuned.json` to configure `microshift-tuned.service` 1. user starts/enables `microshift-tuned.service`: - Saves current kernel args - Applies tuned `microshift-low-latency` profile @@ -261,9 +261,9 @@ Purely MicroShift enhancement. #### TuneD Profile -New `microshift-low-latency` tuned profile will be created and will include existing `cpu-partitioning` profile. +New `microshift-baseline` tuned profile will be created and will include existing `cpu-partitioning` profile. -`/etc/tuned/microshift-low-latency-variables.conf` will be used by users to provide custom values for settings such as: +`/etc/tuned/microshift-baseline-variables.conf` will be used by users to provide custom values for settings such as: - isolated CPU set - hugepage count (both 2M and 1G) - additional kernel arguments @@ -274,7 +274,7 @@ summary=Optimize for running low latency workloads on MicroShift include=cpu-partitioning [variables] -include=/etc/tuned/microshift-low-latency-variables.conf +include=/etc/tuned/microshift-baseline-variables.conf [bootloader] cmdline_microshift=+default_hugepagesz=${hugepagesDefaultSize} hugepagesz=2M hugepages=${hugepages2M} hugepagesz=1G hugepages=${hugepages1G} @@ -302,7 +302,7 @@ isolated_cores=${f:calc_isolated_cores:1} # To disable the kernel load balancing in certain isolated CPUs: # no_balance_cores=5-10 -### microshift-low-latency variables +### microshift-baseline variables # Default hugepages size hugepagesDefaultSize = 2M @@ -324,7 +324,7 @@ the kargs before and after applying profile are mismatched. ```json { "auto_reboot_enabled": "true", - "profile": "microshift-low-latency" + "profile": "microshift-baseline" } ``` @@ -420,7 +420,7 @@ Also, it is assumed that users are not pushing new image to production devices w It may happen that some users need to use TuneD plugins that are not handled by the profile we'll create. In such case we may investigate if it's something generic enough to include, or we can instruct them -to create new profile that would include `microshift-low-latency` profile. +to create new profile that would include `microshift-baseline` profile. Systemd daemon we'll provide to enable TuneD profile should have a strict requirement before it reboots the node, so it doesn't put it into a boot loop. @@ -610,7 +610,7 @@ Kubelet, CRI-O, Tuned, Sysctls, or MachineConfig. - .HardwareTuning - Tuned: if !.PerPodPowerManagement, then cmdline =+ `intel_pstate=active` > cpu-partitioning sets `intel_pstate=disable`, if user wants different value they can use - > `additionalArgs` in `microshift-low-latency-variables.conf` - in case of duplicated parameters, + > `additionalArgs` in `microshift-baseline-variables.conf` - in case of duplicated parameters, > last one takes precedence - .IsolatedCpuFreq (int) - defines a minimum frequency to be set across isolated cpus - Tuned: `/sys/devices/system/cpu/cpufreq/policy{{ <> }}/scaling_max_freq={{$.IsolatedCpuMaxFreq}}` From 6b0c1b85353d8b48586432fc78fe410f9318db8c Mon Sep 17 00:00:00 2001 From: Patryk Matuszak Date: Mon, 24 Jun 2024 16:48:11 +0200 Subject: [PATCH 14/53] change config from json to yaml --- .../low-latency-workloads-on-microshift.md | 26 ++++++++----------- 1 file changed, 11 insertions(+), 15 deletions(-) diff --git a/enhancements/microshift/low-latency-workloads-on-microshift.md b/enhancements/microshift/low-latency-workloads-on-microshift.md index 57d660fad9..a038630ef9 100644 --- a/enhancements/microshift/low-latency-workloads-on-microshift.md +++ b/enhancements/microshift/low-latency-workloads-on-microshift.md @@ -102,7 +102,7 @@ Workflow consists of two parts: - User supplies additional configs using blueprint: - `/etc/tuned/microshift-baseline-variables.conf` to configure tuned profile - `/etc/microshift/config.yaml` to configure Kubelet - - `/etc/microshift/microshift-tuned.json` to configure `microshift-tuned.service` + - `/etc/microshift/microshift-tuned.yaml` to configure `microshift-tuned.service` 1. User builds the blueprint 1. User deploys the commit / installs the system. 1. System boots @@ -161,12 +161,10 @@ kubelet: """ [[customizations.files]] -path = "/etc/microshift/microshift-tuned.json" +path = "/etc/microshift/microshift-tuned.yaml" data = """ -{ - "auto_reboot_enabled": "true", - "profile": "microshift-baseline" -} +auto_reboot_enabled: True +profile: microshift-baseline """ ``` @@ -180,7 +178,7 @@ data = """ - adds following configs - `/etc/tuned/microshift-baseline-variables.conf` - `/etc/microshift/config.yaml` to configure Kubelet - - `/etc/microshift/microshift-tuned.json` to configure `microshift-tuned.service` + - `/etc/microshift/microshift-tuned.yaml` to configure `microshift-tuned.service` 1. User builds the blueprint 1. User deploys the commit / installs the system. 1. System boots - rest is just like in OSTree flow @@ -195,7 +193,7 @@ FROM registry.redhat.io/rhel9/rhel-bootc:9.4 RUN dnf install kernel-rt microshift-tuned COPY microshift-baseline-variables.conf /etc/tuned/microshift-low-latency-variables.conf COPY microshift-config.yaml /etc/microshift/config.yaml -COPY microshift-tuned.json /etc/microshift/microshift-tuned.json +COPY microshift-tuned.yaml /etc/microshift/microshift-tuned.yaml RUN systemctl enable microshift-tuned.service ``` @@ -204,9 +202,9 @@ RUN systemctl enable microshift-tuned.service 1. User installs `microshift-low-latency` RPM. 1. User creates following configs: - - `/etc/tuned/microshift-low-latency-variables.conf` + - `/etc/tuned/microshift-baseline-variables.conf` - `/etc/microshift/config.yaml` to configure Kubelet - - `/etc/microshift/microshift-tuned.json` to configure `microshift-tuned.service` + - `/etc/microshift/microshift-tuned.yaml` to configure `microshift-tuned.service` 1. user starts/enables `microshift-tuned.service`: - Saves current kernel args - Applies tuned `microshift-low-latency` profile @@ -321,11 +319,9 @@ additionalArgs = "" Config file to specify which profile to re-apply each boot and if host should be rebooted if the kargs before and after applying profile are mismatched. -```json -{ - "auto_reboot_enabled": "true", - "profile": "microshift-baseline" -} +```yaml +auto_reboot_enabled: True +profile: microshift-baseline ``` #### CRI-O configuration From 3dd0cbec65423bbd64bb6a960cf521f2d529e231 Mon Sep 17 00:00:00 2001 From: Patryk Matuszak Date: Mon, 24 Jun 2024 17:03:21 +0200 Subject: [PATCH 15/53] don't reapply the profile on each boot --- .../low-latency-workloads-on-microshift.md | 33 ++++++++++--------- 1 file changed, 17 insertions(+), 16 deletions(-) diff --git a/enhancements/microshift/low-latency-workloads-on-microshift.md b/enhancements/microshift/low-latency-workloads-on-microshift.md index a038630ef9..a11f2795e2 100644 --- a/enhancements/microshift/low-latency-workloads-on-microshift.md +++ b/enhancements/microshift/low-latency-workloads-on-microshift.md @@ -107,12 +107,12 @@ Workflow consists of two parts: 1. User deploys the commit / installs the system. 1. System boots 1. `microshift-tuned.service` starts (after `tuned.service`, before `microshift.service`): - - Saves current kernel args - - Applies tuned `microshift-baseline` profile - - Verifies expected kernel args - - ostree: `rpm-ostree kargs` or checking if new deployment was created[0] - - rpm: `grubby` - - If the current and expected kernel args are different, reboot the node + - Queries tuned for active profile + - Compares active profile with requested profile + - If profile are the same: do nothing, exit. + - If profiles are different: + - Apply requested profile + - If `reboot_after_apply` is True, then reboot the host 1. Host boots again, everything for low latency is in place, `microshift.service` can continue start up. @@ -163,7 +163,7 @@ kubelet: [[customizations.files]] path = "/etc/microshift/microshift-tuned.yaml" data = """ -auto_reboot_enabled: True +reboot_after_apply: True profile: microshift-baseline """ ``` @@ -204,14 +204,8 @@ RUN systemctl enable microshift-tuned.service 1. User creates following configs: - `/etc/tuned/microshift-baseline-variables.conf` - `/etc/microshift/config.yaml` to configure Kubelet - - `/etc/microshift/microshift-tuned.yaml` to configure `microshift-tuned.service` -1. user starts/enables `microshift-tuned.service`: - - Saves current kernel args - - Applies tuned `microshift-low-latency` profile - - Verifies expected kernel args - - ostree: `rpm-ostree kargs` - - rpm: `grubby` - - If the current and expected kernel args are different, reboots the node +1. User runs `sudo tuned-adm profile microshift-baseline` to enable the profile. +1. User reboots the host to make changes to kernel arguments active. 1. Host boots again, everything for low latency is in place, 1. User starts/enables `microshift.service` @@ -320,8 +314,8 @@ Config file to specify which profile to re-apply each boot and if host should be the kargs before and after applying profile are mismatched. ```yaml -auto_reboot_enabled: True profile: microshift-baseline +reboot_after_apply: True ``` #### CRI-O configuration @@ -555,6 +549,13 @@ Responsibility of dev team is to remove common hurdles from user's path so they and want to continue using the product. +### Applying user requested TuneD profile on every start of the `microshift-tuned.service` + +This was preferred option before it was discovered that when profile is applied, tuned will +first remove all the old kernel args, then append new kernel args resulting in creation of new staged OSTree deployment. +This happens even if the the same profile is being reapplied. + + ## Infrastructure Needed [optional] Nothing. From d9733df04b8255e576f6bd427639bc7109095a62 Mon Sep 17 00:00:00 2001 From: Patryk Matuszak Date: Tue, 25 Jun 2024 18:38:11 +0200 Subject: [PATCH 16/53] use checksums to decide if profile should be re-applied --- .../low-latency-workloads-on-microshift.md | 34 ++++++++++++++----- 1 file changed, 25 insertions(+), 9 deletions(-) diff --git a/enhancements/microshift/low-latency-workloads-on-microshift.md b/enhancements/microshift/low-latency-workloads-on-microshift.md index a11f2795e2..ac198977c5 100644 --- a/enhancements/microshift/low-latency-workloads-on-microshift.md +++ b/enhancements/microshift/low-latency-workloads-on-microshift.md @@ -107,12 +107,13 @@ Workflow consists of two parts: 1. User deploys the commit / installs the system. 1. System boots 1. `microshift-tuned.service` starts (after `tuned.service`, before `microshift.service`): - - Queries tuned for active profile - Compares active profile with requested profile - - If profile are the same: do nothing, exit. - - If profiles are different: - - Apply requested profile - - If `reboot_after_apply` is True, then reboot the host + - If requested profile is already active: + - Compare checksum of requested profile with cached checksum. + - If checksums are the same - exit. + - Apply requested profile + - Calculate checksum of the profile and the variables file and save it + - If `reboot_after_apply` is True, then reboot the host 1. Host boots again, everything for low latency is in place, `microshift.service` can continue start up. @@ -204,10 +205,25 @@ RUN systemctl enable microshift-tuned.service 1. User creates following configs: - `/etc/tuned/microshift-baseline-variables.conf` - `/etc/microshift/config.yaml` to configure Kubelet -1. User runs `sudo tuned-adm profile microshift-baseline` to enable the profile. -1. User reboots the host to make changes to kernel arguments active. -1. Host boots again, everything for low latency is in place, -1. User starts/enables `microshift.service` +1. Development environment + - User runs `sudo tuned-adm profile microshift-baseline` to enable the profile. + - User reboots the host to make changes to kernel arguments active. + - Host boots again, everything for low latency is in place, + - User starts/enables `microshift.service` +1. Production environment + - User creates `/etc/microshift/microshift-tuned.yaml` to configure `microshift-tuned.service` + - User enables `microshift.service` + - User enables and starts `microshift-tuned.service` which: + - Compares active profile with requested profile + - If requested profile is already active: + - Compare checksum of requested profile with cached checksum. + - If checksums are the same - exit. + - Apply requested profile + - Calculate checksum of the profile and the variables file and save it + - If `reboot_after_apply` is True, then reboot the host + - Host is rebooted: MicroShift starts because it was enabled + - Host doesn't need reboot: + - User starts `microshift.service` #### Preparing low latency workload From d8258d7f0be0f2352a9f3ec2c129914fbf17efa4 Mon Sep 17 00:00:00 2001 From: Patryk Matuszak Date: Thu, 27 Jun 2024 10:55:27 +0200 Subject: [PATCH 17/53] why passthrough > drop-in config dir --- .../microshift/low-latency-workloads-on-microshift.md | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/enhancements/microshift/low-latency-workloads-on-microshift.md b/enhancements/microshift/low-latency-workloads-on-microshift.md index ac198977c5..6abf136116 100644 --- a/enhancements/microshift/low-latency-workloads-on-microshift.md +++ b/enhancements/microshift/low-latency-workloads-on-microshift.md @@ -572,6 +572,14 @@ first remove all the old kernel args, then append new kernel args resulting in c This happens even if the the same profile is being reapplied. +### Using Kubelet's drop-in config directory feature + +This is alternative to "MicroShift config -> Kubelet config" passthrough. +Using drop-ins would mean there is less implementation on MicroShift side, but it also gives away +full power of Kubelet config. With passthrough we can later introduce some validation or +limit configuration options to keep the exposed configuration surface minimal. + + ## Infrastructure Needed [optional] Nothing. From 941f5c6391830d5e4a94e65d742acbcaf9b8eda9 Mon Sep 17 00:00:00 2001 From: Tim Rozet Date: Fri, 3 May 2024 18:57:10 -0400 Subject: [PATCH 18/53] Add support for user-defined networks in OVNK Signed-off-by: Tim Rozet --- enhancements/network/images/VRFs.svg | 1 + .../network/images/egress-ip-l2-primary.svg | 1 + .../network/images/egress-ip-vrf-lgw.svg | 1 + .../network/images/egress-ip-vrf-sgw.svg | 1 + .../images/local-gw-node-setup-vrfs.svg | 1 + .../network/images/multi-homing-l2-gw.svg | 1 + .../images/openshift-router-multi-network.svg | 1 + .../user-defined-network-segmentation.md | 969 ++++++++++++++++++ 8 files changed, 976 insertions(+) create mode 100644 enhancements/network/images/VRFs.svg create mode 100644 enhancements/network/images/egress-ip-l2-primary.svg create mode 100644 enhancements/network/images/egress-ip-vrf-lgw.svg create mode 100644 enhancements/network/images/egress-ip-vrf-sgw.svg create mode 100644 enhancements/network/images/local-gw-node-setup-vrfs.svg create mode 100644 enhancements/network/images/multi-homing-l2-gw.svg create mode 100644 enhancements/network/images/openshift-router-multi-network.svg create mode 100644 enhancements/network/user-defined-network-segmentation.md diff --git a/enhancements/network/images/VRFs.svg b/enhancements/network/images/VRFs.svg new file mode 100644 index 0000000000..855417fa73 --- /dev/null +++ b/enhancements/network/images/VRFs.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/enhancements/network/images/egress-ip-l2-primary.svg b/enhancements/network/images/egress-ip-l2-primary.svg new file mode 100644 index 0000000000..e1454122a9 --- /dev/null +++ b/enhancements/network/images/egress-ip-l2-primary.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/enhancements/network/images/egress-ip-vrf-lgw.svg b/enhancements/network/images/egress-ip-vrf-lgw.svg new file mode 100644 index 0000000000..cb6222bf6a --- /dev/null +++ b/enhancements/network/images/egress-ip-vrf-lgw.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/enhancements/network/images/egress-ip-vrf-sgw.svg b/enhancements/network/images/egress-ip-vrf-sgw.svg new file mode 100644 index 0000000000..e2387ae778 --- /dev/null +++ b/enhancements/network/images/egress-ip-vrf-sgw.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/enhancements/network/images/local-gw-node-setup-vrfs.svg b/enhancements/network/images/local-gw-node-setup-vrfs.svg new file mode 100644 index 0000000000..9b7ba269a5 --- /dev/null +++ b/enhancements/network/images/local-gw-node-setup-vrfs.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/enhancements/network/images/multi-homing-l2-gw.svg b/enhancements/network/images/multi-homing-l2-gw.svg new file mode 100644 index 0000000000..f633254ac4 --- /dev/null +++ b/enhancements/network/images/multi-homing-l2-gw.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/enhancements/network/images/openshift-router-multi-network.svg b/enhancements/network/images/openshift-router-multi-network.svg new file mode 100644 index 0000000000..0386aecabf --- /dev/null +++ b/enhancements/network/images/openshift-router-multi-network.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/enhancements/network/user-defined-network-segmentation.md b/enhancements/network/user-defined-network-segmentation.md new file mode 100644 index 0000000000..3a30845530 --- /dev/null +++ b/enhancements/network/user-defined-network-segmentation.md @@ -0,0 +1,969 @@ +--- +title: user-defined-network-segmentation +authors: + - "@trozet" + - "@qinqon" +reviewers: + - "@tssurya" + - "@danwinship" + - "@fedepaol" + - "@maiqueb" + - "@jcaamano" + - "@Miciah" + - "@dceara" + - "@dougbtv" +approvers: + - "@tssurya" + - "@jcaamano" +api-approvers: + - "None" +creation-date: 2024-05-03 +last-updated: 2024-05-28 +tracking-link: + - https://issues.redhat.com/browse/SDN-4789 +--- + +# User-Defined Network Segmentation + +## Summary + +OVN-Kubernetes today allows multiple different types of networks per secondary network: layer 2, layer 3, or localnet. +Pods can be connected to different networks without discretion. For the primary network, OVN-Kubernetes only supports all +pods connecting to the same layer 3 virtual topology. The scope of this effort is to bring the same flexibility of the +secondary network to the primary network. Therefore, pods are able to connect to different types of networks as their +primary network. + +Additionally, multiple and different instances of primary networks may co-exist for different users, and they will provide +native network isolation. + +## Terminology + +* **Primary Network** - The network which is used as the default gateway for the pod. Typically recognized as the eth0 +interface in the pod. +* **Secondary Network** - An additional network and interface presented to the pod. Typically created as an additional +Network Attachment Definition (NAD), leveraging Multus. Secondary Network in the context of this document refers to a +secondary network provided by the OVN-Kubernetes CNI. +* **Cluster Default Network** - This is the routed OVN network that pods attach to by default today as their primary network. +The pods default route, service access, as well as kubelet probe are all served by the interface (typically eth0) on this network. +* **User-Defined Network** - A network that may be primary or secondary, but is declared by the user. +* **Layer 2 Type Network** - An OVN-Kubernetes topology rendered into OVN where pods all connect to the same distributed +logical switch (layer 2 segment) which spans all nodes. Uses Geneve overlay. +* **Layer 3 Type Network** - An OVN-Kubernetes topology rendered into OVN where pods have a per-node logical switch and subnet. +Routing is used for pod to pod communication across nodes. This is the network type used by the cluster default network today. +Uses Geneve overlay. +* **Localnet Type Network** - An OVN-Kubernetes topology rendered into OVN where pods connect to a per-node logical switch +that is directly wired to the underlay. + +## Motivation + +As users migrate from OpenStack to Kubernetes, there is a need to provide network parity for those users. In OpenStack, +each tenant (akin to Kubernetes namespace) by default has a layer 2 network, which is isolated from any other tenant. +Connectivity to other networks must be specified explicitly as network configuration via a Neutron router. In Kubernetes +the paradigm is opposite, by default all pods can reach other pods, and security is provided by implementing Network Policy. +Network Policy can be cumbersome to configure and manage for a large cluster. It also can be limiting as it only matches +TCP, UDP, and SCTP traffic. Furthermore, large amounts of network policy can cause performance issues in CNIs. With all +these factors considered, there is a clear need to address network security in a native fashion, by using networks per +tenant to isolate traffic. + +### User Stories + +* As a user I want to be able to migrate applications traditionally on OpenStack to Kubernetes, keeping my tenant network + space isolated and having the ability to use a layer 2 network. +* As a user I want to be able to ensure network security between my namespaces without having to manage and configure + complex network policy rules. +* As an administrator, I want to be able to provision networks to my tenants to ensure their networks and applications + are natively isolated from other tenants. +* As a user, I want to be able to request a unique, primary network for my namespace without having to get administrator + permission. +* As a user, I want user-defined primary networks to be able to have similar functionality as the cluster default network, + regardless of being on a layer 2 or layer 3 type network. Features like Egress IP, Egress QoS, Kubernetes services, + Ingress, and pod Egress should all function as they do today in the cluster default network. +* As a user, I want to be able to use my own consistent IP addressing scheme in my network. I want to be able to specify + and re-use the same IP subnet for my pods across different namespaces and clusters. This provides a consistent + and repeatable network environment for administrators and users. + +### Goals + +* Provide a configurable way to indicate that a pod should be connected to a user-defined network of a specific type as a +primary interface. +* The primary network may be configured as a layer 3 or layer 2 type network. +* Allow networks to have overlapping pod IP address space. This range may not overlap with the default cluster subnet +used for allocating pod IPs on the cluster default network today. +* The cluster default primary network defined today will remain in place as the default network pods attach to. The cluster +default network will continue to serve as the primary network for pods in a namespace that has no primary user-defined network. Pods +with primary user-defined networks will still attach to the cluster default network with limited access to Kubernetes system resources. +Pods with primary user-defined networks will have at least two network interfaces, one connected to the cluster default network and one +connected to the user-defined network. Pods with primary user-defined networks will use the user-defined network as their default +gateway. +* Allow multiple namespaces per network. +* Support cluster ingress/egress traffic for user-defined networks, including secondary networks. +* Support for ingress/egress features on user-defined primary networks where possible: + * EgressQoS + * EgressService + * EgressIP + * Load Balancer and NodePort Services, as well as services with External IPs. +* In addition to ingress service support, there will be support for Kubernetes services in user-defined networks. The +scope of reachability to that service as well as endpoints selected for that service will be confined to the network +and corresponding namespace(s) where that service was created. +* Support for pods to continue to have access to the cluster default primary network for DNS and KAPI service access. +* Kubelet healthchecks/probes will still work on all pods. +* OpenShift Router/Ingress will work with some limitations for user-defined networks. + +### Non-Goals + +* Allowing different service CIDRs to be used in different networks. +* Localnet will not be supported initially for primary networks. +* Allowing multiple primary networks per namespace. +* Hybrid overlay support on user-defined networks. + +### Future-Goals + +* DNS lookup for pods returning records for IPs on the user-defined network. In the first phase DNS will return the pod +IP on the cluster default network instead. +* Admin ability to configure networks to have access to all services and/or expose services to be accessible from all +networks. +* Ability to advertise user-defined networks to external networks using BGP/EVPN. This will enable things like: + * External -> Pod ingress per VRF (Ingress directly to pod IP) + * Multiple External Gateway (MEG) in a BGP context, with ECMP routes +* Allow connection of multiple networks via explicit router API configuration. +* An API to allow user-defined ports for pods to be exposed on the cluster default network. This may be used for things +like promethus metric scraping. +* Potentially, coming up with an alternative solution for requiring the cluster default network connectivity to the pod, +and presenting the IP of the pod to Kubernetes as the user-defined primary network IP, rather than the cluster default +network IP. + +## Proposal + +By default in OVN-Kubernetes, pods are attached to what is known as the “cluster default" network, which is a routed network +divided up into a subnet per node. All pods will continue to have an attachment to this network, even when assigned a +different primary network. Therefore, when a pod is assigned to a user-defined network, it will have two interfaces, one +to the cluster default network, and one to the user-defined network. The cluster default network is required in order to provide: + +1. KAPI service access +2. DNS service access +3. Kubelet healthcheck probes to the pod + +All other traffic from the pod will be dropped by firewall rules on this network, when the pod is assigned a user-defined +primary network. Routes will be added to the pod to route KAPI/DNS traffic out towards the cluster default network. Note, +it may be desired to allow access to any Kubernetes service on the cluster default network (instead of just KAPI/DNS), +but at a minimum KAPI/DNS will be accessible. Furthermore, the IP of the pod from the Kubernetes API will continue to +show the IP assigned in the cluster default network. + +In OVN-Kubernetes secondary networks are defined using Network Attachment Definitions (NADs). For more information on +how these are configured, refer to: + +[https://github.com/ovn-org/ovn-kubernetes/blob/master/docs/features/multi-homing.md](https://github.com/ovn-org/ovn-kubernetes/blob/master/docs/features/multi-homing.md) + +The proposal here is to leverage this existing mechanism to create the network. A new field, “primaryNetwork” is +introduced to the NAD spec which indicates that this network should be used for the pod's primary network. Additionally, +a new "joinSubnet" field is added in order to specify the join subnet used inside the OVN network topology. An +example OVN-Kubernetes NAD may look like: + +``` +apiVersion: k8s.cni.cncf.io/v1 +kind: NetworkAttachmentDefinition +metadata: + name: l3-network + namespace: default +spec: + config: |2 + { + "cniVersion": "0.3.1", + "name": "l3-network", + "type": "ovn-k8s-cni-overlay", + "topology":"layer3", + "subnets": "10.128.0.0/16/24,2600:db8::/29", + "joinSubnet": "100.65.0.0/24,fd99::/64", + "mtu": 1400, + "netAttachDefName": "default/l3-network", + "primaryNetwork": true + } +``` + +The NAD must be created before any pods are created for this namespace. If cluster default networked pods existed before +the user-defined network was created, any further pods created in this namespace after the NAD was created will return +an error on CNI ADD. + +Only one primary network may exist per namespace. If more than one user-defined network is created with the +"primaryNetwork" key set to true, then future pod creations will return an error on CNI ADD until the network +configuration is corrected. + +A pod may not connect to multiple primary networks other than the cluster default. When the NAD is created, +OVN-Kubernetes will validate the configuration, as well as that no pods have been created in the namespace already. If +pods existed before the NAD was created, errors will be logged, and no further pods will be created in this namespace +until the network configuration is fixed. + +After creating the NAD, pods created in this namespace will connect to the newly defined network as their primary +network. The primaryNetwork key is used so that OVN-Kubernetes knows which network should be used, in case there are multiple +NADs created for a namespace (secondary networks). + +After a pod is created that shall connect to a user-defined network, it will then be annotated by OVN-Kubernetes with the +appropriate networking config: + +``` +trozet@fedora:~/Downloads$ oc get pods -o yaml -n ns1 +apiVersion: v1 +items: +- apiVersion: v1 + kind: Pod + metadata: + annotations: + k8s.ovn.org/pod-networks: 'k8s.ovn.org/pod-networks: '{"default":{"ip_addresses":["10.244.1.6/24","fd00:10:244:2::6/64"],"mac_address":"0a:58:0a:f4:01:06","routes":[{"dest":"10.244.0.0/16","nextHop":"10.244.1.1"},{"dest":"100.64.0.0/16","nextHop":"10.244.1.1"},{"dest":"fd00:10:244::/48","nextHop":"fd00:10:244:2::1"},{"dest":"fd98::/64","nextHop":"fd00:10:244:2::1"}],"type":default},"default/l3-network":{"ip_addresses":["10.128.1.3/24","2600:db8:0:2::3/64"],"mac_address":"0a:58:0a:80:01:03","gateway_ips":["10.128.1.1","2600:db8:0:2::1"],"routes":[{"dest":"10.128.0.0/16","nextHop":"10.128.1.1"},{"dest":"10.96.0.0/16","nextHop":"10.128.1.1"},{"dest":"100.64.0.0/16","nextHop":"10.128.1.1"},{"dest":"2600:db8::/29","nextHop":"2600:db8:0:2::1"},{"dest":"fd00:10:96::/112","nextHop":"2600:db8:0:2::1"},{"dest":"fd99::/64","nextHop":"2600:db8:0:2::1"}],"type":primary}}' + k8s.v1.cni.cncf.io/network-status: |- + [{ + "name": "default", + "interface": "eth0", + "ips": [ + "10.244.1.6" + ]}, + "mac": "0a:58:0a:f4:02:03", + "default": true, + "dns": {} + }] +status: + phase: Running + podIP: 10.244.1.6 + podIPs: + - ip: 10.244.1.6 + - ip: fd00:10:244:2::6 +``` + +In the above output the primary network is listed within the k8s.ovn.org/pod-networks annotation. However, the network-status +cncf annotation does not contain the primary network. This is due to the fact that OVNK will do an implicit CNI ADD for both +the default cluster network and the primary network. This way a user does not have to manually request that the pod is attached +to the primary network. + +Multiple namespaces may also be configured to use the same network. In this case the underlying OVN network will be the +same, following a similar pattern to what is [already supported today for secondary networks](https://docs.openshift.com/container-platform/4.15/networking/multiple_networks/configuring-additional-network.html#configuration-ovnk-network-plugin-json-object_configuring-additional-network). + +### CRDs for Managing Networks + +Network Attachment Definitions (NADs) are the current way to configure the network in OVN-Kubernetes today, and the +method proposed in this enhancement. There are two major shortcomings of NAD: + +1. It has free-form configuration that depends on the CNI. There is no API validation of what a user enters, leading to +mistakes which are not caught at configuration time and may cause unexpected functional behavior at runtime. +2. It requires cluster admin RBAC in order to create the NAD. + +In order to address these issues, a proper CRD may be implemented which indirectly creates the NAD for OVN-Kubernetes. +This solution may consist of more than one CRD, namely an Admin based CRD and one that is namespace scoped for tenants. +The reasoning behind this is we want tenants to be able to create their own user-defined network for their namespace, +but we do not want them to be able to connect to another namespace’s network without permission. The Admin based version +would give higher level access and allow an administrator to create a network that multiple namespaces could connect to. +It may also expose more settings in the future for networks that would not be safe in the hands of a tenant, like +deciding if a network is able to reach other services in other networks. With tenants having access to be able to create +multiple networks, we need to consider potential attack vectors like a tenant trying to exhaust OVN-Kubernetes +resources by creating too many secondary networks. + +Furthermore, by utilizing a CRD, the status of the network CR itself can be used to indicate whether it is configured +by OVN-Kubernetes. For example, if a user creates a network CR and there is some problem (like pods already existed) then +an error status can be reported to the CR, rather than relying on the user to check OVN-Kubernetes logs. + +### IP Addressing + +As previously mentioned, one of the goals is to allow user-defined networks to have overlapping pod IP addresses. This +is enabled by allowing a user to configure what CIDR to use for pod addressing when they create the network. However, +this range cannot overlap with the default cluster CIDR used by the cluster default network today. + +Furthermore, the internal masquerade subnet and the Kubernetes service subnet will remain unique and will exist globally +to serve all networks. The masquerade subnet must be large enough to accommodate enough networks. Therefore, the +subnet size of the masquerade subnet is equal to the number of desired networks * 2, as we need 2 masquerade IPs per +network. The masquerade subnet remains localized to each node, so each node can use the same IP addresses and the size +of the subnet does not scale with number of nodes. + +The transit switch subnets may overlap between all networks. This network is just used for transport between nodes, and +is never seen by the pods or external clients. + +The join subnet of the default cluster network may not overlap with the join subnet of user-defined networks. This is +due to the fact that the pod is connected to the default network, as well as the user-defined primary network. The join +subnet is SNAT'ed by the GR of that network in order to facilitate ingress reply service traffic going back to the +proper GR, in case it traverses the overlay. For this reason, the pods may see this IP address and routes are added to +the pod to steer the traffic to the right interface (100.64.0.0/16 is the default cluster network join subnet): + +``` +[root@pod3 /]# ip route show +default via 10.244.1.1 dev eth0 +10.96.0.0/16 via 10.244.1.1 dev eth0 +10.244.0.0/16 via 10.244.1.1 dev eth0 +10.244.1.0/24 dev eth0 proto kernel scope link src 10.244.1.8 +100.64.0.0/16 via 10.244.1.1 dev eth0 +``` + +Since the pod needs routes for each join subnet, any layer 3 or layer 2 network that is attached to the pod needs a unique +join subnet. Consider a pod connected to the default cluster network, a user-defined, layer 3, primary network, and a +layer 2, secondary network: + +| Network | Pod Subnet | Node Pod Subnet | Join Subnet | +|-----------------|---------------|-----------------|---------------| +| Cluster Default | 10.244.0.0/16 | 10.244.0.0/24 | 100.64.0.0/16 | +| Layer 3 | 10.245.0.0/16 | 10.245.0.0/24 | 100.65.0.0/16 | +| Layer 2 | 10.246.0.0/16 | N/A | 100.66.0.0/16 | + + +The routing table would look like: + +``` +[root@pod3 /]# ip route show +default via 10.245.0.1 dev eth1 +10.96.0.0/16 via 10.245.0.1 dev eth1 +10.244.0.0/16 via 10.244.0.1 dev eth0 +10.245.0.0/16 via 10.245.0.1 dev eth1 +10.244.0.0/24 dev eth0 proto kernel scope link src 10.244.0.8 +10.245.0.0/24 dev eth1 proto kernel scope link src 10.245.0.8 +10.246.0.0/16 dev eth2 proto kernel scope link src 10.246.0.8 +100.64.0.0/16 via 10.244.0.1 dev eth0 +100.65.0.0/16 via 10.245.0.1 dev eth1 +100.66.0.0/16 via 10.246.0.1 dev eth2 +``` + +Therefore, when specifying a user-defined network it will be imperative to ensure that the networks a pod will connect to +do not have overlapping pod network or join network subnets. OVN-Kubernetes should be able to detect this scenario and +refuse to CNI ADD a pod with conflicts. + +### DNS + +DNS lookups will happen via every pod’s access to the DNS service on the cluster default network. CoreDNS lookups for +pods will resolve to the pod’s IP on the cluster default network. This is a limitation of the first phase of this feature +and will be addressed in a future enhancement. DNS lookups for services and external entities will function correctly. + +### Services + +Services in Kubernetes are namespace scoped. Any creation of a service in a namespace without a user-defined network +(using cluster default network as primary) will only be accessible by other namespaces also using the default network as +their primary network. Services created in namespaces served by user-defined networks, will only be accessible to +namespaces connected to the user-defined network. + +Since most applications require DNS and KAPI access, there is an exception to the above conditions where pods that are +connected to user-defined networks are still able to access KAPI and DNS services that reside on the cluster default +network. In the future, access to more services on the default network may be granted. However, that would require more +groundwork around enforcing network policy (which is evaluated typically after service DNAT) as potentially nftables +rules. Such work is considered a future enhancement and beyond the scope of this initial implementation. + +With this proposal, OVN-Kubernetes will check which network is being used for this namespace, and then only enable the +service there. The cluster IP of the service will only be available in the network of that service, except for KAPI and +DNS as previously explained. Host networked pods in a namespace with a user-defined primary network will also be limited +to only accessing the cluster IP of the services for that network. Load balancer IP and nodeport services are also +supported on user-defined networks. Service selectors are only able to select endpoints from the same namespace where the +service exists. Services that exist before the user-defined network is assigned to a namespace will result in +OVN-Kubernetes executing a re-sync on all services in that namespace, and updating all load balancers. Keep in mind that +pods must not exist in the namespace when the namespace is assigned to a new network or the new network assignment will +not be accepted by OVN-Kubernetes. + +Services in a user-defined network will be reachable by other namespaces that share the same network. + +As previously mentioned, Kubernetes API and DNS services will be accessible by all pods. + +Endpoint slices will provide the IPs of the cluster default network in Kubernetes API. For this implementation the required +endpoints are those IP addresses which reside on the user-defined primary network. In order to solve this problem, +OVN-Kubernetes may create its own endpoint slices or may choose to do dynamic lookups at runtime to map endpoints to +their primary IP address. Leveraging a second set of endpoint slices will be the preferred method, as it creates less +indirection and gives explicit Kube API access to what IP addresses are being used by OVN-Kubernetes. + +Kubelet health checks to pods are queried via the cluster default network. When endpoints are considered unhealthy they +will be removed from the endpoint slice, and thus their primary IP will be removed from the OVN load balancer. However, +it is important to note that the healthcheck is being performed via the cluster default network interface on the pod, +which ensures the application is alive, but does not confirm network connectivity of the primary interface. Therefore, +there could be a situation where OVN networking on the primary interface is broken, but the default interface continues +to work and reports 200 OK to Kubelet, thus rendering the pod serving in the endpoint slice, but unable to function. +Although this is an unlikely scenario, it is good to document. + +### Network Policy + +Network Policy will be fully supported for user-defined primary networks as it is today with the cluster default network. +However, configuring network policies that allow traffic between namespaces that connect to different user-defined +primary networks will have no effect. This traffic will not be allowed, as the networks have no connectivity to each other. +These types of policies will not be invalidated by OVN-Kubernetes, but the configuration will have no effect. Namespaces +that share the same user-defined primary network will still benefit from network policy that applies access control over +a shared network. Additionally, policies that block/allow cluster egress or ingress traffic will still be enforced for +any user-defined primary network. + +### API Extensions + +The main API extension here will be a namespace scoped network CRD as well a cluster scoped network CRD. These CRDs +will be registered by Cluster Network Operator (CNO). See the [CRDs for Managing Networks](#crds-for-managing-networks) +section for more information on how the CRD will work. There will be a finalizer on the CRDs, so that upon deletion +OVN-Kubernetes can validate that there are no pods still using this network. If there are pods still attached to this +network, the network will not be removed. + +### Workflow Description + +#### Tenant Use Case + +As a tenant I want to ensure when I create pods in my namespace their network traffic is isolated from other tenants on +the cluster. In order to ensure this, I first create a network CRD that is namespace scoped and indicate: + + - Type of network (Layer 3 or Layer 2) + - IP addressing scheme I wish to use (optional) + - Indicate this network will be the primary network + +After creating this CRD, I can check the status of the CRD to ensure it is actively being used as the primary network +for my namespace by OVN-Kubernetes. Once verified, I can now create pods and they will be in their own isolated SDN. + +#### Admin Use Case + +As an admin, I have a customer who has multiple namespaces and wants to connect them all to the same private network. In +order to accomplish this, I first create an admin network CRD that is cluster scoped and indicate: + +- Type of network (Layer 3 or Layer 2) +- IP addressing scheme I wish to use (optional) +- Indicate this network will be the primary network +- Selector to decide which namespaces may connect to this network. May use the ```kubernetes.io/metadata.name``` label to +guarantee uniqueness and eliminates the ability to falsify access. + +After creating the CRD, check the status to ensure OVN-Kubernetes has accepted this network to serve the namespaces +selected. Now tenants may go ahead and be provisioned their namespace. + +### Topology Considerations + +#### Hypershift / Hosted Control Planes + +The management cluster should have no reason to use multiple networks, unless it for security reasons it makes sense +to use native network isolation over network policy. It makes more sense that multiple primary networks will be used in +the hosted cluster in order to provide tenants with better isolation from each other, without the need for network +policy. There should be no hypershift platform-specific considerations with this feature. + +#### Standalone Clusters + +Full support. + +#### Single-node Deployments or MicroShift + +SNO and Microshift will both have full support for creating multiple primary networks. There may be some increased +resource usage when scaling up number of networks as it requires more configuration in OVN, and more processing in +OVN-Kubernetes. + +### Risks and Mitigations + +The biggest risk with this feature is hitting scale limitations. With many namespaces and networks, the number of +internal OVN objects will multiply, as well as internal kernel devices, rules, VRFs. There will need to be a large-scale +effort to determine how many networks we can comfortably support. + +There is also a risk of breaking secondary projects that integrate with OVN-Kubernetes, such as Metal LB or Submariner. + +### Drawbacks + +As described in the Design Details section, this proposal will require reserving two IPs per network in the masquerade +subnet. This is a private subnet only used internally by OVN-Kubernetes, but it will require increasing the subnet size +in order to accommodate multiple networks. Today this subnet by default is configured as a /29 for IPv4, and only 6 IP +addresses are used. With this new design, users will need to reconfigure their subnet to be large enough to hold the +desired number of networks. Note, API changes will need to be made in order to support changing the masquerade subnet +post-installation. + +### Implementation Details/Notes/Constraints + +OVN offers the ability to create multiple virtual topologies. As with secondary networks in OVN-Kubernetes today, +separate topologies are created whenever a new network is needed. The same methodology will be leveraged for this design. +Whenever a new network of type layer 3 or layer 2 is requested, a new topology will be created for that network where +pods may connect to. + +The limitation today with secondary networks is that there is only support for east/west traffic. This RFE will address +adding support for user-defined primary and secondary network north/south support. In order to support north/south +traffic, pods on different networks need to be able to egress, typically using the host’s IP. Today in shared gateway +mode we use a Gateway Router (GR) in order to provide this external connectivity, while in local gateway mode, the host +kernel handles SNAT’ing and routing out egress traffic. Ingress traffic also follows similar, and reverse paths. There +are some exceptions to these rules: + +1. MEG traffic always uses the GR to send and receive traffic. +2. Egress IP on the primary NIC always uses the GR, even in local gateway mode. +3. Egress Services always use the host kernel for egress routing. + +To provide an ingress/egress point for pods on different networks the most simple solution may appear to be to connect +them all to a single gateway router. This introduces an issue where now networks are all connected to a single router, +and there may be routing happening between networks that were supposed to be isolated from one another. Furthermore in +the future, we will want to extend these networks beyond the cluster, and to do that in OVN would require making a +single router VRF aware, which adds more complexity into OVN. + +The proposal here is to create a GR per network. With this topology, OVN will create a patch port per network to the +br-ex bridge. OVN-Kubernetes will be responsible for being VRF/network aware and forwarding packets via flows in br-ex +to the right GR. Each per-network GR will only have load balancers configured on it for its network, and only be able to +route to pods in its network. The logical topology would look something like this, if we use an example of having a +cluster default primary network, a layer 3 primary network, and a layer 2 primary network: + +![VRF Topology](images/VRFs.svg) + +In the above diagram, each network is assigned a unique conntrack zone and conntrack mark. These are required in order +to be able to handle overlapping networks egressing into the same VRF and SNAT’ing to the host IP. Note, the default +cluster network does not need to use a unique CT mark or zone, and will continue to work as it does today. This is due +to the fact that no user-defined network may overlap with the default cluster subnet. More details in the next section. + +#### Shared Gateway Mode + +##### Pod Egress + +On pod egress, the respective GR of that network will handle doing the SNAT to a unique masquerade subnet IP assigned to +this network. For example, in the above diagram packets leaving GR-layer3 would be SNAT’ed to 169.254.169.5 in zone 64005. +The packet will then enter br-ex, where flows in br-ex will match this packet, and then SNAT the packet to the node IP +in zone 0, and apply its CT mark of 5. Finally, the packet will be recirculated back to table 0, where the packet will +be CT marked with 1 in zone 64000, and sent out of the physical interface. In OVN-Kubernetes we use zone 64000 to track +things from OVN or the host and additionally, we mark packets from OVN with a CT Mark of 1 and packets from the host +with 2. Pseudo openflow rules would look like this (assuming node IP of 172.18.0.3): + +``` +pod-->GR(snat, 169.254.169.5, zone 64005)->br-ex(snat, 172.18.0.3, zone 0, mark 5, table=0) -->recirc table0 (commit +zone 64000, mark 1) -->eth0 +``` + +The above design will accommodate for overlapping networks with overlapping ports. The worst case scenario is if two +networks share the same address space, and two pods with identical IPs are trying to connect externally using the same +source and destination port. Although unlikely, we have to plan for this type of scenario. When each pod tries to send a +packet through their respective GR, SNAT’ing to the unique GR masquerade IP differentiates the conntrack entries. Now, +when the final SNAT occurs in br-ex with zone 0, they can be determined as different connections via source IP, and when +SNAT’ing to host IP, conntrack will detect a collision using the same layer 4 port, and choose a different port to use. + +When reply traffic comes back into the cluster, we must now submit the packet to conntrack to find which network this +traffic belongs to. The packet is always first sent into zone 64000, where it is determined whether this packet +belonged to OVN (CT mark of 1) or the host. Once identified by CT mark as OVN traffic, the packet will then be unSNAT’ed +in zone 0 via br-ex rules and the CT mark restored of which network it belonged to. Finally, we can send the packet to +the correct GR via the right patch port, by matching on the restored CT Mark. From there, OVN will handle unSNAT’ing the +masquerade IP and forward the packet to the original pod. + +To support KubeVirt live migration the GR LRP will have an extra address with the configured gateway for the layer2 +subnet (to allow the gateway IP to be independent of the node where the VM is running on). After live migration succeeds, +OVN should send a GARP for VMs to clean up its ARP tables since the gateway IP has different mac now. + +The live migration feature at layer 2 described here will work only with OVN interconnect (OVN IC, which is used by OCP). +Since there is no MAC learning between zones, so we can have the same extra address on every gateway router port, basically +implementing anycast for this SVI address. + +Following is a picture that illustrate all these bits with a topology + +![Layer 2 Egress Topology](images/multi-homing-l2-gw.svg) + +##### Services + +When ingress service traffic enters br-ex, there are flows installed that steer service traffic towards the OVN GR. With +additional networks, these flows will be modified to steer traffic to the correct GR-<network>’s patch port. + +When a host process or host networked pod on a Kubernetes node initiates a connection to a service, iptables rules will +DNAT the nodeport or loadbalancer IP into the cluster IP, and then send the traffic via br-ex where it is masqueraded +and sent into the OVN GR. These flows can all be modified to detect the service IP and then send to the correct +GR-<network> patch port. For example, in the br-ex (breth0) bridge today we have flows that match on packets sent +to the service CIDR (10.96.0.0/24): + +``` +[root@ovn-worker ~]# ovs-ofctl dump-flows breth0 table=0 | grep 10.96 + cookie=0xdeff105, duration=22226.373s, table=0, n_packets=41, n_bytes=4598, idle_age=19399,priority=500,ip,in_port=LOCAL,nw_dst=10.96.0.0/16 actions=ct(commit,table=2,zone=64001,nat(src=169.254.169.2)) +``` + +Packets that are destined to the service CIDR are SNAT'ed to the masquerade IP of the host (169.254.169.2) and then +sent to the dispatch table 2: + +``` +[root@ovn-worker ~]# ovs-ofctl dump-flows breth0 table=2 + cookie=0xdeff105, duration=22266.310s, table=2, n_packets=41, n_bytes=4598, actions=mod_dl_dst:02:42:ac:12:00:03,output:"patch-breth0_ov" +``` + +In the above flow, all packets have the dest MAC address changed to be that of the OVN GR, and then sent on the patch port +towards the OVN GR. With multiple networks, host access to cluster IP service flows will now be modified to be on a per +cluster IP basis. For example, if we assume two services exist on two user defined namespaces with cluster IPs 10.96.0.5 +and 10.96.0.6. The flows would look like: + +``` +[root@ovn-worker ~]# ovs-ofctl dump-flows breth0 table=0 | grep 10.96 + cookie=0xdeff105, duration=22226.373s, table=0, n_packets=41, n_bytes=4598, idle_age=19399,priority=500,ip,in_port=LOCAL,nw_dst=10.96.0.5 actions=set_field:2->reg1,ct(commit,table=2,zone=64001,nat(src=169.254.169.2)) + cookie=0xdeff105, duration=22226.373s, table=0, n_packets=41, n_bytes=4598, idle_age=19399,priority=500,ip,in_port=LOCAL,nw_dst=10.96.0.6 actions=set_field:3->reg1,ct(commit,table=2,zone=64001,nat(src=169.254.169.2)) +``` + +The above flows are now per cluster IP and will send the packet to the dispatch table while also setting unique register +values to differentiate which OVN network these packets should be delivered to: + +``` +[root@ovn-worker ~]# ovs-ofctl dump-flows breth0 table=2 + cookie=0xdeff105, duration=22266.310s, table=2, n_packets=41, n_bytes=4598, reg1=0x2 actions=mod_dl_dst:02:42:ac:12:00:05,output:"patch-breth0-net1" + cookie=0xdeff105, duration=22266.310s, table=2, n_packets=41, n_bytes=4598, reg1=0x3 actions=mod_dl_dst:02:42:ac:12:00:06,output:"patch-breth0-net2" +``` + +Furthermore, host networked pod access to services will be restricted to the network it belongs to. For more information +see the [Host Networked Pods](#host-networked-pods) section. + +Additionally, in the case where there is hairpin service traffic to the host +(Host->Service->Endpoint is also the host), the endpoint reply traffic will need to be distinguishable on a per network +basis. In order to achieve this, each OVN GR’s unique masquerade IP will be leveraged. + +For service access towards KAPI/DNS or potentially other services on the cluster default network, there are two potential +technical solutions. Assume eth0 is the pod interface connected to the cluster default network, and eth1 is connected to the +user-defined primary network: + +1. Add routes for KAPI/DNS specifically into the pod to go out eth0, while all other service access will go to eth1. +This will then just work normally with the load balancers on the switches for the respective networks. + +2. Do not send any service traffic out of eth0, instead all service traffic goes to eth1. In this case all service +traffic is flowing through the user-defined primary network, where only load balancers for that network are configured +on that network's OVN worker switch. Therefore, packets to KAPI/DNS (services not on this network) are not DNAT'ed at +the worker switch and are instead forwarded onwards to the ovn_cluster_router_<user-defined network> or +GR-<node-user-defined-network> for layer 3 or layer 2 networks, respectively . This router is +configured to send service CIDR traffic to ovn-k8s-mp0-<user-defined network>. IPTables rules in the host only permit +access to KAPI/DNS and drop all other service traffic coming from ovn-k8s-mp0-<user-defined network>. The traffic then +gets routed to br-ex and default GR where it hits the OVN load balancer there and forwarded to the right endpoint. + +While the second option is more complex, it allows for not configuring routes to service addresses in the pod that could +hypothetically change. + +##### Egress IP + +This feature works today by labeling and choosing a node+network to be used for egress, and then OVN logical routes and +logical route policies are created which steer traffic from a pod towards a specific gateway router (for primary network +egress). From there the packets are SNAT’ed by the OVN GR to the egress IP, and sent to br-ex. Egress IP is cluster +scoped, but applies to selected namespaces, which will allow us to only apply the SNAT and routes to the GR and OVN +topology elements of that network. In the layer 3 case, the current design used today for the cluster default primary +network will need some changes. Since Egress IP may be served on multiple namespaces and thus networks, it is possible +that there could be a collision as previously mentioned in the Pod Egress section. Therefore, the same solution provided +in that section where the GR SNATs to the masquerade subnet must be utilized. However, once the packet arrives in br-ex +we will need a way to tell if it was sent from a pod affected by a specific egress IP. To address this, pkt_mark will be +used to mark egress IP packets and signify to br-ex which egress IP to SNAT to. An example where the egress IP is +1.1.1.1 that maps to pkt_mark 10 would look something like this: + +![Egress IP VRF SGW](images/egress-ip-vrf-sgw.svg) + +For layer 2, egress IP has never been supported before. With the IC design, there is no need to have an +ovn_cluster_router and join switch separating the layer 2 switch network (transit switch) from the GR. For non-IC cases +this might be necessary, but for OpenShift purposes will only describe the behavior of IC in this proposal. In the layer +2 IC model, GRs per node on a network will all be connected to the layer 2 transit switch: + +![Egress IP Layer 2](images/egress-ip-l2-primary.svg) + +In the above diagram, Node 2 is chosen to be the egress IP node for any pods in namespace A. Pod 1 and Pod 2 have +default gateway routes to their respective GR on their node. When egress traffic leaves Pod 2, it is sent towards its +GR-A on node 2, where it is SNAT’ed to the egress IP and the traffic sent to br-ex. For Pod 1, its traffic is sent to +its GR-A on Node 1, where it is then rerouted towards GR-A on Node 2 for egress. + +##### Egress Firewall + +Egress firewall is enforced at the OVN logical switch, and this proposal has no effect on its functionality. + +##### Egress QoS + +Egress QoS is namespace scoped and functions by marking packets at the OVN logical switch, and this proposal has no +effect on its functionality. + +##### Egress Service + +Egress service is namespace scoped and its primary function is to SNAT egress packets to a load balancer IP. As +previously mentioned, the feature works the same in shared and local gateway mode, by leveraging the local gateway mode +path. Therefore, its design will be covered in the Local Gateway Mode section of the Design Details. + +##### Multiple External Gateways (MEG) + +There will be no support for MEG or pod direct ingress on any network other than the primary, cluster default network. +This support may be enhanced later by extending VRFs/networks outside the cluster. + +#### Local Gateway Mode + +With local gateway mode, egress/ingress traffic uses the kernel’s networking stack as a next hop. OVN-Kubernetes +leverages an interface named “ovn-k8s-mp0” in order to facilitate sending traffic to and receiving traffic from the +host. For egress traffic, the host routing table decides where to send the egress packet, and then the source IP is +masqueraded to the node IP of the egress interface. For ingress traffic, the host routing table steers packets destined +for pods via ovn-k8s-mp0 and SNAT’s the packet to the interface address. + +For multiple networks to use local gateway mode, some changes are necessary. The ovn-k8s-mp0 port is a logical port in +the OVN topology tied to the cluster default network. There will need to be multiple ovn-k8s-mp0 ports created, one per +network. Additionally, all of these ports cannot reside in the default VRF of the host network. Doing so would result in +an inability to have overlapping subnets, as well as the host VRF would be capable of routing packets between namespace +networks, which is undesirable. Therefore, each ovn-k8s-mp0-<network> interface must be placed in its own VRF: + +![Local GW Node Setup](images/local-gw-node-setup-vrfs.svg) + +The VRFs will clone the default routing table, excluding routes that are created by OVN-Kubernetes for its networks. +This is similar to the methodology in place today for supporting +[Egress IP with multiple NICs](https://github.com/openshift/enhancements/blob/master/enhancements/network/egress-ip-multi-nic.md). + +##### Pod Egress + +Similar to the predicament outlined in Shared Gateway mode, we need to solve the improbable case where two networks have +the same address space, and pods with the same IP/ports are trying to talk externally to the same server. In this case, +OVN-Kubernetes will reserve an extra IP from the masquerade subnet per network. This masquerade IP will be used to SNAT +egress packets from pods leaving via mp0. The SNAT will be performed by ovn_cluster_router for layer 3 networks and +the gateway router (GR) for layer 2 networks using configuration like: + +``` +ovn-nbctl --gateway-port=rtos-ovn-worker lr-nat-add DR snat 10.244.0.6 169.254.169.100 +``` + +Now when egress traffic arrives in the host via mp0, it will enter the VRF, where clone routes will route the packet as +if it was in the default VRF out a physical interface, typically towards br-ex, and the packet is SNAT’ed to the host IP. + +When the egress reply comes back into the host, iptables will unSNAT the packet and the destination will be +169.254.169.100. At this point, an ip rule will match the destination on the packet and do a lookup in the VRF where a +route specifying 169.254.169.100/32 via 10.244.0.1 will cause the packet to be sent back out the right mp0 port for the +respective network. + +Note, the extra masquerade SNAT will not be required on the cluster default network's ovn-k8s-mp0 port. This will +preserve the previous behavior, and it is not necessary to introduce this SNAT since the default cluster network subnet +may not overlap with user-defined networks. + +##### Services + +Local gateway mode services function similar to the behavior described in host -> service description in the Shared +Gateway Mode Services section. When the packet enters br-ex, it is forwarded to the host, where it is then DNAT’ed to +the cluster IP and typically sent back into br-ex towards the OVN GR. This traffic will behave the same as previously +described. There are some exceptions to this case, namely when external traffic policy (ETP) is set to local. In this +case traffic is DNAT’ed to a special masquerade IP (169.254.169.3) and sent via ovn-k8s-mp0. There will need to be IP +rules to match on the destination node port and steer traffic to the right VRF for this case. Additionally, with internal +traffic policy (ITP) is set to local, packets are marked in the mangle table and forwarded via ovn-k8s-mp0 with an IP +rule and routing table 7. This logic will need to ensure the right ovn-k8s-mp0 is chosen for this case as well. + +##### Egress IP + +As previously mentioned, egress IP on the primary NIC follows the pathway of shared gateway mode. The traffic is not +routed by the kernel networking stack as a next hop. However, for multi-nic support, packets are sent into the kernel +via the ovn-k8s-mp0 port. Here the packets are matched on, sent to an egress IP VRF, SNAT’ed and sent out the chosen +interface. The detailed steps for a pod with IP address 10.244.2.3 affected by egress IP look like: + +1. Pod sends egress packet, arrives in the kernel via ovn-k8s-mp0 port, the packet is marked with 1008 (0x3f0 in hex) +if it should skip egress IP. It has no mark if the packet should be affected by egress IP. +2. IP rules match the source IP of the packet, and send it into an egress IP VRF (rule 6000): + + ``` + sh-5.2# ip rule + + 0: from all lookup local + 30: from all fwmark 0x1745ec lookup 7 + 5999: from all fwmark 0x3f0 lookup main + 6000: from 10.244.2.3 lookup 1111 + 32766: from all lookup main + 32767: from all lookup default + ``` + +3. Iptables rules save the packet mark in conntrack. This is only applicable to packets that were marked with 1008 and +are bypassing egress IP: + + ``` + sh-5.2# iptables -t mangle -L PREROUTING + + Chain PREROUTING (policy ACCEPT) + target prot opt source destination + CONNMARK all -- anywhere anywhere mark match 0x3f0 CONNMARK save + CONNMARK all -- anywhere anywhere mark match 0x0 CONNMARK restore + ``` + +4. VRF 1111 has a route in it to steer the packet to the right egress interface: + + ``` + sh-5.2# ip route show table 1111 + default dev eth1 + ``` + +5. IPTables rules in NAT table SNAT the packet: + + ``` + -A OVN-KUBE-EGRESS-IP-MULTI-NIC -s 10.244.2.3/32 -o eth1 -j SNAT --to-source 10.10.10.100 + ``` + +6. For reply bypass traffic, the 0x3f0 mark is restored, and ip rules 5999 send it back into default VRF for routing +back into mp0 for non-egress IP packets. This is rule and connmark restoring is required for the packet to pass the +reverse path filter (RPF) check. For egress IP reply packets, there is no connmark restored and the packets hit the +default routing table to go back into mp0. + +This functionality will continue to work, with ip rules steering the packets from the per network VRF to the appropriate +egress IP VRF. CONNMARK will continue to be used so that return traffic is sent back to the correct VRF. Step 5 in the +above may need to be tweaked to match on mark in case 2 pods have overlapping IPs, and are both egressing +the same interface with different Egress IPs. The flow would look something like this: + +![Egress IP VRF LGW](images/egress-ip-vrf-lgw.svg) + +##### Egress Firewall + +Egress firewall is enforced at the OVN logical switch, and this proposal has no effect on its functionality. + + +##### Egress QoS + +Egress QoS is namespace scoped and functions by marking packets at the OVN logical switch, and this proposal has no +effect on its functionality. + +##### Egress Service + +Egress service functions similar to Egress IP in local gateway mode, with the exception that all traffic paths go +through the kernel networking stack. Egress Service also uses IP rules and VRFs in order to match on traffic and forward +it out the right network (if specified in the CRD). It uses iptables in order to SNAT packets to the load balancer IP. +Like Egress IP, with user-defined networks there will need to be IP rules with higher precedence to match on packets from +specific networks and direct them to the right VRF. + +##### Multiple External Gateways (MEG) + +There will be no support for MEG or pod direct ingress on any network other than the primary, cluster default network. +Remember, MEG works the same way in local or shared gateway mode, by utilizing the shared gateway path. This support may +be enhanced later by extending VRFs/networks outside the cluster. + +#### Kubernetes Readiness/Liveness Probes + +As previously mentioned, Kubelet probes will continue to work. This includes all types of probes such as TCP, HTTP or +GRPC. Additionally, we want to restrict host networked pods in namespaces that belong to user-defined networks from +being able to access pods in other networks. For that reason, we need to block host networked pods from being able to +access pods via the cluster default network. In order to do this, but still allow Kubelet to send probes; the cgroup +module in iptables will be leveraged. For example: + +``` +root@ovn-worker:/# iptables -L -t raw -v +Chain PREROUTING (policy ACCEPT 6587 packets, 1438K bytes) + pkts bytes target prot opt in out source destination + +Chain OUTPUT (policy ACCEPT 3003 packets, 940K bytes) + pkts bytes target prot opt in out source destination + 3677 1029K ACCEPT all -- any any anywhere anywhere cgroup kubelet.slice/kubelet.service + 0 0 ACCEPT all -- any any anywhere anywhere ctstate ESTABLISHED + 564 33840 DROP all -- any any anywhere 10.244.0.0/16 +``` + +From the output we can see that traffic to the pod network ```10.244.0.0/16``` will be dropped by default. However, +traffic coming from kubelet will be allowed. + +#### Host Networked Pods + +##### VRF Considerations + +By encompassing VRFs into the host, this introduces some constraints and requirements for the behavior of host networked +type pods. If a host networked pod is created in a Kubernetes namespace that has a user-defined network, it should be +confined to only talking to ovn-networked pods on that same user-defined network. + +With Linux VRFs, different socket types behave differently by default. Raw, unbound sockets by default are allowed to +listen and span multiple VRFs, while TCP, UDP, SCTP and other protocols are restricted to the default VRF. There are +settings to control this behavior via sysctl, with the defaults looking like this: + +``` +trozet@fedora:~/Downloads/ip-10-0-169-248.us-east-2.compute.internal$ sudo sysctl -A | grep net | grep l3mdev +net.ipv4.raw_l3mdev_accept = 1 +net.ipv4.tcp_l3mdev_accept = 0 +net.ipv4.udp_l3mdev_accept = 0 +``` + +Note, there is no current [support in the kernel for SCTP](https://lore.kernel.org/netdev/bf6bcf15c5b1f921758bc92cae2660f68ed6848b.1668357542.git.lucien.xin@gmail.com/), +and it does not look like there is support for IPv6. Given the desired behavior to restrict host networked pods to +talking to only pods in their namespace/network, it may make sense to set raw_l3dev_accept to 0. This is set to 1 by +default to allow legacy ping applications to work over VRFs. Furthermore, a user modifying sysctl settings to allow +applications to listen across all VRFs will be unsupported. Reasons include there can be odd behavior and interactions +that occur with applications communicating across multiple VRFs, as well as the fact that this would break the native +network isolation paradigm offered by this feature. + +For host network pods to be able to communicate with pod IPs on their user-defined network, the only supported method +will be for the applications to bind their socket to the VRF device. Many applications will not be able to support this, +so in the future it makes sense to come up with a better solution. One possibility is to use ebpf in order to intercept +the socket bind call of an application (that typically will bind to INADDR_ANY) and force it to bind to the VRF device. +Note, host network pods will still be able to communicate with pods via services that belong to its user-defined network +without any limitations. See the next section [Service Access](#service-access) for more information. + +Keep in mind that if a host network pod runs and does not bind to the VRF device, it will be able to communicate on the +default VRF. This means the host networked pod will be able to talk to other host network pods. However, due to nftables +rules in the host however, it will not be able to talk to OVN networked pods via the default cluster network/VRF. + +For more information on how VRFs function in Linux and the settings discussed in this section, refer to +[https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/Documentation/networking/vrf.rst?h=v6.1](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/Documentation/networking/vrf.rst?h=v6.1) for +more details. + +##### Service Access + +Host networked pods in a user-defined network will be restricted to only accessing services in either: +1. The cluster default network. +2. The user-defined network in which the host networked pod's namespace belongs to. + +This will be enforced by iptables/nftables rules added that match on the cgroup of the host networked pod. For example: + +``` +root@ovn-worker:/# iptables -L -t raw -v +Chain PREROUTING (policy ACCEPT 60862 packets, 385M bytes) + pkts bytes target prot opt in out source destination + +Chain OUTPUT (policy ACCEPT 36855 packets, 2504K bytes) + pkts bytes target prot opt in out source destination + 17 1800 ACCEPT all -- any any anywhere 10.96.0.1 cgroup /kubelet.slice/kubelet-kubepods.slice/kubelet-kubepods-besteffort.slice/kubelet-kubepods-besteffort-pod992d3b9e_3f85_42e2_9558_9d4273d4236f.slice +23840 6376K ACCEPT all -- any any anywhere anywhere cgroup kubelet.slice/kubelet.service + 0 0 ACCEPT all -- any any anywhere anywhere ctstate ESTABLISHED + 638 37720 DROP all -- any any anywhere 10.244.0.0/16 + 28 1440 DROP all -- any any anywhere 10.96.0.0/16 +``` +In the example above, access to the service network of ```10.96.0.0/16``` is denied by default. However, one host networked +pod is given access to the 10.96.0.1 cluster IP service, while other host networked pods are blocked from access. + +#### OpenShift Router/Ingress + +One of the goals of this effort is to allow ingress services to pods that live on primary user-defined networks. + +OpenShift router is deployed in two different ways: + +* On-prem: deployed as a host networked pod running haproxy, using a keepalived VIP. +* Cloud: deployed as an OVN networked pod running haproxy, fronted by a cloud external load balancer which forwards +traffic to a nodeport service with ETP Local. + +OpenShift router is implemented as HAProxy, which is capable of terminating TLS or allowing TLS passthrough and +forwarding HTTP connections directly to endpoints of services. + +##### On-Prem + +Due to previously listed constraints in the Host Networked Pods section, host networked pods are unable to talk directly +to OVN networked pods unless bound to the VRF itself. HAProxy needs to communicate with potentially many VRFs, so +therefore this functionality does not work. However, a host networked pod is capable of reaching any service CIDR. In +order to solve this problem, openshift-ingress-operator will be modified so that when it configures HAProxy, it will +check if the service belongs to a user-defined network, and then use the service CIDR as the forwarding path. + +##### Cloud + +In the cloud case, the ovn-networked HAProxy pod cannot reach pods on other networks. Like the On-Prem case, HAProxy +will be modified for user-defined networks to forward to the service CIDR rather than the endpoint directly. The caveat +here is that an OVN networked pod only has service access to: + +1. CoreDNS and Kube API via the cluster default network +2. Any service that resides on the same namespace as the pod + +To solve this problem, the OpenShift ingress namespace will be given permission to access any service, by forwarding +service access via mp0, and allowing the access with ipTables rules. In OVN-Kubernetes upstream, we can allow a +configuration to list certain namespaces that are able to have access to all services. This should be reserved for +administrators to configure. + +A diagram of how the Cloud traffic path will work: + +![OpenShift Router Multi Network](images/openshift-router-multi-network.svg) + +##### Limitations + +With OpenShift Router, stickiness is used to ensure that traffic for a session always reaches the same endpoint. For TLS +passthrough, HAProxy uses the source IP address as that's all it has (it's not decrypting the connection, so it can't +look at cookies, for example). Otherwise, it uses a cookie. + +By changing the behavior of HAProxy to forward to a service CIDR instead of an endpoint directly, we cannot accommodate +using cookies for stickiness. However, a user can configure the service with sessionAffinity, which will cause OVN to +use stickiness by IP. Therefore, users who wish to use OpenShift router with user-defined networks will be limited to +only enabling session stickiness via client IP. + +## Test Plan + +* E2E upstream and downstream jobs covering supported features across multiple networks. +* E2E tests which ensure network isolation between OVN networked and host networked pods, services, etc. +* E2E tests covering network subnet overlap and reachability to external networks. +* Scale testing to determine limits and impact of multiple user-defined networks. This is not only limited to OVN, but + also includes OVN-Kubernetes’ design where we spawn a new network controller for every new network created. +* Integration testing with other features like IPSec to ensure compatibility. + +## Graduation Criteria + +### Dev Preview -> Tech Preview + +There will be no dev or tech preview for this feature. + +### Tech Preview -> GA + +Targeting GA in OCP version 4.17. + +### Removing a deprecated feature + +N/A + +## Upgrade / Downgrade Strategy + +There are no specific requirements to be able to upgrade. The cluster default primary network will continue to function +as it does in previous releases on upgrade. Only newly created namespaces and pods will be able to leverage this feature +post upgrade. + +## Version Skew Strategy + +N/A + +## Operational Aspects of API Extensions + +Operating with user-defined primary networks using a network CRD API will function mostly the same as default cluster +network does today. Creating the user-defined networks shall be done for namespace(s) before pods are created in that +namespace. Otherwise, pods will fail during CNI add. Any functional limitations with user-defined networks are outlined +in other sections of this document. + +## Support Procedures + +## Alternatives From d0883f9c7a30ebecd6c73b9ed5a194a58e6b0ae2 Mon Sep 17 00:00:00 2001 From: Hemant Kumar Date: Mon, 1 Jul 2024 11:59:57 -0400 Subject: [PATCH 19/53] Add a note about ocp-build-data changes to be merged same time as csi-operato --- enhancements/storage/csi-driver-operator-merge.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/enhancements/storage/csi-driver-operator-merge.md b/enhancements/storage/csi-driver-operator-merge.md index bfd1f69ec6..fb4316e309 100644 --- a/enhancements/storage/csi-driver-operator-merge.md +++ b/enhancements/storage/csi-driver-operator-merge.md @@ -531,6 +531,8 @@ But once your operator has been changed to conform to new code in csi-operator r 1. Make sure that `Dockerfile.` at top of the `csi-operator` tree refers to new location of code and not older `legacy/` location.See example of existing Dockerfiles. 2. After your changes to `csi-operator` are merged, you should remove the old location from cachito - https://github.com/openshift-eng/ocp-build-data/pull/4219 +Please note the changes to `ocp-build-data`should be merged almost same time as changes into `csi-operator`'s Dockerfile are merged, otherwise we risk builds from breaking. + ## Implementation History Major milestones in the life cycle of a proposal should be tracked in `Implementation From e7d5e020cf8d2ccd9cdf7d7884efdc06229270c5 Mon Sep 17 00:00:00 2001 From: Hemant Kumar Date: Mon, 1 Jul 2024 12:21:46 -0400 Subject: [PATCH 20/53] Add steps for post merge cleanup --- enhancements/storage/csi-driver-operator-merge.md | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/enhancements/storage/csi-driver-operator-merge.md b/enhancements/storage/csi-driver-operator-merge.md index fb4316e309..bd4b24ad59 100644 --- a/enhancements/storage/csi-driver-operator-merge.md +++ b/enhancements/storage/csi-driver-operator-merge.md @@ -533,6 +533,16 @@ But once your operator has been changed to conform to new code in csi-operator r Please note the changes to `ocp-build-data`should be merged almost same time as changes into `csi-operator`'s Dockerfile are merged, otherwise we risk builds from breaking. +### Post migration changes + +Once migration is complete, we should perform following post migration steps to ensure that we are not left over with legacy stuff: + +1. Mark existing `openshift/--operator` repository as deprecated. +2. Ensure that we have test manifest available in `test/e2e2` directory. +3. Make changes into `release` repository so as it longer relies on anything from `legacy` directory. See - https://github.com/openshift/release/pull/49655 for example. +4. Remove code from `vendor/legacy` in `csi-operator` repository. + + ## Implementation History Major milestones in the life cycle of a proposal should be tracked in `Implementation From c9f129f619d89b200b104d274c6759cbb839436e Mon Sep 17 00:00:00 2001 From: Sinny Kumari Date: Mon, 1 Jul 2024 21:54:04 +0200 Subject: [PATCH 21/53] Update approvers list to yuqi-zhang for MCO --- OWNERS | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/OWNERS b/OWNERS index 665f088903..7412a9df95 100644 --- a/OWNERS +++ b/OWNERS @@ -63,7 +63,7 @@ approvers: - sadasu - sdodson - shawn-hurley - - sinnykumari + - yuqi-zhang - sjenning - soltysh - sosiouxme From a6e7d5cea4669f4da12450c59b6d812c1af5cfb7 Mon Sep 17 00:00:00 2001 From: Patryk Matuszak Date: Tue, 2 Jul 2024 13:39:44 +0200 Subject: [PATCH 22/53] support only single size of hugepages due to bug in tuned --- .../low-latency-workloads-on-microshift.md | 24 ++++++++++--------- 1 file changed, 13 insertions(+), 11 deletions(-) diff --git a/enhancements/microshift/low-latency-workloads-on-microshift.md b/enhancements/microshift/low-latency-workloads-on-microshift.md index 6abf136116..0f1110e4c4 100644 --- a/enhancements/microshift/low-latency-workloads-on-microshift.md +++ b/enhancements/microshift/low-latency-workloads-on-microshift.md @@ -273,7 +273,12 @@ New `microshift-baseline` tuned profile will be created and will include existin `/etc/tuned/microshift-baseline-variables.conf` will be used by users to provide custom values for settings such as: - isolated CPU set -- hugepage count (both 2M and 1G) +- hugepage size and count + > Warning: Due to bug in tuned, it is impossible to set up both 2M and 1G hugepages using + > kernel arguments on ostree system (see https://issues.redhat.com/browse/RHEL-45836). + > + > Therefore `microshift-baseline` will "allow" only for single size of hugepages. + > Users are welcomed though to introduce non-kernel-args ways of setting up hugepages in their profiles. - additional kernel arguments ```ini @@ -285,8 +290,8 @@ include=cpu-partitioning include=/etc/tuned/microshift-baseline-variables.conf [bootloader] -cmdline_microshift=+default_hugepagesz=${hugepagesDefaultSize} hugepagesz=2M hugepages=${hugepages2M} hugepagesz=1G hugepages=${hugepages1G} -cmdline_additionalArgs=+${additionalArgs} +cmdline_microshift_hp=+hugepagesz=${hugepages_size} hugepages=${hugepages} +cmdline_additional_args=+${additional_args} ``` ```ini @@ -311,17 +316,14 @@ isolated_cores=${f:calc_isolated_cores:1} # no_balance_cores=5-10 ### microshift-baseline variables -# Default hugepages size -hugepagesDefaultSize = 2M +# Size of the hugepages +hugepages_size = 0 -# Amount of 2M hugepages -hugepages2M = 128 - -# Amount of 1G hugepages -hugepages1G = 0 +# Amount of the hugepages +hugepages = 0 # Additional kernel arguments -additionalArgs = "" +additional_args = ``` #### `microshift-tuned.service` configuration From 10efcd2e07f60fb964d989fe5d2dd7b73e851514 Mon Sep 17 00:00:00 2001 From: Fabio Bertinatto Date: Thu, 15 Jun 2023 14:43:36 -0300 Subject: [PATCH 23/53] Add a continuous Kubernetes rebase proposal --- dev-guide/kubernetes-continuous-rebase.md | 153 ++++++++++++++++++++++ 1 file changed, 153 insertions(+) create mode 100644 dev-guide/kubernetes-continuous-rebase.md diff --git a/dev-guide/kubernetes-continuous-rebase.md b/dev-guide/kubernetes-continuous-rebase.md new file mode 100644 index 0000000000..a902031e4a --- /dev/null +++ b/dev-guide/kubernetes-continuous-rebase.md @@ -0,0 +1,153 @@ +--- +title: kubernetes-rebase +authors: + - "@fbertina" +reviewers: + - "@soltysh" +approvers: + - "@soltysh" +creation-date: 2023-06-15 +last-updated: 2023-06-15 +--- + +# Kubernetes Continuous Rebase + +## Goal + +The main goal of this proposal is to proactively identify and address +any potential issues that may arise during the upcoming rebase +process. + +The desired outcome is to be able to land the rebase PR significantly +earlier in the process, potentially aligning with the release of the +upstream tag. + +## Proposal + +Currently, the rebase work is typically spread out over a period of 1 +or 2 months. However, it can potentially be distributed throughout the +development cycle of Kubernetes. To achieve this, we could have an OCP +branch with a continually updated Kubernetes codebase, allowing most +of the work to be completed even before the rebase process begins. + +The main approach involves applying our downstream patches against the +upstream master branch on a daily basis. It is expected that some +patches may fail to be applied multiple times during the development +cycle. However, as soon as such failures occur, we will receive +notifications, and the necessary fixes will be applied to the +downstream patches or the upstream code. + +Implementing this approach brings several benefits: + +1. The rebase process becomes less time-sensitive. +2. We receive early signals if an upstream change breaks OCP, enabling + us to address the issue promptly either in the upstream code or on + our side. +3. The rebase PR should be ready to be landed as soon as the upstream + code becomes generally available (GA). + +To implement this proposal, the following steps are required: + +### Watcher + +For each OCP release, we will designate a watcher to participate in +the process. Ideally, it should be the same person who will execute +the final rebase. + +A watcher is responsible for ensuring that the remaining steps +outlined below are executed without errors. + +Although some manual work is required, it should not occupy their +entire daily working time. + +### A -next branch (optional) + +For each of the dependencies listed below, a new branch called +`ocp-next` is created with their Kubernetes dependencies updated: + +* openshift/api +* openshift/client-go +* openshift/library-go +* openshift/apiserver-library-go + +Initially, this can be done manually on a weekly basis. In the future, +certain parts of this process can potentially be automated, requiring +manual intervention only when the automation fails. + +This process should already uncover some future issues, requiring +fixes on unit tests or Makefiles for instance. + +### CI Job + +The goal of the CI job is to detect if our downstream patches create +any code conflicts when applied to the upstream code. In addition to +that, it will uncover potential issues with dependencies and +generated code. + +In short, the new CI job will: + +1. Take a series of downstream patches and apply them against the + upstream code. +2. Pin the dependencies mentioned above to the HEAD of their + respective `ocp-next` branches. +3. Update the auto-generated code and docs (i.e., `make update`). +4. Make sure the codebase is in a sane state by executing automated + verification and testing with `make` (i.e., `test`, `verify`, + `build`, etc.). +5. Commit and push the local changes to an `ocp-next` branch in a + remote repository. +6. Update or create the Pull Request. + +If the job fails to execute any of the steps above, the watcher is +responsible for fixing whatever is preventing the job from +succeeding. Examples of fixes include: + +1. Making a code change to the downstream patch to address a code + conflict. +2. Creating an upstream PR to correct any breaking change. +3. Creating a new downstream patch to rectify an incorrect assumption + in our operators. + +A prototype of this workflow is available +[here](https://github.com/bertinatto/ocp-next/blob/master/next.go). + +### Open Questions + +This proposal assumes that all downstream patches are located in a +specific directory, such as the `patches` directory in [this +prototype](https://github.com/bertinatto/ocp-next/tree/master/patches). + +However, it is unclear how we can ensure that this directory remains +up-to-date with the latest patches imported into our +openshift/kubernetes fork. + +Here are a few potential options to address this issue: + +1. Establish the patches directory as the source of truth for all + downstream patches. This would require teams to ensure that their + patches are imported into this directory whenever they introduce a + new carry patch. It may be beneficial to implement some automation + to streamline this process. +2. Automate the process of listing and applying patches from the git + log, as described + [here](https://github.com/openshift/kubernetes/blob/master/REBASE.openshift.md#creating-a-spreadsheet-of-carry-commits-from-the-previous-release). + In case the automation fails to cherry-pick a specific patch, it + can then search for the patch in the patches directory. This is the + approach taken by the tooling currently under development + [here](https://github.com/soltysh/rebase). + +## Conclusion + +The proposed approach involves establishing an OCP branch with an +updated Kubernetes codebase, daily application of downstream patches, +and the setup of a CI job to detect code conflicts and and failures in +generated code. + +The implementation of this proposal aims to improve the rebase process +and proactively address potential issues the are currently only +detected when the rebase process starts. + +The ultimate goal is to land the rebase PR considerably early, +potentially aligning with the release of the upstream GA tag. This +will allow us to expose updated features and fixes from upstream to +our OCP teams considerably earlier than we do today. From 4e0d5bd9993a6fad403eb4c8e41a5a7caa62a7a4 Mon Sep 17 00:00:00 2001 From: Hemant Kumar Date: Tue, 2 Jul 2024 15:02:35 -0400 Subject: [PATCH 24/53] Update enhancements/storage/csi-driver-operator-merge.md MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-authored-by: Roman Bednář --- enhancements/storage/csi-driver-operator-merge.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/enhancements/storage/csi-driver-operator-merge.md b/enhancements/storage/csi-driver-operator-merge.md index bd4b24ad59..af1d470807 100644 --- a/enhancements/storage/csi-driver-operator-merge.md +++ b/enhancements/storage/csi-driver-operator-merge.md @@ -538,7 +538,7 @@ Please note the changes to `ocp-build-data`should be merged almost same time as Once migration is complete, we should perform following post migration steps to ensure that we are not left over with legacy stuff: 1. Mark existing `openshift/--operator` repository as deprecated. -2. Ensure that we have test manifest available in `test/e2e2` directory. +2. Ensure that we have test manifest available in `test/e2e` directory. 3. Make changes into `release` repository so as it longer relies on anything from `legacy` directory. See - https://github.com/openshift/release/pull/49655 for example. 4. Remove code from `vendor/legacy` in `csi-operator` repository. From 75ef5474b783085718f1a02b9c75e791cb2d07cc Mon Sep 17 00:00:00 2001 From: Hemant Kumar Date: Tue, 2 Jul 2024 15:42:11 -0400 Subject: [PATCH 25/53] Update enhancements/storage/csi-driver-operator-merge.md Co-authored-by: Jonathan Dobson --- enhancements/storage/csi-driver-operator-merge.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/enhancements/storage/csi-driver-operator-merge.md b/enhancements/storage/csi-driver-operator-merge.md index af1d470807..3ec64e04c4 100644 --- a/enhancements/storage/csi-driver-operator-merge.md +++ b/enhancements/storage/csi-driver-operator-merge.md @@ -528,7 +528,7 @@ So in previous section we merely copied existing code from operator’s own repo But once your operator has been changed to conform to new code in csi-operator repo, You need to perform following additional steps: -1. Make sure that `Dockerfile.` at top of the `csi-operator` tree refers to new location of code and not older `legacy/` location.See example of existing Dockerfiles. +1. Make sure that `Dockerfile.` at top of the `csi-operator` tree refers to new location of code and not older `legacy/` location. See example of existing Dockerfiles. 2. After your changes to `csi-operator` are merged, you should remove the old location from cachito - https://github.com/openshift-eng/ocp-build-data/pull/4219 Please note the changes to `ocp-build-data`should be merged almost same time as changes into `csi-operator`'s Dockerfile are merged, otherwise we risk builds from breaking. From 9660e8c029786f59ce54b56f5e2dbfb133f558ad Mon Sep 17 00:00:00 2001 From: Hemant Kumar Date: Tue, 2 Jul 2024 15:42:51 -0400 Subject: [PATCH 26/53] Add additional docs for OLM based operators --- enhancements/storage/csi-driver-operator-merge.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/enhancements/storage/csi-driver-operator-merge.md b/enhancements/storage/csi-driver-operator-merge.md index 3ec64e04c4..038a5848af 100644 --- a/enhancements/storage/csi-driver-operator-merge.md +++ b/enhancements/storage/csi-driver-operator-merge.md @@ -488,7 +488,7 @@ git subtree push --prefix legacy/azure-disk-csi-driver-operator https://github.c ### Add Dockerfiles for building images from new location -Place a `Dockerfile.` and `Dockerfile..test` at top of csi-operator tree and make sure that you are able to build an image of the operator from csi-operator repository. +Place a `Dockerfile.` at top of csi-operator tree and make sure that you are able to build an image of the operator from csi-operator repository. ### Update openshift/release to build image from new location Make a PR to openshift/release repository to build the operator from csi-operator. For example - https://github.com/openshift/release/pull/46233. @@ -514,6 +514,8 @@ azure-disk-csi-driver-operator= \ oc adm release extract --command openshift-install quay.io/jsafrane/scratch:release1 ``` +This step is only applicable for CVO based operators and not OLM based operators. For OLM based operator - either an image can be built locally and deployed using your personal index image or you can ask ART team for a scratch image when you open `ocp-build-data` PR and proceed to include that image in your personal index image. + ### Co-ordinating merges in ocp-build-data and release repository Both PRs in openshift/release and ocp-build-data must be merged +/- at the same time. There is a robot that syncs some data from ocp-build-data to openshift/release and actually breaks things when these two repos use different source repository to build images. From 51e9eaeb6522c953367d67e04828993619239906 Mon Sep 17 00:00:00 2001 From: Michail Resvanis Date: Tue, 30 Apr 2024 16:58:02 +0200 Subject: [PATCH 27/53] Add image-based installer enhancement Signed-off-by: Michail Resvanis --- .../installer/image-based-installer.md | 371 ++++++++++++++++++ 1 file changed, 371 insertions(+) create mode 100644 enhancements/installer/image-based-installer.md diff --git a/enhancements/installer/image-based-installer.md b/enhancements/installer/image-based-installer.md new file mode 100644 index 0000000000..0460a537c3 --- /dev/null +++ b/enhancements/installer/image-based-installer.md @@ -0,0 +1,371 @@ +--- +title: image-based-installer +authors: + - "@mresvanis" + - "@eranco74" +reviewers: + - "@patrickdillon" + - "@romfreiman" +approvers: + - "@zaneb" +api-approvers: + - None +creation-date: 2024-04-30 +last-updated: 2024-04-30 +tracking-link: + - https://issues.redhat.com/browse/MGMT-17600 +see-also: + - "/enhancements/agent-installer/agent-based-installer.md" +replaces: N/A +superseded-by: N/A +--- + +# Image-based Installer + +## Summary + +The Image-based Installer (IBI) is an installation method for on-premise +single-node OpenShift (SNO) clusters, that will use the following to run on the +hosts that are to become SNO clusters: + +1. A seed [OCI image](https://github.com/opencontainers/image-spec/blob/main/spec.md) + [generated](https://github.com/openshift-kni/lifecycle-agent/blob/main/docs/seed-image-generation.md) + via the [lifecycle-agent](https://github.com/openshift-kni/lifecycle-agent) + operator from a SNO system provisioned with the target OpenShift version. +2. A bootable installation ISO generated to provision multiple SNO clusters. +3. A configuration ISO generated to be used in a single SNO cluster installation. + +Each of the aforementioned artifacts is potentially generated by a different +user persona. This enhancement focuses on the last two artifacts, which the +users will generate using a command-line tool. The installation ISO will be +configured with a [seed OCI image](https://github.com/openshift-kni/lifecycle-agent/blob/main/docs/seed-image-generation.md). +The latter is installed onto a SNO as a new [ostree stateroot](https://ostreedev.github.io/ostree/deployment/#stateroot-aka-osname-group-of-deployments-that-share-var) +and includes, among other files, the `/var`, `/etc` (with specific exclusions) +and `/ostree/repo` directories, which contain the target OpenShift version and +most of its configuration, but no site specific configuration and amounts +approximately to just over 1GB in size. The configuration ISO will contain the +site specific configuration data (e.g. cluster name, domain and crypto +objects), which need to be set up per cluster and are derived mainly from the +OpenShift installer [install config](https://github.com/openshift/installer/tree/release-4.15/pkg/asset/installconfig). + +## Motivation + +The primary motivation for relocatable SNO is the fast deployment of single-node +OpenShift. Telecommunications providers continue to deploy OpenShift at the Far +Edge. The acceleration of this adoption and the nature of existing +Telecommunication infrastructure and processes drive the need to improve +OpenShift provisioning speed at the Far Edge site and the simplicity of +preparation and deployment of Far Edge clusters, at scale. + +IBI provides users with such speed and simplicity, but it currently needs the +[multicluster engine](https://docs.openshift.com/container-platform/4.15/architecture/mce-overview-ocp.html) +and/or the [Image-based Install operator](https://github.com/openshift/image-based-install-operator) +to generate the required installation and configuration artifacts. We would like +to enable users to generate the latter intuitively and independently, using +their own automation or even manual intervention to boot the host. + +### User Stories + +- As a user in a on-premise disconnected environment with no existing management + cluster, I want to deploy a single-node OpenShift cluster using a [seed OCI image](https://github.com/openshift-kni/lifecycle-agent/blob/main/docs/seed-image-generation.md) + and my own automation for provisioning. +- As a user in a on-premise disconnected environment with no existing management + cluster, I want to generate a bootable installation ISO, common for a large + number of hosts that are to be provisioned as single-node OpenShift + clusters, using a [seed OCI image](https://github.com/openshift-kni/lifecycle-agent/blob/main/docs/seed-image-generation.md). +- As a user in a on-premise disconnected environment with no existing management + cluster, I want to generate a configuration ISO for a specific host that is to + be provisioned as a single-node OpenShift cluster using an already generated + bootable installation ISO containing a [seed OCI image](https://github.com/openshift-kni/lifecycle-agent/blob/main/docs/seed-image-generation.md). + +### Goals + +- Install clusters with single-node topology. +- Install clusters in fully disconnected environments. +- Perform reproducible cluster builds from configuration artifacts. +- Require no machines in the cluster environment other than the one to be the + single node of the cluster. +- Be agnostic to the tools used to provision machines, so that users can + leverage their own tooling and provisioning. + +### Non-Goals + +- Replace any other OpenShift installation method in any capacity. +- Generate image formats other than ISO. +- Automate booting of the ISO image on the machines. +- Support installation configurations for cloud-based platforms. + +## Proposal + +A command-line tool will enable users to build a single custom RHCOS seed image +in ISO format, containing the components needed to provision multiple +single-node OpenShift clusters from that single ISO and multiple site +configuration ISO images, one per cluster to be installed. + +The command-line tool will download the base RHCOS ISO, create an [Ignition](https://coreos.github.io/ignition/) +file with generic configuration data (i.e. configuration that is going to be +included in all clusters to be installed with that ISO) and generate an +image-based installation ISO. The Ignition file will configure the live ISO such +that once the machine is booted with the latter, it will install RHCOS to the +installation disk, mount the installation disk, restore the single-node +OpenShift from the [seed OCI image](https://github.com/openshift-kni/lifecycle-agent/blob/main/docs/seed-image-generation.md) +and optionally precache all release container images under the +`/var/lib/containers` directory. + +The installation ISO approach is very similar to what is already implemented by +the functionality of the [Agent-based Installer](/enhancements/agent-installer-agent-based-installer.md), although +IBI differs from the OpenShift Agent-based Installer in several key aspects: + +- while the Agent-based Installer may offer flexibility and versatility in certain scenarios, + it may not meet the stringent time constraints and requirements of far-edge deployments + in the telecommunications industry due to the inherently long installation process, + exacerbated by low bandwidth and high packet latency. +- with the Agent-based Installer all cluster configuration needs to be provided upfront + during the generation of the ISO image, while with IBI the cluster + configuration is provided in an additional step. + +IBI offers key advantages, where fast and reliable deployment at the edge is +crucial. By generating ISO images containing all the necessary components, +IBI significantly accelerates deployment times. Moreover, unlike the Agent-based +Installer, the image-based approach allows for cluster configuration to be +supplied upon deployment at the edge, rather than during the initial ISO +generation process. This flexibility enables operators to use a single generic +image for installing multiple clusters, streamlining the deployment process and +reducing the need for multiple customized ISO images. + +The command-line tool will also support generating a configuration ISO with all +the site specific configuration data for the cluster to be installed provided as +input. The configuration ISO contents are the following: +- `ClusterInfo` (cluster name, base domain, hostname, node IP) +- SSH `authorized_keys` +- Pull Secret +- Extra Manifests +- Generated keys and certs (compatible with the generated admin `kubeconfig`) +- Static networking configuration + +The site specific configuration data will be generated according to information +provided in the `install-config.yaml` and the manifests provided in the +installation directory as input. To complete the installation at the edge site: +- the cluster configuration for the edge location can be delivered by copying + the config ISO content onto the node and placing it under `/opt/openshift/cluster-configuration/`. +- the cluster configuration can also be delivered using an attached ISO, a + systemd service running on the host pre-installed and IBI will + mount that ISO (identified by a known label) and copy the cluster configuration + to `/opt/openshift/cluster-configuration/`. +- the cluster configuration data on the disk will be used to configure the + cluster and allow OCP to start successfully. + +### Workflow Description + +The image-based installation high-level flow consists of the following +stages, each of which is performed by different users: + +1. Generate a [seed OCI image](https://github.com/openshift-kni/lifecycle-agent/blob/main/docs/seed-image-generation.md) + via the [Lifecycle Agent operator](https://github.com/openshift-kni/lifecycle-agent) + and is set up with the desired OpenShift version. This OCI image will be used + for multiple SNO cluster installations. +2. Generate a bootable installation ISO using the seed OCI image generated in + the previous stage. This ISO will be also used for multiple SNO cluster + installations. +3. Extract the installation ISO to the machine's disk at the factory. +4. Ship the machine to the edge site. +5. Generate a configuration ISO at the edge site, which will contain site + specific configuration for a single SNO cluster installation. +6. Extract the configuration ISO contents under `/opt/openshift/cluster-configuration/` + or attach the former (which a pre-installed systemd service will identify by + a known label, mount and copy its contents to the aforementioned filesystem + location), then boot the machine at the edge site to be provisioned as a SNO + cluster. + +### API Extensions + +N/A + +### Topology Considerations + +#### Hypershift / Hosted Control Planes + +N/A + +#### Standalone Clusters + +N/A + +#### Single-node Deployments or MicroShift + +The Image-based Installer targets single-node OpenShift deployments. + +### Implementation Details/Notes/Constraints + +Since we must allow users to provision hosts themselves, either manually or +using automated tooling of their choice, the ISO format offers the widest range +of compatibility. Building a single ISO to boot multiple hosts makes it +considerably easier for the user to manage. The additional site configuration +ISO is necessary for configuring each cluster securely and independently. + +The user, before running the Image-based Installer, must generate a [seed OCI image](https://github.com/openshift-kni/lifecycle-agent/blob/main/docs/seed-image-generation.md) +via the [Lifecycle Agent SeedGenerator Custom Resouce (CR)](https://github.com/openshift-kni/lifecycle-agent/blob/main/docs/seed-image-generation.md). +The prerequisites to generating a seed OCI image are the following: + +- an already provisioned single-node OpenShift cluster (seed SNO). + - The CPU topology of that host must align with the target host(s), i.e. it + should have the same number of cores and the same tuned performance + configuration (ie. reserved CPUs). +- the [Lifecycle Agent](https://github.com/openshift-kni/lifecycle-agent/tree/main) + operator must be installed on the seed SNO. + +### Risks and Mitigations + +N/A + +### Drawbacks + +N/A + +## Open Questions [optional] + +- Should the command-line tool that generates the installation ISO be a subcommand + of the OpenShift installer, or a standalone binary? + + Having the functionality provided by the command-line tool in the OpenShift + installer would be beneficial to the users, as the former refers to the + provisioning of single-node OpenShift clusters and it should follow the + OpenShift versions (i.e. the installation ISO generating tool for a specific + OpenShift version should be used to install a seed OCI image with the same + OpenShift version). It generates the required installation artifacts in the + same way as the [Agent-based Installer](/enhancements/agent-installer/agent-based-installer.md) + `openshift-install agent create image` command. + +- Should the command-line tool that generates the configuration ISO be a subcommand + of the OpenShift installer, or a standalone binary? + + Having the functionality provided by the command-line tool in the OpenShift + installer would be a natural addition to the latter, as the former refers to + the provisioning of single-node OpenShift clusters and consumes the OpenShift + installer `install-config.yaml`. It generates the required installation + artifacts in the same way as the [Agent-based Installer](/enhancements/agent-installer/agent-based-installer.md) + `openshift-install agent create config-image` command. + +## Test Plan + +The Image-based Installer will be covered by end-to-end testing using virtual +machines (in a baremetal configuration), automated by some variation on the +metal platform [dev-scripts](https://github.com/openshift-metal3/dev-scripts/#readme). +This is similar to the testing of the Agent-based Installer, the baremetal IPI +and assisted installation flows. + +## Graduation Criteria + +**Note:** *Section not required until targeted at a release.* + +TBD + +### Dev Preview -> Tech Preview + +- Ability to utilize the enhancement end to end +- End user documentation, relative API stability +- Sufficient test coverage +- Gather feedback from users rather than just developers +- Enumerate service level indicators (SLIs), expose SLIs as metrics +- Write symptoms-based alerts for the component(s) + +### Tech Preview -> GA + +- More testing (upgrade, downgrade, scale) +- Sufficient time for feedback +- Available by default +- Backhaul SLI telemetry +- Document SLOs for the component +- Conduct load testing +- User facing documentation created in [openshift-docs](https://github.com/openshift/openshift-docs/) + +**For non-optional features moving to GA, the graduation criteria must include +end to end tests.** + +### Removing a deprecated feature + +N/A + +## Upgrade / Downgrade Strategy + +N/A + +## Version Skew Strategy + +### Configuration ISO and Seed OCI Image + +The IBI configuration ISO contains the [cluster configuration file](https://github.com/openshift-kni/lifecycle-agent/blob/release-4.16/api/seedreconfig/seedreconfig.go). +The component responsible for parsing and using the latter is part of the seed +OCI image, which in turn is contained in the IBI installation ISO. That +component validates its compatibility with the cluster configuration file [version](https://github.com/openshift-kni/lifecycle-agent/blob/release-4.16/api/seedreconfig/seedreconfig.go#L6) +during the image-based installation. In the case of incompatibility between the +cluster configuration version and the seed OCI image version, the image-based +installation will fail with the respective error message. + +### RHCOS ISO and Seed OCI Image + +The RHCOS base ISO, which is contained in the IBI installation ISO and derived +from the OpenShift release image, has currently no strict requirements to be +tied to the seed OCI image OpenShift version. The features and configuration of +the underlying tools required to successfully complete an image-based +installation are Podman, SELinux and ostree. In order to remove the risk of +version skew between the RHCOS ISO and the seed OCI image, we plan on restricting +users to generating an installation ISO using the same RHCOS base ISO version as +the one contained in the seed OCI image OpenShift version. + +### RHCOS ISO and Ignition + +Since the IBI installation ISO is customized via an Ignition file, we need to +ensure that the RHCOS base ISO version is compatible with the latter. To that +end, the RHCOS base ISO will be fetched by looking into the [coreos stream metadata](https://github.com/openshift/installer/blob/master/data/data/coreos/rhcos.json) +embedded in the OpenShift installer binary and it will either be extracted from +the OpenShift release image, which contains the same version, or be downloaded +from the mirror URL in the metadata. + +## Operational Aspects of API Extensions + +N/A + +## Support Procedures + +N/A + +## Alternatives + +### Downloading the installation ISO from Assisted Installer SaaS + +The Assisted Installer SaaS could be used to generate and serve the image-based +installation ISO, which would potentially enhance the user experience, compared +to configuring and executing a command-line tool. In addition, even in a fully +disconnected environment, the user could generate the ISO via the Assisted +Installer service and then carry it over to the disconnected environment. + +However, for the Assisted Installer service this would be a completely new +scope, as the image-based installation ISO is not generated per cluster, but for +multiple SNO clusters with similar underlying hardware (due to the requirements +of the IBI flow) and its configuration input is different than what is currently +supported. + +### Building the installation ISO in the Lifecycle Agent Operator + +The Lifecycle Agent Operator could be used to generate and serve the image-based +installation ISO, as this is the component used to generate the seed OCI +image, which is the basis of the IBI flow. + +However, the following reasons constitute a separate command-line tool a better +fit: + +- the seed OCI image is generated by a different user persona (e.g. + release/certification/other team) than the one to generate the image-based + installation ISO. +- the same seed OCI image can be used to generate multiple image-based + installation ISOs (e.g. with different installation disks). +- having 2 (installation ISO and configuration ISO) out of 3 (the 3rd is the + seed OCI image) IBI artifacts generated by the same command-line tool + simplifies the user experience. +- as a future enhancement we can provide generic seed OCI images, in order to skip + the seed OCI image generation step altogether. + +## Infrastructure Needed [optional] + +N/A From 08f10acf7c2e57dcfa5c2bbf5ca48d3a457d56d5 Mon Sep 17 00:00:00 2001 From: Mat Kowalski Date: Mon, 28 Aug 2023 15:09:22 +0200 Subject: [PATCH 28/53] Mutable dual-stack VIPs --- enhancements/network/on-prem-mutable-vips.md | 236 +++++++++++++++++++ 1 file changed, 236 insertions(+) create mode 100644 enhancements/network/on-prem-mutable-vips.md diff --git a/enhancements/network/on-prem-mutable-vips.md b/enhancements/network/on-prem-mutable-vips.md new file mode 100644 index 0000000000..5ab3acf030 --- /dev/null +++ b/enhancements/network/on-prem-mutable-vips.md @@ -0,0 +1,236 @@ +--- +title: on-prem-mutable-vips +authors: + - "@mkowalski" +reviewers: + - "@JoelSpeed, to review API" + - "@sinnykumari, to review MCO" + - "@cybertron, to peer-review OPNET" + - "@cgwalters" +approvers: + - "@JoelSpeed" +api-approvers: + - "@danwinship" + - "@JoelSpeed" +creation-date: 2023-08-28 +last-updated: 2023-08-28 +tracking-link: + - https://issues.redhat.com/browse/OCPSTRAT-178 + - https://issues.redhat.com/browse/OPNET-340 + - https://issues.redhat.com/browse/OPNET-80 +see-also: + - "/enhancements/on-prem-dual-stack-vips.md" +replaces: +superseded-by: +--- + +# On-Prem Mutable VIPs + +## Summary + +Originally the on-prem loadbalancer architecture supported only single-stack IPv4 or IPv6 deployments. Later on, after dual-stack support was added to the on-prem deployments, a work has been done to allow installing clusters with a pair of virtual IPs (one per IP stack). +This work however only covered installation-time, leaving clusters originally installed as single-stack and later converted to dual-stack out of scope. + +This design is proposing a change which will allow adding a second pair of virtual IPs to clusters that became dual-stack only during their lifetime. + +## Motivation + +We have customers who installed clusters before dual-stack VIP feature existed. Currently they have no way of migrating because the new feature is avaialble only during initial installation of the cluster. + +### User Stories + +* As a deployer of a dual-stack OpenShift cluster, I want to access the API using both IPv4 and IPv6. + +* As a deployer of a dual-stack OpenShift cluster, I want to access the Ingress using both IPv4 and IPv6. + +### Goals + +* Allow adding a second pair of virtual IPs to an already installed dual-stack cluster. + +* Allow deleting a second pair of virtual IPs on an already installed dual-stack cluster. + +### Non-Goals + +* Modifying existing virtual IP configuration. + +* Modifying virtual IP configuration after second pair of VIPs has been added. We only want to cover "create" and "delete" operation and only for the second pair of VIPs. + +* Configuration of any VIPs beyond a second pair for the second IP stack. MetalLB is a better solution for creating arbitrary loadbalancers. + +## Proposal + +### Workflow Description + +The proposed worklflow after implementing the feature would look as described below + +1. Administrator of a dual-stack cluster with single-stack VIPs wants to add a second pair for the second IP stack configured. + +1. Administrator edits the Infrastructure CR named `cluster` by changing the `spec.platformSpec.[*].apiServerInternalIPs` and `spec.platformSpec.[*].ingressIPs` fields. For dual-stack vips the sample `platformSpec` would look like shown below under "Sample platformSpec". + +1. Cluster Network Operator picks the modification of the object and compares values with `spec.platformStatus.[*].apiServerInternalIPs` and `spec.platformStatus.[*].ingressIPs`. + +1. After validating that the requested change is valid (i.e. conforms with the goals and non-goals as well as with the validations performed in o/installer), the change is propagated down to the keepalived template and configuration file. + +1. After keepalived configuration is changed, the service is restarted or reloaded to apply the changes. + +### Sample platformSpec + +```yaml +platformSpec: + baremetal: + apiServerInternalIPs: + - "192.0.2.100" + - "2001:0DB8::100" + ingressIPs: + - "192.0.2.101" + - "2001:0DB8::101" +``` + +### API Extensions + +New fields will have to be added to the platform spec section for [baremetal](https://github.com/openshift/api/blob/938af62eda38e539488d6193e7af292e47a09a5e/config/v1/types_infrastructure.go#L694), [vsphere](https://github.com/openshift/api/blob/938af62eda38e539488d6193e7af292e47a09a5e/config/v1/types_infrastructure.go#L1089) +and [OpenStack](https://github.com/openshift/api/blob/938af62eda38e539488d6193e7af292e47a09a5e/config/v1/types_infrastructure.go#L772). In the first implementation we aim to cover only baremetal. + +### Topology Considerations + +#### Hypershift / Hosted Control Planes + +Change is not designed for Hypershift. + +#### Standalone Clusters + +No special considerations. + +#### Single-node Deployments or MicroShift + +No special considerations. + +### Implementation Details/Notes/Constraints + +Because of the expertise and the team owning this feature, we think the best place for implementing the logic is Cluster Network Operator. CNO however is not an owner of the `cluster` Infrastructure CR what at first sight may seem like a challenge. + +Kubernetes API provides a solution for this and we can leverage Watches (with Predicate for optimization) to allow a controller inside CNO to watch an object it does not own. This in fact already happens inside CNO +in the [infrastructureconfig controller](https://github.com/openshift/cluster-network-operator/blob/2005bcd8c93de5bffc05c9c943b51386007f6b9a/pkg/controller/infrastructureconfig/infrastructureconfig_controller.go#L47). It was added as part of the initial dual-stack VIPs implementation. From the discussion with multiple teams back then it was decided that CNO is the best place to implement it. + +The controller would get a logic implemented that validates if the change requested by user is valid as in the first implementation we only want to allow adding second pair of VIPs but forbid any modifications of the already configured ones. + +The current values of `spec.platformStatus.[*]` are propagated during installation [by the installer](https://github.com/openshift/installer/blob/186cb916c388d29ed3f6ef4e71d9fda409f30bdf/pkg/asset/manifests/infrastructure.go#L160). After implementing this feature, the same function would also set `spec.platformSpec.[*]` so that the cluster is installed with a correct configuration from the very beginning. + +When modifying `spec.platformSpec.[*]` the CNO controller would need to propagate this change down to the [keepalived templates managed by the Machine Config Operator](https://github.com/openshift/machine-config-operator/blob/2b31b5b58f0d7d9fe6c3e331e4b3f01c9a1bd00c/templates/common/on-prem/files/keepalived.yaml#L46-L49). As MCO's ControllerConfig +[already observes changes in the PlatformStatus](https://github.com/openshift/machine-config-operator/blob/5b821a279c88fee1cc1886a6cf1ec774891a2258/lib/resourcemerge/machineconfig.go#L100-L105), no additional changes are needed. + +One of the tasks of CNO is validating if the change is correct (e.g. VIP needs to belong to the node network). To do so, CNO needs to access various network configuration parameters with `MachineNetworks` being one of them. To facilitate it, we store it as part of the `spec.platformStatus.[*]` and `spec.platformSpec.[*]`. + +### Risks and Mitigations + +Minimal risk. The dual-stack VIPs feature is already used. This is just adding an ability to add this feature to an already existing cluster. Because modifying and deleting VIPs is out of scope of this enhancement, the operations with the biggest potential of causing an issue do not need to be covered. + +### Drawbacks + +Because we do not implement Update operation, the main drawback is a scenario when user adds an incorrect address to be configured. We will only allow deleting a second entry, therefore user will need to first delete and then add again a correct pair of VIPs. Because fixing such a typo will require two reboots (one for deleting and one for adding), +enablement of this feature should be performed with some level of care by the end user. + +## Design Details + +This feature will require changes in a few different components: + +* Installer - New fields will need to be populated during the installation. This is a safe part as installer is already populating `status` and now will populate `status` and `spec`. + +* API - New fields will need to be added to the platform spec section of the infrastructure object. + +* Cluster Network Operator - New object will need to be watched by one of its controllers. The core logic of validating the requested change will be implemented there. + +* Machine Config Operator - Definition of the runtimecfg Pod will need to be rendered again when the VIP configuration changes. + +* baremetal-runtimecfg - The code rendering the keepalived template belongs to this repo. As the enhancement does not touch the rendering part no big changes are expected in the runtimecfg. This component is used to re-render configuration in case something changes in the cluster (e.g. new nodes are added) so the fact we modify configuration is not a new feature from its perspective. +Small changes may be needed to acommodate to the new user scenario to provide good user experience. + +### openshift/api + +The new structure of platformSpec would look like this: + +```yaml +platformSpec: + properties: + baremetal: + description: BareMetal contains settings specific to the BareMetal platform. + type: object + properties: + apiServerInternalIPs: + description: apiServerInternalIPs are the IP addresses to... + type: array + ingressIPs: + description: ingressIPs are the external IPs which... + type: array + machineNetworks: + description: IP networks used to connect all the OpenShift cluster nodes. + type: array + ... +``` + +This format (i.e. usage of arrays) is already used by platformStatus so this does not introduce a new type nor schema. + +### Cluster Network Operator + +There already exists `infrastructureconfig_controller` inside CNO that watches for changes in `configv1.Infrastructure` for on-prem platforms. Because of this we do not need to create a new one nor change any fundaments, we only need to extend its logic. + +Currently the controller already reconciles on the `CreateEvent` and `UpdateEvent` so we will implement a set of functions that compare and validate `platformSpec` with `platformStatus` and do what's needed. + +### Machine Config Operator + +[Keepalived templates managed by the Machine Config Operator](https://github.com/openshift/machine-config-operator/blob/2b31b5b58f0d7d9fe6c3e331e4b3f01c9a1bd00c/templates/common/on-prem/files/keepalived.yaml#L46-L49) currently use the `PlatformStatus` fields. CNO will be responsible for keeping `PlatformSpec` and `PlatformStatus` in sync. +As MCO [already grabs values from the latter](https://github.com/openshift/machine-config-operator/blob/2b31b5b58f0d7d9fe6c3e331e4b3f01c9a1bd00c/pkg/controller/template/render.go#L551-L575), no changes are needed. + +It is important to note that Machine Config Operator triggers a reboot whenever configuration of the system changes. We already have a history of introducing changes into MCO and forcefully preventing reboot (i.e. single-stack to dual-stack conversion) but this proven to be problematic in a long run (e.g. https://issues.redhat.com/browse/OCPBUGS-15910) and is now being reverted. +Unless there is a strong push towards the mutable VIPs being rebootless, we should follow the default MCO behaviour and let it reboot. + +Similarly to how it happens today, we are not covering a scenario of updating the PlatformStatus only after the change is rolled out to all the nodes. Today installer sets the PlatformStatus as soon as the configuration is desired and keeps it even if the keepalived ultimately did not apply the config. For simplicity we are keeping it that way, effectively meaning that PlatformStatus +and PlatformSpec will contain always the same content. The main reason is that implementing a continuous watch for keepalived runtime would require us to implement a new and relatively complicated controller that tracks the VIP configuration at runtime. Since we have not created one till now, it remains outside of the scope of this enhancement. + +### baremetal-runtimecfg + +Keepalived configuration is rendered based on the command-line parameters provided to the baremetal-runtimecfg. Those come from the Pod definition that is managed by the MCO. Because of that it is valid to say that baremetal-runtimecfg is oblivious to whether `PlatformSpec` or `PlatformStatus` stores the desired config. + +## Test Plan + +We will need to add a few tests that will perform the operation of adding a pair of VIPs to an already existing dual-stack clusters. Once done, we can reuse the already existing steps that test dual-stack clusters with dual-stack VIPs. + +## Graduation Criteria +We do not anticipate needing a graduation process for this feature. The internal loadbalancer implementation has been around for a number of releases at this point and we are just extending it. + +### Dev Preview -> Tech Preview + +NA + +### Tech Preview -> GA + +NA + +### Removing a deprecated feature + +NA + +## Upgrade / Downgrade Strategy + +Upgrades and downgrades will be handled the same way they are for the current internal loadbalancer implementation. On upgrade the existing VIP configuration will be maintained. We will not automatically add additional VIPs to the cluster on upgrade. If an administrator of a dual-stack cluster wants to use the new functionality that will need to happen as a separate operation from the upgrade. + +When upgrading a cluster to the version that supports this functionality we will need to propagate new `PlatformSpec` fields from the current `PlatformStatus` fields. We have experience with this as when introducing dual-stack VIPs +the [respective synchronization](https://github.com/openshift/cluster-network-operator/blob/2005bcd8c93de5bffc05c9c943b51386007f6b9a/pkg/controller/infrastructureconfig/sync_vips.go#L18-L22) has been implemented inside CNO. + +## Version Skew Strategy + +The keepalived configuration does not change between the version of the cluster and also when introducing this feature. There is no coordination needed and as mentioned above, enabling this functionality should be a separate operation from upgrading the cluster. + +## Operational Aspects of API Extensions + +NA + +## Support Procedures + +NA + +## Alternatives + +The original enhancement for dual-stack VIPs introduced an idea of creating a separate instance of keepalived for the second IP stack. It was mentioned as something with simpler code changes but following this path would mean we can have two fundamentally different architectures of OpenShift clusters running in the field. + +Because we want clusters with dual-stack VIPs to not differ when they were installed like this from converted clusters, this is probably not the best idea. From 06800ad10a5f362979e8e8b7d6aa8db0ec1f3f33 Mon Sep 17 00:00:00 2001 From: Patryk Matuszak Date: Tue, 9 Jul 2024 12:52:22 +0200 Subject: [PATCH 29/53] move details of microshift-tuned to dedicated section --- .../low-latency-workloads-on-microshift.md | 61 +++++++++++++------ 1 file changed, 41 insertions(+), 20 deletions(-) diff --git a/enhancements/microshift/low-latency-workloads-on-microshift.md b/enhancements/microshift/low-latency-workloads-on-microshift.md index 0f1110e4c4..d994a201ba 100644 --- a/enhancements/microshift/low-latency-workloads-on-microshift.md +++ b/enhancements/microshift/low-latency-workloads-on-microshift.md @@ -106,14 +106,8 @@ Workflow consists of two parts: 1. User builds the blueprint 1. User deploys the commit / installs the system. 1. System boots -1. `microshift-tuned.service` starts (after `tuned.service`, before `microshift.service`): - - Compares active profile with requested profile - - If requested profile is already active: - - Compare checksum of requested profile with cached checksum. - - If checksums are the same - exit. - - Apply requested profile - - Calculate checksum of the profile and the variables file and save it - - If `reboot_after_apply` is True, then reboot the host +1. `microshift-tuned.service` runs (after `tuned.service`, before `microshift.service`) and optionally reboots the system. + See "microshift-tuned service" section below for more information. 1. Host boots again, everything for low latency is in place, `microshift.service` can continue start up. @@ -213,14 +207,8 @@ RUN systemctl enable microshift-tuned.service 1. Production environment - User creates `/etc/microshift/microshift-tuned.yaml` to configure `microshift-tuned.service` - User enables `microshift.service` - - User enables and starts `microshift-tuned.service` which: - - Compares active profile with requested profile - - If requested profile is already active: - - Compare checksum of requested profile with cached checksum. - - If checksums are the same - exit. - - Apply requested profile - - Calculate checksum of the profile and the variables file and save it - - If `reboot_after_apply` is True, then reboot the host + - User enables and starts `microshift-tuned.service` which activates the TuneD profile and optionally reboot the host.: + See "microshift-tuned service" section below for more information. - Host is rebooted: MicroShift starts because it was enabled - Host doesn't need reboot: - User starts `microshift.service` @@ -232,7 +220,7 @@ RUN systemctl enable microshift-tuned.service - Setting Pod's memory limit and memory request to the same value, and setting CPU limit and CPU request to the same value to ensure Pod has guaranteed QoS class. - Use annotations to get desired behavior - (unless link to a documentation is present, these annotations only take two values: enabled and disabled): + (unless link to a documentation is present, these annotations only take two values: `enabled` and `disabled`): - `cpu-load-balancing.crio.io: "disable"` - disable CPU load balancing for Pod (only use with CPU Manager `static` policy and for Guaranteed QoS Pods using whole CPUs) - `cpu-quota.crio.io: "disable"` - disable Completely Fair Scheduler (CFS) @@ -326,16 +314,49 @@ hugepages = 0 additional_args = ``` -#### `microshift-tuned.service` configuration +#### microshift-tuned service -Config file to specify which profile to re-apply each boot and if host should be rebooted if -the kargs before and after applying profile are mismatched. +`microshift-tuned.service` will be responsible for activating TuneD profile specified by user in the config. +User will also need to specify if host should be rebooted after activating the profile. ```yaml profile: microshift-baseline reboot_after_apply: True ``` +Rationale for creating microshift-tuned.service: +1. Automatic apply of a TuneD profile and reboot if requested - helps with unattended install of a fleet of devices. +2. TuneD does not reboot the system in case of profile change. TuneD daemon will reapply the profile, + but there were changes to kernel arguments, they will be inactive until host is rebooted which would a manual operation. + +To address 1. microshift-tuned will, on each start, apply specified by user TuneD profile and reboot the host if user commanded so (`reboot_after_apply`). +To address 2. microshift-tuned will calculate checksum of content of TuneD profile and checksum of variables file referrenced in the profile. +These checksums will be persisted on disk and used on next start of the microshift-tuned to decide +if the profile changed and should be re-applied, optionally followed with reboot. + +`microshift-tuned.service`'s workflow is as follows: +- Compares active profile (according to TuneD) with requested profile (in the config file) +- If requested profile is already active: + - Compare checksum of requested profile with cached checksum. + - If checksums are the same - exit. +- Apply requested profile +- Calculate checksum of the profile and the variables file and save it +- If `reboot_after_apply` is True, then reboot the host + +In case of errors: +- Checksum cache cannot be loaded + - Checksum cache is loaded when active and desired profiles are the same. + - If checksum cannot be loaded, it means that profile was activated outside `microshift-tuned.service` + - Checksum should be calculated and stored in the cache + - microshift-tuned.service exits with success + - Why not reapply the profile and reboot? Maybe the cache is not there, because it could not be written. If on missing cache we would potentially reboot, this could mean a boot loop. +- If any operation fails not reaching the `tuned-adm profile $PROFILE`, then the currently active profile will stay active. +- If `tuned-adm profile $PROFILE` fails, it is very likely that the active profile is unloaded and new profile is not active resulting in no active profile. + - This can be further investigated by interacting with TuneD, e.g. + - `sudo tuned-adm active` to get active profile + - Inspecting `/var/log/tuned/tuned.log` for errors + + #### CRI-O configuration ```ini From 8ff43359d2ed4592974c0b221d4ad47f1da02b1f Mon Sep 17 00:00:00 2001 From: Patryk Matuszak Date: Tue, 9 Jul 2024 16:24:30 +0200 Subject: [PATCH 30/53] remove mention of setting kargs via blueprint it is not supported to edge-commit --- enhancements/microshift/low-latency-workloads-on-microshift.md | 2 -- 1 file changed, 2 deletions(-) diff --git a/enhancements/microshift/low-latency-workloads-on-microshift.md b/enhancements/microshift/low-latency-workloads-on-microshift.md index d994a201ba..6a1d4a2892 100644 --- a/enhancements/microshift/low-latency-workloads-on-microshift.md +++ b/enhancements/microshift/low-latency-workloads-on-microshift.md @@ -94,8 +94,6 @@ Workflow consists of two parts: ##### OSTree 1. User creates an osbuild blueprint: - - (optional) User configures `[customizations.kernel]` in the blueprint if the values are known - beforehand. This could prevent from necessary reboot after applying tuned profile. - (optional) User adds `kernel-rt` package to the blueprint - User adds `microshift-low-latency.rpm` to the blueprint - User enables `microshift-tuned.service` From 0c98133781228231c2d8ea0a18a3fd6f2aec7866 Mon Sep 17 00:00:00 2001 From: Patryk Matuszak Date: Tue, 9 Jul 2024 16:44:04 +0200 Subject: [PATCH 31/53] we can include some non-tuned settings with scripts --- .../microshift/low-latency-workloads-on-microshift.md | 6 ------ 1 file changed, 6 deletions(-) diff --git a/enhancements/microshift/low-latency-workloads-on-microshift.md b/enhancements/microshift/low-latency-workloads-on-microshift.md index 6a1d4a2892..0b17cf7dd1 100644 --- a/enhancements/microshift/low-latency-workloads-on-microshift.md +++ b/enhancements/microshift/low-latency-workloads-on-microshift.md @@ -461,12 +461,6 @@ Approach described in this enhancement does not provide much of the NTO's functi due to the "static" nature of RPMs and packaged files (compared to NTO's dynamic templating), but it must be noted that NTO is going beyond low latency. -One of the NTO's strengths is that it can create systemd units for runtime configuration -(such as offlining CPUs, setting hugepages per NUMA node, clearing IRQ balance banned CPUs, -setting RPS masks). Such dynamic actions are beyond capabilities of static files shipped via RPM. -If such features are required by users, we could ship such systemd units and they could be no-op -unless they're turned on in MicroShift's config. However, it is unknown to author of the enhancement -if these are integral part of the low latency. ## Open Questions [optional] From 91fb92860ec3b587d6db9097f9d4adac61f9268d Mon Sep 17 00:00:00 2001 From: Patryk Matuszak Date: Tue, 9 Jul 2024 16:48:39 +0200 Subject: [PATCH 32/53] update enhancement with implementation findings --- .../low-latency-workloads-on-microshift.md | 63 ++++++++++++------- 1 file changed, 42 insertions(+), 21 deletions(-) diff --git a/enhancements/microshift/low-latency-workloads-on-microshift.md b/enhancements/microshift/low-latency-workloads-on-microshift.md index 0b17cf7dd1..3907333432 100644 --- a/enhancements/microshift/low-latency-workloads-on-microshift.md +++ b/enhancements/microshift/low-latency-workloads-on-microshift.md @@ -13,7 +13,7 @@ approvers: api-approvers: - "@jerpeter1" creation-date: 2024-06-12 -last-updated: 2024-06-12 +last-updated: 2024-07-09 tracking-link: - https://issues.redhat.com/browse/USHIFT-2981 --- @@ -66,9 +66,9 @@ parts need to be put in place: - CRI-O configuration + Kubernetes' RuntimeClass - Kubelet configuration (CPU, Memory, and Topology Managers and other) - `microshift-tuned.service` to activate user selected TuneD profile on boot and reboot the host - if the kernel args are changed. + if the profile was updated. -New `microshift-low-latency` RPM will be created that will contain tuned profile, CRI-O configs, and mentioned systemd daemon. +New `microshift-low-latency` RPM will be created that will contain new artifacts mentioned above. We'll leverage existing know how of Performance and Scalability team expertise and look at Node Tuning Operator capabilities. @@ -100,7 +100,7 @@ Workflow consists of two parts: - User supplies additional configs using blueprint: - `/etc/tuned/microshift-baseline-variables.conf` to configure tuned profile - `/etc/microshift/config.yaml` to configure Kubelet - - `/etc/microshift/microshift-tuned.yaml` to configure `microshift-tuned.service` + - `/etc/microshift/tuned.yaml` to configure `microshift-tuned.service` 1. User builds the blueprint 1. User deploys the commit / installs the system. 1. System boots @@ -139,9 +139,8 @@ name = "KERNEL-rt" path = "/etc/tuned/microshift-baseline-variables.conf" data = """ isolated_cores=1-2 -hugepagesDefaultSize = 2M -hugepages2M = 128 -hugepages1G = 0 +hugepages_size = 2M +hugepages = 128 additionalArgs = "" """ @@ -149,12 +148,14 @@ additionalArgs = "" path = "/etc/microshift/config.yaml" data = """ kubelet: + ... cpuManagerPolicy: static memoryManagerPolicy: Static + ... """ [[customizations.files]] -path = "/etc/microshift/microshift-tuned.yaml" +path = "/etc/microshift/tuned.yaml" data = """ reboot_after_apply: True profile: microshift-baseline @@ -172,8 +173,8 @@ profile: microshift-baseline - `/etc/tuned/microshift-baseline-variables.conf` - `/etc/microshift/config.yaml` to configure Kubelet - `/etc/microshift/microshift-tuned.yaml` to configure `microshift-tuned.service` -1. User builds the blueprint -1. User deploys the commit / installs the system. +1. User builds the Containerfile +1. User deploys the bootc image / installs the system. 1. System boots - rest is just like in OSTree flow Example Containerfile: @@ -184,9 +185,9 @@ FROM registry.redhat.io/rhel9/rhel-bootc:9.4 # ... MicroShift installation ... RUN dnf install kernel-rt microshift-tuned -COPY microshift-baseline-variables.conf /etc/tuned/microshift-low-latency-variables.conf +COPY microshift-baseline-variables.conf /etc/tuned/microshift-baseline-variables.conf COPY microshift-config.yaml /etc/microshift/config.yaml -COPY microshift-tuned.yaml /etc/microshift/microshift-tuned.yaml +COPY microshift-tuned.yaml /etc/microshift/tuned.yaml RUN systemctl enable microshift-tuned.service ``` @@ -203,9 +204,9 @@ RUN systemctl enable microshift-tuned.service - Host boots again, everything for low latency is in place, - User starts/enables `microshift.service` 1. Production environment - - User creates `/etc/microshift/microshift-tuned.yaml` to configure `microshift-tuned.service` - - User enables `microshift.service` - - User enables and starts `microshift-tuned.service` which activates the TuneD profile and optionally reboot the host.: + - User creates `/etc/microshift/tuned.yaml` to configure `microshift-tuned.service` + - User enables, but not starts `microshift.service` + - User enables and starts `microshift-tuned.service` which activates the TuneD profile and optionally reboots the host.: See "microshift-tuned service" section below for more information. - Host is rebooted: MicroShift starts because it was enabled - Host doesn't need reboot: @@ -266,6 +267,19 @@ New `microshift-baseline` tuned profile will be created and will include existin > Therefore `microshift-baseline` will "allow" only for single size of hugepages. > Users are welcomed though to introduce non-kernel-args ways of setting up hugepages in their profiles. - additional kernel arguments +- CPU set to offline + +Any other tunables are responsibility of the user. For example, if they want to control hugepages per NUMA node, +they need to create a tuned profile that will include: +```ini +[sysfs] +/sys/devices/system/node/node${NUMA_NODE}/hugepages/hugepages-${HUGEPAGES_SIZE}kB/nr_hugepages=5 +``` + +Probably best resource on TuneD is [RHEL documentation](https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/9/html/monitoring_and_managing_system_status_and_performance/customizing-tuned-profiles_monitoring-and-managing-system-status-and-performance). + + +##### Expected microshift-baseline tuned profile ```ini [main] @@ -310,6 +324,10 @@ hugepages = 0 # Additional kernel arguments additional_args = + +# CPU set to be offlined +# WARNING: Should not overlap with `isolated_cores` +offline_cpu_set = ``` #### microshift-tuned service @@ -449,9 +467,10 @@ It may happen that some users need to use TuneD plugins that are not handled by In such case we may investigate if it's something generic enough to include, or we can instruct them to create new profile that would include `microshift-baseline` profile. -Systemd daemon we'll provide to enable TuneD profile should have a strict requirement before it -reboots the node, so it doesn't put it into a boot loop. -This pattern of reboot after booting affects the number of "effective" greenboot retries, +To mitigate risk of entering boot loop by continuosly applying and rebooting the node, microshift-tuned +daemon will compare checksums of previously applied version of the profile with current, and reboot the host +only if user allows it in the config file. +This pattern of reboot on first boot after installing/upgrading the system affects the number of "effective" greenboot retries, so customers might need to account for that by increasing the number of retries. @@ -464,8 +483,9 @@ but it must be noted that NTO is going beyond low latency. ## Open Questions [optional] -- Verify if osbuild blueprint can override a file from RPM - (variables.conf needs to exist for tuned profile, so it's nice to have some fallback)? +- ~~Verify if osbuild blueprint can override a file from RPM~~ + ~~(variables.conf needs to exist for tuned profile, so it's nice to have some fallback)?~~ + > Yes, it can. - ~~NTO runs tuned in non-daemon one shot mode using systemd unit.~~ ~~Should we try doing the same or we want the tuned daemon to run continuously?~~ > Let's stick to default RHEL behaviour. MicroShift doesn't own the OS. @@ -475,7 +495,8 @@ but it must be noted that NTO is going beyond low latency. - NTO took an approach to duplicate many of the setting from included profiles - should we do the same? > Comment: Probably no need to do that. `cpu-partitioning` profile is not changed very often, > so the risk of breakage is low, but if they change something, we should get that automatically, right? -- Should we also provide NTO's systemd units for offlining CPUs, setting hugepages per NUMA node, clearing IRQ balance, setting RPS masks? +- Should we also provide NTO's systemd units for ~~offlining CPUs,~~ setting hugepages per NUMA node, ~~clearing IRQ balance~~, setting RPS masks? + > We're including offlining CPUs. And `cpu-partitioning` is giving user-provided list of isolated CPUs to `[irqbalance] banned_cpus`. ## Test Plan From 6c591e165857678d22f8325f789a606ddd445e86 Mon Sep 17 00:00:00 2001 From: Patryk Matuszak Date: Thu, 11 Jul 2024 16:12:07 +0200 Subject: [PATCH 33/53] fix value of crio annotations --- enhancements/microshift/low-latency-workloads-on-microshift.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/enhancements/microshift/low-latency-workloads-on-microshift.md b/enhancements/microshift/low-latency-workloads-on-microshift.md index 3907333432..f0af496451 100644 --- a/enhancements/microshift/low-latency-workloads-on-microshift.md +++ b/enhancements/microshift/low-latency-workloads-on-microshift.md @@ -219,7 +219,7 @@ RUN systemctl enable microshift-tuned.service - Setting Pod's memory limit and memory request to the same value, and setting CPU limit and CPU request to the same value to ensure Pod has guaranteed QoS class. - Use annotations to get desired behavior - (unless link to a documentation is present, these annotations only take two values: `enabled` and `disabled`): + (unless link to a documentation is present, these annotations only take two values: `enable` and `disable`): - `cpu-load-balancing.crio.io: "disable"` - disable CPU load balancing for Pod (only use with CPU Manager `static` policy and for Guaranteed QoS Pods using whole CPUs) - `cpu-quota.crio.io: "disable"` - disable Completely Fair Scheduler (CFS) From a03e02d4d2df2530885fe3bd1ddb922633b4132d Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Jakob=20M=C3=B6ller?= Date: Thu, 2 May 2024 18:16:36 +0200 Subject: [PATCH 34/53] feat: init subprovisioner csi driver integration proposal --- .../subprovisioner-integration-into-lvms.md | 255 ++++++++++++++++++ 1 file changed, 255 insertions(+) create mode 100644 enhancements/local-storage/subprovisioner-integration-into-lvms.md diff --git a/enhancements/local-storage/subprovisioner-integration-into-lvms.md b/enhancements/local-storage/subprovisioner-integration-into-lvms.md new file mode 100644 index 0000000000..f5eba299f7 --- /dev/null +++ b/enhancements/local-storage/subprovisioner-integration-into-lvms.md @@ -0,0 +1,255 @@ +--- +title: Subprovisioner CSI Driver Integration into LVMS +authors: + - "@jakobmoellerdev" +reviewers: + - "CNV Team" + - "LVMS Team" +approvers: + - "@DanielFroehlich" + - "@jerpeter1" + - "@suleymanakbas91" +api-approvers: + - "@DanielFroehlich" + - "@jerpeter1" + - "@suleymanakbas91" +creation-date: 2024-05-02 +last-updated: 2024-05-02 +status: discovery +--- + +# Subprovisioner CSI Driver Integration into LVMS + +[Subprovisioner](https://gitlab.com/subprovisioner/subprovisioner) +is a CSI plugin for Kubernetes that enables you to provision Block volumes +backed by a single, cluster-wide, shared block device (e.g., a single big LUN on a SAN). + +Logical Volume Manager Storage (LVMS) uses the TopoLVM CSI driver to dynamically provision local storage on the OpenShift Container Platform clusters. + +This proposal is about integrating the Subprovisioner CSI driver into the LVMS operator to enable the provisioning of +shared block devices on the OpenShift Container Platform clusters. + +This enhancement will significantly increase scope of LVMS, but allows LVMS to gain the unique value proposition +of serving as a valid layered operator that offers LUN synchronization and provisioning capabilities. + +## Release Signoff Checklist + +- [ ] Enhancement is `implementable` +- [ ] Design details are appropriately documented from clear requirements +- [ ] Test plan is defined +- [ ] Graduation criteria for dev preview, tech preview, GA +- [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/) + +## Summary + +This is a proposal to +- Create an enhancement to the "LVMCluster" CRD that is able to differentiate a deviceClass into a new + type of shared storage that can be provisioned side-by-side or in alternative to regular LVMS device-classes managed by TopoLVM. +- Create a productization for a LUN-backed CSI driver alternative to TopoLVM that allows for shared vg usage, especially in the context of virtualization. + + +## Motivation + +TopoLVM as our existing in-tree driver of LVMS is a great solution for local storage provisioning, but it lacks the ability to provision shared storage. +This is a significant limitation for virtualization workloads that require shared storage for their VMs that can dynamically be provisioned and deprovisioned +on multiple nodes. Since OCP 4.15, LVMS support Multi-Node Deployments as a Topology, but without Replication or inbuilt resiliency behavior. + +The Subprovisioner CSI driver is a great solution for shared storage provisioning, but it is currently not productized as part of OpenShift Container Platform. + +### Goals + +- Extension of the LVMCluster CRD to support a new deviceClass policy field that can be used to provision shared storage via Subprovisioner. +- Find a way to productize the Subprovisioner CSI driver as part of OpenShift Container Platform and increasing the Value Proposition of LVMS. +- Allow provisioning of regular TopoLVM deviceClasses and shared storage deviceClasses side-by-side in the same cluster. + +### Non-Goals + +- Compatibility with other CSI drivers than Subprovisioner. +- Switching the default CSI driver for LVMS from TopoLVM to Subprovisioner or the other way around. +- Implementing a new CSI driver from scratch. +- Integrating the Subprovisioner CSI driver into TopoLVM. + +### Risks and Mitigations +- There is a risk of increased maintenance burden by integrating a new CSI driver into LVMS without gaining traction + - tested separately in the Subprovisioner project as pure CSI Driver similar to TopoLVM and within LVMS with help of QE + - we will not GA the solution until we have a clear understanding of the maintenance burden. The solution will stay in TechPreview until then. +- There is a risk that Subprovisioner is so different from TopoLVM that behavior changes can not be accomodated in the current CRD + - we will scrap this effort for integration and look for alternative solutions if the integration is not possible with reasonable effort. +- There is a risk that Subprovisioner is gonna break easily as its a really young project + - we will not GA the solution until we have a clear understanding of the stability of the Subprovisioner project. The solution will stay in TechPreview until then. + +## Proposal + +The proposal is to extend the LVMCluster CRD with a new deviceClass policy field that can be used to provision shared storage via Subprovisioner. +We will use this field as a hook in lvm-operator, our orchestrating operator, to provision shared storage via Subprovisioner instead of TopoLVM. +Whenever LVMCluster discovers a new deviceClass with the Subprovisioner associated policy, it will create a new CSI driver deployment for Subprovisioner and configure it to use the shared storage deviceClass. +As such, it will handover the provisioning of shared storage to the Subprovisioner CSI driver. Also internal engineering such as sanlock orchestration will be managed by the driver. + + +### Workflow of Subprovisioner instantiation via LVMCluster + +1. The user is informed of the intended use case of Subprovisioner, and decides to use it for its multi-node capabilities before provisioning Storage +2. The user configures LVMCluster with non-default values for the Volume Group and the deviceClass policy field +3. The lvm-operator detects the new deviceClass policy field and creates a new CSI driver deployment for Subprovisioner. +4. The Subprovisioner CSI driver is configured to use the shared storage deviceClass, initializes the global lock space, and starts provisioning shared storage. +5. The user can now provision shared storage via Subprovisioner on the OpenShift Container Platform cluster. +6. The user can also provision regular TopoLVM deviceClasses side-by-side with shared storage deviceClasses in the same cluster. Then, TopoLVM gets provisioned side-by-side. + +## Design Details for `LVMCluster CR extension` + +API scheme for `LVMCluster` CR: + +```go + + // The DeviceAccessPolicy type defines the accessibility of the create lvm2 volume group backing the deviceClass. + type DeviceAccessPolicy string + + const ( + DeviceAccessPolicyShared DeviceAccessPolicy = "shared" + DeviceAccessPolicyNodeLocal DeviceAccessPolicy = "nodeLocal" + ) + + // LVMClusterSpec defines the desired state of LVMCluster + type LVMClusterSpec struct { + // Important: Run "make" to regenerate code after modifying this file + + // Tolerations to apply to nodes to act on + // +optional + Tolerations []corev1.Toleration `json:"tolerations,omitempty"` + // Storage describes the deviceClass configuration for local storage devices + // +Optional + Storage Storage `json:"storage,omitempty"` + } + type Storage struct { + // DeviceClasses are a rules that assign local storage devices to volumegroups that are used for creating lvm based PVs + // +Optional + DeviceClasses []DeviceClass `json:"deviceClasses,omitempty"` + } + + type DeviceClass struct { + // Name of the class, the VG and possibly the storageclass. + // Validations to confirm that this field can be used as metadata.name field in storageclass + // ref: https://github.com/kubernetes/apimachinery/blob/de7147/pkg/util/validation/validation.go#L209 + // +kubebuilder:validation:MaxLength=245 + // +kubebuilder:validation:MinLength=1 + // +kubebuilder:validation:Pattern="^[a-z0-9]([-a-z0-9]*[a-z0-9])?$" + Name string `json:"name,omitempty"` + + // DeviceSelector is a set of rules that should match for a device to be included in the LVMCluster + // +optional + DeviceSelector *DeviceSelector `json:"deviceSelector,omitempty"` + + // NodeSelector chooses nodes on which to create the deviceclass + // +optional + NodeSelector *corev1.NodeSelector `json:"nodeSelector,omitempty"` + + // ThinPoolConfig contains configurations for the thin-pool + // +optional + ThinPoolConfig *ThinPoolConfig `json:"thinPoolConfig,omitempty"` + + // Default is a flag to indicate whether the device-class is the default. + // This will mark the storageClass as default. + // +optional + Default bool `json:"default,omitempty"` + + // FilesystemType sets the filesystem the device should use. + // For shared deviceClasses, this field must be set to "" or none. + // +kubebuilder:validation:Enum=xfs;ext4;none;"" + // +kubebuilder:default=xfs + // +optional + FilesystemType DeviceFilesystemType `json:"fstype,omitempty"` + + // Policy defines the policy for the deviceClass. + // TECH PREVIEW: shared will allow accessing the deviceClass from multiple nodes. + // The deviceClass will then be configured via shared volume group. + // +optional + // +kubebuilder:validation:Enum=shared;local ++ DeviceAccessPolicy DeviceAccessPolicy `json:"deviceAccessPolicy,omitempty"` + } + + type ThinPoolConfig struct { + // Name of the thin pool to be created. Will only be used for node-local storage, + // since shared volume groups will create a thin pool with the same name as the volume group. + // +kubebuilder:validation:Required + // +required + Name string `json:"name"` + + // SizePercent represents percentage of space in the volume group that should be used + // for creating the thin pool. + // +kubebuilder:default=90 + // +kubebuilder:validation:Minimum=10 + // +kubebuilder:validation:Maximum=90 + SizePercent int `json:"sizePercent,omitempty"` + + // OverProvisionRatio is the factor by which additional storage can be provisioned compared to + // the available storage in the thin pool. Only applicable for node-local storage. + // +kubebuilder:validation:Minimum=1 + // +kubebuilder:validation:Required + // +required + OverprovisionRatio int `json:"overprovisionRatio"` + } + + // DeviceSelector specifies the list of criteria that have to match before a device is assigned + type DeviceSelector struct { + // A list of device paths which would be chosen for creating Volume Group. + // For example "/dev/disk/by-path/pci-0000:04:00.0-nvme-1" + // We discourage using the device names as they can change over node restarts. ++ // For multiple nodes, all paths MUST be present on all nodes. + // +optional + Paths []string `json:"paths,omitempty"` + + // A list of device paths which could be chosen for creating Volume Group. + // For example "/dev/disk/by-path/pci-0000:04:00.0-nvme-1" + // We discourage using the device names as they can change over node restarts. ++ // For multiple nodes, all paths SHOULD be present on all nodes. + // +optional + OptionalPaths []string `json:"optionalPaths,omitempty"` + + // ForceWipeDevicesAndDestroyAllData runs wipefs to wipe the devices. + // This can lead to data lose. Enable this only when you know that the disk + // does not contain any important data. + // +optional + ForceWipeDevicesAndDestroyAllData *bool `json:"forceWipeDevicesAndDestroyAllData,omitempty"` + } +``` + +## Design Details on volume group orchestration and management via vgmanager + +TBD + +## Design Details for Status Reporting + +TBD + +### Test Plan + +- The integration tests for the LVMS already exist. These tests will need to be updated to test this feature. +- The tests must ensure that detection of devices are working/updating correctly. + +### Graduation Criteria + +TBD + +#### Removing a deprecated feature + +- None of the features are getting deprecated + +### Upgrade / Downgrade Strategy + +TBD + +### Version Skew Strategy + +TBD + +## Implementation History + +TBD + +## Drawbacks + +TBD + +## Alternatives + +TBD \ No newline at end of file From 1d94fa4d5ad1c14c2ed80c4e40674c279b26b260 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Jakob=20M=C3=B6ller?= Date: Mon, 27 May 2024 10:07:54 +0200 Subject: [PATCH 35/53] feat: subprovisioner further considerations --- .../subprovisioner-integration-into-lvms.md | 147 +++++++++++++++--- 1 file changed, 126 insertions(+), 21 deletions(-) diff --git a/enhancements/local-storage/subprovisioner-integration-into-lvms.md b/enhancements/local-storage/subprovisioner-integration-into-lvms.md index f5eba299f7..09739d58d7 100644 --- a/enhancements/local-storage/subprovisioner-integration-into-lvms.md +++ b/enhancements/local-storage/subprovisioner-integration-into-lvms.md @@ -213,43 +213,148 @@ API scheme for `LVMCluster` CR: } ``` -## Design Details on volume group orchestration and management via vgmanager +## Design Details on Volume Group Orchestration and Management via vgmanager -TBD +The `vgmanager` component will be responsible for managing volume groups (VGs) and coordinating the orchestration between TopoLVM and Subprovisioner CSI drivers. This includes: -## Design Details for Status Reporting +1. **Detection and Configuration**: + - Detecting devices that match the `DeviceSelector` criteria specified in the `LVMCluster` CR. + - Configuring volume groups based on the `DeviceAccessPolicy` (either `shared` for Subprovisioner or `local` for TopoLVM). + - Ensuring that shared volume groups are correctly initialized and managed across multiple nodes. -TBD +2. **Dynamic Provisioning**: + - Creating and managing VGs dynamically based on incoming requests and the policy defined in the CR. + - For shared deviceClasses, ensure that the VG is accessible and consistent across all nodes in the cluster. + - For shared volume groups mandated by a shared deviceClass, the VG will be created in shared mode and a SAN lock might need to be initialized + +3. **Monitoring and Maintenance**: + - Continuously monitor the health and status of the VGs. + - Handling any required maintenance tasks, such as resizing, repairing, or migrating VGs must be performed manually for shared Volume Groups. -### Test Plan +4. **Synchronization**: + - Ensure synchronization mechanisms (such as locks) are in place for shared VGs to prevent data corruption and ensure consistency. + - Utilize `sanlock` or similar technologies to manage and synchronize access to shared storage at all times. + - For SAN lock initialization, a race-free initialization of the lock space will be required. This can be achieved by using a Lease Object, + which is a Kubernetes object that can be used to coordinate distributed systems. The Lease Object will be used to ensure that only one node + can initialize the lock space at a time. The Lease will be owned on a first-come-first-serve basis, and the node that acquires the Lease will + will be used for shared lockspace initialization. [A sample implementation can be found here](https://github.com/openshift/lvm-operator/commit/8ba6307c7bcaccc02953e0e2bdad5528636d5e2d) -- The integration tests for the LVMS already exist. These tests will need to be updated to test this feature. -- The tests must ensure that detection of devices are working/updating correctly. -### Graduation Criteria +## Design Details for Status Reporting -TBD +The status reporting will include: -#### Removing a deprecated feature +1. **VG Status**: + - Report the health and state of each VG managed by `vgmanager`. + - Include details such as size, available capacity, and any errors or warnings. + - Health reporting per node is still mandatory. -- None of the features are getting deprecated +2. **Node-Specific Information**: + - Report node-specific information related to the VGs, such as which nodes have access to shared VGs. + - Include status of node-local VGs and any issues detected. -### Upgrade / Downgrade Strategy +3. **CSI Driver Status**: + - Provide status updates on the CSI drivers (both TopoLVM and Subprovisioner) deployed in the cluster. + - Include information on driver health, performance metrics, and any incidents. + - Ideally, subprovisioner implements Volume Health Monitoring CSI calls. -TBD +4. **Event Logging**: + - Maintain detailed logs of all events related to VG management and CSI driver operations. + - Ensure that any significant events (such as failovers, recoveries, and maintenance actions) are logged and reported. -### Version Skew Strategy +### Test Plan + +- **Integration Tests**: + - Update existing LVMS integration tests to include scenarios for shared storage provisioning with Subprovisioner. + - Ensure that device detection and VG management are functioning correctly with both TopoLVM and Subprovisioner. + - QE will be extending the existing test suites to include shared storage provisioning and synchronization tests. -TBD +- **E2E Tests**: + - Implement end-to-end tests to validate the complete workflow from device discovery to VG provisioning and usage. + - Include multi-node scenarios to test shared storage provisioning and synchronization. -## Implementation History +- **Performance and Stress Tests**: + - Conduct performance tests to assess the scalability and robustness of the VG management and CSI driver operations. + - The performance tests will have the same scope as the existing TopoLVM performance tests, mainly provisioning times and I/O + - Perform stress tests to evaluate system behavior under high load and failure conditions. + - We will run these tests before any graduation to GA at the minimum. -TBD +### Graduation Criteria -## Drawbacks +- **Developer Preview (Early Evaluation and Feedback)**: + - Initial implementation with basic functionality for shared and node-local VG provisioning. + - Basic integration and E2E tests in place. + - Feedback from early adopters and stakeholders collected. + - No official Product Support. + - Functionality is provided with very limited, if any, documentation. Documentation is not included as part of the product’s documentation set. + +- **Technology Preview**: + - Feature-complete implementation with all planned functionality. + - Comprehensive test coverage including performance and stress tests. + - Functionality is documented as part of the products documentation set (on the Red Hat Customer Portal) and/or via the release notes. + - Functionality is provided with LIMITED support by Red Hat. Customers can open support cases, file bugs, and request feature enhancements. However, support is provided with NO commercial SLA and no commitment to implement any changes. + - Functionality has undergone more complete Red Hat testing for the configurations supported by the underlying product. + - Functionality is, with rare exceptions, on Red Hat’s product roadmap for a future release. + +- **GA**: + - Proven stability and performance in production-like environments. + - Positive feedback from initial users. + - Full documentation, including troubleshooting guides and best practices. + - Full LVMS Support Lifecycle -TBD +### Upgrade / Downgrade Strategy -## Alternatives +- **Upgrade**: + - Ensure that upgrades are seamless with no downtime for existing workloads. Migrating to a subprovisioner enabled version is a no-break operation + - Test upgrade paths thoroughly to ensure compatibility and data integrity. The subprovisioner to topolvm (or vice versa) switch should be excluded and forbidden explicitly. + - New deviceClasses with the shared policy should be able to be added to existing LVMClusters without affecting existing deviceClasses. + +- **Downgrade**: + - Allow safe downgrades by maintaining backward compatibility. Downgrading from a subprovisioner enabled version to a purely topolvm enabled version should be a no-break operation for the topolvm part. For the subprovisioner part, the operator should ensure that the shared VGs can be cleaned up manually + - Provide rollback mechanisms and detailed instructions to revert to previous versions. Ensure that downgrades do not result in data loss or service interruptions. + The operator should ensure that the shared VGs can be cleaned up manually. + - Ensure that downgrades do not result in data loss or service interruptions. The operator should ensure that the shared VGs can be cleaned up without data loss on other device classes. + +### Version Skew Strategy -TBD \ No newline at end of file +- Ensure compatibility between different versions of LVMS and the integrated Subprovisioner CSI driver. + - Implement version checks and compatibility checks in the `vgmanager` component. + - Ensure that the operator can handle version skew between the LVMS operator and the Subprovisioner CSI driver where required. + - Provide clear guidelines on how to manage version skew and perform upgrades in a controlled manner. + - One version of LVMS should be able to handle one version of the Subprovisioner CSI driver. +- Document supported version combinations and any known issues with version mismatches. +- Provide clear guidelines on how to manage version skew and perform upgrades in a controlled manner. + +### Security Considerations + +- **Access Control**: + - Ensure that access to shared storage is controlled and restricted to authorized users. Node-level access control should be enforced similarly to TopoLVM. + - Implement RBAC policies to restrict access to VGs and CSI drivers based on user roles and permissions. + - Ensure that shared VGs are only accessible by nodes that are authorized to access them. +- **CVE Scanning**: + - Ensure that the Subprovisioner CSI driver is regularly scanned for vulnerabilities and that any identified issues are addressed promptly. + - Implement a process for CVE scanning and remediation for the Subprovisioner CSI driver. + - Fixes for CVEs should be handled in a dedicated midstream openshift/subprovisioner for critical CVEs when Red Hat decides to no longer solely own the project. Until then, the fixes will be handled by the Red Hat team and a midstream is optional. + +### Implementation Milestones + +- **Phase 1**: Initial design and prototyping. Basic integration with Subprovisioner and updates to the LVMCluster CR. +- **Phase 2**: Development of `vgmanager` functionalities for VG orchestration and management. Integration and E2E testing. +- **Phase 3**: Performance testing, bug fixes, and documentation. Preparing for Alpha release. +- **Phase 4**: Developer Preview release with comprehensive manual and QE testing. Gathering user feedback and making improvements. +- **Phase 5**: Technology PReview with Documentation Extension and preparation of GA. +- **Phase 5**: General Availability (GA) release with proven stability and performance in production environments. + +### Drawbacks + +- Increased complexity in managing both node-local and shared storage. +- Potential for increased maintenance burden with the integration of a new CSI driver. +- Risks associated with the stability and maturity of the Subprovisioner project. +- Complex testing matrix and shared volume group use cases can be hard to debug / troubleshoot. + +### Alternatives + +- Continue using TopoLVM exclusively for local storage provisioning. +- Evaluate and integrate other CSI drivers that support shared storage. +- Develop a custom CSI driver to meet the specific needs of LVMS and OpenShift. +- Move Subprovisioner to CNV and package it in a separate product. From dd0389295cc2f8a3770cd8c23311a44ef4deafeb Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Jakob=20M=C3=B6ller?= Date: Mon, 27 May 2024 10:10:05 +0200 Subject: [PATCH 36/53] feat: subprovisioner small PR nits --- .../subprovisioner-integration-into-lvms.md | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/enhancements/local-storage/subprovisioner-integration-into-lvms.md b/enhancements/local-storage/subprovisioner-integration-into-lvms.md index 09739d58d7..0af2010fb4 100644 --- a/enhancements/local-storage/subprovisioner-integration-into-lvms.md +++ b/enhancements/local-storage/subprovisioner-integration-into-lvms.md @@ -101,19 +101,19 @@ API scheme for `LVMCluster` CR: ```go - // The DeviceAccessPolicy type defines the accessibility of the create lvm2 volume group backing the deviceClass. + // The DeviceAccessPolicy type defines the accessibility of the lvm2 volume group backing the deviceClass. type DeviceAccessPolicy string const ( DeviceAccessPolicyShared DeviceAccessPolicy = "shared" - DeviceAccessPolicyNodeLocal DeviceAccessPolicy = "nodeLocal" + DeviceAccessPolicyNodeLocal DeviceAccessPolicy = "nodeLocal" ) // LVMClusterSpec defines the desired state of LVMCluster type LVMClusterSpec struct { // Important: Run "make" to regenerate code after modifying this file - // Tolerations to apply to nodes to act on + // Tolerations applied to CSI driver pods // +optional Tolerations []corev1.Toleration `json:"tolerations,omitempty"` // Storage describes the deviceClass configuration for local storage devices @@ -143,7 +143,8 @@ API scheme for `LVMCluster` CR: // +optional NodeSelector *corev1.NodeSelector `json:"nodeSelector,omitempty"` - // ThinPoolConfig contains configurations for the thin-pool + // ThinPoolConfig contains configurations for the thin-pool. ++ // MUST NOT be set for shared deviceClasses. // +optional ThinPoolConfig *ThinPoolConfig `json:"thinPoolConfig,omitempty"` From c59642bc270d98fe329d4429ed17e1ccd9d90648 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Jakob=20M=C3=B6ller?= Date: Mon, 27 May 2024 10:11:45 +0200 Subject: [PATCH 37/53] feat: subprovisioner small PR nits --- .../subprovisioner-integration-into-lvms.md | 22 +++++++++---------- 1 file changed, 11 insertions(+), 11 deletions(-) diff --git a/enhancements/local-storage/subprovisioner-integration-into-lvms.md b/enhancements/local-storage/subprovisioner-integration-into-lvms.md index 0af2010fb4..a3d44e373c 100644 --- a/enhancements/local-storage/subprovisioner-integration-into-lvms.md +++ b/enhancements/local-storage/subprovisioner-integration-into-lvms.md @@ -101,13 +101,13 @@ API scheme for `LVMCluster` CR: ```go - // The DeviceAccessPolicy type defines the accessibility of the lvm2 volume group backing the deviceClass. - type DeviceAccessPolicy string ++ // The DeviceAccessPolicy type defines the accessibility of the lvm2 volume group backing the deviceClass. ++ type DeviceAccessPolicy string - const ( - DeviceAccessPolicyShared DeviceAccessPolicy = "shared" - DeviceAccessPolicyNodeLocal DeviceAccessPolicy = "nodeLocal" - ) ++ const ( ++ DeviceAccessPolicyShared DeviceAccessPolicy = "shared" ++ DeviceAccessPolicyNodeLocal DeviceAccessPolicy = "nodeLocal" ++ ) // LVMClusterSpec defines the desired state of LVMCluster type LVMClusterSpec struct { @@ -160,11 +160,11 @@ API scheme for `LVMCluster` CR: // +optional FilesystemType DeviceFilesystemType `json:"fstype,omitempty"` - // Policy defines the policy for the deviceClass. - // TECH PREVIEW: shared will allow accessing the deviceClass from multiple nodes. - // The deviceClass will then be configured via shared volume group. - // +optional - // +kubebuilder:validation:Enum=shared;local ++ // Policy defines the policy for the deviceClass. ++ // TECH PREVIEW: shared will allow accessing the deviceClass from multiple nodes. ++ // The deviceClass will then be configured via shared volume group. ++ // +optional ++ // +kubebuilder:validation:Enum=shared;local + DeviceAccessPolicy DeviceAccessPolicy `json:"deviceAccessPolicy,omitempty"` } From d57f5b408e787e88334be3d22e47e3ab424b9127 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Jakob=20M=C3=B6ller?= Date: Mon, 27 May 2024 10:12:57 +0200 Subject: [PATCH 38/53] feat: subprovisioner small PR nits --- .../local-storage/subprovisioner-integration-into-lvms.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/enhancements/local-storage/subprovisioner-integration-into-lvms.md b/enhancements/local-storage/subprovisioner-integration-into-lvms.md index a3d44e373c..a7d4bdc808 100644 --- a/enhancements/local-storage/subprovisioner-integration-into-lvms.md +++ b/enhancements/local-storage/subprovisioner-integration-into-lvms.md @@ -344,7 +344,7 @@ The status reporting will include: - **Phase 3**: Performance testing, bug fixes, and documentation. Preparing for Alpha release. - **Phase 4**: Developer Preview release with comprehensive manual and QE testing. Gathering user feedback and making improvements. - **Phase 5**: Technology PReview with Documentation Extension and preparation of GA. -- **Phase 5**: General Availability (GA) release with proven stability and performance in production environments. +- **Phase 6**: General Availability (GA) release with proven stability and performance in production environments. ### Drawbacks From 1e3f102517b8fde62913b43eec344195ef757c39 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Jakob=20M=C3=B6ller?= Date: Mon, 3 Jun 2024 16:57:13 +0200 Subject: [PATCH 39/53] feat: subprovisioner small PR nits --- .../local-storage/subprovisioner-integration-into-lvms.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/enhancements/local-storage/subprovisioner-integration-into-lvms.md b/enhancements/local-storage/subprovisioner-integration-into-lvms.md index a7d4bdc808..d3d42684f3 100644 --- a/enhancements/local-storage/subprovisioner-integration-into-lvms.md +++ b/enhancements/local-storage/subprovisioner-integration-into-lvms.md @@ -75,7 +75,7 @@ The Subprovisioner CSI driver is a great solution for shared storage provisionin - we will not GA the solution until we have a clear understanding of the maintenance burden. The solution will stay in TechPreview until then. - There is a risk that Subprovisioner is so different from TopoLVM that behavior changes can not be accomodated in the current CRD - we will scrap this effort for integration and look for alternative solutions if the integration is not possible with reasonable effort. -- There is a risk that Subprovisioner is gonna break easily as its a really young project +- There is a risk that Subprovisioner will break easily as its a really young project - we will not GA the solution until we have a clear understanding of the stability of the Subprovisioner project. The solution will stay in TechPreview until then. ## Proposal @@ -308,6 +308,7 @@ The status reporting will include: - **Upgrade**: - Ensure that upgrades are seamless with no downtime for existing workloads. Migrating to a subprovisioner enabled version is a no-break operation - Test upgrade paths thoroughly to ensure compatibility and data integrity. The subprovisioner to topolvm (or vice versa) switch should be excluded and forbidden explicitly. + - The "default" deviceClass cannot be changed as well and changeing from shared to local or vice versa is not supported without resetting the LVMCluster. - New deviceClasses with the shared policy should be able to be added to existing LVMClusters without affecting existing deviceClasses. - **Downgrade**: From 518b28188d4805754cb43a6aae887e60513f2c90 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Jakob=20M=C3=B6ller?= Date: Wed, 19 Jun 2024 12:24:39 +0200 Subject: [PATCH 40/53] feat: subprovisioner small user stories and refactor --- .../subprovisioner-integration-into-lvms.md | 122 +++++++++++++++--- 1 file changed, 105 insertions(+), 17 deletions(-) diff --git a/enhancements/local-storage/subprovisioner-integration-into-lvms.md b/enhancements/local-storage/subprovisioner-integration-into-lvms.md index d3d42684f3..8948f20038 100644 --- a/enhancements/local-storage/subprovisioner-integration-into-lvms.md +++ b/enhancements/local-storage/subprovisioner-integration-into-lvms.md @@ -69,6 +69,38 @@ The Subprovisioner CSI driver is a great solution for shared storage provisionin - Implementing a new CSI driver from scratch. - Integrating the Subprovisioner CSI driver into TopoLVM. +### User Stories + +As a Data Center OCP Admin: +- I want to seamlessly add my existing SAN infrastructure to OCP nodes to host VM workloads, enabling better (live) migration of VMs from vSphere to OCP Virt and from one OCP node to another. +- I want to provision shared block storage across multiple nodes, ensuring high availability and resiliency for my virtualization workloads. +- I want to manage both local and shared storage within the same OpenShift cluster to optimize resource utilization and simplify storage management. + +As a Developer: +- I want to deploy applications that require shared storage across multiple pods and nodes, ensuring data consistency and high availability. +- I want to use a single, unified API to provision and manage both local and shared storage classes, reducing complexity in my deployment scripts. +- I want to benefit from the unique capabilities of Subprovisioner for shared storage without having to manage separate storage solutions, both TopoLVM and Subprovisioner use lvm2 under the hood. + +As a Storage Administrator: +- I want to easily configure and manage volume groups using the new deviceClass policy field in the LVMCluster CRD, ensuring that my storage setup is consistent and efficient. +- I want to monitor the health and status of my volume groups, receiving alerts and logs for any issues that arise with the shared storage. +- I want to leverage existing expensive SAN infrastructure to provide shared storage, maximizing the return on investment for our hardware. + +As an IT Operations Engineer: +- I want to ensure that upgrades and downgrades of the LVMS operator and Subprovisioner CSI driver are seamless and do not cause downtime for my existing workloads. +- I want to follow clear guidelines and best practices for managing version skew between LVMS and Subprovisioner, ensuring compatibility and stability. +- I want detailed documentation and troubleshooting guides to help resolve any issues that arise during the deployment and operation of shared storage. + +As a Quality Assurance Engineer: +- I want to execute comprehensive integration and end-to-end tests that validate the functionality of shared storage provisioning with Subprovisioner. +- I want to conduct performance and stress tests to ensure that the solution can handle high load and failure conditions without degradation of service. +- I want to gather and analyze feedback from early adopters to improve the stability and performance of the integrated solution before general availability. + +As a Product Manager: +- I want to offer a unique value proposition with LVMS by integrating Subprovisioner, enabling OCP customers to use shared block storage seamlessly. +- I want to ensure that the solution meets the needs of our enterprise customers, providing high availability, resiliency, and performance for their critical workloads. +- I want to manage the roadmap and release cycles effectively, ensuring that each phase of the project is delivered on time and meets quality standards. + ### Risks and Mitigations - There is a risk of increased maintenance burden by integrating a new CSI driver into LVMS without gaining traction - tested separately in the Subprovisioner project as pure CSI Driver similar to TopoLVM and within LVMS with help of QE @@ -85,8 +117,9 @@ We will use this field as a hook in lvm-operator, our orchestrating operator, to Whenever LVMCluster discovers a new deviceClass with the Subprovisioner associated policy, it will create a new CSI driver deployment for Subprovisioner and configure it to use the shared storage deviceClass. As such, it will handover the provisioning of shared storage to the Subprovisioner CSI driver. Also internal engineering such as sanlock orchestration will be managed by the driver. +### Workflow Description -### Workflow of Subprovisioner instantiation via LVMCluster +#### Subprovisioner instantiation via LVMCluster 1. The user is informed of the intended use case of Subprovisioner, and decides to use it for its multi-node capabilities before provisioning Storage 2. The user configures LVMCluster with non-default values for the Volume Group and the deviceClass policy field @@ -95,7 +128,9 @@ As such, it will handover the provisioning of shared storage to the Subprovision 5. The user can now provision shared storage via Subprovisioner on the OpenShift Container Platform cluster. 6. The user can also provision regular TopoLVM deviceClasses side-by-side with shared storage deviceClasses in the same cluster. Then, TopoLVM gets provisioned side-by-side. -## Design Details for `LVMCluster CR extension` +### API Extensions + +#### Design Details for `LVMCluster CR extension` API scheme for `LVMCluster` CR: @@ -214,7 +249,9 @@ API scheme for `LVMCluster` CR: } ``` -## Design Details on Volume Group Orchestration and Management via vgmanager +### Implementation Details/Notes/Constraints + +#### Design Details on Volume Group Orchestration and Management via vgmanager The `vgmanager` component will be responsible for managing volume groups (VGs) and coordinating the orchestration between TopoLVM and Subprovisioner CSI drivers. This includes: @@ -241,7 +278,7 @@ The `vgmanager` component will be responsible for managing volume groups (VGs) a will be used for shared lockspace initialization. [A sample implementation can be found here](https://github.com/openshift/lvm-operator/commit/8ba6307c7bcaccc02953e0e2bdad5528636d5e2d) -## Design Details for Status Reporting +#### Design Details for Status Reporting The status reporting will include: @@ -263,7 +300,33 @@ The status reporting will include: - Maintain detailed logs of all events related to VG management and CSI driver operations. - Ensure that any significant events (such as failovers, recoveries, and maintenance actions) are logged and reported. -### Test Plan + +### Drawbacks + +- Increased complexity in managing both node-local and shared storage. +- Potential for increased maintenance burden with the integration of a new CSI driver. +- Risks associated with the stability and maturity of the Subprovisioner project. +- Complex testing matrix and shared volume group use cases can be hard to debug / troubleshoot. + +### Topology Considerations + +* The primary use case for Subprovisioner is to enable shared storage across multiple nodes. This capability is critical for environments where high availability and data redundancy are required. +* Ensure that all nodes in the cluster can access the shared storage devices consistently and reliably. This may involve configuring network settings and storage paths appropriately. + +#### Hypershift / Hosted Control Planes + +N/A + +#### Standalone Clusters + +LVMS can be installed on standalone clusters, but the shared storage provisioning will only work in a multi-node environment. + +#### Single-node Deployments or MicroShift + +* While LVMS can be installed on single-node deployments and MicroShift, the shared storage provisioning feature enabled by Subprovisioner is designed for multi-node environments. Single-node setups can still use local storage provisioning through TopoLVM. +* MicroShift deployments will include the Subprovisioner binaries but will not use shared storage provisioning due to the single-node nature of MicroShift. + +## Test Plan - **Integration Tests**: - Update existing LVMS integration tests to include scenarios for shared storage provisioning with Subprovisioner. @@ -280,7 +343,9 @@ The status reporting will include: - Perform stress tests to evaluate system behavior under high load and failure conditions. - We will run these tests before any graduation to GA at the minimum. -### Graduation Criteria +## Graduation Criteria + +### Dev Preview -> Tech Preview - **Developer Preview (Early Evaluation and Feedback)**: - Initial implementation with basic functionality for shared and node-local VG provisioning. @@ -297,13 +362,19 @@ The status reporting will include: - Functionality has undergone more complete Red Hat testing for the configurations supported by the underlying product. - Functionality is, with rare exceptions, on Red Hat’s product roadmap for a future release. +### Tech Preview -> GA + - **GA**: - Proven stability and performance in production-like environments. - Positive feedback from initial users. - Full documentation, including troubleshooting guides and best practices. - Full LVMS Support Lifecycle -### Upgrade / Downgrade Strategy +### Removing a deprecated feature + +N/A + +## Upgrade / Downgrade Strategy - **Upgrade**: - Ensure that upgrades are seamless with no downtime for existing workloads. Migrating to a subprovisioner enabled version is a no-break operation @@ -317,7 +388,7 @@ The status reporting will include: The operator should ensure that the shared VGs can be cleaned up manually. - Ensure that downgrades do not result in data loss or service interruptions. The operator should ensure that the shared VGs can be cleaned up without data loss on other device classes. -### Version Skew Strategy +## Version Skew Strategy - Ensure compatibility between different versions of LVMS and the integrated Subprovisioner CSI driver. - Implement version checks and compatibility checks in the `vgmanager` component. @@ -327,7 +398,30 @@ The status reporting will include: - Document supported version combinations and any known issues with version mismatches. - Provide clear guidelines on how to manage version skew and perform upgrades in a controlled manner. -### Security Considerations +## Operational Aspects of API Extensions + +The integration of the Subprovisioner CSI driver into LVMS introduces several new API extensions, primarily within the LVMCluster CRD. These extensions include new fields for the deviceClass policy, specifically designed to support shared storage provisioning. The operational aspects of these API extensions are as follows: + +* Configuration and Management: + * Administrators can configure shared storage by setting the DeviceAccessPolicy field in the DeviceClass section of the LVMCluster CRD to shared. + * The API ensures that only valid configurations are accepted, providing clear error messages for any misconfigurations, such as setting a filesystem type for shared device classes. + +* Validation and Enforcement: + * The operator will enforce constraints on shared storage configurations, such as requiring shared volume groups to use the shared policy and prohibiting thin pool configurations. + * The vgmanager component will validate device paths and ensure that they are consistent across all nodes in the cluster. + +* Dynamic Provisioning: + * When a shared device class is configured, the operator will dynamically create and manage the corresponding Subprovisioner CSI driver deployment, ensuring that the shared storage is properly initialized and synchronized across nodes. + +Monitoring and Reporting: + * The status of the shared storage, including health and capacity metrics, will be reported through the LVMCluster CRD status fields. + * Node-specific information and events related to the shared storage will be logged and made available for troubleshooting and auditing purposes. + +## Support Procedures + +Regular product support for LVMS will continue to be established through the LVMS team. In addition, Subprovisioner will receive upstream issues through consumption in the LVMS project and will serve as a repackaging customer for the Subprovisioner project. + +## Security Considerations - **Access Control**: - Ensure that access to shared storage is controlled and restricted to authorized users. Node-level access control should be enforced similarly to TopoLVM. @@ -338,7 +432,7 @@ The status reporting will include: - Implement a process for CVE scanning and remediation for the Subprovisioner CSI driver. - Fixes for CVEs should be handled in a dedicated midstream openshift/subprovisioner for critical CVEs when Red Hat decides to no longer solely own the project. Until then, the fixes will be handled by the Red Hat team and a midstream is optional. -### Implementation Milestones +## Implementation Milestones - **Phase 1**: Initial design and prototyping. Basic integration with Subprovisioner and updates to the LVMCluster CR. - **Phase 2**: Development of `vgmanager` functionalities for VG orchestration and management. Integration and E2E testing. @@ -347,14 +441,8 @@ The status reporting will include: - **Phase 5**: Technology PReview with Documentation Extension and preparation of GA. - **Phase 6**: General Availability (GA) release with proven stability and performance in production environments. -### Drawbacks - -- Increased complexity in managing both node-local and shared storage. -- Potential for increased maintenance burden with the integration of a new CSI driver. -- Risks associated with the stability and maturity of the Subprovisioner project. -- Complex testing matrix and shared volume group use cases can be hard to debug / troubleshoot. -### Alternatives +## Alternatives - Continue using TopoLVM exclusively for local storage provisioning. - Evaluate and integrate other CSI drivers that support shared storage. From 338cd68b6881ccfa4c0392a804e9db027c04d55f Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Jakob=20M=C3=B6ller?= Date: Wed, 19 Jun 2024 13:08:41 +0200 Subject: [PATCH 41/53] fix: small cleanup and file rename --- ...ms.md => subprovisioner-csi-driver-integration-into-lvms.md} | 2 ++ 1 file changed, 2 insertions(+) rename enhancements/local-storage/{subprovisioner-integration-into-lvms.md => subprovisioner-csi-driver-integration-into-lvms.md} (99%) diff --git a/enhancements/local-storage/subprovisioner-integration-into-lvms.md b/enhancements/local-storage/subprovisioner-csi-driver-integration-into-lvms.md similarity index 99% rename from enhancements/local-storage/subprovisioner-integration-into-lvms.md rename to enhancements/local-storage/subprovisioner-csi-driver-integration-into-lvms.md index 8948f20038..49d1cb7990 100644 --- a/enhancements/local-storage/subprovisioner-integration-into-lvms.md +++ b/enhancements/local-storage/subprovisioner-csi-driver-integration-into-lvms.md @@ -16,6 +16,8 @@ api-approvers: creation-date: 2024-05-02 last-updated: 2024-05-02 status: discovery +tracking-link: + - https://issues.redhat.com/browse/OCPEDGE-1147 --- # Subprovisioner CSI Driver Integration into LVMS From c294a715c8140b62f7923b7f0c742241b1099be5 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Jakob=20M=C3=B6ller?= Date: Wed, 19 Jun 2024 15:48:28 +0200 Subject: [PATCH 42/53] fix: small cleanup and file rename --- .../subprovisioner-csi-driver-integration-into-lvms.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/enhancements/local-storage/subprovisioner-csi-driver-integration-into-lvms.md b/enhancements/local-storage/subprovisioner-csi-driver-integration-into-lvms.md index 49d1cb7990..3bbcb370a2 100644 --- a/enhancements/local-storage/subprovisioner-csi-driver-integration-into-lvms.md +++ b/enhancements/local-storage/subprovisioner-csi-driver-integration-into-lvms.md @@ -20,7 +20,7 @@ tracking-link: - https://issues.redhat.com/browse/OCPEDGE-1147 --- -# Subprovisioner CSI Driver Integration into LVMS +# subprovisioner-csi-driver-integration-into-lvms [Subprovisioner](https://gitlab.com/subprovisioner/subprovisioner) is a CSI plugin for Kubernetes that enables you to provision Block volumes From 47c728d0805c67e14384e99367653d08236a0b2a Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Jakob=20M=C3=B6ller?= Date: Wed, 19 Jun 2024 15:48:36 +0200 Subject: [PATCH 43/53] fix: small cleanup and file rename --- .../subprovisioner-csi-driver-integration-into-lvms.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/enhancements/local-storage/subprovisioner-csi-driver-integration-into-lvms.md b/enhancements/local-storage/subprovisioner-csi-driver-integration-into-lvms.md index 3bbcb370a2..3ec720982e 100644 --- a/enhancements/local-storage/subprovisioner-csi-driver-integration-into-lvms.md +++ b/enhancements/local-storage/subprovisioner-csi-driver-integration-into-lvms.md @@ -1,5 +1,5 @@ --- -title: Subprovisioner CSI Driver Integration into LVMS +title: subprovisioner-csi-driver-integration-into-lvms authors: - "@jakobmoellerdev" reviewers: @@ -20,7 +20,7 @@ tracking-link: - https://issues.redhat.com/browse/OCPEDGE-1147 --- -# subprovisioner-csi-driver-integration-into-lvms +# Subprovisioner CSI Driver Integration into LVMS [Subprovisioner](https://gitlab.com/subprovisioner/subprovisioner) is a CSI plugin for Kubernetes that enables you to provision Block volumes From b1999b051b7d8d0a461772f8d05a5e69ba0f6b55 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Jakob=20M=C3=B6ller?= Date: Thu, 20 Jun 2024 15:57:51 +0200 Subject: [PATCH 44/53] chore: Subprovisioner -> KubeSAN --- ...besan-csi-driver-integration-into-lvms.md} | 104 +++++++++--------- 1 file changed, 52 insertions(+), 52 deletions(-) rename enhancements/local-storage/{subprovisioner-csi-driver-integration-into-lvms.md => kubesan-csi-driver-integration-into-lvms.md} (79%) diff --git a/enhancements/local-storage/subprovisioner-csi-driver-integration-into-lvms.md b/enhancements/local-storage/kubesan-csi-driver-integration-into-lvms.md similarity index 79% rename from enhancements/local-storage/subprovisioner-csi-driver-integration-into-lvms.md rename to enhancements/local-storage/kubesan-csi-driver-integration-into-lvms.md index 3ec720982e..1ac9fbd9fb 100644 --- a/enhancements/local-storage/subprovisioner-csi-driver-integration-into-lvms.md +++ b/enhancements/local-storage/kubesan-csi-driver-integration-into-lvms.md @@ -1,5 +1,5 @@ --- -title: subprovisioner-csi-driver-integration-into-lvms +title: kubesan-csi-driver-integration-into-lvms authors: - "@jakobmoellerdev" reviewers: @@ -20,15 +20,15 @@ tracking-link: - https://issues.redhat.com/browse/OCPEDGE-1147 --- -# Subprovisioner CSI Driver Integration into LVMS +# KubeSAN CSI Driver Integration into LVMS -[Subprovisioner](https://gitlab.com/subprovisioner/subprovisioner) +[KubeSAN](https://gitlab.com/kubesan/kubesan) is a CSI plugin for Kubernetes that enables you to provision Block volumes backed by a single, cluster-wide, shared block device (e.g., a single big LUN on a SAN). Logical Volume Manager Storage (LVMS) uses the TopoLVM CSI driver to dynamically provision local storage on the OpenShift Container Platform clusters. -This proposal is about integrating the Subprovisioner CSI driver into the LVMS operator to enable the provisioning of +This proposal is about integrating the KubeSAN CSI driver into the LVMS operator to enable the provisioning of shared block devices on the OpenShift Container Platform clusters. This enhancement will significantly increase scope of LVMS, but allows LVMS to gain the unique value proposition @@ -56,20 +56,20 @@ TopoLVM as our existing in-tree driver of LVMS is a great solution for local sto This is a significant limitation for virtualization workloads that require shared storage for their VMs that can dynamically be provisioned and deprovisioned on multiple nodes. Since OCP 4.15, LVMS support Multi-Node Deployments as a Topology, but without Replication or inbuilt resiliency behavior. -The Subprovisioner CSI driver is a great solution for shared storage provisioning, but it is currently not productized as part of OpenShift Container Platform. +The KubeSAN CSI driver is a great solution for shared storage provisioning, but it is currently not productized as part of OpenShift Container Platform. ### Goals -- Extension of the LVMCluster CRD to support a new deviceClass policy field that can be used to provision shared storage via Subprovisioner. -- Find a way to productize the Subprovisioner CSI driver as part of OpenShift Container Platform and increasing the Value Proposition of LVMS. +- Extension of the LVMCluster CRD to support a new deviceClass policy field that can be used to provision shared storage via KubeSAN. +- Find a way to productize the KubeSAN CSI driver as part of OpenShift Container Platform and increasing the Value Proposition of LVMS. - Allow provisioning of regular TopoLVM deviceClasses and shared storage deviceClasses side-by-side in the same cluster. ### Non-Goals -- Compatibility with other CSI drivers than Subprovisioner. -- Switching the default CSI driver for LVMS from TopoLVM to Subprovisioner or the other way around. +- Compatibility with other CSI drivers than KubeSAN. +- Switching the default CSI driver for LVMS from TopoLVM to KubeSAN or the other way around. - Implementing a new CSI driver from scratch. -- Integrating the Subprovisioner CSI driver into TopoLVM. +- Integrating the KubeSAN CSI driver into TopoLVM. ### User Stories @@ -81,7 +81,7 @@ As a Data Center OCP Admin: As a Developer: - I want to deploy applications that require shared storage across multiple pods and nodes, ensuring data consistency and high availability. - I want to use a single, unified API to provision and manage both local and shared storage classes, reducing complexity in my deployment scripts. -- I want to benefit from the unique capabilities of Subprovisioner for shared storage without having to manage separate storage solutions, both TopoLVM and Subprovisioner use lvm2 under the hood. +- I want to benefit from the unique capabilities of KubeSAN for shared storage without having to manage separate storage solutions, both TopoLVM and KubeSAN use lvm2 under the hood. As a Storage Administrator: - I want to easily configure and manage volume groups using the new deviceClass policy field in the LVMCluster CRD, ensuring that my storage setup is consistent and efficient. @@ -89,45 +89,45 @@ As a Storage Administrator: - I want to leverage existing expensive SAN infrastructure to provide shared storage, maximizing the return on investment for our hardware. As an IT Operations Engineer: -- I want to ensure that upgrades and downgrades of the LVMS operator and Subprovisioner CSI driver are seamless and do not cause downtime for my existing workloads. -- I want to follow clear guidelines and best practices for managing version skew between LVMS and Subprovisioner, ensuring compatibility and stability. +- I want to ensure that upgrades and downgrades of the LVMS operator and KubeSAN CSI driver are seamless and do not cause downtime for my existing workloads. +- I want to follow clear guidelines and best practices for managing version skew between LVMS and KubeSAN, ensuring compatibility and stability. - I want detailed documentation and troubleshooting guides to help resolve any issues that arise during the deployment and operation of shared storage. As a Quality Assurance Engineer: -- I want to execute comprehensive integration and end-to-end tests that validate the functionality of shared storage provisioning with Subprovisioner. +- I want to execute comprehensive integration and end-to-end tests that validate the functionality of shared storage provisioning with KubeSAN. - I want to conduct performance and stress tests to ensure that the solution can handle high load and failure conditions without degradation of service. - I want to gather and analyze feedback from early adopters to improve the stability and performance of the integrated solution before general availability. As a Product Manager: -- I want to offer a unique value proposition with LVMS by integrating Subprovisioner, enabling OCP customers to use shared block storage seamlessly. +- I want to offer a unique value proposition with LVMS by integrating KubeSAN, enabling OCP customers to use shared block storage seamlessly. - I want to ensure that the solution meets the needs of our enterprise customers, providing high availability, resiliency, and performance for their critical workloads. - I want to manage the roadmap and release cycles effectively, ensuring that each phase of the project is delivered on time and meets quality standards. ### Risks and Mitigations - There is a risk of increased maintenance burden by integrating a new CSI driver into LVMS without gaining traction - - tested separately in the Subprovisioner project as pure CSI Driver similar to TopoLVM and within LVMS with help of QE + - tested separately in the KubeSAN project as pure CSI Driver similar to TopoLVM and within LVMS with help of QE - we will not GA the solution until we have a clear understanding of the maintenance burden. The solution will stay in TechPreview until then. -- There is a risk that Subprovisioner is so different from TopoLVM that behavior changes can not be accomodated in the current CRD +- There is a risk that KubeSAN is so different from TopoLVM that behavior changes can not be accomodated in the current CRD - we will scrap this effort for integration and look for alternative solutions if the integration is not possible with reasonable effort. -- There is a risk that Subprovisioner will break easily as its a really young project - - we will not GA the solution until we have a clear understanding of the stability of the Subprovisioner project. The solution will stay in TechPreview until then. +- There is a risk that KubeSAN will break easily as its a really young project + - we will not GA the solution until we have a clear understanding of the stability of the KubeSAN project. The solution will stay in TechPreview until then. ## Proposal -The proposal is to extend the LVMCluster CRD with a new deviceClass policy field that can be used to provision shared storage via Subprovisioner. -We will use this field as a hook in lvm-operator, our orchestrating operator, to provision shared storage via Subprovisioner instead of TopoLVM. -Whenever LVMCluster discovers a new deviceClass with the Subprovisioner associated policy, it will create a new CSI driver deployment for Subprovisioner and configure it to use the shared storage deviceClass. -As such, it will handover the provisioning of shared storage to the Subprovisioner CSI driver. Also internal engineering such as sanlock orchestration will be managed by the driver. +The proposal is to extend the LVMCluster CRD with a new deviceClass policy field that can be used to provision shared storage via KubeSAN. +We will use this field as a hook in lvm-operator, our orchestrating operator, to provision shared storage via KubeSAN instead of TopoLVM. +Whenever LVMCluster discovers a new deviceClass with the KubeSAN associated policy, it will create a new CSI driver deployment for KubeSAN and configure it to use the shared storage deviceClass. +As such, it will handover the provisioning of shared storage to the KubeSAN CSI driver. Also internal engineering such as sanlock orchestration will be managed by the driver. ### Workflow Description -#### Subprovisioner instantiation via LVMCluster +#### KubeSAN instantiation via LVMCluster -1. The user is informed of the intended use case of Subprovisioner, and decides to use it for its multi-node capabilities before provisioning Storage +1. The user is informed of the intended use case of KubeSAN, and decides to use it for its multi-node capabilities before provisioning Storage 2. The user configures LVMCluster with non-default values for the Volume Group and the deviceClass policy field -3. The lvm-operator detects the new deviceClass policy field and creates a new CSI driver deployment for Subprovisioner. -4. The Subprovisioner CSI driver is configured to use the shared storage deviceClass, initializes the global lock space, and starts provisioning shared storage. -5. The user can now provision shared storage via Subprovisioner on the OpenShift Container Platform cluster. +3. The lvm-operator detects the new deviceClass policy field and creates a new CSI driver deployment for KubeSAN. +4. The KubeSAN CSI driver is configured to use the shared storage deviceClass, initializes the global lock space, and starts provisioning shared storage. +5. The user can now provision shared storage via KubeSAN on the OpenShift Container Platform cluster. 6. The user can also provision regular TopoLVM deviceClasses side-by-side with shared storage deviceClasses in the same cluster. Then, TopoLVM gets provisioned side-by-side. ### API Extensions @@ -255,11 +255,11 @@ API scheme for `LVMCluster` CR: #### Design Details on Volume Group Orchestration and Management via vgmanager -The `vgmanager` component will be responsible for managing volume groups (VGs) and coordinating the orchestration between TopoLVM and Subprovisioner CSI drivers. This includes: +The `vgmanager` component will be responsible for managing volume groups (VGs) and coordinating the orchestration between TopoLVM and KubeSAN CSI drivers. This includes: 1. **Detection and Configuration**: - Detecting devices that match the `DeviceSelector` criteria specified in the `LVMCluster` CR. - - Configuring volume groups based on the `DeviceAccessPolicy` (either `shared` for Subprovisioner or `local` for TopoLVM). + - Configuring volume groups based on the `DeviceAccessPolicy` (either `shared` for KubeSAN or `local` for TopoLVM). - Ensuring that shared volume groups are correctly initialized and managed across multiple nodes. 2. **Dynamic Provisioning**: @@ -294,9 +294,9 @@ The status reporting will include: - Include status of node-local VGs and any issues detected. 3. **CSI Driver Status**: - - Provide status updates on the CSI drivers (both TopoLVM and Subprovisioner) deployed in the cluster. + - Provide status updates on the CSI drivers (both TopoLVM and KubeSAN) deployed in the cluster. - Include information on driver health, performance metrics, and any incidents. - - Ideally, subprovisioner implements Volume Health Monitoring CSI calls. + - Ideally, kubesan implements Volume Health Monitoring CSI calls. 4. **Event Logging**: - Maintain detailed logs of all events related to VG management and CSI driver operations. @@ -307,12 +307,12 @@ The status reporting will include: - Increased complexity in managing both node-local and shared storage. - Potential for increased maintenance burden with the integration of a new CSI driver. -- Risks associated with the stability and maturity of the Subprovisioner project. +- Risks associated with the stability and maturity of the KubeSAN project. - Complex testing matrix and shared volume group use cases can be hard to debug / troubleshoot. ### Topology Considerations -* The primary use case for Subprovisioner is to enable shared storage across multiple nodes. This capability is critical for environments where high availability and data redundancy are required. +* The primary use case for KubeSAN is to enable shared storage across multiple nodes. This capability is critical for environments where high availability and data redundancy are required. * Ensure that all nodes in the cluster can access the shared storage devices consistently and reliably. This may involve configuring network settings and storage paths appropriately. #### Hypershift / Hosted Control Planes @@ -325,14 +325,14 @@ LVMS can be installed on standalone clusters, but the shared storage provisionin #### Single-node Deployments or MicroShift -* While LVMS can be installed on single-node deployments and MicroShift, the shared storage provisioning feature enabled by Subprovisioner is designed for multi-node environments. Single-node setups can still use local storage provisioning through TopoLVM. -* MicroShift deployments will include the Subprovisioner binaries but will not use shared storage provisioning due to the single-node nature of MicroShift. +* While LVMS can be installed on single-node deployments and MicroShift, the shared storage provisioning feature enabled by KubeSAN is designed for multi-node environments. Single-node setups can still use local storage provisioning through TopoLVM. +* MicroShift deployments will include the KubeSAN binaries but will not use shared storage provisioning due to the single-node nature of MicroShift. ## Test Plan - **Integration Tests**: - - Update existing LVMS integration tests to include scenarios for shared storage provisioning with Subprovisioner. - - Ensure that device detection and VG management are functioning correctly with both TopoLVM and Subprovisioner. + - Update existing LVMS integration tests to include scenarios for shared storage provisioning with KubeSAN. + - Ensure that device detection and VG management are functioning correctly with both TopoLVM and KubeSAN. - QE will be extending the existing test suites to include shared storage provisioning and synchronization tests. - **E2E Tests**: @@ -379,30 +379,30 @@ N/A ## Upgrade / Downgrade Strategy - **Upgrade**: - - Ensure that upgrades are seamless with no downtime for existing workloads. Migrating to a subprovisioner enabled version is a no-break operation - - Test upgrade paths thoroughly to ensure compatibility and data integrity. The subprovisioner to topolvm (or vice versa) switch should be excluded and forbidden explicitly. + - Ensure that upgrades are seamless with no downtime for existing workloads. Migrating to a kubesan enabled version is a no-break operation + - Test upgrade paths thoroughly to ensure compatibility and data integrity. The kubesan to topolvm (or vice versa) switch should be excluded and forbidden explicitly. - The "default" deviceClass cannot be changed as well and changeing from shared to local or vice versa is not supported without resetting the LVMCluster. - New deviceClasses with the shared policy should be able to be added to existing LVMClusters without affecting existing deviceClasses. - **Downgrade**: - - Allow safe downgrades by maintaining backward compatibility. Downgrading from a subprovisioner enabled version to a purely topolvm enabled version should be a no-break operation for the topolvm part. For the subprovisioner part, the operator should ensure that the shared VGs can be cleaned up manually + - Allow safe downgrades by maintaining backward compatibility. Downgrading from a kubesan enabled version to a purely topolvm enabled version should be a no-break operation for the topolvm part. For the kubesan part, the operator should ensure that the shared VGs can be cleaned up manually - Provide rollback mechanisms and detailed instructions to revert to previous versions. Ensure that downgrades do not result in data loss or service interruptions. The operator should ensure that the shared VGs can be cleaned up manually. - Ensure that downgrades do not result in data loss or service interruptions. The operator should ensure that the shared VGs can be cleaned up without data loss on other device classes. ## Version Skew Strategy -- Ensure compatibility between different versions of LVMS and the integrated Subprovisioner CSI driver. +- Ensure compatibility between different versions of LVMS and the integrated KubeSAN CSI driver. - Implement version checks and compatibility checks in the `vgmanager` component. - - Ensure that the operator can handle version skew between the LVMS operator and the Subprovisioner CSI driver where required. + - Ensure that the operator can handle version skew between the LVMS operator and the KubeSAN CSI driver where required. - Provide clear guidelines on how to manage version skew and perform upgrades in a controlled manner. - - One version of LVMS should be able to handle one version of the Subprovisioner CSI driver. + - One version of LVMS should be able to handle one version of the KubeSAN CSI driver. - Document supported version combinations and any known issues with version mismatches. - Provide clear guidelines on how to manage version skew and perform upgrades in a controlled manner. ## Operational Aspects of API Extensions -The integration of the Subprovisioner CSI driver into LVMS introduces several new API extensions, primarily within the LVMCluster CRD. These extensions include new fields for the deviceClass policy, specifically designed to support shared storage provisioning. The operational aspects of these API extensions are as follows: +The integration of the KubeSAN CSI driver into LVMS introduces several new API extensions, primarily within the LVMCluster CRD. These extensions include new fields for the deviceClass policy, specifically designed to support shared storage provisioning. The operational aspects of these API extensions are as follows: * Configuration and Management: * Administrators can configure shared storage by setting the DeviceAccessPolicy field in the DeviceClass section of the LVMCluster CRD to shared. @@ -413,7 +413,7 @@ The integration of the Subprovisioner CSI driver into LVMS introduces several ne * The vgmanager component will validate device paths and ensure that they are consistent across all nodes in the cluster. * Dynamic Provisioning: - * When a shared device class is configured, the operator will dynamically create and manage the corresponding Subprovisioner CSI driver deployment, ensuring that the shared storage is properly initialized and synchronized across nodes. + * When a shared device class is configured, the operator will dynamically create and manage the corresponding KubeSAN CSI driver deployment, ensuring that the shared storage is properly initialized and synchronized across nodes. Monitoring and Reporting: * The status of the shared storage, including health and capacity metrics, will be reported through the LVMCluster CRD status fields. @@ -421,7 +421,7 @@ Monitoring and Reporting: ## Support Procedures -Regular product support for LVMS will continue to be established through the LVMS team. In addition, Subprovisioner will receive upstream issues through consumption in the LVMS project and will serve as a repackaging customer for the Subprovisioner project. +Regular product support for LVMS will continue to be established through the LVMS team. In addition, KubeSAN will receive upstream issues through consumption in the LVMS project and will serve as a repackaging customer for the KubeSAN project. ## Security Considerations @@ -430,13 +430,13 @@ Regular product support for LVMS will continue to be established through the LVM - Implement RBAC policies to restrict access to VGs and CSI drivers based on user roles and permissions. - Ensure that shared VGs are only accessible by nodes that are authorized to access them. - **CVE Scanning**: - - Ensure that the Subprovisioner CSI driver is regularly scanned for vulnerabilities and that any identified issues are addressed promptly. - - Implement a process for CVE scanning and remediation for the Subprovisioner CSI driver. - - Fixes for CVEs should be handled in a dedicated midstream openshift/subprovisioner for critical CVEs when Red Hat decides to no longer solely own the project. Until then, the fixes will be handled by the Red Hat team and a midstream is optional. + - Ensure that the KubeSAN CSI driver is regularly scanned for vulnerabilities and that any identified issues are addressed promptly. + - Implement a process for CVE scanning and remediation for the KubeSAN CSI driver. + - Fixes for CVEs should be handled in a dedicated midstream openshift/kubesan for critical CVEs when Red Hat decides to no longer solely own the project. Until then, the fixes will be handled by the Red Hat team and a midstream is optional. ## Implementation Milestones -- **Phase 1**: Initial design and prototyping. Basic integration with Subprovisioner and updates to the LVMCluster CR. +- **Phase 1**: Initial design and prototyping. Basic integration with KubeSAN and updates to the LVMCluster CR. - **Phase 2**: Development of `vgmanager` functionalities for VG orchestration and management. Integration and E2E testing. - **Phase 3**: Performance testing, bug fixes, and documentation. Preparing for Alpha release. - **Phase 4**: Developer Preview release with comprehensive manual and QE testing. Gathering user feedback and making improvements. @@ -449,4 +449,4 @@ Regular product support for LVMS will continue to be established through the LVM - Continue using TopoLVM exclusively for local storage provisioning. - Evaluate and integrate other CSI drivers that support shared storage. - Develop a custom CSI driver to meet the specific needs of LVMS and OpenShift. -- Move Subprovisioner to CNV and package it in a separate product. +- Move KubeSAN to CNV and package it in a separate product. From b169404a3fe2cb565a3b7d4e15d7acb659eed429 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Jakob=20M=C3=B6ller?= Date: Thu, 11 Jul 2024 08:32:46 +0200 Subject: [PATCH 45/53] chore: KubeSAN Cleanup Routines and small nits --- ...ubesan-csi-driver-integration-into-lvms.md | 29 +++++++++++++++++-- 1 file changed, 27 insertions(+), 2 deletions(-) diff --git a/enhancements/local-storage/kubesan-csi-driver-integration-into-lvms.md b/enhancements/local-storage/kubesan-csi-driver-integration-into-lvms.md index 1ac9fbd9fb..02b99ea283 100644 --- a/enhancements/local-storage/kubesan-csi-driver-integration-into-lvms.md +++ b/enhancements/local-storage/kubesan-csi-driver-integration-into-lvms.md @@ -111,6 +111,7 @@ As a Product Manager: - we will scrap this effort for integration and look for alternative solutions if the integration is not possible with reasonable effort. - There is a risk that KubeSAN will break easily as its a really young project - we will not GA the solution until we have a clear understanding of the stability of the KubeSAN project. The solution will stay in TechPreview until then. + - we will use community health as a gatekeeper for GA. We will need solid upstream support for the project and that is only possible through a community. ## Proposal @@ -303,6 +304,28 @@ The status reporting will include: - Ensure that any significant events (such as failovers, recoveries, and maintenance actions) are logged and reported. +#### Design Details for Rollback Routines + +In case of a failure during the provisioning of shared storage, the operator should be able to roll back the changes and clean up the shared VGs without data loss. At the same time, existing PVCs and PVs should be unaffected by the rollback as much as possible. + +Luckily, due to the CSI design, the PVCs and PVs are not directly affected by the driver for operation, meaning that once the mount procedure has been completed, +no further action is required from the driver to keep the PVCs and PVs operational. This means that the rollback can be done without affecting the PVCs and PVs. + +However, if for any reason the shared VGs need to be cleaned up, the operator should be able to do so without affecting the existing PVCs and PVs. This can be achieved by the following steps: + +1. **Rollback Procedure**: + - The operator can remove the shared VGs from the LVMCluster CR to ensure that they are not recreated. This is fundamentally the same as removing a deviceClass + from the CR currently via TopoLVM, however the daemonset running on the node may bail out due to potential data loss incurred via force removal. + In this case, the manual rollback can be performed. + - The operator can then delete the KubeSAN CSI driver deployment to prevent any further provisioning of shared storage. +2. **Manual Rollback Procedure with a broken shared volume group**: + - The node administrator can manually delete the shared VGs using the `vgremove` command on each node. In case of lock contention due to the failure state, + it is possible to forcefully circumvent the lock by using the `--force` and ` --ignorelockingfailure` flag on the `vgremove` command. + This allows nodes that no longer achieve quorum through sanlock to recover. It is possible to do this for every node and to restart a procedure from scratch. + - The operator can then delete the shared VGs from the LVMCluster CR to ensure that they are not recreated. + - The operator can then delete the KubeSAN CSI driver deployment to prevent any further provisioning of shared storage. + + ### Drawbacks - Increased complexity in managing both node-local and shared storage. @@ -371,6 +394,8 @@ LVMS can be installed on standalone clusters, but the shared storage provisionin - Positive feedback from initial users. - Full documentation, including troubleshooting guides and best practices. - Full LVMS Support Lifecycle + - Healthy community around the KubeSAN project, upstream contributions are seamless similar to TopoLVM + - We have a contribution model in place for the KubeSAN project, and the project is in a healthy state with regular releases and active maintainers. ### Removing a deprecated feature @@ -385,9 +410,9 @@ N/A - New deviceClasses with the shared policy should be able to be added to existing LVMClusters without affecting existing deviceClasses. - **Downgrade**: - - Allow safe downgrades by maintaining backward compatibility. Downgrading from a kubesan enabled version to a purely topolvm enabled version should be a no-break operation for the topolvm part. For the kubesan part, the operator should ensure that the shared VGs can be cleaned up manually + - Allow safe downgrades by maintaining backward compatibility. Downgrading from a kubesan enabled version to a purely topolvm enabled version should be a no-break operation for the topolvm part. For the kubesan part, the operator should ensure that the shared VGs can be cleaned up manually in case of failure. - Provide rollback mechanisms and detailed instructions to revert to previous versions. Ensure that downgrades do not result in data loss or service interruptions. - The operator should ensure that the shared VGs can be cleaned up manually. + The operator should ensure that the shared VGs can be cleaned up manually. (more details on rollback routines are in the design details section) - Ensure that downgrades do not result in data loss or service interruptions. The operator should ensure that the shared VGs can be cleaned up without data loss on other device classes. ## Version Skew Strategy From ce40073a18bdc093501b145597e445013af61abf Mon Sep 17 00:00:00 2001 From: Jon Cope Date: Thu, 11 Jul 2024 16:51:34 -0500 Subject: [PATCH 46/53] init --- .../disabling-storage-components.md | 218 ++++++++++++++++++ 1 file changed, 218 insertions(+) create mode 100644 enhancements/microshift/disabling-storage-components.md diff --git a/enhancements/microshift/disabling-storage-components.md b/enhancements/microshift/disabling-storage-components.md new file mode 100644 index 0000000000..f2dbc044ee --- /dev/null +++ b/enhancements/microshift/disabling-storage-components.md @@ -0,0 +1,218 @@ +--- +title: disabling-storage-components +authors: + - "@copejon" +reviewers: + - "@pacevedom: MicroShift team-lead" + - "@jerpeter1, Edge Enablement Staff Engineer" + - "@jakobmoellerdev Edge Enablement LVMS Engineer" + - "@suleymanakbas91, Edge Enablement LVMS Engineer" +approvers: + - "@jerpeter1, Edge Enablement Staff Engineer" +api-approvers: + - "None" +creation-date: 2024-07-11 +last-updated: 2024-07-11 +tracking-link: + - https://issues.redhat.com/browse/OCPSTRAT-856 +see-also: + - "enhancements/microshift/microshift-default-csi-plugin.md" +--- + +# Disabling Storage Components + +## Summary + +MicroShift is a small form-factor, single-node OpenShift targeting IoT and Edge Computing use cases characterized by +tight resource constraints, unpredictable network connectivity, and single-tenant workloads. See +[kubernetes-for-devices-edge.md](./kubernetes-for-device-edge.md) for more detail. + +Out of the box, MicroShift includes the LVMS CSI driver and CSI snapshotter. The LVM Operator is not included with the +platform, in adherence with the project's [guiding principles](./kubernetes-for-device-edge.md#goals). See the +[MicroShift Default CSI Plugin](microshift-default-csi-plugin.md) proposal for a more in-depth explanation of the +storage provider and reasons for its integration into MicroShift. Configuration of the driver is exposed via a config +file at /etc/microshift/lvmd.yaml. The manifests required to deploy and configure the driver and snapshotter are baked +into the MicroShift binary during compilation and are deployed during runtime. + +LVMD is the node-side component of LVMS and is configured via a config file. If LVMS is started with no LVMD config, the +process will crash, causing a crashLoopBackoff and requiring user intervention. If the file `/etc/microshift/lvmd.yaml` +does not exist, MicroShift will attempt to generate a config in memory and provide it to LVMD via config map. If it +cannot create the config, it will skip LVMS and CSI components instantiation. + +Thus, it is already _technically_ possible to disable the storage components, but only via this esoteric functionality. +Users should be provided a clean UX for this feature and not be forced to learn the inner-workings of MicroShift. + +## Motivation + +There are variety of reasons a user may not want to deploy LVMS. Firstly, MicroShift is designed to run on hosts with as +little as 2Gb of memory. Users operating on such small form factors are going to be resource conscious and seek to limit +unnecessary consumption as much as possible. At idle, the LVMS and CSI components consume roughly 250Mb of memory, about +8% of a 2Gb system. Providing a user-facing API to disable LVMS or CSI snapshotting is therefore a must for MicroShift. + +### User Stories + +- A user is operating MicroShift on a small-form factor machine with cluster workloads that require persistent storage + but do not require volume snapshotting. +- A user is operating MicroShift on a small-form factor machine, is reducing resource consumption wherever possible, and +does not have a requirement for persistent storage. + +### Goals + +- Enhance the MicroShift config API to support selective deployment of LVMS and CSI Snapshotting. +- Do not make backwards-incompatible changes to the MicroShift config API. +- Do not endanger or make inaccessible persistent data on upgraded clusters. + +### Non-Goals + +- Provide a generalized framework to support potential future alternatives to LVMS. +- Generically enable installing CSI components without enabling LVMS. + +## Proposal + +### Workflow Description + +**_Installation with CSI Driver and Snapshotting (Default)_** + +1. User installs MicroShift. A MicroShift and LVMD config are not provided. +2. MicroShift starts and detects no config and falls back to default MicroShift configuration (LVMS and CSI on by default). +3. MicroShift checks to determine if host is compatible with LVMS. If it is not, an error is logged, +LVMS and CSI installation is skipped and install continues. (This it the current behavior in MicroShift) +4. MicroShift attempts to dynamically create generate the LVMD config. If it cannot, an error is logged, LVMS and CSI +installation is skipped and install continues. +5. If all checks pass, LVMS and CSI are deployed. + +**_Installation without CSI Driver and Snapshotting_** + +1. User installs MicroShift. +2. User specifies a MicroShift config sets fields to disable LVMS and CSI Snapshots. +3. MicroShift starts and detects the provided config. +4. LVMS and CSI snapshot components are not deployed. + +**_Post-Start Installation, LVMS only_** + +1. User has already installed and started MicroShift service. +2. User later determines there is a requirement for persistent storage, but not snapshotting. +3. User edits the MicroShift config, enabling LVMS only. +4. User restarts MicroShift service. +5. MicroShift starts and detects the provided config. +6. MicroShift performs LVMS startup checks. +7. If all checks pass, LVMS is deployed. Else if checks fail, an error is logged, LVMS deployment is skipped, +and startup continues. + +**_Complete Uninstallation, with Data removal_** + +> NOTE: Installation is not reversible. Deleting LVMS or the CSI snapshotter while there are still storage volumes risks +orphaning volumes. It is the user's responsibility to manually destroy data volumes and/or snapshots before +uninstalling. + +1. User has already deployed a cluster with LVMS and CSI Snapshotting installed. +2. User has deployed cluster workloads with persistent storage. +3. User decides that LVMS and snapshotting are no longer needed. +4. User takes actions to back up, wipe, or otherwise ensure data will not be irretrievable. +5. User stops workloads with mounted storage volumes. +6. User deletes VolumeSnapshots and waits for deletion of VolumeSnapshotContent objects to verify. The deletion process +cannot happen after the CSI Snapshotter is deleted. +7. User delete PersistentVolumeClaims and waits for deletion of PersistentVolumes. The deletion process +cannot happen after LVMS is deleted. +8. User deletes the relevant LVMS and Snapshotter cluster API resources: + ```shell + $ oc delete -n kube-system deployment.apps/csi-snapshot-controller deployment.apps/csi-snapshot-webhook + $ oc delete -n openshift-storage daemonset.apps/topolvm-node + $ oc delete -n openshift-storage deployment.apps/topolvm-controller + $ oc delete -n openshift-storage configmaps/lvmd + $ oc delete storageclasses.storage.k8s.io/topolvm-provisioner + ``` +9. Component deletion completes. On restart, MicroShift will not deploy LVMS or the snapshotter. + +### API Extensions + +- MicroShift config will be extended from the root with a field called `storage`, which will have 2 subfields. + - `.storage.driver`: **ENUM**, type is preferable to leave room for future supported drivers + - One of ["None|none", "lvms"] + - Default, nil, empty: ["lvms"] + - `.storage.csi-snapshot:` **BOOL** + +```yaml +storage: + driver: ENUM + csi-snapshot: BOOL +``` + +### Topology Considerations + +#### Single-node Deployments or MicroShift + +The changes proposed here only affect MicroShift. + +### Implementation Details/Notes/Constraints + +- We should not remove MicroShift's logic for dynamically generating LVMD default values in the absence of a + user-provided config. Thus, these checks will be performed only if LVMS is enabled. + +### Risks and Mitigations + +- N/A + +### Drawbacks + +- This design adds a field to a user facing API specific to a non-MicroShift component. If MicroShift were to shift towards +being agnostic of the storage provider, this field would have to continue to be supported for existing users and +deprecated for a reasonable time before being removed. + +## Test Plan + +Test scenarios will be written to validate the combinations of states the additional fields correlate to their desired +outcomes. Unit tests will be written as necessary. + +## Graduation Criteria + +### GA + +- Ability to utilize the enhancement end to end +- End user documentation, relative API stability +- Sufficient test coverage +- Sufficient time for feedback +- Available by default +- User facing documentation created in [openshift-docs](https://github.com/openshift/openshift-docs/) + +## Upgrade / Downgrade Strategy + +MicroShift gracefully handles config fields it does not recognize. Downgrading from a system with these config fields +will not break MicroShift start up. + +For clusters that disabled the storage components, a downgrade would result in these components being deployed. This +will not break the user's cluster but will result in additional resource overhead. This situation can't be avoided, as +even if the user scaled down the storage component replica sets, restarting or rebooting microshift would cause the +resources to be re-applied, thus scaling the sets back up. + +LVMS installation automatically reversible so upgrades and downgrades do not present a danger to user's data. +Upgrading a cluster with LVMS to a cluster with LVMS set to disabled will not cause the storage components to be deleted. + +## Version Skew Strategy + +LVMS is versioned separately from MicroShift. However, MicroShift pins the version of LVMS during rebasing. This creates +a loose coupling between the two projects, but does not pose a risk of incompatibility. LVMS runs in cluster workloads +and thus does not directly interact with MicroShift internals. + +LVMS does not interact with the MicroShift config API, so it will not be affected by this change. + +## Operational Aspects of API Extensions + +With only LVMS deployed, the cluster overhead is increased by roughly 150Mb and 0.1% CPU. Deployment of the +csi-snapshotter consumes an additional 50Mb and >0.1% CPU. These are taken at system idle and represent the lowest +likely resource consumption. + +## Support Procedures + +- N/A + +## Alternatives + +Do Nothing. Continue to use the MicroShift or LVMD config to determine deployment of LVMS and/or snapshotting. Directing +the user to leverage low-level processes to achieve this goal is an anti-pattern and would provide a poor UX. + +Install LVMS and CSI via rpm, as is done with other cluster components. MicroShift is strongly opinionated towards LVMS, +with significant logic for managing some of the LVMS life cycle and default configs. If LVMS manifests and container +images were installed via an RPM, the logic for handling dynamic defaults would need to be implemented elsewhere. Either as a %post-install stage or as a separate system service. Neither of these approaches is suitable for the goal of this EP, +explodes the complexity of managing LVMS and CSI, as well as testing. Because of this, it should be considered in a +separate EP, if at all. From d4b0366b83d2aace3ce02f072bb3f1abaa652b1c Mon Sep 17 00:00:00 2001 From: Jon Cope Date: Fri, 12 Jul 2024 12:49:06 -0500 Subject: [PATCH 47/53] linter fixes --- .../disabling-storage-components.md | 21 +++++++++++++++++++ 1 file changed, 21 insertions(+) diff --git a/enhancements/microshift/disabling-storage-components.md b/enhancements/microshift/disabling-storage-components.md index f2dbc044ee..29a8d66a19 100644 --- a/enhancements/microshift/disabling-storage-components.md +++ b/enhancements/microshift/disabling-storage-components.md @@ -140,6 +140,14 @@ storage: ### Topology Considerations +#### Hypershift / Hosted Control Planes + +- N/A + +#### Standalone Clusters + +- N/A + #### Single-node Deployments or MicroShift The changes proposed here only affect MicroShift. @@ -166,6 +174,14 @@ outcomes. Unit tests will be written as necessary. ## Graduation Criteria +### Dev Preview -> Tech Preview + +- N/A + +### Tech Preview -> GA + +- N/A + ### GA - Ability to utilize the enhancement end to end @@ -175,6 +191,11 @@ outcomes. Unit tests will be written as necessary. - Available by default - User facing documentation created in [openshift-docs](https://github.com/openshift/openshift-docs/) +### Removing a deprecated feature + +- Announce deprecation and support policy of the existing feature +- Deprecate the feature + ## Upgrade / Downgrade Strategy MicroShift gracefully handles config fields it does not recognize. Downgrading from a system with these config fields From 86b54f32e2d9bcfa1ad2483b92d604cd618aadc7 Mon Sep 17 00:00:00 2001 From: Jon Cope Date: Tue, 16 Jul 2024 16:26:43 -0500 Subject: [PATCH 48/53] added config step to uninstall hopefully clarified the csi field under api extensions --- .../disabling-storage-components.md | 27 +++++++++++-------- 1 file changed, 16 insertions(+), 11 deletions(-) diff --git a/enhancements/microshift/disabling-storage-components.md b/enhancements/microshift/disabling-storage-components.md index 29a8d66a19..ab34048313 100644 --- a/enhancements/microshift/disabling-storage-components.md +++ b/enhancements/microshift/disabling-storage-components.md @@ -71,7 +71,7 @@ does not have a requirement for persistent storage. ### Workflow Description -**_Installation with CSI Driver and Snapshotting (Default)_** +**_Installation with LVMS and Snapshotting (Default)_** 1. User installs MicroShift. A MicroShift and LVMD config are not provided. 2. MicroShift starts and detects no config and falls back to default MicroShift configuration (LVMS and CSI on by default). @@ -81,7 +81,7 @@ LVMS and CSI installation is skipped and install continues. (This it the current installation is skipped and install continues. 5. If all checks pass, LVMS and CSI are deployed. -**_Installation without CSI Driver and Snapshotting_** +**_Installation without LVMS and Snapshotting_** 1. User installs MicroShift. 2. User specifies a MicroShift config sets fields to disable LVMS and CSI Snapshots. @@ -108,13 +108,14 @@ uninstalling. 1. User has already deployed a cluster with LVMS and CSI Snapshotting installed. 2. User has deployed cluster workloads with persistent storage. 3. User decides that LVMS and snapshotting are no longer needed. -4. User takes actions to back up, wipe, or otherwise ensure data will not be irretrievable. -5. User stops workloads with mounted storage volumes. -6. User deletes VolumeSnapshots and waits for deletion of VolumeSnapshotContent objects to verify. The deletion process +4. User edits `/etc/microshift/config.yaml`, setting `.storage.driver: none`. +5. User takes actions to back up, wipe, or otherwise ensure data will not be irretrievable. +6. User stops workloads with mounted storage volumes. +7. User deletes VolumeSnapshots and waits for deletion of VolumeSnapshotContent objects to verify. The deletion process cannot happen after the CSI Snapshotter is deleted. -7. User delete PersistentVolumeClaims and waits for deletion of PersistentVolumes. The deletion process +8. User delete PersistentVolumeClaims and waits for deletion of PersistentVolumes. The deletion process cannot happen after LVMS is deleted. -8. User deletes the relevant LVMS and Snapshotter cluster API resources: +9. User deletes the relevant LVMS and Snapshotter cluster API resources: ```shell $ oc delete -n kube-system deployment.apps/csi-snapshot-controller deployment.apps/csi-snapshot-webhook $ oc delete -n openshift-storage daemonset.apps/topolvm-node @@ -122,20 +123,24 @@ cannot happen after LVMS is deleted. $ oc delete -n openshift-storage configmaps/lvmd $ oc delete storageclasses.storage.k8s.io/topolvm-provisioner ``` -9. Component deletion completes. On restart, MicroShift will not deploy LVMS or the snapshotter. +10. Component deletion completes. On restart, MicroShift will not deploy LVMS or the snapshotter. ### API Extensions - MicroShift config will be extended from the root with a field called `storage`, which will have 2 subfields. - `.storage.driver`: **ENUM**, type is preferable to leave room for future supported drivers - One of ["None|none", "lvms"] - - Default, nil, empty: ["lvms"] - - `.storage.csi-snapshot:` **BOOL** + - An empty value or null field defaults to deploying LVMS. This is because the driver is already. + - `.storage.with-csi-components:` **Array**. Empty array or null value defaults to not deploying additional CSI + components. + - Excepted values are: ['csi-snapshot-controller'] + - Even though it's the most common csi components, the csi-driver should not be part of this list because it is +required, and usually deployed, by the storage provider. ```yaml storage: driver: ENUM - csi-snapshot: BOOL + with-csi-components: ARRAY ``` ### Topology Considerations From fb1b2509e7db9ce7f55f6cb9b86486908d560546 Mon Sep 17 00:00:00 2001 From: Jon Cope Date: Mon, 22 Jul 2024 10:54:05 -0500 Subject: [PATCH 49/53] reflect mini-lvms stats in operational aspects --- enhancements/microshift/disabling-storage-components.md | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/enhancements/microshift/disabling-storage-components.md b/enhancements/microshift/disabling-storage-components.md index ab34048313..f05b9c4a0d 100644 --- a/enhancements/microshift/disabling-storage-components.md +++ b/enhancements/microshift/disabling-storage-components.md @@ -224,9 +224,8 @@ LVMS does not interact with the MicroShift config API, so it will not be affecte ## Operational Aspects of API Extensions -With only LVMS deployed, the cluster overhead is increased by roughly 150Mb and 0.1% CPU. Deployment of the -csi-snapshotter consumes an additional 50Mb and >0.1% CPU. These are taken at system idle and represent the lowest -likely resource consumption. +With LVMS and the CSI components deployed, the cluster overhead is increased by roughly 50Mb and 0.1% CPU. Though +comparatively small by OCP standards, 50Mb of memory is a non-trivial amount on far-edge devices. ## Support Procedures From 40cc364ff6180379ee232667347052d9ebe31f24 Mon Sep 17 00:00:00 2001 From: Jon Cope Date: Mon, 22 Jul 2024 12:20:26 -0500 Subject: [PATCH 50/53] fixed truncated sentence clarified user uninstall actions --- enhancements/microshift/disabling-storage-components.md | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/enhancements/microshift/disabling-storage-components.md b/enhancements/microshift/disabling-storage-components.md index f05b9c4a0d..528d4773cb 100644 --- a/enhancements/microshift/disabling-storage-components.md +++ b/enhancements/microshift/disabling-storage-components.md @@ -109,7 +109,7 @@ uninstalling. 2. User has deployed cluster workloads with persistent storage. 3. User decides that LVMS and snapshotting are no longer needed. 4. User edits `/etc/microshift/config.yaml`, setting `.storage.driver: none`. -5. User takes actions to back up, wipe, or otherwise ensure data will not be irretrievable. +5. User takes steps to back up, and then erase, or otherwise ensure that data cannot be recovered. 6. User stops workloads with mounted storage volumes. 7. User deletes VolumeSnapshots and waits for deletion of VolumeSnapshotContent objects to verify. The deletion process cannot happen after the CSI Snapshotter is deleted. @@ -130,7 +130,11 @@ cannot happen after LVMS is deleted. - MicroShift config will be extended from the root with a field called `storage`, which will have 2 subfields. - `.storage.driver`: **ENUM**, type is preferable to leave room for future supported drivers - One of ["None|none", "lvms"] - - An empty value or null field defaults to deploying LVMS. This is because the driver is already. + - An empty value or null field defaults to deploying LVMS. This preserves MicroShift's default installation workflow + as it currently is, which in turn ensures API compatibility for clusters that are upgraded from 4.16. If the + null/empty values defaulted to disabling LVMS, the effect would not be seen on the cluster until an LVMS or CSI + component were deleted and the cluster is restarted. This creates a kind of hidden behavior that the user may not + be aware of or want. - `.storage.with-csi-components:` **Array**. Empty array or null value defaults to not deploying additional CSI components. - Excepted values are: ['csi-snapshot-controller'] From 6d59db8daba4bbebb150000fb15a1d5f8f832f13 Mon Sep 17 00:00:00 2001 From: Jon Cope Date: Mon, 22 Jul 2024 13:49:50 -0500 Subject: [PATCH 51/53] optional step for workloads that can be run without persistent storage clarified user story to call out persistent storage is applicable to cluster workloads only --- enhancements/microshift/disabling-storage-components.md | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/enhancements/microshift/disabling-storage-components.md b/enhancements/microshift/disabling-storage-components.md index 528d4773cb..844014b0c1 100644 --- a/enhancements/microshift/disabling-storage-components.md +++ b/enhancements/microshift/disabling-storage-components.md @@ -54,7 +54,7 @@ unnecessary consumption as much as possible. At idle, the LVMS and CSI component - A user is operating MicroShift on a small-form factor machine with cluster workloads that require persistent storage but do not require volume snapshotting. - A user is operating MicroShift on a small-form factor machine, is reducing resource consumption wherever possible, and -does not have a requirement for persistent storage. +does not have a requirement for persistent cluster-workload storage. ### Goals @@ -111,6 +111,9 @@ uninstalling. 4. User edits `/etc/microshift/config.yaml`, setting `.storage.driver: none`. 5. User takes steps to back up, and then erase, or otherwise ensure that data cannot be recovered. 6. User stops workloads with mounted storage volumes. + 1. (Optional) If workloads can be run without persistent storage and the user wishes to do so: User + recreates the workload manifest(s) and specifies another provider, an emptyDir or hostpath volume, or no + storage at all. 7. User deletes VolumeSnapshots and waits for deletion of VolumeSnapshotContent objects to verify. The deletion process cannot happen after the CSI Snapshotter is deleted. 8. User delete PersistentVolumeClaims and waits for deletion of PersistentVolumes. The deletion process From 75b067f018b082dc8373f043f911d3f879abf520 Mon Sep 17 00:00:00 2001 From: Jon Cope Date: Mon, 22 Jul 2024 15:35:43 -0500 Subject: [PATCH 52/53] corrected old metrics data point --- enhancements/microshift/disabling-storage-components.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/enhancements/microshift/disabling-storage-components.md b/enhancements/microshift/disabling-storage-components.md index 844014b0c1..103d538632 100644 --- a/enhancements/microshift/disabling-storage-components.md +++ b/enhancements/microshift/disabling-storage-components.md @@ -46,8 +46,8 @@ Users should be provided a clean UX for this feature and not be forced to learn There are variety of reasons a user may not want to deploy LVMS. Firstly, MicroShift is designed to run on hosts with as little as 2Gb of memory. Users operating on such small form factors are going to be resource conscious and seek to limit -unnecessary consumption as much as possible. At idle, the LVMS and CSI components consume roughly 250Mb of memory, about -8% of a 2Gb system. Providing a user-facing API to disable LVMS or CSI snapshotting is therefore a must for MicroShift. +unnecessary consumption as much as possible. At idle, the LVMS and CSI components consume roughly 50Mi of memory. +Providing a user-facing API to disable LVMS or CSI snapshotting is therefore a must for MicroShift. ### User Stories From 39a380177ea6c195c36ac92c8135b617f401bd52 Mon Sep 17 00:00:00 2001 From: Jon Cope Date: Tue, 23 Jul 2024 10:54:57 -0500 Subject: [PATCH 53/53] Update enhancements/microshift/disabling-storage-components.md Co-authored-by: Patryk Matuszak --- enhancements/microshift/disabling-storage-components.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/enhancements/microshift/disabling-storage-components.md b/enhancements/microshift/disabling-storage-components.md index 103d538632..6c431908e6 100644 --- a/enhancements/microshift/disabling-storage-components.md +++ b/enhancements/microshift/disabling-storage-components.md @@ -140,7 +140,7 @@ cannot happen after LVMS is deleted. be aware of or want. - `.storage.with-csi-components:` **Array**. Empty array or null value defaults to not deploying additional CSI components. - - Excepted values are: ['csi-snapshot-controller'] + - Expected values are: ['csi-snapshot-controller'] - Even though it's the most common csi components, the csi-driver should not be part of this list because it is required, and usually deployed, by the storage provider.