diff --git a/keps/prod-readiness/sig-storage/4049.yaml b/keps/prod-readiness/sig-storage/4049.yaml new file mode 100644 index 000000000000..cb368981a9c8 --- /dev/null +++ b/keps/prod-readiness/sig-storage/4049.yaml @@ -0,0 +1,3 @@ +kep-number: 4049 +alpha: + approver: "@soltysh" diff --git a/keps/sig-storage/4049-storage-capacity-scoring-of-nodes-for-dynamic-provisioning/README.md b/keps/sig-storage/4049-storage-capacity-scoring-of-nodes-for-dynamic-provisioning/README.md new file mode 100644 index 000000000000..2c6077b6ec98 --- /dev/null +++ b/keps/sig-storage/4049-storage-capacity-scoring-of-nodes-for-dynamic-provisioning/README.md @@ -0,0 +1,1028 @@ + +# KEP-4049: Storage Capacity Scoring of Nodes for Dynamic Provisioning + + + + + + +- [Release Signoff Checklist](#release-signoff-checklist) +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Proposal](#proposal) + - [User Stories (Optional)](#user-stories-optional) + - [Story 1](#story-1) + - [Story 2](#story-2) + - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) +- [Design Details](#design-details) + - [Modify stateData to be able to store StorageCapacity](#modify-statedata-to-be-able-to-store-storagecapacity) + - [Get the capacity of nodes for dynamic provisioning](#get-the-capacity-of-nodes-for-dynamic-provisioning) + - [Scoring of nodes for dynamic provisioning](#scoring-of-nodes-for-dynamic-provisioning) + - [Conditions for scoring static or dynamic provisioning](#conditions-for-scoring-static-or-dynamic-provisioning) + - [Test Plan](#test-plan) + - [Prerequisite testing updates](#prerequisite-testing-updates) + - [Unit tests](#unit-tests) + - [Integration tests](#integration-tests) + - [e2e tests](#e2e-tests) + - [Graduation Criteria](#graduation-criteria) + - [Version Skew Strategy](#version-skew-strategy) +- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) + - [Feature Enablement and Rollback](#feature-enablement-and-rollback) + - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) + - [Monitoring Requirements](#monitoring-requirements) + - [Dependencies](#dependencies) + - [Scalability](#scalability) + - [Troubleshooting](#troubleshooting) +- [Implementation History](#implementation-history) +- [Infrastructure Needed (Optional)](#infrastructure-needed-optional) + + +## Release Signoff Checklist + + + +Items marked with (R) are required *prior to targeting to a milestone / release*. + +- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) +- [ ] (R) KEP approvers have approved the KEP status as `implementable` +- [ ] (R) Design details are appropriately documented +- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) + - [ ] e2e Tests for all Beta API Operations (endpoints) + - [ ] (R) Ensure GA e2e tests for meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) + - [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free +- [ ] (R) Graduation criteria is in place + - [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) +- [ ] (R) Production readiness review completed +- [ ] (R) Production readiness review approved +- [ ] "Implementation History" section is up-to-date for milestone +- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] +- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes + + + +[kubernetes.io]: https://kubernetes.io/ +[kubernetes/enhancements]: https://git.k8s.io/enhancements +[kubernetes/kubernetes]: https://git.k8s.io/kubernetes +[kubernetes/website]: https://git.k8s.io/website + +## Summary + +This KEP proposes adding a way to score nodes for dynamic provisioning of PVs. This scoring method is based on storage capacity in the VolumeBinding plugin. +By considering the amount of free space that nodes have, it is possible to dynamically schedule pods on the node that has the most or least free space. + + + +## Motivation + +Storage capacity needs to be considered when: + +- we want to resize after a node-local PV is scheduled. In this case we need to select a node with as much free space as possible. +- we want to select a node with less free node space to reduce the number of nodes as much as possible. + + + +### Goals + +- To modify the scoring logic to count on dynamic provisioning in addition to the current, considering only static provisioning. + + + +### Non-Goals + +- To change how to score nodes for static provisioning. + + + +## Proposal + +- Node scores based on available space can be taken into account when performing dynamic provisioning. + +Cluster admin can configure the scoring logic using a new field in [`VolumeBindingArgs`](https://github.com/kubernetes/kubernetes/blob/1bb62cd27506f86d4b3f71a61a78e892aa2dbca1/pkg/scheduler/apis/config/types_pluginargs.go#L146-L169) of `kubescheduler.config.k8s.io`. The scoring logic is global for the whole cluster and we propose two values: + +- Prefer a node with the least capacity. +- Prefer a node with the maximum capacity. + +Considering the common scenario of local storage, we want to leave room for volume expansion after node allocation. The default setting is to prefer a node with the maximum capacity. + + + +### User Stories (Optional) + + + +#### Story 1 + +We want to leave room for volume expansion after node allocation. In this case, we want to allocate the node that has the maximum amount of free space. + +#### Story 2 + +We want to reduce the number of nodes as much as possible to reduce costs when using a cloud environment. In this case, we want to allocate the node that has the smallest amount of sufficiently free space left. + +### Notes/Constraints/Caveats (Optional) + + + +## Design Details + + + +We modify the existing VolumeBinding plugin to achieve scoring of nodes for dynamic provisioning. + +### Modify stateData to be able to store StorageCapacity + +We modify the struct called `PodVolumes` contained in `stateData` to score nodes for dynamic provisioning. + +The struct of `stateData` is as follows: + +```go +type stateData struct { + skip bool + boundClaims []*v1.PersistentVolumeClaim + claimsToBind []*v1.PersistentVolumeClaim + allBound bool + podVolumesByNode map[string]*PodVolumes + sync.Mutex +} +``` + +By making the following changes to `PodVolumes`, `CSIStorageCapacity` can be stored. + +```diff ++ type DynamicProvision struct { ++ PVC *v1.PersistentVolumeClaim ++ Capacity *storagev1.CSIStorageCapacity ++ } + +type PodVolumes struct { + StaticBindings []*BindingInfo +- DynamicProvisions []*v1.PersistentVolumeClaim ++ DynamicProvisions []*DynamicProvision +} +``` + +### Get the capacity of nodes for dynamic provisioning + +Add `CSIStorageCapacity` to the return value of the `volumeBinder.hasEnoughCapacity` method. This returns the `DynamicProvision.Capacity` field in the case of dynamic provisioning. + +```diff +- func (b *volumeBinder) hasEnoughCapacity(provisioner string, claim *v1.PersistentVolumeClaim, storageClass *storagev1.StorageClass, node *v1.Node) (bool, error) { ++ func (b *volumeBinder) hasEnoughCapacity(provisioner string, claim *v1.PersistentVolumeClaim, storageClass *storagev1.StorageClass, node *v1.Node) (bool, *storagev1.CSIStorageCapacity, error) { + quantity, ok := claim.Spec.Resources.Requests[v1.ResourceStorage] + if !ok { + // No capacity to check for. +- return true, nil ++ return true, nil, nil + } + + // Only enabled for CSI drivers which opt into it. + driver, err := b.csiDriverLister.Get(provisioner) + if err != nil { + if apierrors.IsNotFound(err) { + // Either the provisioner is not a CSI driver or the driver does not + // opt into storage capacity scheduling. Either way, skip + // capacity checking. +- return true, nil ++ return true, nil, nil + } +- return false, err ++ return false, nil, err + } + if driver.Spec.StorageCapacity == nil || !*driver.Spec.StorageCapacity { +- return true, nil ++ return true, nil, nil + } + + // Look for a matching CSIStorageCapacity object(s). + // TODO (for beta): benchmark this and potentially introduce some kind of lookup structure (https://github.com/kubernetes/enhancements/issues/1698#issuecomment-654356718). + capacities, err := b.csiStorageCapacityLister.List(labels.Everything()) + if err != nil { +- return false, err ++ return false, nil, err + } + + sizeInBytes := quantity.Value() + for _, capacity := range capacities { + if capacity.StorageClassName == storageClass.Name && + capacitySufficient(capacity, sizeInBytes) && + b.nodeHasAccess(node, capacity) { + // Enough capacity found. +- return true, nil ++ return true, capacity, nil + } + } + + // TODO (?): this doesn't give any information about which pools where considered and why + // they had to be rejected. Log that above? But that might be a lot of log output... + klog.V(4).InfoS("Node has no accessible CSIStorageCapacity with enough capacity for PVC", + "node", klog.KObj(node), "PVC", klog.KObj(claim), "size", sizeInBytes, "storageClass", klog.KObj(storageClass)) +- return false, nil ++ return false, nil, nil +} +``` + +### Scoring of nodes for dynamic provisioning + +The `Score` method in the current VolumeBinding plug-in scores nodes considering only static provisioning. The scoring applies to every entry in `podVolumes.StaticBindings`. + +In this KEP, add the scoring of nodes for dynamic provisioning in the `Score` method of the VolumeBinding plugin. The scoring applies to every entry in `podVolumes.DynamicProvisions` where `Capacity` is not equal to `nil`. + +Scoring for dynamic provisioning is executed if there are no `StaticBindings`. In other words, if there is only static provisioning or both static and dynamic provisioning, the scoring will be done as usual for static provisioning. Then, if there is only dynamic provisioning, the following will be set to `classResources` and passed to the `scorer` function: + +- `Requested: provision.PVC.Spec.Resources.Requests[v1.ResourceName(v1.ResourceStorage)]` +- `Capacity: CSIStorageCapacity` + +By doing this, we can calculate scores to nodes for dynamic provisioning in a way that is based on the `Shape` setting of `VolumeBindingArgs`, and which takes into account the amount of free space the nodes have. + +```diff +// Score invoked at the score extension point. +func (pl *VolumeBinding) Score(ctx context.Context, cs *framework.CycleState, pod *v1.Pod, nodeName string) (int64, *framework.Status) { + if pl.scorer == nil { + return 0, nil + } + state, err := getStateData(cs) + if err != nil { + return 0, framework.AsStatus(err) + } + podVolumes, ok := state.podVolumesByNode[nodeName] + if !ok { + return 0, nil + } +- // group by storage class ++ + classResources := make(classResourceMap) +- for _, staticBinding := range podVolumes.StaticBindings { +- class := staticBinding.StorageClassName() +- storageResource := staticBinding.StorageResource() +- if _, ok := classResources[class]; !ok { +- classResources[class] = &StorageResource{ +- Requested: 0, +- Capacity: 0, ++ if len(podVolumes.StaticBindings) != 0 { ++ // group static biding volumes by storage class ++ for _, staticBinding := range podVolumes.StaticBindings { ++ class := staticBinding.StorageClassName() ++ storageResource := staticBinding.StorageResource() ++ if _, ok := classResources[class]; !ok { ++ classResources[class] = &StorageResource{ ++ Requested: 0, ++ Capacity: 0, ++ } ++ } ++ classResources[class].Requested += storageResource.Requested ++ classResources[class].Capacity += storageResource.Capacity ++ } ++ } else { ++ // group dynamic biding volumes by storage class ++ for _, provision := range podVolumes.DynamicProvisions { ++ if provision.Capacity == nil { ++ continue ++ } ++ class := *provision.PVC.Spec.StorageClassName ++ if _, ok := classResources[class]; !ok { ++ classResources[class] = &StorageResource{ ++ Requested: 0, ++ Capacity: 0, ++ } + } ++ requestedQty := provision.PVC.Spec.Resources.Requests[v1.ResourceName(v1.ResourceStorage)] ++ classResources[class].Requested += requestedQty.Value() ++ classResources[class].Capacity += provision.Capacity.Capacity.Value() + } +- classResources[class].Requested += storageResource.Requested +- classResources[class].Capacity += storageResource.Capacity + } ++ + return pl.scorer(classResources), nil +} +``` + +Users can select the scoring logic from the following options in `VolumeBindingargs`. The scoring logic is the same among all Pod + PVC(s). + +- (a) Prefer a node with the least capacity. +- (b) Prefer a node with the maximum capacity. + +Considering the common scenario of local storage, we want to leave room for volume expansion after node allocation. The default setting is to prefer a node with the maximum capacity. + +### Conditions for scoring static or dynamic provisioning + +About the `Score` function, the score will be calculated with the existing way (only static provisioning is taken into account) if at least one PVC was statically provisioned. Otherwise, the score will be calculated from dynamic provisioning. + +Implementation idea: + +```diff +func (pl *VolumeBinding) Score(ctx context.Context, cs *framework.CycleState, pod *v1.Pod, nodeName string) (int64, *framework.Status) { + ... + ++ if len(static) != 0 { ++ return static_score, nil; // Same value as the current method ++ } else { ++ return dynamic_score, nil; // Propose in this KEP ++ } +- return pl.scorer(classResources), nil +} +``` + +### Test Plan + + + +[X] I/we understand the owners of the involved components may require updates to +existing tests to make this code solid enough prior to committing the changes necessary +to implement this enhancement. + +##### Prerequisite testing updates + + + +Nothing in particular. + +##### Unit tests + + + + + +The following unit tests are planned: + +- Are the scores assigned to nodes for dynamic provisioning appropriate for the amount of free space? +- Are the amount of free space score of nodes for dynamic provisioning and the Static Bindings score both functional? + +##### Integration tests + + + +The scoring function will be tested in test/integration/volumescheduling/storage_capacity_scoring_test.go. + +##### e2e tests + + + +The following e2e tests are planned: + +- When only static provisioning is available, or a mixture of static provisioning and dynamic provisioning is available: + - Does it pass traditional tests? +- When only dynamic provisioning is available: + - Is the Pod placed on the node with the largest available space by default? + - When `VolumeBindingArgs` is set to "Prefer a node with the maximum capacity", is the Pod placed on the node with the largest available space? + - When `VolumeBindingArgs` is set to "Prefer a node with the least capacity", is the Pod placed on the node that meets the requested size but has the smallest available space? + - Does the Pod placement fail if no node meets the requested size? + - Even when the Pod is recreated, is the placement in the node performed as expected above? + +### Graduation Criteria + + + +TBD + +### Version Skew Strategy + + + +Nothing in particular. + +## Production Readiness Review Questionnaire + + + +### Feature Enablement and Rollback + + + +###### How can this feature be enabled / disabled in a live cluster? + + + +- [X] Feature gate (also fill in values in `kep.yaml`) + - Feature gate name: StorageCapacityScoring + - Components depending on the feature gate: kube-scheduler + +###### Does enabling the feature change any default behavior? + +The scheduling behavior is changed if this function is enabled. + + + +###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? + +Yes, this feature can be disabled after it has been enabled by setting the feature gate to false again. In doing so, the scoring for VolumeBinding will revert to the current method. This change won't affect the behavior of existing Pods. + + + +###### What happens if we reenable the feature if it was previously rolled back? + +Re-enabling the feature from a rolled-back state will result in scheduling that considers dynamic provisioning. There will be no impact on existing running Pods. + +###### Are there any tests for feature enablement/disablement? + +Yes. We will add unit tests with and without the feature gate enabled. + + + +### Rollout, Upgrade and Rollback Planning + + + +###### How can a rollout or rollback fail? Can it impact already running workloads? + +Turning the feature gate flag on/off only changes scheduling scoring. So there is no possibility of impacting workloads that are already running. + + + +###### What specific metrics should inform a rollback? + +A spike on metric `schedule_attempts_total{result="error|unschedulable"}` when this feature gate is enabled. + + + +###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? + +Not applicable, yet. + + + +###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? + +No, it isn't. + + + +### Monitoring Requirements + + + +###### How can an operator determine if the feature is in use by workloads? + +If enabled, this feature applies to all workloads which uses delay binding PVCs. Also non-zero value of metric `plugin_execution_duration_seconds{plugin="VolumeBinding",extension_point="Score"}` is a sign indicating this feature is in use. +Unfortunately, there is no way to distinguish whether only static provisioning is being considered (the current behavior) or both static and dynamic provisioning are being considered (the new behavior). + + + +###### How can someone using this feature know that it is working for their instance? + +Pods that use only dynamically provisioned PVCs will be scheduled to nodes with more available capacity. + + + +- [ ] Events + - Event Reason: +- [ ] API .status + - Condition name: + - Other field: +- [ ] Other (treat as last resort) + - Details: + +###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? + +Nothing in particular. + + + +###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? + +Nothing in particular. + + + +- [ ] Metrics + - Metric name: + - [Optional] Aggregation method: + - Components exposing the metric: +- [ ] Other (treat as last resort) + - Details: + +###### Are there any missing metrics that would be useful to have to improve observability of this feature? + +Nothing in particular. + + + +### Dependencies + + + +###### Does this feature depend on any specific services running in the cluster? + +Yes, it depends on the kube-scheduler. + + + +### Scalability + + + +###### Will enabling / using this feature result in any new API calls? + +No. + + + +###### Will enabling / using this feature result in introducing new API types? + +No. + + + +###### Will enabling / using this feature result in any new calls to the cloud provider? + +No. + + + +###### Will enabling / using this feature result in increasing size or count of the existing API objects? + +No. + + + +###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? + +Yes, it may affect the time taken by scheduling. + + + +###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? + +No. + + + +### Troubleshooting + + + +###### How does this feature react if the API server and/or etcd is unavailable? + +The behavior in such cases does not change. This proposal only modifies one of the plugins in the kube-scheduler. + +###### What are other known failure modes? + +Not applicable, yet. + + + +###### What steps should be taken if SLOs are not being met to determine the problem? + +Check the kube-scheduler logs. + +## Implementation History + +Not applicable, yet. + + + +## Infrastructure Needed (Optional) + + diff --git a/keps/sig-storage/4049-storage-capacity-scoring-of-nodes-for-dynamic-provisioning/kep.yaml b/keps/sig-storage/4049-storage-capacity-scoring-of-nodes-for-dynamic-provisioning/kep.yaml new file mode 100644 index 000000000000..11df29a279bb --- /dev/null +++ b/keps/sig-storage/4049-storage-capacity-scoring-of-nodes-for-dynamic-provisioning/kep.yaml @@ -0,0 +1,45 @@ +title: Storage Capacity Scoring of Nodes for Dynamic Provisioning +kep-number: 4049 +authors: + - "@cupnes" +owning-sig: sig-storage +participating-sigs: +status: TBD +creation-date: TBD +reviewers: + - "@xing-yang" + - "@jsafrane" +approvers: + - "@xing-yang" + - "@jsafrane" + +see-also: + - "/keps/sig-storage/1845-prioritization-on-volume-capacity" +replaces: + +# The target maturity stage in the current dev cycle for this KEP. +stage: alpha + +# The most recent milestone for which work toward delivery of this KEP has been +# done. This can be the current (upcoming) milestone, if it is being actively +# worked on. +latest-milestone: "v1.32" + +# The milestone at which this feature was, or is targeted to be, at each stage. +milestone: + alpha: "v1.32" + beta: TBD + stable: TBD + +# The following PRR answers are required at alpha release +# List the feature gate name and the components for which it must be enabled +feature-gates: + - name: StorageCapacityScoring + components: + - kube-scheduler +disable-supported: true + +# The following PRR answers are required at beta release +metrics: + - schedule_attempts_total + - plugin_execution_duration_seconds