Splunk Operator: Autoscaling Issue #1352

nathan-bowman · 2024-07-15T14:26:00Z

Please select the type of request

Bug

Tell us more

Describe the Problem
I'm following the details here for pod autoscaling. It seems that spec.replicas is a mandatory field, but with the HorizontalPodAutoscaler docs recommend that you remove spec.replicas from the target manifest.

When an HPA is enabled, it is recommended that the value of spec.replicas of the Deployment and / or StatefulSet be removed from their manifest(s).

Error I receive when I remove spec.replicas:

the HPA controller was unable to get the target's current scale: Internal error occurred: the spec replicas field ".spec.replicas" does not exist

Expected behavior
One should be able to remove spec.replicas from the Splunk CR indexerclusters.enterprise.splunk.com (and probably other CRs...) to allow HorizontalPodAutoscaler to manage the spec.replicas.

Splunk setup on K8S
AWS EKS v1.29, with Splunk Operator 2.5.2.

Last thing to note, I'm using autoscaling/v2 apiVersion

Reproduction/Testing steps
idx-cluster.yaml:

---
apiVersion: enterprise.splunk.com/v4
kind: IndexerCluster
metadata:
  name: idx-cluster
  finalizers:
  - enterprise.splunk.com/delete-pvc
spec:
  imagePullPolicy: IfNotPresent
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: affinityNodeGroup
            operator: In
            values:
            - splunk-nodegroup-indexers
  tolerations:
  - key: splunk-indexers
    value: "true"
    effect: NoExecute
  serviceAccount: splunk-enterprise-serviceaccount
  resources:
    limits:
      cpu: 15
      memory: 60G
    requests:
      cpu: 13
      memory: 57G
  #replicas: 3
  clusterManagerRef:
    name: cm
  licenseManagerRef:
    name: lm
  monitoringConsoleRef:
    name: mc

  etcVolumeStorageConfig:
    ephemeralStorage: false
    storageCapacity: 10Gi
    storageClassName: gp3

  varVolumeStorageConfig:
    ephemeralStorage: false
    storageCapacity: 800Gi
    storageClassName: topolvm-local-ssd

HorizontalPodAutoscaler yaml:

---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: idx-cluster-autoscaler
spec:
  scaleTargetRef:
    apiVersion: enterprise.splunk.com/v4
    kind: IndexerCluster
    name: idx-cluster
  minReplicas: 3
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 50

K8s environment
k8s v1.29

The text was updated successfully, but these errors were encountered:

nathan-bowman · 2024-07-17T20:20:55Z

I'm not entirely sure if this will mess things up, but I got the autoscaling to work by pointing the HPA at the IndexerCluster's downstream statefulset:

---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: idx-cluster-autoscaler
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: StatefulSet
    name: splunk-idx-cluster-indexer
  minReplicas: 3
  maxReplicas: 15
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 25

Is this correct?

nathan-bowman · 2024-07-18T15:03:59Z

Going down the rabbit hole...

It looks like my HPA isn't gathering metrics for the target:

NAME                          REFERENCE                         TARGETS         MINPODS   MAXPODS   REPLICAS   AGE
idx-cluster-autoscaler   IndexerCluster/idx-cluster             <unknown>/25%   3         15        3          18h

Additional digging into the metric-server shows lots of scrape errors:

E0717 23:45:19.725688       1 scraper.go:149] "Failed to scrape node" err="Get \"https://192.168.28.96:10250/metrics/resource\": remote error: tls: internal error" node="ip-192-168-28-96.us-west-2.compute.internal"

I think this is related to a recent issue posted in the official metrics-server repo, and an associated PR.

I'm not totally sure, though... Other HPA's in my EKS clusters seem to work fine...

Edit: To clarify, the HPA sees the targets data when I point it towards the statefulset, but not when I point it towards kind: IndexerCluster

nathan-bowman · 2024-07-18T18:10:13Z

Adding more info...

# kubectl --raw /apis/enterprise.splunk.com/v4/ | jq '.resources[] | select(.name=="indexerclusters/scale")'
{
  "name": "indexerclusters/scale",
  "singularName": "",
  "namespaced": true,
  "group": "autoscaling",
  "version": "v1",
  "kind": "Scale",
  "verbs": [
    "get",
    "patch",
    "update"
  ]
}

is v1 pointing to autoscaling? I'm using autoscaling/v2 in my HPA

nathan-bowman · 2024-07-18T20:21:58Z

I tried autoscaling/v1 and have the same issue 👎

apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
  annotations:
    autoscaling.alpha.kubernetes.io/conditions: >-
      [{"type":"AbleToScale","status":"False","lastTransitionTime":"2024-07-18T19:46:25Z","reason":"FailedGetScale","message":"the
      HPA controller was unable to get the target's current scale: Internal
      error occurred: the spec replicas field \".spec.replicas\" does not
      exist"}]
  creationTimestamp: '2024-07-18T19:46:10Z'
  labels:
    app.kubernetes.io/instance: backend-staging-splunk-enterprise
  name: idx-cluster-autoscaler
  namespace: splunk-enterprise
  resourceVersion: '171496564'
  uid: e655b8c2-a04a-48c6-b882-1b4bd29fa2f4
spec:
  maxReplicas: 15
  minReplicas: 3
  scaleTargetRef:
    apiVersion: enterprise.splunk.com/v4
    kind: IndexerCluster
    name: idx-cluster
  targetCPUUtilizationPercentage: 25
status:
  currentReplicas: 0
  desiredReplicas: 0

nathan-bowman · 2024-09-05T03:43:09Z

I worked with Splunk support on this, and they suggest that despite Kubernetes docs saying otherwise, you must hardcode .spec.replicas on the CR in order to get it working.

Since I use ArgoCD, I had to set ignoreDifferences on the CR to stop it from showing up as out of sync.

akondur · 2024-09-06T16:11:54Z

CSPL-2819

nathan-bowman assigned jryb and vivekr-splunk Jul 15, 2024

michal-tatusko-splunk added Q3 2024 to-close labels Sep 4, 2024

michal-tatusko-splunk assigned akondur Sep 4, 2024

nathan-bowman closed this as completed Sep 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Splunk Operator: Autoscaling Issue #1352

Splunk Operator: Autoscaling Issue #1352

nathan-bowman commented Jul 15, 2024 •

edited

Loading

nathan-bowman commented Jul 17, 2024

nathan-bowman commented Jul 18, 2024 •

edited

Loading

nathan-bowman commented Jul 18, 2024

nathan-bowman commented Jul 18, 2024

nathan-bowman commented Sep 5, 2024

akondur commented Sep 6, 2024

Splunk Operator: Autoscaling Issue #1352

Splunk Operator: Autoscaling Issue #1352

Comments

nathan-bowman commented Jul 15, 2024 • edited Loading

Please select the type of request

Tell us more

nathan-bowman commented Jul 17, 2024

nathan-bowman commented Jul 18, 2024 • edited Loading

nathan-bowman commented Jul 18, 2024

nathan-bowman commented Jul 18, 2024

nathan-bowman commented Sep 5, 2024

akondur commented Sep 6, 2024

nathan-bowman commented Jul 15, 2024 •

edited

Loading

nathan-bowman commented Jul 18, 2024 •

edited

Loading