Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Splunk Operator: Autoscaling Issue #1352

Closed
nathan-bowman opened this issue Jul 15, 2024 · 6 comments
Closed

Splunk Operator: Autoscaling Issue #1352

nathan-bowman opened this issue Jul 15, 2024 · 6 comments

Comments

@nathan-bowman
Copy link

nathan-bowman commented Jul 15, 2024

Please select the type of request

Bug

Tell us more

Describe the Problem
I'm following the details here for pod autoscaling. It seems that spec.replicas is a mandatory field, but with the HorizontalPodAutoscaler docs recommend that you remove spec.replicas from the target manifest.

When an HPA is enabled, it is recommended that the value of spec.replicas of the Deployment and / or StatefulSet be removed from their manifest(s).

Error I receive when I remove spec.replicas:

the HPA controller was unable to get the target's current scale: Internal error occurred: the spec replicas field ".spec.replicas" does not exist

Expected behavior
One should be able to remove spec.replicas from the Splunk CR indexerclusters.enterprise.splunk.com (and probably other CRs...) to allow HorizontalPodAutoscaler to manage the spec.replicas.

Splunk setup on K8S
AWS EKS v1.29, with Splunk Operator 2.5.2.

Last thing to note, I'm using autoscaling/v2 apiVersion

Reproduction/Testing steps
idx-cluster.yaml:

---
apiVersion: enterprise.splunk.com/v4
kind: IndexerCluster
metadata:
  name: idx-cluster
  finalizers:
  - enterprise.splunk.com/delete-pvc
spec:
  imagePullPolicy: IfNotPresent
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: affinityNodeGroup
            operator: In
            values:
            - splunk-nodegroup-indexers
  tolerations:
  - key: splunk-indexers
    value: "true"
    effect: NoExecute
  serviceAccount: splunk-enterprise-serviceaccount
  resources:
    limits:
      cpu: 15
      memory: 60G
    requests:
      cpu: 13
      memory: 57G
  #replicas: 3
  clusterManagerRef:
    name: cm
  licenseManagerRef:
    name: lm
  monitoringConsoleRef:
    name: mc

  etcVolumeStorageConfig:
    ephemeralStorage: false
    storageCapacity: 10Gi
    storageClassName: gp3

  varVolumeStorageConfig:
    ephemeralStorage: false
    storageCapacity: 800Gi
    storageClassName: topolvm-local-ssd

HorizontalPodAutoscaler yaml:

---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: idx-cluster-autoscaler
spec:
  scaleTargetRef:
    apiVersion: enterprise.splunk.com/v4
    kind: IndexerCluster
    name: idx-cluster
  minReplicas: 3
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 50

K8s environment
k8s v1.29

@nathan-bowman
Copy link
Author

I'm not entirely sure if this will mess things up, but I got the autoscaling to work by pointing the HPA at the IndexerCluster's downstream statefulset:

---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: idx-cluster-autoscaler
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: StatefulSet
    name: splunk-idx-cluster-indexer
  minReplicas: 3
  maxReplicas: 15
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 25

Is this correct?

@nathan-bowman
Copy link
Author

nathan-bowman commented Jul 18, 2024

Going down the rabbit hole...

It looks like my HPA isn't gathering metrics for the target:

NAME                          REFERENCE                         TARGETS         MINPODS   MAXPODS   REPLICAS   AGE
idx-cluster-autoscaler   IndexerCluster/idx-cluster             <unknown>/25%   3         15        3          18h

Additional digging into the metric-server shows lots of scrape errors:

E0717 23:45:19.725688       1 scraper.go:149] "Failed to scrape node" err="Get \"https://192.168.28.96:10250/metrics/resource\": remote error: tls: internal error" node="ip-192-168-28-96.us-west-2.compute.internal"

I think this is related to a recent issue posted in the official metrics-server repo, and an associated PR.

I'm not totally sure, though... Other HPA's in my EKS clusters seem to work fine...

Edit: To clarify, the HPA sees the targets data when I point it towards the statefulset, but not when I point it towards kind: IndexerCluster

@nathan-bowman
Copy link
Author

Adding more info...

# kubectl --raw /apis/enterprise.splunk.com/v4/ | jq '.resources[] | select(.name=="indexerclusters/scale")'
{
  "name": "indexerclusters/scale",
  "singularName": "",
  "namespaced": true,
  "group": "autoscaling",
  "version": "v1",
  "kind": "Scale",
  "verbs": [
    "get",
    "patch",
    "update"
  ]
}

is v1 pointing to autoscaling? I'm using autoscaling/v2 in my HPA

@nathan-bowman
Copy link
Author

I tried autoscaling/v1 and have the same issue 👎

apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
  annotations:
    autoscaling.alpha.kubernetes.io/conditions: >-
      [{"type":"AbleToScale","status":"False","lastTransitionTime":"2024-07-18T19:46:25Z","reason":"FailedGetScale","message":"the
      HPA controller was unable to get the target's current scale: Internal
      error occurred: the spec replicas field \".spec.replicas\" does not
      exist"}]
  creationTimestamp: '2024-07-18T19:46:10Z'
  labels:
    app.kubernetes.io/instance: backend-staging-splunk-enterprise
  name: idx-cluster-autoscaler
  namespace: splunk-enterprise
  resourceVersion: '171496564'
  uid: e655b8c2-a04a-48c6-b882-1b4bd29fa2f4
spec:
  maxReplicas: 15
  minReplicas: 3
  scaleTargetRef:
    apiVersion: enterprise.splunk.com/v4
    kind: IndexerCluster
    name: idx-cluster
  targetCPUUtilizationPercentage: 25
status:
  currentReplicas: 0
  desiredReplicas: 0

@nathan-bowman
Copy link
Author

I worked with Splunk support on this, and they suggest that despite Kubernetes docs saying otherwise, you must hardcode .spec.replicas on the CR in order to get it working.

Since I use ArgoCD, I had to set ignoreDifferences on the CR to stop it from showing up as out of sync.

@akondur
Copy link
Collaborator

akondur commented Sep 6, 2024

CSPL-2819

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants