Upgrade from 2.9.3 to 2.12.6 incurred approximately double of the network usage #20772

nvatuan · 2024-11-13T02:37:09Z

Checklist:

I've searched in the docs and FAQ for my answer: https://bit.ly/argocd-faq.
I've included steps to reproduce the bug. (not sure if applicable for this problem, but I attached our env info 🤔)
I've pasted the output of argocd version.

Describe the bug

We upgraded from 2.9 to 2.12 recently at the end of October, and notice a high network usage immediately on the next day. We let it run for 1~2 weeks and the usage keeps being high, incurred a lot of cost for us, so we reverted it back to 2.9. After the revert for 2 days, it seems the cost return to similar as before.

I wonder what changes 2.10 -> 2.11 -> 2.12 introduces that can cost high network usage?
I wonder if other users experience the same problem?

To Reproduce

I will attach information about our environment:

We use AWS EKS. Kubernetes version 1.29
Argocd version:
- Application version: Before is 2.9.3, After is 2.12.6 but we have reverted back to 2.9.3
- Helm chart version: 5.52.1 (for 2.9) and 7.6.12 for (2.12)
Our cluster is very big, we have around 600+ Argocd Application
Argocd helm chart values.yaml file:

repoServer:
  autoscaling:
    enabled: true
    minReplicas: 2
    maxReplicas: 8
    targetCPUUtilizationPercentage: 70
    targetMemoryUtilizationPercentage: 70
  metrics:
    enabled: true
    serviceMonitor:
      enabled: true
  resources:
    limits:
      cpu: 4
      memory: 4Gi
    requests:
      cpu: 2
      memory: 2Gi

server:
  extraArgs:
  - --insecure
  config:
    exec.enabled: "true"
  autoscaling:
    enabled: true
    minReplicas: 2
  metrics:
    enabled: true
    serviceMonitor:
      enabled: true
  resources:
    limits:
      cpu: 1
      memory: 512Mi
    requests:
      cpu: 500m
      memory: 512Mi
  extensions:
    resources:
      limits:
        cpu: 200m
        memory: 256Mi
      requests:
        cpu: 100m
        memory: 128Mi

controller:
  replicas: 1
  metrics:
    enabled: true
    serviceMonitor:
      enabled: true
  resources:
    limits:
      cpu: 14
      memory: 24Gi
    requests:
      cpu: 10
      memory: 16Gi
  env:
  # Setting for changing the client qps
  - name: ARGOCD_K8S_CLIENT_QPS
    value: "200"
  # Setting for changing the burst qps
  - name: ARGOCD_K8S_CLIENT_BURST
    value: "400"

redis-ha:
  enabled: true
  redis:
    resources:
      limits:
        cpu: 1
        memory: 8Gi
      requests:
        cpu: 500m
        memory: 8Gi

applicationSet:
  replicaCount: 2
  metrics:
    enabled: true
    serviceMonitor:
      enabled: true
  resources:
    limits:
      cpu: 2
      memory: 2Gi
    requests:
      cpu: 1
      memory: 1Gi

## Argo Configs
configs:
  # Argo CD configuration parameters
  ## Ref: https://github.com/argoproj/argo-cd/blob/master/docs/operator-manual/argocd-cmd-params-cm.yaml
  params:
    # -- Create the argocd-cmd-params-cm configmap
    # If false, it is expected the configmap will be created by something else.
    create: true

    ## Controller Properties
    # https://argo-cd.readthedocs.io/en/stable/operator-manual/high_availability/#argocd-application-controller
    # -- Number of application status processors
    controller.status.processors: 50
    # -- Number of application operation processors
    controller.operation.processors: 25
    # -- Repo server RPC call timeout seconds.
    controller.repo.server.timeout.seconds: 300
    controller.kubectl.parallelism.limit: 100

    ## Repo-server properties
    # -- Limit on number of concurrent manifests generate requests. Any value less the 1 means no limit.
    # The --parallelismlimit flag controls how many manifests generations are running concurrently and helps avoid OOM kills.
    # https://argo-cd.readthedocs.io/en/stable/operator-manual/high_availability/#argocd-repo-server
    reposerver.parallelism.limit: 15

  cm:
    # -- Timeout to discover if a new manifests version got published to the repository
    timeout.reconciliation: 600s # 10 minutes
    exec.enabled: "true"

Expected behavior

We expected the network usage remains the same, but in reality it doubled.
We doubted our guts so we let it runs for 1~2 weeks but it kept remaining high usage.

Screenshots

Kubecost hourly network usage into namespace argocd:

There is a dip from 11:00AM - 2:00 AM, this is when we do the revert and also inactive time for our cluster.
After the downgrade, you can see the cost remained only half compared to before.

All of network costs is from repo-server

Here is our NAT Gateway usage:

Network usage over larger span

Network usage around the revert time

Version

argocd: v2.9.3+6eba5be
  BuildDate: 2023-12-01T23:05:50Z
  GitCommit: 6eba5be864b7e031871ed7698f5233336dfe75c7
  GitTreeState: clean
  GoVersion: go1.21.3
  Compiler: gc
  Platform: linux/arm64

For argocd 2.9.3, we use helm chart version: 5.52.1
For argocd 2.12.6, we use helm chart version: 7.6.12

Logs

N/A

The text was updated successfully, but these errors were encountered:

andrii-korotkov-verkada · 2024-11-14T03:20:18Z

Do you have a breakdown of traffic between argocd components, e.g. between application controller and repo server?
Do you have argocd metrics like number of reconciliations, longest running processor duration?
Those would be very helpful for debugging.

nvatuan added the bug Something isn't working label Nov 13, 2024

andrii-korotkov-verkada added the version:2.12 Latest confirmed affected version is 2.12 label Nov 13, 2024

andrii-korotkov-verkada added the more-information-needed Further information is requested label Nov 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrade from 2.9.3 to 2.12.6 incurred approximately double of the network usage #20772

Upgrade from 2.9.3 to 2.12.6 incurred approximately double of the network usage #20772

nvatuan commented Nov 13, 2024

andrii-korotkov-verkada commented Nov 14, 2024

Upgrade from 2.9.3 to 2.12.6 incurred approximately double of the network usage #20772

Upgrade from 2.9.3 to 2.12.6 incurred approximately double of the network usage #20772

Comments

nvatuan commented Nov 13, 2024

andrii-korotkov-verkada commented Nov 14, 2024