Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade from 2.9.3 to 2.12.6 incurred approximately double of the network usage #20772

Open
3 tasks done
nvatuan opened this issue Nov 13, 2024 · 1 comment
Open
3 tasks done
Labels
bug Something isn't working more-information-needed Further information is requested version:2.12 Latest confirmed affected version is 2.12

Comments

@nvatuan
Copy link

nvatuan commented Nov 13, 2024

Checklist:

  • I've searched in the docs and FAQ for my answer: https://bit.ly/argocd-faq.
  • I've included steps to reproduce the bug. (not sure if applicable for this problem, but I attached our env info 🤔)
  • I've pasted the output of argocd version.

Describe the bug

We upgraded from 2.9 to 2.12 recently at the end of October, and notice a high network usage immediately on the next day. We let it run for 1~2 weeks and the usage keeps being high, incurred a lot of cost for us, so we reverted it back to 2.9. After the revert for 2 days, it seems the cost return to similar as before.

  • I wonder what changes 2.10 -> 2.11 -> 2.12 introduces that can cost high network usage?
  • I wonder if other users experience the same problem?

To Reproduce

I will attach information about our environment:

  • We use AWS EKS. Kubernetes version 1.29
  • Argocd version:
    • Application version: Before is 2.9.3, After is 2.12.6 but we have reverted back to 2.9.3
    • Helm chart version: 5.52.1 (for 2.9) and 7.6.12 for (2.12)
  • Our cluster is very big, we have around 600+ Argocd Application
  • Argocd helm chart values.yaml file:
repoServer:
  autoscaling:
    enabled: true
    minReplicas: 2
    maxReplicas: 8
    targetCPUUtilizationPercentage: 70
    targetMemoryUtilizationPercentage: 70
  metrics:
    enabled: true
    serviceMonitor:
      enabled: true
  resources:
    limits:
      cpu: 4
      memory: 4Gi
    requests:
      cpu: 2
      memory: 2Gi

server:
  extraArgs:
  - --insecure
  config:
    exec.enabled: "true"
  autoscaling:
    enabled: true
    minReplicas: 2
  metrics:
    enabled: true
    serviceMonitor:
      enabled: true
  resources:
    limits:
      cpu: 1
      memory: 512Mi
    requests:
      cpu: 500m
      memory: 512Mi
  extensions:
    resources:
      limits:
        cpu: 200m
        memory: 256Mi
      requests:
        cpu: 100m
        memory: 128Mi

controller:
  replicas: 1
  metrics:
    enabled: true
    serviceMonitor:
      enabled: true
  resources:
    limits:
      cpu: 14
      memory: 24Gi
    requests:
      cpu: 10
      memory: 16Gi
  env:
  # Setting for changing the client qps
  - name: ARGOCD_K8S_CLIENT_QPS
    value: "200"
  # Setting for changing the burst qps
  - name: ARGOCD_K8S_CLIENT_BURST
    value: "400"

redis-ha:
  enabled: true
  redis:
    resources:
      limits:
        cpu: 1
        memory: 8Gi
      requests:
        cpu: 500m
        memory: 8Gi

applicationSet:
  replicaCount: 2
  metrics:
    enabled: true
    serviceMonitor:
      enabled: true
  resources:
    limits:
      cpu: 2
      memory: 2Gi
    requests:
      cpu: 1
      memory: 1Gi

## Argo Configs
configs:
  # Argo CD configuration parameters
  ## Ref: https://github.com/argoproj/argo-cd/blob/master/docs/operator-manual/argocd-cmd-params-cm.yaml
  params:
    # -- Create the argocd-cmd-params-cm configmap
    # If false, it is expected the configmap will be created by something else.
    create: true

    ## Controller Properties
    # https://argo-cd.readthedocs.io/en/stable/operator-manual/high_availability/#argocd-application-controller
    # -- Number of application status processors
    controller.status.processors: 50
    # -- Number of application operation processors
    controller.operation.processors: 25
    # -- Repo server RPC call timeout seconds.
    controller.repo.server.timeout.seconds: 300
    controller.kubectl.parallelism.limit: 100

    ## Repo-server properties
    # -- Limit on number of concurrent manifests generate requests. Any value less the 1 means no limit.
    # The --parallelismlimit flag controls how many manifests generations are running concurrently and helps avoid OOM kills.
    # https://argo-cd.readthedocs.io/en/stable/operator-manual/high_availability/#argocd-repo-server
    reposerver.parallelism.limit: 15

  cm:
    # -- Timeout to discover if a new manifests version got published to the repository
    timeout.reconciliation: 600s # 10 minutes
    exec.enabled: "true"

Expected behavior

  • We expected the network usage remains the same, but in reality it doubled.
  • We doubted our guts so we let it runs for 1~2 weeks but it kept remaining high usage.

Screenshots

  • Kubecost hourly network usage into namespace argocd:

image

There is a dip from 11:00AM - 2:00 AM, this is when we do the revert and also inactive time for our cluster.
After the downgrade, you can see the cost remained only half compared to before.

All of network costs is from repo-server

  • Here is our NAT Gateway usage:

image

Network usage over larger span

image

Network usage around the revert time

Version

argocd: v2.9.3+6eba5be
  BuildDate: 2023-12-01T23:05:50Z
  GitCommit: 6eba5be864b7e031871ed7698f5233336dfe75c7
  GitTreeState: clean
  GoVersion: go1.21.3
  Compiler: gc
  Platform: linux/arm64
  • For argocd 2.9.3, we use helm chart version: 5.52.1
  • For argocd 2.12.6, we use helm chart version: 7.6.12

Logs

N/A

@nvatuan nvatuan added the bug Something isn't working label Nov 13, 2024
@andrii-korotkov-verkada andrii-korotkov-verkada added the version:2.12 Latest confirmed affected version is 2.12 label Nov 13, 2024
@andrii-korotkov-verkada
Copy link
Contributor

Do you have a breakdown of traffic between argocd components, e.g. between application controller and repo server?
Do you have argocd metrics like number of reconciliations, longest running processor duration?
Those would be very helpful for debugging.

@andrii-korotkov-verkada andrii-korotkov-verkada added the more-information-needed Further information is requested label Nov 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working more-information-needed Further information is requested version:2.12 Latest confirmed affected version is 2.12
Projects
None yet
Development

No branches or pull requests

2 participants