Add longevity test plan and results (#1113)

Problem: * We don't know if NGF can successfully process both control plane and data plane transactions over a period of time much greater than in our tests. * We didn't yet try to catch bugs that could only appear over a period of time (like resource leaks). Solution: - Create a longevity test plan - Run the test - Document the results CLOSES #956 Co-authored-by: bjee19 <[email protected]> Co-authored-by: Saylor Berman <[email protected]>
nginx · Oct 11, 2023 · 71d605e · 71d605e
1 parent 704f8a8
commit 71d605e
Show file tree

Hide file tree

Showing 14 changed files with 634 additions and 0 deletions.
diff --git a/.yamllint.yaml b/.yamllint.yaml
@@ -41,6 +41,7 @@ rules:
       .github/workflows/
       deploy/manifests/nginx-gateway.yaml
       deploy/manifests/crds
+      tests/longevity/manifests/cronjob.yaml
   new-line-at-end-of-file: enable
   new-lines: enable
   octal-values: disable

diff --git a/tests/longevity/longevity.md b/tests/longevity/longevity.md
@@ -0,0 +1,149 @@
+# Longevity Test
+
+This document describes how we test NGF for longevity.
+
+<!-- TOC -->
+
+- [Longevity Test](#longevity-test)
+  - [Goals](#goals)
+  - [Test Environment](#test-environment)
+  - [Steps](#steps)
+    - [Start](#start)
+    - [Check the Test is Running Correctly](#check-the-test-is-running-correctly)
+    - [End](#end)
+  - [Analyze](#analyze)
+  - [Results](#results)
+
+<!-- TOC -->
+
+## Goals
+
+- Ensure that NGF successfully processes both control plane and data plane transactions over a period of time much
+  greater than in our other tests.
+- Catch bugs that could only appear over a period of time (like resource leaks).
+
+## Test Environment
+
+- A Kubernetes cluster with 3 nodes on GKE
+  - Node: e2-medium (2 vCPU, 4GB memory)
+  - Enabled GKE logging.
+  - Enabled GKE Cloud monitoring with managed Prometheus service, with enabled:
+    - system.
+    - kube state - pods, deployments.
+- Tester VMs on Google Cloud:
+  - Configuration:
+    - Debian
+    - Install packages: tmux, wrk
+  - Location - same zone as the Kubernetes cluster.
+  - First VM - for HTTP traffic
+  - Second VM - for sending HTTPs traffic
+- NGF
+  - Deployment with 1 replica
+  - Exposed via a Service with type LoadBalancer, private IP
+  - Gateway, two listeners - HTTP and HTTPs
+  - Two apps:
+    - Coffee - 3 replicas
+    - Tea - 3 replicas
+  - Two HTTPRoutes
+    - Coffee (HTTP)
+    - Tea (HTTPS)
+
+## Steps
+
+### Start
+
+Test duration - 4 days.
+
+1. Create a Kubernetes cluster on GKE.
+2. Deploy NGF.
+3. Expose NGF via a LoadBalancer Service with `"networking.gke.io/load-balancer-type":"Internal"` annotation to
+   allocate an internal load balancer.
+4. Apply the manifests which will:
+    1. Deploy the coffee and tea backends.
+    2. Configure HTTP and HTTPS listeners on the Gateway.
+    3. Expose coffee via HTTP listener and tea via HTTPS listener.
+    4. Create two CronJobs to re-rollout backends:
+        1. Coffee - every minute for an hour every 6 hours
+        2. Tea - every minute for an hour every 6 hours, 3 hours apart from coffee.
+    5. Configure Prometheus on GKE to pick up NGF metrics.
+
+    ```shell
+    kubectl apply -f files
+    ```
+
+5. In Tester VMs, update `/etc/hosts` to have an entry with the External IP of the NGF Service (`10.128.0.10` in this
+   case):
+
+   ```text
+   10.128.0.10 cafe.example.com
+   ```
+
+6. In Tester VMs, start a tmux session (this is needed so that even if you disconnect from the VM, any launched command
+   will keep running):
+
+   ```shell
+   tmux
+   ```
+
+7. In First VM, start wrk for 4 days for coffee via HTTP:
+
+   ```shell
+   wrk -t2 -c100 -d96h http://cafe.example.com/coffee
+   ```
+
+8. In Second VM, start wrk for 4 days for tea via HTTPS:
+
+   ```shell
+   wrk -t2 -c100 -d96h https://cafe.example.com/tea
+   ```
+
+Notes:
+
+- The updated coffee and tea backends in cafe.yaml include extra configuration for zero time upgrades, so that
+  wrk in Tester VMs doesn't get 502 from NGF. Based on https://learnk8s.io/graceful-shutdown
+
+### Check the Test is Running Correctly
+
+Check that you don't see any errors:
+
+1. Check that GKE exports NGF pod logs to Google Cloud Operations Logging and Prometheus metrics to Google Cloud
+   Monitoring.
+2. Check that traffic is flowing - look at the access logs of NGINX in Google Cloud Operations Logging.
+3. Check that CronJob can run.
+
+   ```shell
+   kubectl create job --from=cronjob/coffee-rollout-mgr coffee-test
+   kubectl create job --from=cronjob/tea-rollout-mgr tea-test
+   ```
+
+In case of errors, double check if you prepared the environment and launched the test correctly.
+
+### End
+
+- Remove CronJobs.
+
+## Analyze
+
+- Traffic
+  - Tester VMs (clients)
+    - As wrk stop, they will print output upon termination. To connect to the tmux session with wrk,
+          run `tmux attach -t 0`
+    - Check for errors, latency, RPS
+- Logs
+  - Check the logs for errors in Google Cloud Operations Logging.
+    - NGF
+    - NGINX
+- Check metrics in Google Cloud Monitoring.
+  - NGF
+    - CPU usage
+      - NGINX
+      - NGF
+    - Memory usage
+      - NGINX
+      - NGF
+    - NGINX metrics
+    - Reloads
+
+## Results
+
+- [1.0.0](results/1.0.0/1.0.0.md)
diff --git a/tests/longevity/manifests/cafe-routes.yaml b/tests/longevity/manifests/cafe-routes.yaml
@@ -0,0 +1,37 @@
+apiVersion: gateway.networking.k8s.io/v1beta1
+kind: HTTPRoute
+metadata:
+  name: coffee
+spec:
+  parentRefs:
+  - name: gateway
+    sectionName: http
+  hostnames:
+  - "cafe.example.com"
+  rules:
+  - matches:
+    - path:
+        type: PathPrefix
+        value: /coffee
+    backendRefs:
+    - name: coffee
+      port: 80
+---
+apiVersion: gateway.networking.k8s.io/v1beta1
+kind: HTTPRoute
+metadata:
+  name: tea
+spec:
+  parentRefs:
+  - name: gateway
+    sectionName: https
+  hostnames:
+  - "cafe.example.com"
+  rules:
+  - matches:
+    - path:
+        type: PathPrefix
+        value: /tea
+    backendRefs:
+    - name: tea
+      port: 80
diff --git a/tests/longevity/manifests/cafe-secret.yaml b/tests/longevity/manifests/cafe-secret.yaml
@@ -0,0 +1,8 @@
+apiVersion: v1
+kind: Secret
+metadata:
+  name: cafe-secret
+type: kubernetes.io/tls
+data:
+  tls.crt: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUNzakNDQVpvQ0NRQzdCdVdXdWRtRkNEQU5CZ2txaGtpRzl3MEJBUXNGQURBYk1Sa3dGd1lEVlFRRERCQmoKWVdabExtVjRZVzF3YkdVdVkyOXRNQjRYRFRJeU1EY3hOREl4TlRJek9Wb1hEVEl6TURjeE5ESXhOVEl6T1ZvdwpHekVaTUJjR0ExVUVBd3dRWTJGbVpTNWxlR0Z0Y0d4bExtTnZiVENDQVNJd0RRWUpLb1pJaHZjTkFRRUJCUUFECmdnRVBBRENDQVFvQ2dnRUJBTHFZMnRHNFc5aStFYzJhdnV4Q2prb2tnUUx1ek10U1Rnc1RNaEhuK3ZRUmxIam8KVzFLRnMvQVdlS25UUStyTWVKVWNseis4M3QwRGtyRThwUisxR2NKSE50WlNMb0NEYUlRN0Nhck5nY1daS0o4Qgo1WDNnVS9YeVJHZjI2c1REd2xzU3NkSEQ1U2U3K2Vab3NPcTdHTVF3K25HR2NVZ0VtL1Q1UEMvY05PWE0zZWxGClRPL051MStoMzROVG9BbDNQdTF2QlpMcDNQVERtQ0thaEROV0NWbUJQUWpNNFI4VERsbFhhMHQ5Z1o1MTRSRzUKWHlZWTNtdzZpUzIrR1dYVXllMjFuWVV4UEhZbDV4RHY0c0FXaGRXbElweHlZQlNCRURjczN6QlI2bFF1OWkxZAp0R1k4dGJ3blVmcUVUR3NZdWxzc05qcU95V1VEcFdJelhibHhJZVVDQXdFQUFUQU5CZ2txaGtpRzl3MEJBUXNGCkFBT0NBUUVBcjkrZWJ0U1dzSnhLTGtLZlRkek1ISFhOd2Y5ZXFVbHNtTXZmMGdBdWVKTUpUR215dG1iWjlpbXQKL2RnWlpYVE9hTElHUG9oZ3BpS0l5eVVRZVdGQ2F0NHRxWkNPVWRhbUloOGk0Q1h6QVJYVHNvcUNOenNNLzZMRQphM25XbFZyS2lmZHYrWkxyRi8vblc0VVNvOEoxaCtQeDljY0tpRDZZU0RVUERDRGh1RUtFWXcvbHpoUDJVOXNmCnl6cEJKVGQ4enFyM3paTjNGWWlITmgzYlRhQS82di9jU2lyamNTK1EwQXg4RWpzQzYxRjRVMTc4QzdWNWRCKzQKcmtPTy9QNlA0UFlWNTRZZHMvRjE2WkZJTHFBNENCYnExRExuYWRxamxyN3NPbzl2ZzNnWFNMYXBVVkdtZ2todAp6VlZPWG1mU0Z4OS90MDBHUi95bUdPbERJbWlXMGc9PQotLS0tLUVORCBDRVJUSUZJQ0FURS0tLS0tCg==
+  tls.key: LS0tLS1CRUdJTiBQUklWQVRFIEtFWS0tLS0tCk1JSUV2UUlCQURBTkJna3Foa2lHOXcwQkFRRUZBQVNDQktjd2dnU2pBZ0VBQW9JQkFRQzZtTnJSdUZ2WXZoSE4KbXI3c1FvNUtKSUVDN3N6TFVrNExFeklSNS9yMEVaUjQ2RnRTaGJQd0ZuaXAwMFBxekhpVkhKYy92TjdkQTVLeApQS1VmdFJuQ1J6YldVaTZBZzJpRU93bXF6WUhGbVNpZkFlVjk0RlAxOGtSbjl1ckV3OEpiRXJIUncrVW51L25tCmFMRHF1eGpFTVBweGhuRklCSnYwK1R3djNEVGx6TjNwUlV6dnpidGZvZCtEVTZBSmR6N3Rid1dTNmR6MHc1Z2kKbW9RelZnbFpnVDBJek9FZkV3NVpWMnRMZllHZWRlRVJ1VjhtR041c09va3R2aGxsMU1udHRaMkZNVHgySmVjUQo3K0xBRm9YVnBTS2NjbUFVZ1JBM0xOOHdVZXBVTHZZdFhiUm1QTFc4SjFINmhFeHJHTHBiTERZNmpzbGxBNlZpCk0xMjVjU0hsQWdNQkFBRUNnZ0VBQnpaRE50bmVTdWxGdk9HZlFYaHRFWGFKdWZoSzJBenRVVVpEcUNlRUxvekQKWlV6dHdxbkNRNlJLczUyandWNTN4cU9kUU94bTNMbjNvSHdNa2NZcEliWW82MjJ2dUczYnkwaVEzaFlsVHVMVgpqQmZCcS9UUXFlL2NMdngvSkczQWhFNmJxdFRjZFlXeGFmTmY2eUtpR1dzZk11WVVXTWs4MGVJVUxuRmZaZ1pOCklYNTlSOHlqdE9CVm9Sa3hjYTVoMW1ZTDFsSlJNM3ZqVHNHTHFybmpOTjNBdWZ3ZGRpK1VDbGZVL2l0K1EvZkUKV216aFFoTlRpNVFkRWJLVStOTnYvNnYvb2JvandNb25HVVBCdEFTUE05cmxFemIralQ1WHdWQjgvLzRGY3VoSwoyVzNpcjhtNHVlQ1JHSVlrbGxlLzhuQmZ0eVhiVkNocVRyZFBlaGlPM1FLQmdRRGlrR3JTOTc3cjg3Y1JPOCtQClpoeXltNXo4NVIzTHVVbFNTazJiOTI1QlhvakpZL2RRZDVTdFVsSWE4OUZKZnNWc1JRcEhHaTFCYzBMaTY1YjIKazR0cE5xcVFoUmZ1UVh0UG9GYXRuQzlPRnJVTXJXbDVJN0ZFejZnNkNQMVBXMEg5d2hPemFKZUdpZVpNYjlYTQoybDdSSFZOcC9jTDlYbmhNMnN0Q1lua2Iwd0tCZ1FEUzF4K0crakEyUVNtRVFWNXA1RnRONGcyamsyZEFjMEhNClRIQ2tTazFDRjhkR0Z2UWtsWm5ZbUt0dXFYeXNtekJGcnZKdmt2eUhqbUNYYTducXlpajBEdDZtODViN3BGcVAKQWxtajdtbXI3Z1pUeG1ZMXBhRWFLMXY4SDNINGtRNVl3MWdrTWRybVJHcVAvaTBGaDVpaGtSZS9DOUtGTFVkSQpDcnJjTzhkUVp3S0JnSHA1MzRXVWNCMVZibzFlYStIMUxXWlFRUmxsTWlwRFM2TzBqeWZWSmtFb1BZSEJESnp2ClIrdzZLREJ4eFoyWmJsZ05LblV0YlhHSVFZd3lGelhNcFB5SGxNVHpiZkJhYmJLcDFyR2JVT2RCMXpXM09PRkgKcmppb21TUm1YNmxhaDk0SjRHU0lFZ0drNGw1SHhxZ3JGRDZ2UDd4NGRjUktJWFpLZ0w2dVJSSUpBb0dCQU1CVApaL2p5WStRNTBLdEtEZHUrYU9ORW4zaGxUN3hrNXRKN3NBek5rbWdGMU10RXlQUk9Xd1pQVGFJbWpRbk9qbHdpCldCZ2JGcXg0M2ZlQ1Z4ZXJ6V3ZEM0txaWJVbWpCTkNMTGtYeGh3ZEVteFQwVit2NzZGYzgwaTNNYVdSNnZZR08KditwVVovL0F6UXdJcWZ6dlVmV2ZxdStrMHlhVXhQOGNlcFBIRyt0bEFvR0FmQUtVVWhqeFU0Ym5vVzVwVUhKegpwWWZXZXZ5TW54NWZyT2VsSmRmNzlvNGMvMHhVSjh1eFBFWDFkRmNrZW96dHNpaVFTNkN6MENRY09XVWxtSkRwCnVrdERvVzM3VmNSQU1BVjY3NlgxQVZlM0UwNm5aL2g2Tkd4Z28rT042Q3pwL0lkMkJPUm9IMFAxa2RjY1NLT3kKMUtFZlNnb1B0c1N1eEpBZXdUZmxDMXc9Ci0tLS0tRU5EIFBSSVZBVEUgS0VZLS0tLS0K
diff --git a/tests/longevity/manifests/cafe.yaml b/tests/longevity/manifests/cafe.yaml
@@ -0,0 +1,81 @@
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: coffee
+spec:
+  replicas: 3
+  selector:
+    matchLabels:
+      app: coffee
+  template:
+    metadata:
+      labels:
+        app: coffee
+    spec:
+      containers:
+      - name: coffee
+        image: nginxdemos/nginx-hello:plain-text
+        ports:
+        - containerPort: 8080
+        readinessProbe:
+          httpGet:
+            path: /
+            port: 8080
+        lifecycle:
+          preStop:
+            exec:
+              command: ["/bin/sleep", "15"]
+---
+apiVersion: v1
+kind: Service
+metadata:
+  name: coffee
+spec:
+  ports:
+  - port: 80
+    targetPort: 8080
+    protocol: TCP
+    name: http
+  selector:
+    app: coffee
+---
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: tea
+spec:
+  replicas: 3
+  selector:
+    matchLabels:
+      app: tea
+  template:
+    metadata:
+      labels:
+        app: tea
+    spec:
+      containers:
+      - name: tea
+        image: nginxdemos/nginx-hello:plain-text
+        ports:
+        - containerPort: 8080
+        readinessProbe:
+          httpGet:
+            path: /
+            port: 8080
+        lifecycle:
+          preStop:
+            exec:
+              command: ["/bin/sleep", "15"]
+---
+apiVersion: v1
+kind: Service
+metadata:
+  name: tea
+spec:
+  ports:
+  - port: 80
+    targetPort: 8080
+    protocol: TCP
+    name: http
+  selector:
+    app: tea
diff --git a/tests/longevity/manifests/cronjob.yaml b/tests/longevity/manifests/cronjob.yaml
@@ -0,0 +1,92 @@
+apiVersion: v1
+kind: ServiceAccount
+metadata:
+  name: rollout-mgr
+  namespace: default
+---
+apiVersion: rbac.authorization.k8s.io/v1
+kind: Role
+metadata:
+  name: rollout-mgr
+  namespace: default
+rules:
+- apiGroups:
+  - "apps"
+  resources:
+  - deployments
+  verbs:
+  - patch
+---
+apiVersion: rbac.authorization.k8s.io/v1
+kind: RoleBinding
+metadata:
+  name: rollout-mgr
+  namespace: default
+roleRef:
+  apiGroup: rbac.authorization.k8s.io
+  kind: Role
+  name: rollout-mgr
+subjects:
+- kind: ServiceAccount
+  name: rollout-mgr
+  namespace: default
+---
+apiVersion: batch/v1
+kind: CronJob
+metadata:
+  name: coffee-rollout-mgr
+  namespace: default
+spec:
+  schedule: "* */6 * * *" # every minute every 6 hours
+  jobTemplate:
+    spec:
+      template:
+        spec:
+          serviceAccountName: rollout-mgr
+          containers:
+          - name: coffee-rollout-mgr
+            image: curlimages/curl:8.3.0
+            imagePullPolicy: IfNotPresent
+            command:
+            - /bin/sh
+            - -c
+            args:
+            - |
+                TOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)
+                RESTARTED_AT=$(date -u +"%Y-%m-%dT%H:%M:%SZ")
+                curl -X PATCH -s -k -v \
+                -H "Authorization: Bearer $TOKEN" \
+                -H "Content-type: application/merge-patch+json" \
+                --data-raw "{\"spec\": {\"template\": {\"metadata\": {\"annotations\": {\"kubectl.kubernetes.io/restartedAt\": \"$RESTARTED_AT\"}}}}}" \
+                "https://kubernetes/apis/apps/v1/namespaces/default/deployments/coffee?fieldManager=kubectl-rollout" 2>&1
+          restartPolicy: OnFailure
+---
+apiVersion: batch/v1
+kind: CronJob
+metadata:
+  name: tea-rollout-mgr
+  namespace: default
+spec:
+  schedule: "* 3,9,15,21 * * *" # every minute every 6 hours, 3 hours apart from coffee
+  jobTemplate:
+    spec:
+      template:
+        spec:
+          serviceAccountName: rollout-mgr
+          containers:
+          - name: coffee-rollout-mgr
+            image: curlimages/curl:8.3.0
+            imagePullPolicy: IfNotPresent
+            command:
+            - /bin/sh
+            - -c
+            args:
+            - |
+                TOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)
+                RESTARTED_AT=$(date -u +"%Y-%m-%dT%H:%M:%SZ")
+                curl -X PATCH -s -k -v \
+                -H "Authorization: Bearer $TOKEN" \
+                -H "Content-type: application/merge-patch+json" \
+                --data-raw "{\"spec\": {\"template\": {\"metadata\": {\"annotations\": {\"kubectl.kubernetes.io/restartedAt\": \"$RESTARTED_AT\"}}}}}" \
+                "https://kubernetes/apis/apps/v1/namespaces/default/deployments/tea?fieldManager=kubectl-rollout" 2>&1
+          restartPolicy: OnFailure