Skip to content

Commit

Permalink
Add longevity test plan and results (#1113)
Browse files Browse the repository at this point in the history
Problem:
* We don't know if NGF can successfully process both control plane and
data plane transactions over a period of time much greater than in our
tests.
* We didn't yet try to catch bugs that could only appear over a period
of time (like resource leaks).

Solution:
- Create a longevity test plan
- Run the test
- Document the results

CLOSES #956

Co-authored-by: bjee19 <[email protected]>
Co-authored-by: Saylor Berman <[email protected]>
  • Loading branch information
3 people authored Oct 11, 2023
1 parent 704f8a8 commit 71d605e
Show file tree
Hide file tree
Showing 14 changed files with 634 additions and 0 deletions.
1 change: 1 addition & 0 deletions .yamllint.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@ rules:
.github/workflows/
deploy/manifests/nginx-gateway.yaml
deploy/manifests/crds
tests/longevity/manifests/cronjob.yaml
new-line-at-end-of-file: enable
new-lines: enable
octal-values: disable
Expand Down
149 changes: 149 additions & 0 deletions tests/longevity/longevity.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,149 @@
# Longevity Test

This document describes how we test NGF for longevity.

<!-- TOC -->

- [Longevity Test](#longevity-test)
- [Goals](#goals)
- [Test Environment](#test-environment)
- [Steps](#steps)
- [Start](#start)
- [Check the Test is Running Correctly](#check-the-test-is-running-correctly)
- [End](#end)
- [Analyze](#analyze)
- [Results](#results)

<!-- TOC -->

## Goals

- Ensure that NGF successfully processes both control plane and data plane transactions over a period of time much
greater than in our other tests.
- Catch bugs that could only appear over a period of time (like resource leaks).

## Test Environment

- A Kubernetes cluster with 3 nodes on GKE
- Node: e2-medium (2 vCPU, 4GB memory)
- Enabled GKE logging.
- Enabled GKE Cloud monitoring with managed Prometheus service, with enabled:
- system.
- kube state - pods, deployments.
- Tester VMs on Google Cloud:
- Configuration:
- Debian
- Install packages: tmux, wrk
- Location - same zone as the Kubernetes cluster.
- First VM - for HTTP traffic
- Second VM - for sending HTTPs traffic
- NGF
- Deployment with 1 replica
- Exposed via a Service with type LoadBalancer, private IP
- Gateway, two listeners - HTTP and HTTPs
- Two apps:
- Coffee - 3 replicas
- Tea - 3 replicas
- Two HTTPRoutes
- Coffee (HTTP)
- Tea (HTTPS)

## Steps

### Start

Test duration - 4 days.

1. Create a Kubernetes cluster on GKE.
2. Deploy NGF.
3. Expose NGF via a LoadBalancer Service with `"networking.gke.io/load-balancer-type":"Internal"` annotation to
allocate an internal load balancer.
4. Apply the manifests which will:
1. Deploy the coffee and tea backends.
2. Configure HTTP and HTTPS listeners on the Gateway.
3. Expose coffee via HTTP listener and tea via HTTPS listener.
4. Create two CronJobs to re-rollout backends:
1. Coffee - every minute for an hour every 6 hours
2. Tea - every minute for an hour every 6 hours, 3 hours apart from coffee.
5. Configure Prometheus on GKE to pick up NGF metrics.

```shell
kubectl apply -f files
```

5. In Tester VMs, update `/etc/hosts` to have an entry with the External IP of the NGF Service (`10.128.0.10` in this
case):

```text
10.128.0.10 cafe.example.com
```

6. In Tester VMs, start a tmux session (this is needed so that even if you disconnect from the VM, any launched command
will keep running):

```shell
tmux
```

7. In First VM, start wrk for 4 days for coffee via HTTP:

```shell
wrk -t2 -c100 -d96h http://cafe.example.com/coffee
```

8. In Second VM, start wrk for 4 days for tea via HTTPS:

```shell
wrk -t2 -c100 -d96h https://cafe.example.com/tea
```

Notes:

- The updated coffee and tea backends in cafe.yaml include extra configuration for zero time upgrades, so that
wrk in Tester VMs doesn't get 502 from NGF. Based on https://learnk8s.io/graceful-shutdown
### Check the Test is Running Correctly
Check that you don't see any errors:

1. Check that GKE exports NGF pod logs to Google Cloud Operations Logging and Prometheus metrics to Google Cloud
Monitoring.
2. Check that traffic is flowing - look at the access logs of NGINX in Google Cloud Operations Logging.
3. Check that CronJob can run.

```shell
kubectl create job --from=cronjob/coffee-rollout-mgr coffee-test
kubectl create job --from=cronjob/tea-rollout-mgr tea-test
```

In case of errors, double check if you prepared the environment and launched the test correctly.

### End

- Remove CronJobs.

## Analyze

- Traffic
- Tester VMs (clients)
- As wrk stop, they will print output upon termination. To connect to the tmux session with wrk,
run `tmux attach -t 0`
- Check for errors, latency, RPS
- Logs
- Check the logs for errors in Google Cloud Operations Logging.
- NGF
- NGINX
- Check metrics in Google Cloud Monitoring.
- NGF
- CPU usage
- NGINX
- NGF
- Memory usage
- NGINX
- NGF
- NGINX metrics
- Reloads

## Results

- [1.0.0](results/1.0.0/1.0.0.md)
37 changes: 37 additions & 0 deletions tests/longevity/manifests/cafe-routes.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
apiVersion: gateway.networking.k8s.io/v1beta1
kind: HTTPRoute
metadata:
name: coffee
spec:
parentRefs:
- name: gateway
sectionName: http
hostnames:
- "cafe.example.com"
rules:
- matches:
- path:
type: PathPrefix
value: /coffee
backendRefs:
- name: coffee
port: 80
---
apiVersion: gateway.networking.k8s.io/v1beta1
kind: HTTPRoute
metadata:
name: tea
spec:
parentRefs:
- name: gateway
sectionName: https
hostnames:
- "cafe.example.com"
rules:
- matches:
- path:
type: PathPrefix
value: /tea
backendRefs:
- name: tea
port: 80
8 changes: 8 additions & 0 deletions tests/longevity/manifests/cafe-secret.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
apiVersion: v1
kind: Secret
metadata:
name: cafe-secret
type: kubernetes.io/tls
data:
tls.crt: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUNzakNDQVpvQ0NRQzdCdVdXdWRtRkNEQU5CZ2txaGtpRzl3MEJBUXNGQURBYk1Sa3dGd1lEVlFRRERCQmoKWVdabExtVjRZVzF3YkdVdVkyOXRNQjRYRFRJeU1EY3hOREl4TlRJek9Wb1hEVEl6TURjeE5ESXhOVEl6T1ZvdwpHekVaTUJjR0ExVUVBd3dRWTJGbVpTNWxlR0Z0Y0d4bExtTnZiVENDQVNJd0RRWUpLb1pJaHZjTkFRRUJCUUFECmdnRVBBRENDQVFvQ2dnRUJBTHFZMnRHNFc5aStFYzJhdnV4Q2prb2tnUUx1ek10U1Rnc1RNaEhuK3ZRUmxIam8KVzFLRnMvQVdlS25UUStyTWVKVWNseis4M3QwRGtyRThwUisxR2NKSE50WlNMb0NEYUlRN0Nhck5nY1daS0o4Qgo1WDNnVS9YeVJHZjI2c1REd2xzU3NkSEQ1U2U3K2Vab3NPcTdHTVF3K25HR2NVZ0VtL1Q1UEMvY05PWE0zZWxGClRPL051MStoMzROVG9BbDNQdTF2QlpMcDNQVERtQ0thaEROV0NWbUJQUWpNNFI4VERsbFhhMHQ5Z1o1MTRSRzUKWHlZWTNtdzZpUzIrR1dYVXllMjFuWVV4UEhZbDV4RHY0c0FXaGRXbElweHlZQlNCRURjczN6QlI2bFF1OWkxZAp0R1k4dGJ3blVmcUVUR3NZdWxzc05qcU95V1VEcFdJelhibHhJZVVDQXdFQUFUQU5CZ2txaGtpRzl3MEJBUXNGCkFBT0NBUUVBcjkrZWJ0U1dzSnhLTGtLZlRkek1ISFhOd2Y5ZXFVbHNtTXZmMGdBdWVKTUpUR215dG1iWjlpbXQKL2RnWlpYVE9hTElHUG9oZ3BpS0l5eVVRZVdGQ2F0NHRxWkNPVWRhbUloOGk0Q1h6QVJYVHNvcUNOenNNLzZMRQphM25XbFZyS2lmZHYrWkxyRi8vblc0VVNvOEoxaCtQeDljY0tpRDZZU0RVUERDRGh1RUtFWXcvbHpoUDJVOXNmCnl6cEJKVGQ4enFyM3paTjNGWWlITmgzYlRhQS82di9jU2lyamNTK1EwQXg4RWpzQzYxRjRVMTc4QzdWNWRCKzQKcmtPTy9QNlA0UFlWNTRZZHMvRjE2WkZJTHFBNENCYnExRExuYWRxamxyN3NPbzl2ZzNnWFNMYXBVVkdtZ2todAp6VlZPWG1mU0Z4OS90MDBHUi95bUdPbERJbWlXMGc9PQotLS0tLUVORCBDRVJUSUZJQ0FURS0tLS0tCg==
tls.key: LS0tLS1CRUdJTiBQUklWQVRFIEtFWS0tLS0tCk1JSUV2UUlCQURBTkJna3Foa2lHOXcwQkFRRUZBQVNDQktjd2dnU2pBZ0VBQW9JQkFRQzZtTnJSdUZ2WXZoSE4KbXI3c1FvNUtKSUVDN3N6TFVrNExFeklSNS9yMEVaUjQ2RnRTaGJQd0ZuaXAwMFBxekhpVkhKYy92TjdkQTVLeApQS1VmdFJuQ1J6YldVaTZBZzJpRU93bXF6WUhGbVNpZkFlVjk0RlAxOGtSbjl1ckV3OEpiRXJIUncrVW51L25tCmFMRHF1eGpFTVBweGhuRklCSnYwK1R3djNEVGx6TjNwUlV6dnpidGZvZCtEVTZBSmR6N3Rid1dTNmR6MHc1Z2kKbW9RelZnbFpnVDBJek9FZkV3NVpWMnRMZllHZWRlRVJ1VjhtR041c09va3R2aGxsMU1udHRaMkZNVHgySmVjUQo3K0xBRm9YVnBTS2NjbUFVZ1JBM0xOOHdVZXBVTHZZdFhiUm1QTFc4SjFINmhFeHJHTHBiTERZNmpzbGxBNlZpCk0xMjVjU0hsQWdNQkFBRUNnZ0VBQnpaRE50bmVTdWxGdk9HZlFYaHRFWGFKdWZoSzJBenRVVVpEcUNlRUxvekQKWlV6dHdxbkNRNlJLczUyandWNTN4cU9kUU94bTNMbjNvSHdNa2NZcEliWW82MjJ2dUczYnkwaVEzaFlsVHVMVgpqQmZCcS9UUXFlL2NMdngvSkczQWhFNmJxdFRjZFlXeGFmTmY2eUtpR1dzZk11WVVXTWs4MGVJVUxuRmZaZ1pOCklYNTlSOHlqdE9CVm9Sa3hjYTVoMW1ZTDFsSlJNM3ZqVHNHTHFybmpOTjNBdWZ3ZGRpK1VDbGZVL2l0K1EvZkUKV216aFFoTlRpNVFkRWJLVStOTnYvNnYvb2JvandNb25HVVBCdEFTUE05cmxFemIralQ1WHdWQjgvLzRGY3VoSwoyVzNpcjhtNHVlQ1JHSVlrbGxlLzhuQmZ0eVhiVkNocVRyZFBlaGlPM1FLQmdRRGlrR3JTOTc3cjg3Y1JPOCtQClpoeXltNXo4NVIzTHVVbFNTazJiOTI1QlhvakpZL2RRZDVTdFVsSWE4OUZKZnNWc1JRcEhHaTFCYzBMaTY1YjIKazR0cE5xcVFoUmZ1UVh0UG9GYXRuQzlPRnJVTXJXbDVJN0ZFejZnNkNQMVBXMEg5d2hPemFKZUdpZVpNYjlYTQoybDdSSFZOcC9jTDlYbmhNMnN0Q1lua2Iwd0tCZ1FEUzF4K0crakEyUVNtRVFWNXA1RnRONGcyamsyZEFjMEhNClRIQ2tTazFDRjhkR0Z2UWtsWm5ZbUt0dXFYeXNtekJGcnZKdmt2eUhqbUNYYTducXlpajBEdDZtODViN3BGcVAKQWxtajdtbXI3Z1pUeG1ZMXBhRWFLMXY4SDNINGtRNVl3MWdrTWRybVJHcVAvaTBGaDVpaGtSZS9DOUtGTFVkSQpDcnJjTzhkUVp3S0JnSHA1MzRXVWNCMVZibzFlYStIMUxXWlFRUmxsTWlwRFM2TzBqeWZWSmtFb1BZSEJESnp2ClIrdzZLREJ4eFoyWmJsZ05LblV0YlhHSVFZd3lGelhNcFB5SGxNVHpiZkJhYmJLcDFyR2JVT2RCMXpXM09PRkgKcmppb21TUm1YNmxhaDk0SjRHU0lFZ0drNGw1SHhxZ3JGRDZ2UDd4NGRjUktJWFpLZ0w2dVJSSUpBb0dCQU1CVApaL2p5WStRNTBLdEtEZHUrYU9ORW4zaGxUN3hrNXRKN3NBek5rbWdGMU10RXlQUk9Xd1pQVGFJbWpRbk9qbHdpCldCZ2JGcXg0M2ZlQ1Z4ZXJ6V3ZEM0txaWJVbWpCTkNMTGtYeGh3ZEVteFQwVit2NzZGYzgwaTNNYVdSNnZZR08KditwVVovL0F6UXdJcWZ6dlVmV2ZxdStrMHlhVXhQOGNlcFBIRyt0bEFvR0FmQUtVVWhqeFU0Ym5vVzVwVUhKegpwWWZXZXZ5TW54NWZyT2VsSmRmNzlvNGMvMHhVSjh1eFBFWDFkRmNrZW96dHNpaVFTNkN6MENRY09XVWxtSkRwCnVrdERvVzM3VmNSQU1BVjY3NlgxQVZlM0UwNm5aL2g2Tkd4Z28rT042Q3pwL0lkMkJPUm9IMFAxa2RjY1NLT3kKMUtFZlNnb1B0c1N1eEpBZXdUZmxDMXc9Ci0tLS0tRU5EIFBSSVZBVEUgS0VZLS0tLS0K
81 changes: 81 additions & 0 deletions tests/longevity/manifests/cafe.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
apiVersion: apps/v1
kind: Deployment
metadata:
name: coffee
spec:
replicas: 3
selector:
matchLabels:
app: coffee
template:
metadata:
labels:
app: coffee
spec:
containers:
- name: coffee
image: nginxdemos/nginx-hello:plain-text
ports:
- containerPort: 8080
readinessProbe:
httpGet:
path: /
port: 8080
lifecycle:
preStop:
exec:
command: ["/bin/sleep", "15"]
---
apiVersion: v1
kind: Service
metadata:
name: coffee
spec:
ports:
- port: 80
targetPort: 8080
protocol: TCP
name: http
selector:
app: coffee
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: tea
spec:
replicas: 3
selector:
matchLabels:
app: tea
template:
metadata:
labels:
app: tea
spec:
containers:
- name: tea
image: nginxdemos/nginx-hello:plain-text
ports:
- containerPort: 8080
readinessProbe:
httpGet:
path: /
port: 8080
lifecycle:
preStop:
exec:
command: ["/bin/sleep", "15"]
---
apiVersion: v1
kind: Service
metadata:
name: tea
spec:
ports:
- port: 80
targetPort: 8080
protocol: TCP
name: http
selector:
app: tea
92 changes: 92 additions & 0 deletions tests/longevity/manifests/cronjob.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
apiVersion: v1
kind: ServiceAccount
metadata:
name: rollout-mgr
namespace: default
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: rollout-mgr
namespace: default
rules:
- apiGroups:
- "apps"
resources:
- deployments
verbs:
- patch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: rollout-mgr
namespace: default
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: rollout-mgr
subjects:
- kind: ServiceAccount
name: rollout-mgr
namespace: default
---
apiVersion: batch/v1
kind: CronJob
metadata:
name: coffee-rollout-mgr
namespace: default
spec:
schedule: "* */6 * * *" # every minute every 6 hours
jobTemplate:
spec:
template:
spec:
serviceAccountName: rollout-mgr
containers:
- name: coffee-rollout-mgr
image: curlimages/curl:8.3.0
imagePullPolicy: IfNotPresent
command:
- /bin/sh
- -c
args:
- |
TOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)
RESTARTED_AT=$(date -u +"%Y-%m-%dT%H:%M:%SZ")
curl -X PATCH -s -k -v \
-H "Authorization: Bearer $TOKEN" \
-H "Content-type: application/merge-patch+json" \
--data-raw "{\"spec\": {\"template\": {\"metadata\": {\"annotations\": {\"kubectl.kubernetes.io/restartedAt\": \"$RESTARTED_AT\"}}}}}" \
"https://kubernetes/apis/apps/v1/namespaces/default/deployments/coffee?fieldManager=kubectl-rollout" 2>&1
restartPolicy: OnFailure
---
apiVersion: batch/v1
kind: CronJob
metadata:
name: tea-rollout-mgr
namespace: default
spec:
schedule: "* 3,9,15,21 * * *" # every minute every 6 hours, 3 hours apart from coffee
jobTemplate:
spec:
template:
spec:
serviceAccountName: rollout-mgr
containers:
- name: coffee-rollout-mgr
image: curlimages/curl:8.3.0
imagePullPolicy: IfNotPresent
command:
- /bin/sh
- -c
args:
- |
TOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)
RESTARTED_AT=$(date -u +"%Y-%m-%dT%H:%M:%SZ")
curl -X PATCH -s -k -v \
-H "Authorization: Bearer $TOKEN" \
-H "Content-type: application/merge-patch+json" \
--data-raw "{\"spec\": {\"template\": {\"metadata\": {\"annotations\": {\"kubectl.kubernetes.io/restartedAt\": \"$RESTARTED_AT\"}}}}}" \
"https://kubernetes/apis/apps/v1/namespaces/default/deployments/tea?fieldManager=kubectl-rollout" 2>&1
restartPolicy: OnFailure
Loading

0 comments on commit 71d605e

Please sign in to comment.