This document explains how to install and run the project on a local Lima cluster.
- Download gpg here
- Generating a new
gpg
key - Add
gpg
key to your GitHub account - Tell git about your signing key
- Automatically sign all commits
To get started, we need to have the following software installed:
- docker
- lima >= v0.14.0
- golangci-lint
- Kubebuilder Prerequisites (go, docker, kubectl, kubebuilder, controller-gen)
- helm
- yamlfmt
In order to run make test
to run the unit tests, you'll need to install envtest with the following commands:
export ENVTEST_ASSETS_DIR="/usr/local/kubebuilder"
mkdir -p ${ENVTEST_ASSETS_DIR}
test -f ${ENVTEST_ASSETS_DIR}/setup-envtest.sh || curl -sSLo ${ENVTEST_ASSETS_DIR}/setup-envtest.sh https://raw.githubusercontent.com/kubernetes-sigs/controller-runtime/v0.8.3/hack/setup-envtest.sh
source ${ENVTEST_ASSETS_DIR}/setup-envtest.sh; fetch_envtest_tools ${ENVTEST_ASSETS_DIR}; setup_envtest_env ${ENVTEST_ASSETS_DIR};
NOTE: (experimental) you can start the stack using cgroups v2 by running CGROUPS=v2 make lima-all
. Some features of the chaos-controller may not be working in this mode as of today.
NOTE: sometimes, the instance can hang on waiting for ssh for some reasons. When it's the case, best bet is to run make lima-stop
and re-run make lima-all
again.
Once you have installed the above requirements, run the make lima-all
command to spin up a local stack. This command will run the following targets:
make lima-start
to create the lima vm with containerd and Kubernetes (backed by k3s)make lima-kubectx
to add the lima Kubernetes cluster config to your local configs and switch to the lima contextmake lima-install-cert-manager
to install cert-managermake lima-push-all
to build and push the chaos-controller imagesmake lima-install
to render and apply the chaos-controller helm chart
Once the instance is started, you can log into it using either the lima
or its longer form limactl shell <$LIMA_INSTANCE>
commands.
We are not using default
as our instance name on lima anymore.
The alias lima
can still be used as before to quickkly jump on your instance shell (e.g. lima uname -a
)
More specifically:
Usage: lima [COMMAND...]
lima is an alias for "limactl shell default".
The instance name ("default") can be changed by specifying $LIMA_INSTANCE.
The shell and initial workdir inside the instance can be specified via $LIMA_SHELL
and $LIMA_WORKDIR.
See `limactl shell --help` for further information.
Just define export LIMA_INSTANCE=$(whoami | tr "." "-")
into your .zshrc
or similar.
In case you have a Datadog account and want to install the Datadog Agent into your local cluster to retrieve logs/metrics/traces from it:
NB: it is recommended to properly tag/isolate your local workload from your PROD workload, check with your Datadog account admin how to adapt tagging accordingly and confirmm which configuration should be applied to your Datadog Agent
security add-generic-password -a ${USER} -s staging_datadog_api_key -w
security add-generic-password -a ${USER} -s staging_datadog_app_key -w
# security delete-generic-password -a ${USER} -s staging_datadog_api_key
# security delete-generic-password -a ${USER} -s staging_datadog_app_key
# in your .zshrc or similar you can then do:
export STAGING_DATADOG_API_KEY=$(security find-generic-password -a ${USER} -s staging_datadog_api_key -w)
export STAGING_DATADOG_APP_KEY=$(security find-generic-password -a ${USER} -s staging_datadog_app_key -w)
# choose appropriate site here: https://docs.datadoghq.com/getting_started/site/
# only for `make open-dd`
export STAGING_DD_SITE=https://app.datadoghq.com
- Run
make lima-install-datadog-agent
- Run
make open-dd
to open your default browser to the infrastructure page of your host
To deploy changes made to the controller code or chart, run the make lima-redeploy
command that will run the following targets:
make lima-push-all
to build the chaos-controller images locally, and push them into the k3s clustermake lima-install
to render and apply the chaos-controller helm chartmake lima-restart
to restart the chaos-controller manager pod
The samples contains sample data which can be used to test your changes.
demo.yaml contains testing resources you can apply directly to your cluster. First, create the chaos-demo
namespace, then bring up the demo pods:
kubectl apply -f examples/namespace.yaml
kubectl apply -f examples/demo.yaml
To see whether curls are succeeding, by using kubectl to tail the pod's logs, run:
kubectl -n chaos-demo logs -f -l app=demo-curl -c curl
Once you define your test manifest, run:
kubectl apply -f examples/<manifest>.yaml
NB: a good example to start with is
kubectl apply -f examples/network_drop.yaml
that will block all outgoing traffic fromdemo-curl
pod.
To remove the disruption, run:
kubectl delete -f examples/<manifest>.yaml
(kubectl delete -f examples/network_drop.yaml
)
See development guide for more robust documentation and tips!
Once you have installed standard requirements, in order to reliably
test disk throttling locally, you may want to use longhorn
as a storage class.
NB: to detect if you need
longhorn
or not, simply install a disk disruption, e.g.kubectl apply -f examples/disk_pressure_read.yaml
, if thechaos-engineering/chaos-disk-pressure-read-XXXXX
is NOT in statusRunning
and output logs similar toerror initializing the disk pressure injector","disruptionName":"disk-pressure-read","disruptionNamespace":"chaos-demo","targetName":"demo-curl-588bd4ffc8-q5wnk","targetNodeName":"lima","error":"error initializing disk informer: error executing ls command: exit status 2\noutput: ls: cannot access '/dev/disk/by-label/data-volume': No such file or directory\n
you will need to installlonghorn
as explained below. Delete failing disk disruption before proceeding to the next sections:
kubectl delete -f examples/disk_pressure_read.yaml
kubectl patch \
--type json \
--patch='[ { "op": "remove", "path": "/metadata/finalizers" } ]' \
-n chaos-engineering pods/<disruption-pod-name>
First install longhorn
:
make install-longhorn
Wait for longhorn
components to be Running
(this may takes several minutes):
kubectl get pods -n longhorn-system
Then, if you already created examples/demo.yaml
, delete it and re-create it:
kubectl delete -f examples/demo.yaml
kubectl apply -f examples/demo.yaml
This aims to ensure the PVC is created with the proper DEFAULT storage class, that has been changed following longhorn installation.
NB: if you encounter an error like
Error from server (Forbidden): error when creating "examples/demo.yaml": persistentvolumeclaims "demo" is forbidden: Internal error occurred: 2 default StorageClasses were found
you may want to edit storage classes and keep a single default storage classkubectl edit storageClass
look forstorageclass.kubernetes.io/is-default-class
annotation and remove the annotation to the storageClasses not beinglonghorn
, once done, re-applydemo
:kubectl apply -f examples/demo.yaml
Once demo are now running, you can now look at throttling in the dedicated containers:
-
To validate read throttling, look at
read-file
container logskubectl logs --follow --timestamps --prefix --since 1m --tail 20 -c read-file -n chaos-demo -l app=demo-curl
-
Then apply read pressure to the pod:
kubectl apply -f examples/disk_pressure_read.yaml
-
Once disruption is applied (chaos-engineering/pod status is RUNNING), you should witness the slower reading speed (limited at
1.0 kB/s
)kubectl delete -f examples/disk_pressure_read.yaml
-
Following the disruption deletion, reading speed should be back to normal
-
To validate write throttling, look at
write-file
container logskubectl logs --follow --timestamps --prefix --since 1m --tail 20 -c write-file -n chaos-demo -l app=demo-curl
-
Then apply write pressure to the pod:
kubectl apply -f examples/disk_pressure_write.yaml
-
Once disruption is applied (chaos-engineering/pod status is RUNNING), you should witness the slower writing speed (limited at
1.0 kB/s
)kubectl delete -f examples/disk_pressure_write.yaml
-
Following the disruption deletion, writing speed should be back to normal
The gRPC disruption cannot be tested on the nginx client/server pods. To modify and test gRPC disruption code, visit the dogfood README.md and the dogfood CONTRIBUTING.md documents.
The project contains end-to-end test which are meant to run against a real Kubernetes cluster. You can run them easily with the make e2e-test
command. Please ensure that the following requirements are met before running the command:
- you deployed your changes locally (see Deploying local changes to Lima)
- your Kubernetes context is set to Lima (
kubectx lima
for instance)
The end-to-end tests will create a set of dummy pods in the default
namespace of the Lima cluster that will be used to inject different kind of failures and make different assertions against those injections.
In case you have a Datadog account and want to push the tests results to it, you can do the following:
- Create an API key here
- Store it securely and add it to your
.zshrc
:
security add-generic-password -a ${USER} -s datadog_api_key -w
# security delete-generic-password -a ${USER} -s datadog_api_key
# in your .zshrc or similar you can then do:
export DATADOG_API_KEY=$(security find-generic-password -a ${USER} -s datadog_api_key -w)
-
Install
datadog-ci
by runningmake install-datadog-ci
-
Run tests
make test && make e2e-test
-
Go to Datadog you test-services
3rd-party references and licenses are kept in the LICENSE-3rdparty.csv file. This file has been generated by the tasks/thirdparty.py script. If any vendor is updated, added or removed, this file must be updated as well.