Deployment fails when etcd servers are not members of kube_control_plane #11682

jctoussaint · 2024-11-02T13:19:09Z

What happened?

The task Gen_certs | Gather node certs fails with this message:

ok: [k8s-mst1 -> k8s-etcd1(192.168.0.21)]
ok: [k8s-mst2 -> k8s-etcd1(192.168.0.21)]
fatal: [k8s-worker1 -> k8s-etcd1(192.168.0.21)]: FAILED! => {"changed": false, "cmd": "set -o pipefail && tar cfz - -C /etc/ssl/etcd/ssl ca.pem node-k8s-worker1.pem node-k8s-worker1-key.pem | base64 --wrap=0", "delta": "0:00:00.048485", "end": "2024-11-02 11:57:31.981817", "msg": "non-zero return code", "rc": 2, "start": "2024-11-02 11:57:31.933332", "stderr": "tar: node-k8s-worker1.pem : stat impossible: Aucun fichier ou dossier de ce type

In k8s-worker1 nor k8s-etcd1, the files node-k8s-worker1.pem and node-k8s-worker1-key.pem don't exist.

What did you expect to happen?

In k8s-etcd1, the files node-k8s-worker1.pem and node-k8s-worker1-key.pem must exist.

How can we reproduce it (as minimally and precisely as possible)?

With 3 etcd dedicated servers.

Deploy with this command:

source ~/ansible-kubespray/bin/activate
cd kubespray
ansible-playbook -f 10 -i inventory/homecluster/inventory.ini --become --become-user=root cluster.yml -e 'unsafe_show_logs=True'

OS

Linux 6.1.0-26-amd64 x86_64
PRETTY_NAME="Debian GNU/Linux 12 (bookworm)"

Version of Ansible

ansible [core 2.16.12]
config file = /home/me/kubespray/ansible.cfg
configured module search path = ['/home/me/kubespray/library']
ansible python module location = /home/me/ansible-kubespray/lib/python3.11/site-packages/ansible
ansible collection location = /home/me/.ansible/collections:/usr/share/ansible/collections
executable location = /home/me/ansible-kubespray/bin/ansible
python version = 3.11.2 (main, Aug 26 2024, 07:20:54) [GCC 12.2.0] (/home/me/ansible-kubespray/bin/python3)
jinja version = 3.1.4
libyaml = True

Version of Python

Python 3.11.2

Version of Kubespray (commit)

e5bdb3b

Network plugin used

cilium

Full inventory with variables

[all]
k8s-mst1    ansible_host=192.168.0.11
k8s-mst2    ansible_host=192.168.0.12
k8s-etcd1   ansible_host=192.168.0.21 etcd_member_name=etcd1
k8s-etcd2   ansible_host=192.168.0.22 etcd_member_name=etcd2
k8s-etcd3   ansible_host=192.168.0.23 etcd_member_name=etcd3
k8s-worker1 ansible_host=192.168.0.31
k8s-worker2 ansible_host=192.168.0.32
k8s-worker3 ansible_host=192.168.0.33

[kube_control_plane]
k8s-mst1
k8s-mst2

[etcd]
k8s-etcd1
k8s-etcd2
k8s-etcd3

[kube_node]
k8s-worker1
k8s-worker2
k8s-worker3

[calico_rr]

[k8s_cluster:children]
kube_control_plane
kube_node
calico_rr

Command used to invoke ansible

ansible-playbook -f 10 -i inventory/homecluster/inventory.ini --become --become-user=root cluster.yml -e 'unsafe_show_logs=True'

Output of ansible run

ok: [k8s-mst1 -> k8s-etcd1(192.168.0.21)]
ok: [k8s-mst2 -> k8s-etcd1(192.168.0.21)]
fatal: [k8s-worker1 -> k8s-etcd1(192.168.0.21)]: FAILED! => {"changed": false, "cmd": "set -o pipefail && tar cfz - -C /etc/ssl/etcd/ssl ca.pem node-k8s-worker1.pem node-k8s-worker1-key.pem | base64 --wrap=0", "delta": "0:00:00.048485", "end": "2024-11-02 11:57:31.981817", "msg": "non-zero return code", "rc": 2, "start": "2024-11-02 11:57:31.933332", "stderr": "tar: node-k8s-worker1.pem : stat impossible: Aucun fichier ou dossier de ce type

Anything else we need to know

I fixed this issue like this:

create the workers certificates on k8s-etcd1:

# on k8s-etcd1
HOSTS=k8s-worker1 /usr/local/bin/etcd-scripts/make-ssl-etcd.sh -f /etc/ssl/etcd/openssl.conf -d /etc/ssl/etcd/ssl/
HOSTS=k8s-worker2 /usr/local/bin/etcd-scripts/make-ssl-etcd.sh -f /etc/ssl/etcd/openssl.conf -d /etc/ssl/etcd/ssl/
HOSTS=k8s-worker3 /usr/local/bin/etcd-scripts/make-ssl-etcd.sh -f /etc/ssl/etcd/openssl.conf -d /etc/ssl/etcd/ssl/

deploy only etcd (w/ --tags=etcd):

ansible-playbook -f 10 -i inventory/homecluster/inventory.ini --become --become-user=root cluster.yml -e 'unsafe_show_logs=True' --tags=etcd

restart the deployment without --tags=etcd

The text was updated successfully, but these errors were encountered:

VannTen · 2024-11-09T13:40:15Z

Is that reproducible with a setup like this:

[kube_control_plane]
node-1

[etcd]
node-1
node-2
node-3

[kube_node]
node-1
node-2
node-3
node-4

?

(This is the node-etcd-client setup which is tested in CI, so if it does not catch that kind of things we need to tweak it)

jctoussaint · 2024-11-10T16:58:15Z

I'll test it.

But I think it will work because node-1 is in kube_control_plane and etcd.

jctoussaint · 2024-11-10T21:03:30Z

It worked on the first try:

PLAY RECAP *****************************************************************************************************************
k8s-test1                  : ok=697  changed=154  unreachable=0    failed=0    skipped=1084 rescued=0    ignored=3   
k8s-test2                  : ok=561  changed=121  unreachable=0    failed=0    skipped=673  rescued=0    ignored=2   
k8s-test3                  : ok=561  changed=121  unreachable=0    failed=0    skipped=673  rescued=0    ignored=2   
k8s-test4                  : ok=512  changed=104  unreachable=0    failed=0    skipped=669  rescued=0    ignored=1

VannTen · 2024-11-13T18:25:01Z

Hum it looks like the conditions are: - Separate etcd / master - nodes are etcd clients (eg, calico using etcd store) - maybe node != control plane ? Not sure about this one That'd be helpful if you can test that, otherwise I'll start a PR with that as new test case when I can

jctoussaint added the kind/bug Categorizes issue or PR as related to a bug. label Nov 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deployment fails when etcd servers are not members of kube_control_plane #11682

Deployment fails when etcd servers are not members of kube_control_plane #11682

jctoussaint commented Nov 2, 2024 •

edited

Loading

VannTen commented Nov 9, 2024

jctoussaint commented Nov 10, 2024

jctoussaint commented Nov 10, 2024

VannTen commented Nov 13, 2024 via email

Deployment fails when etcd servers are not members of kube_control_plane #11682

Deployment fails when etcd servers are not members of kube_control_plane #11682

Comments

jctoussaint commented Nov 2, 2024 • edited Loading

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

OS

Version of Ansible

Version of Python

Version of Kubespray (commit)

Network plugin used

Full inventory with variables

Command used to invoke ansible

Output of ansible run

Anything else we need to know

VannTen commented Nov 9, 2024

jctoussaint commented Nov 10, 2024

jctoussaint commented Nov 10, 2024

VannTen commented Nov 13, 2024 via email

jctoussaint commented Nov 2, 2024 •

edited

Loading