Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VXLAN not working when tunnel address is borrowed #6160

Closed
cyclinder opened this issue May 31, 2022 · 11 comments · Fixed by #9662
Closed

VXLAN not working when tunnel address is borrowed #6160

cyclinder opened this issue May 31, 2022 · 11 comments · Fixed by #9662

Comments

@cyclinder
Copy link
Contributor

cyclinder commented May 31, 2022

Expected Behavior

the IP of the vxlan.calico should not be assigned from the blocks of other nodes.

Current Behavior

I understand that each node should have at least one block, but in the case of insufficient ip or a large number of nodes, it may not be possible to assign a full block to a newly joined node, which may result in the ip of the newly joined node's vxlan possibly assigning addresses from other nodes' blocks, which will result in failure to access the new node's vxlan IP and timeout for pod query dns on newly joined nodes.

Possible Solution

Steps to Reproduce (for bugs)

  1. create a four-node cluster
[root@dce-10-29-12-122 ~]# kubectl get nodes -o wide
NAME               STATUS   ROLES             AGE   VERSION    INTERNAL-IP    EXTERNAL-IP   OS-IMAGE                KERNEL-VERSION           CONTAINER-RUNTIME
dce-10-29-12-122   Ready    master,registry   8h    v1.18.20   10.29.12.122   <none>        CentOS Linux 7 (Core)   3.10.0-1160.el7.x86_64   docker://20.10.7
dce-10-29-12-123   Ready    infrastructure    8h    v1.18.20   10.29.12.123   <none>        CentOS Linux 7 (Core)   3.10.0-1160.el7.x86_64   docker://20.10.7
dce-10-29-12-124   Ready    infrastructure    8h    v1.18.20   10.29.12.124   <none>        CentOS Linux 7 (Core)   3.10.0-1160.el7.x86_64   docker://20.10.7
dce-10-29-12-125   Ready    <none>            8h    v1.18.20   10.29.12.125   <none>        CentOS Linux 7 (Core)   3.10.0-1160.el7.x86_64   docker://20.10.7
  1. create default-ipv4-ippool:
apiVersion: projectcalico.org/v3
kind: IPPool
metadata:
  name: default-ipv4-ippool
spec:
  blockSize: 28
  cidr: 172.29.0.0/26
  ipipMode: Never
  natOutgoing: true
  nodeSelector: all()
  vxlanMode: Always

This means that there are at most four blocks, and now I have four nodes, everything is fine now.

  1. Now, I join a new node(dce-10-29-12-112), Since there are no extra blocks, so the vxlanIP of a new node will be allocated from the block of a node
[root@dce-10-29-12-122 ~]# kubectl get nodes -w
NAME               STATUS     ROLES             AGE   VERSION
dce-10-29-12-112   NotReady   <none>            42s   v1.18.20
dce-10-29-12-122   Ready      master,registry   9h    v1.18.20
dce-10-29-12-123   Ready      infrastructure    8h    v1.18.20
dce-10-29-12-124   Ready      infrastructure    8h    v1.18.20
dce-10-29-12-125   Ready      <none>            8h    v1.18.20

dce-10-29-12-112   Ready      master,registry   9h    v1.18.20
[root@dce-10-29-12-112 ~]# ip a show vxlan.calico
56: vxlan.calico: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default
    link/ether 66:ec:93:99:93:da brd ff:ff:ff:ff:ff:ff
    inet 172.29.0.62/32 brd 172.29.0.62 scope global vxlan.calico
       valid_lft forever preferred_lft forever
    inet6 fe80::64ec:93ff:fe99:93da/64 scope link
       valid_lft forever preferred_lft forever
[root@dce-10-29-12-122 ~]# calicoctl ipam show --show-blocks
+----------+---------------------------------------+-----------+------------+------------------+
| GROUPING |                 CIDR                  | IPS TOTAL | IPS IN USE |     IPS FREE     |
+----------+---------------------------------------+-----------+------------+------------------+
| IP Pool  | 172.29.0.0/26                         |        64 | 10 (16%)   | 54 (84%)         |
| Block    | 172.29.0.0/28                         |        16 | 1 (6%)     | 15 (94%)         |
| Block    | 172.29.0.16/28                        |        16 | 1 (6%)     | 15 (94%)         |
| Block    | 172.29.0.32/28                        |        16 | 2 (12%)    | 14 (88%)         |
| Block    | 172.29.0.48/28                        |        16 | 6 (38%)    | 10 (62%)         |
| IP Pool  | fdff:ffff:ffff:ffff::/96              | 4.295e+09 | 6 (0%)     | 4.295e+09 (100%) |

172.29.0.62 is belong to block 172.29.0.48/28

  1. Ping the vxlan ip of the new node on the old node and it fails
[root@dce-10-29-12-122 ~]# ping 172.29.0.62
PING 172.29.0.62 (172.29.0.62) 56(84) bytes of data.
^C
--- 172.29.0.62 ping statistics ---
6 packets transmitted, 0 received, 100% packet loss, time 5001ms
  1. DNS query failed in pod:

test-112 is the test pod created on the newly joined node(dce-10-29-12-112)
test-125 is the test pod created on the old node(dce-10-29-12-125)

[root@dce-10-29-12-122 ~]# kubectl get po -o wide
NAME             READY   STATUS              RESTARTS   AGE     IP              NODE               NOMINATED NODE   READINESS GATES
test-112         1/1     Running             0          2d17h   172.29.0.60     dce-10-29-12-112   <none>           <none>
test-125         1/1     Running             0          2d17h   172.29.0.36     dce-10-29-12-125   <none>           <none>
[root@dce-10-29-12-122 ~]# kubectl exec -it test-125 sh
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl kubectl exec [POD] -- [COMMAND] instead.
/ # nslookup kubernetes.default.svc.cluster.local
Server:		172.31.0.10
Address:	172.31.0.10:53


Name:	kubernetes.default.svc.cluster.local
Address: 172.31.0.1

/ # exit
[root@dce-10-29-12-122 ~]# kubectl exec -it test2-112 sh
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl kubectl exec [POD] -- [COMMAND] instead.
/ # nslookup kubernetes.default.svc.cluster.local
;; connection timed out; no servers could be reached

/ #

Context

Your Environment

  • Calico version
  • Orchestrator version (e.g. kubernetes, mesos, rkt):
  • Operating System and version:
  • Link to your project (optional):
@cyclinder
Copy link
Contributor Author

/kind bug

@cyclinder
Copy link
Contributor Author

friendly ping :) @caseydavenport

@caseydavenport
Copy link
Member

Hey sorry for the delay, have been out of the office for a bit.

which may result in the ip of the newly joined node's vxlan possibly assigning addresses from other nodes' blocks, which will result in failure to access the new node's vxlan IP and timeout for pod query dns on newly joined nodes.

I think this is a bug - there's no reason at a networking level that the IP needs to be from within the block on that node.

@caseydavenport
Copy link
Member

Same symptom being described here: #5595

@caseydavenport caseydavenport changed the title Can the IP of vxlan.calico be assigned from the blocks of other nodes? VXLAN not working when tunnel address is borrowed Jun 9, 2022
@cyclinder
Copy link
Contributor Author

cyclinder commented Jun 9, 2022

Thank your reply! @caseydavenport

Our users uses a cidr mask of 20 and a blockSize of 26, which means that there are at most 64 blocks, and the number of nodes in the cluster exceeds 64, so for new nodes added after the 64th node, they will not have a full block to allocate. This will result in the newly added nodes vxlan will not work. I think this is a more serious problem. This will have a big impact on the scalability of the k8s-cluster. Now we can only resize the blocksize to 28 (because of the user's environment, cidr is not adjustable)

I look up the source code, I found the logic of assigning tunnel IP and assigning ip of pod is the same, I think some distinction should be made here.

I have the following two suggestions for this problem:

  • When assigning IPs to vxlan.calico, if there are no extra blocks, an error should be returned

  • We should emphasize this point in the official documentation

@caseydavenport
Copy link
Member

There shouldn't be a reason that Calico can't use a borrowed IP for the tunnel address. There is likely another fix that needs to be made rather than limiting the tunnel address in the way you described. That wouldn't fix the ultimate problem of limiting the cluster size to 64 nodes (any nodes past the number of blocks in the cluster would result in non-functional nodes without a tunnel address)

@biqiangwu
Copy link

When there are not enough blocks, a 31-masked MicoBlock is assigned from another block, and the tunnel IP is split from that MicoBlock, because the problem only affects Pods using HostNetwork on the new node to communicate with non-HostNetwork Pods on other nodes, and does not affect other communication scenarios. We just need to solve the tunnel IP routing problem. Also add a route like "x.xx.xx.xx/31 via tunlIP dev ifcfg-tunl". More nodes can be supported with relatively small changes.

@caseydavenport
Copy link
Member

he problem only affects Pods using HostNetwork on the new node to communicate with non-HostNetwork Pods on other nodes, and does not affect other communication scenarios. We just need to solve the tunnel IP routing problem. Also add a route like "x.xx.xx.xx/31 via tunlIP dev ifcfg-tunl"

Ah, yes this makes sense. We're not programming a return route that tells pods where the tunnel address is (normally that is handled by the route for the block itself).

@biqiangwu
Copy link

he problem only affects Pods using HostNetwork on the new node to communicate with non-HostNetwork Pods on other nodes, and does not affect other communication scenarios. We just need to solve the tunnel IP routing problem. Also add a route like "x.xx.xx.xx/31 via tunlIP dev ifcfg-tunl"

Ah, yes this makes sense. We're not programming a return route that tells pods where the tunnel address is (normally that is handled by the route for the block itself).

OK,Then I'll start modifying it according to this

@sedflix
Copy link

sedflix commented Oct 27, 2022

I'm facing exactly the same issue, ie, the network connectivity of pods using the host network on nodes to communicate with non host network pods on other nodes, and does not affect different communication scenarios"

Link to details of one of the ipam blocks: https://gist.github.com/sedflix/95bc34ee4a4fcde98ae93993708c864e

Setup:

  • EKS 1.21
  • Calico 3.20.3 ( and now 3.23.3)
  • Using VXLAN
  • CALICO_IPV4POOL_BLOCK_SIZE is 26
  • We have several node sizes. Max Pod varies from 48 to 100s.
  • We are using k8s as the data store.

Within 15 minutes, we added approximately 130 nodes while using Calico 3.20. Within 30 minutes we removed those 130 nodes. This was done twice.
Our pod CIDR is 192.168.0.0/18 which allows 16,384 IP ranges. Our block size is 26, which has 64 IPs. We can have at max 256 blocks at a time. We reached that stage twice, ie, the number of blocks allocated reached 256. Currently, we have more than 200 borrowed IPs. The block owner is a node that doesn't exist anymore and the IPs are borrowed by live/present node.

@caseydavenport
Copy link
Member

I think this is a candidate fix: #9662

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants