VXLAN not working when tunnel address is borrowed #6160

cyclinder · 2022-05-31T09:26:55Z

Expected Behavior

the IP of the vxlan.calico should not be assigned from the blocks of other nodes.

Current Behavior

I understand that each node should have at least one block, but in the case of insufficient ip or a large number of nodes, it may not be possible to assign a full block to a newly joined node, which may result in the ip of the newly joined node's vxlan possibly assigning addresses from other nodes' blocks, which will result in failure to access the new node's vxlan IP and timeout for pod query dns on newly joined nodes.

Possible Solution

Steps to Reproduce (for bugs)

create a four-node cluster

[root@dce-10-29-12-122 ~]# kubectl get nodes -o wide
NAME               STATUS   ROLES             AGE   VERSION    INTERNAL-IP    EXTERNAL-IP   OS-IMAGE                KERNEL-VERSION           CONTAINER-RUNTIME
dce-10-29-12-122   Ready    master,registry   8h    v1.18.20   10.29.12.122   <none>        CentOS Linux 7 (Core)   3.10.0-1160.el7.x86_64   docker://20.10.7
dce-10-29-12-123   Ready    infrastructure    8h    v1.18.20   10.29.12.123   <none>        CentOS Linux 7 (Core)   3.10.0-1160.el7.x86_64   docker://20.10.7
dce-10-29-12-124   Ready    infrastructure    8h    v1.18.20   10.29.12.124   <none>        CentOS Linux 7 (Core)   3.10.0-1160.el7.x86_64   docker://20.10.7
dce-10-29-12-125   Ready    <none>            8h    v1.18.20   10.29.12.125   <none>        CentOS Linux 7 (Core)   3.10.0-1160.el7.x86_64   docker://20.10.7

create default-ipv4-ippool:

apiVersion: projectcalico.org/v3
kind: IPPool
metadata:
  name: default-ipv4-ippool
spec:
  blockSize: 28
  cidr: 172.29.0.0/26
  ipipMode: Never
  natOutgoing: true
  nodeSelector: all()
  vxlanMode: Always

This means that there are at most four blocks, and now I have four nodes, everything is fine now.

Now, I join a new node(dce-10-29-12-112), Since there are no extra blocks, so the vxlanIP of a new node will be allocated from the block of a node

[root@dce-10-29-12-122 ~]# kubectl get nodes -w
NAME               STATUS     ROLES             AGE   VERSION
dce-10-29-12-112   NotReady   <none>            42s   v1.18.20
dce-10-29-12-122   Ready      master,registry   9h    v1.18.20
dce-10-29-12-123   Ready      infrastructure    8h    v1.18.20
dce-10-29-12-124   Ready      infrastructure    8h    v1.18.20
dce-10-29-12-125   Ready      <none>            8h    v1.18.20

dce-10-29-12-112   Ready      master,registry   9h    v1.18.20

[root@dce-10-29-12-112 ~]# ip a show vxlan.calico
56: vxlan.calico: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default
    link/ether 66:ec:93:99:93:da brd ff:ff:ff:ff:ff:ff
    inet 172.29.0.62/32 brd 172.29.0.62 scope global vxlan.calico
       valid_lft forever preferred_lft forever
    inet6 fe80::64ec:93ff:fe99:93da/64 scope link
       valid_lft forever preferred_lft forever

[root@dce-10-29-12-122 ~]# calicoctl ipam show --show-blocks
+----------+---------------------------------------+-----------+------------+------------------+
| GROUPING |                 CIDR                  | IPS TOTAL | IPS IN USE |     IPS FREE     |
+----------+---------------------------------------+-----------+------------+------------------+
| IP Pool  | 172.29.0.0/26                         |        64 | 10 (16%)   | 54 (84%)         |
| Block    | 172.29.0.0/28                         |        16 | 1 (6%)     | 15 (94%)         |
| Block    | 172.29.0.16/28                        |        16 | 1 (6%)     | 15 (94%)         |
| Block    | 172.29.0.32/28                        |        16 | 2 (12%)    | 14 (88%)         |
| Block    | 172.29.0.48/28                        |        16 | 6 (38%)    | 10 (62%)         |
| IP Pool  | fdff:ffff:ffff:ffff::/96              | 4.295e+09 | 6 (0%)     | 4.295e+09 (100%) |

172.29.0.62 is belong to block 172.29.0.48/28

Ping the vxlan ip of the new node on the old node and it fails

[root@dce-10-29-12-122 ~]# ping 172.29.0.62
PING 172.29.0.62 (172.29.0.62) 56(84) bytes of data.
^C
--- 172.29.0.62 ping statistics ---
6 packets transmitted, 0 received, 100% packet loss, time 5001ms

DNS query failed in pod:

test-112 is the test pod created on the newly joined node(dce-10-29-12-112)
test-125 is the test pod created on the old node(dce-10-29-12-125)

[root@dce-10-29-12-122 ~]# kubectl get po -o wide
NAME             READY   STATUS              RESTARTS   AGE     IP              NODE               NOMINATED NODE   READINESS GATES
test-112         1/1     Running             0          2d17h   172.29.0.60     dce-10-29-12-112   <none>           <none>
test-125         1/1     Running             0          2d17h   172.29.0.36     dce-10-29-12-125   <none>           <none>
[root@dce-10-29-12-122 ~]# kubectl exec -it test-125 sh
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl kubectl exec [POD] -- [COMMAND] instead.
/ # nslookup kubernetes.default.svc.cluster.local
Server:		172.31.0.10
Address:	172.31.0.10:53


Name:	kubernetes.default.svc.cluster.local
Address: 172.31.0.1

/ # exit
[root@dce-10-29-12-122 ~]# kubectl exec -it test2-112 sh
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl kubectl exec [POD] -- [COMMAND] instead.
/ # nslookup kubernetes.default.svc.cluster.local
;; connection timed out; no servers could be reached

/ #

Context

Your Environment

Calico version
Orchestrator version (e.g. kubernetes, mesos, rkt):
Operating System and version:
Link to your project (optional):

The text was updated successfully, but these errors were encountered:

cyclinder · 2022-06-04T12:27:14Z

/kind bug

cyclinder · 2022-06-07T09:09:45Z

friendly ping :) @caseydavenport

caseydavenport · 2022-06-09T10:48:11Z

Hey sorry for the delay, have been out of the office for a bit.

which may result in the ip of the newly joined node's vxlan possibly assigning addresses from other nodes' blocks, which will result in failure to access the new node's vxlan IP and timeout for pod query dns on newly joined nodes.

I think this is a bug - there's no reason at a networking level that the IP needs to be from within the block on that node.

caseydavenport · 2022-06-09T10:48:56Z

Same symptom being described here: #5595

cyclinder · 2022-06-09T11:12:12Z

Thank your reply! @caseydavenport

Our users uses a cidr mask of 20 and a blockSize of 26, which means that there are at most 64 blocks, and the number of nodes in the cluster exceeds 64, so for new nodes added after the 64th node, they will not have a full block to allocate. This will result in the newly added nodes vxlan will not work. I think this is a more serious problem. This will have a big impact on the scalability of the k8s-cluster. Now we can only resize the blocksize to 28 (because of the user's environment, cidr is not adjustable)

I look up the source code, I found the logic of assigning tunnel IP and assigning ip of pod is the same, I think some distinction should be made here.

I have the following two suggestions for this problem:

When assigning IPs to vxlan.calico, if there are no extra blocks, an error should be returned
We should emphasize this point in the official documentation

caseydavenport · 2022-06-13T22:43:48Z

There shouldn't be a reason that Calico can't use a borrowed IP for the tunnel address. There is likely another fix that needs to be made rather than limiting the tunnel address in the way you described. That wouldn't fix the ultimate problem of limiting the cluster size to 64 nodes (any nodes past the number of blocks in the cluster would result in non-functional nodes without a tunnel address)

biqiangwu · 2022-06-14T02:27:46Z

When there are not enough blocks, a 31-masked MicoBlock is assigned from another block, and the tunnel IP is split from that MicoBlock, because the problem only affects Pods using HostNetwork on the new node to communicate with non-HostNetwork Pods on other nodes, and does not affect other communication scenarios. We just need to solve the tunnel IP routing problem. Also add a route like "x.xx.xx.xx/31 via tunlIP dev ifcfg-tunl". More nodes can be supported with relatively small changes.

caseydavenport · 2022-06-14T23:54:18Z

he problem only affects Pods using HostNetwork on the new node to communicate with non-HostNetwork Pods on other nodes, and does not affect other communication scenarios. We just need to solve the tunnel IP routing problem. Also add a route like "x.xx.xx.xx/31 via tunlIP dev ifcfg-tunl"

Ah, yes this makes sense. We're not programming a return route that tells pods where the tunnel address is (normally that is handled by the route for the block itself).

biqiangwu · 2022-06-15T02:47:35Z

he problem only affects Pods using HostNetwork on the new node to communicate with non-HostNetwork Pods on other nodes, and does not affect other communication scenarios. We just need to solve the tunnel IP routing problem. Also add a route like "x.xx.xx.xx/31 via tunlIP dev ifcfg-tunl"

Ah, yes this makes sense. We're not programming a return route that tells pods where the tunnel address is (normally that is handled by the route for the block itself).

OK,Then I'll start modifying it according to this

sedflix · 2022-10-27T10:39:08Z

I'm facing exactly the same issue, ie, the network connectivity of pods using the host network on nodes to communicate with non host network pods on other nodes, and does not affect different communication scenarios"

Link to details of one of the ipam blocks: https://gist.github.com/sedflix/95bc34ee4a4fcde98ae93993708c864e

Setup:

EKS 1.21
Calico 3.20.3 ( and now 3.23.3)
Using VXLAN
CALICO_IPV4POOL_BLOCK_SIZE is 26
We have several node sizes. Max Pod varies from 48 to 100s.
We are using k8s as the data store.

Within 15 minutes, we added approximately 130 nodes while using Calico 3.20. Within 30 minutes we removed those 130 nodes. This was done twice.
Our pod CIDR is 192.168.0.0/18 which allows 16,384 IP ranges. Our block size is 26, which has 64 IPs. We can have at max 256 blocks at a time. We reached that stage twice, ie, the number of blocks allocated reached 256. Currently, we have more than 200 borrowed IPs. The block owner is a node that doesn't exist anymore and the IPs are borrowed by live/present node.

caseydavenport · 2024-12-31T20:35:03Z

I think this is a candidate fix: #9662

caseydavenport changed the title ~~Can the IP of vxlan.calico be assigned from the blocks of other nodes?~~ VXLAN not working when tunnel address is borrowed Jun 9, 2022

caseydavenport added kind/bug likelihood/low impact/high labels Jun 9, 2022

cyclinder mentioned this issue Jul 5, 2022

feat: change default blockSize for calico kubernetes-sigs/kubespray#9055

Merged

sedflix mentioned this issue Oct 27, 2022

scale-in and scale-out of node fails due to error in garbage collection of IPs #5287

Closed

fasaxc mentioned this issue May 7, 2024

VXLAN not working when tunnel address is borrowed #5595

Closed

fasaxc added likelihood/high and removed likelihood/low labels May 7, 2024

caseydavenport mentioned this issue Dec 31, 2024

Program connected routes for borrowed VXLAN tunnel addresses #9662

Merged

3 tasks

fasaxc mentioned this issue Jan 2, 2025

Calculate which routes are borrowed in route resolver. #9667

Closed

3 tasks

caseydavenport closed this as completed in #9662 Jan 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VXLAN not working when tunnel address is borrowed #6160

VXLAN not working when tunnel address is borrowed #6160

cyclinder commented May 31, 2022 •

edited

Loading

cyclinder commented Jun 4, 2022

cyclinder commented Jun 7, 2022

caseydavenport commented Jun 9, 2022

caseydavenport commented Jun 9, 2022

cyclinder commented Jun 9, 2022 •

edited

Loading

caseydavenport commented Jun 13, 2022

biqiangwu commented Jun 14, 2022

caseydavenport commented Jun 14, 2022

biqiangwu commented Jun 15, 2022

sedflix commented Oct 27, 2022

caseydavenport commented Dec 31, 2024

VXLAN not working when tunnel address is borrowed #6160

VXLAN not working when tunnel address is borrowed #6160

Comments

cyclinder commented May 31, 2022 • edited Loading

Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce (for bugs)

Context

Your Environment

cyclinder commented Jun 4, 2022

cyclinder commented Jun 7, 2022

caseydavenport commented Jun 9, 2022

caseydavenport commented Jun 9, 2022

cyclinder commented Jun 9, 2022 • edited Loading

caseydavenport commented Jun 13, 2022

biqiangwu commented Jun 14, 2022

caseydavenport commented Jun 14, 2022

biqiangwu commented Jun 15, 2022

sedflix commented Oct 27, 2022

caseydavenport commented Dec 31, 2024

cyclinder commented May 31, 2022 •

edited

Loading

cyclinder commented Jun 9, 2022 •

edited

Loading