Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calico-node report Socket error: bind: Cannot assign requested address in BGP mode after a while #9549

Open
lubronzhan opened this issue Dec 3, 2024 · 5 comments

Comments

@lubronzhan
Copy link
Contributor

Expected Behavior

With BGP mode, the calico pod is working as steady

Current Behavior

With BGP mode, suddenly the calico-node pod end up in not ready state

NAMESPACE                 NAME                                                                 READY   STATUS             RESTARTS           AGE
kube-system               calico-node-82tl8                                                    0/1     Running            0                  176d
kube-system               calico-node-xkp28                                                    0/1     Running            0                  176d

Describing the calico-node pod shows Bird is not ready:

  Warning  Unhealthy  16s (x87982 over 8d)  kubelet  (combined from similar events): Readiness probe failed: 2024-11-14 10:15:28.682 [INFO][25772] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 10.16.133.58,10.16.133.30,10.16.133.57,10.16.133.104,10.16.133.35,10.16.133.33,10.16.133.98,10.16.133.15,10.16.133.94,10.16.133.31,10.16.133.110,10.16.133.51,10.16.133.12,10.16.133.95,10.16.133.22,10.16.133.114,10.16.133.21,10.16.133.62,192.168.100.120

And within the calico pod it shows can't assign requested address

2024-11-22T09:04:41.313642354Z stdout F 2024-11-22 09:04:41.313 [INFO][105] monitor-addresses/autodetection_methods.go 103: Using autodetected IPv4 address on interface eth0: [10.16.133.61/25](http://10.16.133.61/25)
2024-11-22T09:04:42.31246517Z stdout F bird: Mesh_10_16_133_32: Socket error: bind: Cannot assign requested address
2024-11-22T09:04:42.312483175Z stdout F bird: Mesh_10_16_133_57: Socket error: bind: Cannot assign requested address
2024-11-22T09:04:42.312485683Z stdout F bird: Mesh_10_16_133_30: Socket error: bind: Cannot assign requested address
2024-11-22T09:04:42.31248767Z stdout F bird: Mesh_10_16_133_114: Socket error: bind: Cannot assign requested address
2024-11-22T09:04:42.312489591Z stdout F bird: Mesh_10_16_133_21: Socket error: bind: Cannot assign requested address
2024-11-22T09:04:42.312491874Z stdout F bird: Mesh_10_16_133_35: Socket error: bind: Cannot assign requested address
2024-11-22T09:04:42.312493849Z stdout F bird: Mesh_10_16_133_95: Socket error: bind: Cannot assign requested address
2024-11-22T09:04:42.312495686Z stdout F bird: Mesh_10_16_133_62: Socket error: bind: Cannot assign requested address
2024-11-22T09:04:42.312497574Z stdout F bird: Mesh_10_16_133_22: Socket error: bind: Cannot assign requested address
2024-11-22T09:04:42.312499358Z stdout F bird: Mesh_10_16_133_58: Socket error: bind: Cannot assign requested address

Ran calico-node show-status and it didnt report any issues with BGPbird v4 status

+-------+-------------------+-----------------+---------------------+---------------------+---------------------+
| READY |      VERSION      |     ROUTEID     |     SERVERTIME      |      LASTBOOT       |    LASTRECONFIG     |
+-------+-------------------+-----------------+---------------------+---------------------+---------------------+
| true  | v0.3.3+birdv1.6.8 | 192.168.100.120 | 2024-11-26 11:33:23 | 2024-05-21 11:53:29 | 2024-11-14 10:17:53 |
+-------+-------------------+-----------------+---------------------+---------------------+---------------------+

bird v4 BGP peers
+---------------+-----------+-------+------------+----------+
| PEER ADDRESS  | PEER TYPE | STATE |   SINCE    | BGPSTATE |
+---------------+-----------+-------+------------+----------+
| 10.16.133.58  | Mesh      | start | 2024-11-13 | Active   |
| 10.16.133.30  | Mesh      | start | 2024-11-13 | Active   |
| 10.16.133.57  | Mesh      | start | 2024-11-13 | Active   |
| 10.16.133.104 | Mesh      | start | 2024-11-13 | Active   |
| 10.16.133.35  | Mesh      | start | 2024-11-13 | Active   |
| 10.16.133.33  | Mesh      | start | 2024-11-13 | Active   |
| 10.16.133.98  | Mesh      | start | 2024-11-13 | Active   |
| 10.16.133.15  | Mesh      | start | 2024-11-13 | Active   |
| 10.16.133.94  | Mesh      | start | 2024-11-13 | Active   |
| 10.16.133.31  | Mesh      | start | 2024-11-13 | Active   |
| 10.16.133.110 | Mesh      | start | 2024-11-13 | Active   |
| 10.16.133.51  | Mesh      | start | 2024-11-13 | Active   |
| 10.16.133.12  | Mesh      | start | 2024-11-13 | Active   |
| 10.16.133.95  | Mesh      | start | 2024-11-13 | Active   |
| 10.16.133.22  | Mesh      | start | 2024-11-13 | Active   |
| 10.16.133.114 | Mesh      | start | 2024-11-13 | Active   |
| 10.16.133.21  | Mesh      | start | 2024-11-13 | Active   |
| 10.16.133.62  | Mesh      | start | 2024-11-13 | Active   |
| 10.16.133.32  | Mesh      | start | 2024-11-14 | Active   |
+---------------+-----------+-------+------------+----------+

bird v4 routes
+------------------+-------------+-----------------+-------------+---------+
|   DESTINATION    |   GATEWAY   |      IFACE      | LEARNEDFROM | PRIMARY |
+------------------+-------------+-----------------+-------------+---------+
| 0.0.0.0/0        | 10.16.133.1 | eth0            | kernel1     | *       |
| 100.98.72.226/32 | N/A         | cali7fba7a35b74 | kernel1     | *       |
| 100.98.72.192/26 | N/A         | blackhole       | static1     | *       |
| 100.98.72.192/32 | N/A         | tunl0           | direct1     | *       |
| 10.16.133.1/32   | N/A         | eth0            | kernel1     | *       |
| 10.16.133.0/25   | N/A         | eth0            | direct1     | *       |
| 100.98.72.209/32 | N/A         | cali00b3d1e6329 | kernel1     | *       |
| 100.98.72.211/32 | N/A         | califfe4e2ef490 | kernel1     | *       |
+------------------+-------------+-----------------+-------------+---------+

Possible Solution

Restarting the pod does workaround the issue

Steps to Reproduce (for bugs)

Context

How to debug this issue further? The old calico log was rotated.

Your Environment

  • Calico version: 3.24.1
  • Calico dataplane (iptables, windows etc.):
  • Orchestrator version (e.g. kubernetes, mesos, rkt): k8s 1.24.10
  • Operating System and version: photon-3
  • Link to your project (optional):
@mazdakn
Copy link
Member

mazdakn commented Dec 3, 2024

@lubronzhan Calico 3.24 and k8s 1.24 are pretty old. Please try with newer releases like 3.28 or 3.29.

Also we need a complete Calico node log to be able identify the root cause.

@lubronzhan
Copy link
Contributor Author

lubronzhan commented Dec 5, 2024

Hi @mazdakn
The node that's reporting error has this ip 10.16.133.61. This is the log of this node. It's not quite useful since old content has been rotated, we can only see log starting 11-21
10.16.133.61.zip

From other node, we observed a leaving signal from this 61 node on 11-13

2024-11-13T17:38:28.858693543Z stdout F bird: Reconfiguring
2024-11-13T17:38:28.858697447Z stdout F bird: device1: Reconfigured
2024-11-13T17:38:28.858699621Z stdout F bird: direct1: Reconfigured
2024-11-13T17:38:28.858701973Z stdout F bird: Mesh_10_16_133_30: Reconfigured
2024-11-13T17:38:28.858704066Z stdout F bird: Mesh_10_16_133_58: Reconfigured
2024-11-13T17:38:28.858706229Z stdout F bird: Mesh_10_16_133_57: Reconfigured
2024-11-13T17:38:28.858708354Z stdout F bird: Mesh_10_16_133_104: Reconfigured
2024-11-13T17:38:28.858710456Z stdout F bird: Mesh_10_16_133_35: Reconfigured
2024-11-13T17:38:28.858716929Z stdout F bird: Mesh_10_16_133_33: Reconfigured
2024-11-13T17:38:28.858719083Z stdout F bird: Mesh_10_16_133_98: Reconfigured
2024-11-13T17:38:28.858721071Z stdout F bird: Mesh_10_16_133_15: Reconfigured
2024-11-13T17:38:28.858723026Z stdout F bird: Mesh_10_16_133_94: Reconfigured
2024-11-13T17:38:28.858725197Z stdout F bird: Mesh_10_16_133_31: Reconfigured
2024-11-13T17:38:28.858727255Z stdout F bird: Mesh_10_16_133_110: Reconfigured
2024-11-13T17:38:28.85872948Z stdout F bird: Mesh_10_16_133_51: Reconfigured
2024-11-13T17:38:28.858731689Z stdout F bird: Mesh_10_16_133_12: Reconfigured
2024-11-13T17:38:28.858733756Z stdout F bird: Mesh_10_16_133_95: Reconfigured
2024-11-13T17:38:28.858738155Z stdout F bird: Mesh_10_16_133_22: Reconfigured
2024-11-13T17:38:28.858740247Z stdout F bird: Mesh_192_168_100_128: Reconfigured
2024-11-13T17:38:28.858742473Z stdout F bird: Mesh_10_16_133_21: Reconfigured
2024-11-13T17:38:28.858744639Z stdout F bird: Removing protocol Mesh_10_16_133_61
2024-11-13T17:38:28.858753007Z stdout F bird: Mesh_10_16_133_61: Shutting down
2024-11-13T17:38:28.858764103Z stdout F bird: Mesh_10_16_133_61: State changed to stop
2024-11-13T17:38:28.858765973Z stdout F bird: Mesh_10_16_133_62: Reconfigured
2024-11-13T17:38:28.858767744Z stdout F bird: Adding protocol Mesh_192_168_100_120
2024-11-13T17:38:28.858769717Z stdout F bird: Mesh_192_168_100_120: Initializing
2024-11-13T17:38:28.858771571Z stdout F bird: Mesh_192_168_100_120: Starting
2024-11-13T17:38:28.858775549Z stdout F bird: Mesh_192_168_100_120: State changed to start
2024-11-13T17:38:28.858866008Z stdout F bird: Mesh_10_16_133_61: State changed to down
2024-11-13T17:38:28.858868881Z stdout F bird: Reconfigured

Then nothing about that ip Mesh_10_16_133_61 in the logs of other node.

Looks like something happened on 61 and causing the calico to crash on that node, thus other node were informed this node left the mesh.

@lubronzhan
Copy link
Contributor Author

Hi @mazdakn , sorry to interrupt, did you get a chance to review the log? Thanks

@caseydavenport
Copy link
Member

@lubronzhan have you tried upgrading to a modern version of Calico? Regardless I think we won't be able to make any fixes to such an old version.

bind: Cannot assign requested address

Wonder if it's a problem with the local interfaces on that node somehow. I think this error can also mean there is another process bound to that address? Are there any other softwares on that node that might be bound to the same IP / port?

| 10.16.133.57 | Mesh | start | 2024-11-13 | Active |

BGP states are a bit tedious.. this means the connections are not Establishd, which would be a functioning session. Active doesn't mean that they are functioning.

@lubronzhan
Copy link
Contributor Author

Hi @caseydavenport thanks for replying. We are trying to find the root cause first.

Wonder if it's a problem with the local interfaces on that node somehow. I think this error can also mean there is another process bound to that address? Are there any other softwares on that node that might be bound to the same IP / port?

Yeah that could be one possibility.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants