Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graceful Restart capability could not be set when the BGP connection was being established. #9575

Open
promacanthus opened this issue Dec 9, 2024 · 1 comment

Comments

@promacanthus
Copy link

Expected Behavior

When we add a new node to the cluster, a calico-node starts running on this node. It then establishes the BGP connection and sends an Open Message to negotiate the capabilities. At this point, we want to enable Graceful Restart with the Flag: 0x80 (Preserve forwarding state). However, sometimes BIRD will send a message with Flag: 0x00, which leads to a failure in enabling the Graceful Restart capability.

Current Behavior

Approximately 20% of nodes send the Open Message with Graceful Restart capability Flag: 0x00.

In our cluster, there are several thousand nodes, so this is a serious problem.

Possible Solution

Restarting the calico-node Pod and reestablishing the BGP connection allows the Graceful Restart capability to be set.

Steps to Reproduce (for bugs)

  1. Run the tcpdump -i bond0 -n -vv 'port 179' -w data.pcap command to capture packets from the interface.
  2. Analyze these packets using Wireshark.

Context

Failed State:
image

Successful State:
image

Your Environment

  • Calico version: v3.27.2
  • Calico dataplane (iptables, windows etc.): iptables
  • Orchestrator version (e.g. kubernetes, mesos, rkt): kubernetes
  • Operating System and version: Linux 5.10.134-16.3.an8.x86_64 GNU/Linux
@caseydavenport
Copy link
Member

I'm not really sure what might trigger BIRD to send that flag, to be honest. Any thoughts?

We do have GR testing that I don't think has ever hit this - functionally graceful restart seems to work in our environments. I wonder if there is some quirk of your BGP environment / ToR that triggers this behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants