You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When abnormal nodes are down or the security group is isolated in the cluster, when nodes are added to the cluster or other nodes are restored to be down, calico bgp route establishment takes a long time, which takes 4 minutes. I expect that the state is that nodes are added to the cluster or nodes are restored to be down, and calico bgp route establishment will not be affected
Current Behavior
calico BGP route association takes 4 minutes
Possible Solution
bird source code modification
Steps to Reproduce (for bugs)
1.Example Modify the proto/bgp/bgp.c file with the following code
static void
bgp_sock_err(sock *sk, int err)
{
struct bgp_conn *conn = sk->data;
struct bgp_proto *p = conn->bgp;
/*
* This error hook may be called either asynchronously from main
* loop, or synchronously from sk_send(). But sk_send() is called
* only from bgp_tx() and bgp_kick_tx(), which are both called
* asynchronously from main loop. Moreover, they end if err hook is
* called. Therefore, we could suppose that it is always called
* asynchronously.
*/
bgp_store_error(p, conn, BE_SOCKET, err);
if (err)
BGP_TRACE(D_EVENTS, "Connection lost (%M)", err);
else
BGP_TRACE(D_EVENTS, "Connection closed");
/*
* xc add code start
*/
if (err == ECONNREFUSED || err == EHOSTUNREACH) {
log(L_INFO "The link error message is Connection refused or No route to host, clear the host lock");
proto_graceful_restart_unlock(&p->p);
}
/*
* xc add code end
*/
if ((conn->state == BS_ESTABLISHED) && p->gr_ready)
bgp_handle_graceful_restart(p);
bgp_conn_enter_idle_state(conn);
}
Context
Your Environment
Calico version 3.29.1
Orchestrator version 1.32
Operating System and version: linux
The text was updated successfully, but these errors were encountered:
I only modified the bird source code, did not modify any bgp configuration, I printed a log in the proto_graceful_restart_unlock method, and showed it in the image below. The final effect of the modification is that when there is a network unreachable node in the cluster, bird can also quickly complete the graceful restart, rather than waiting for the 240s timeout
bird code before adjustment:
After code adjustment:
I believe it's intentional as part of graceful restart that we wait a certain period of time before giving up. Won't changing this window impact the reliability of graceful restart in genuine failure scenarios?
During the GR, routing in the data-plane should still be functional as existing routes won't be removed.
I think the correct solution here is that if a node is down, it is removed from the cluster so that Calico doesn't attempt to establish BGP with it.
Expected Behavior
When abnormal nodes are down or the security group is isolated in the cluster, when nodes are added to the cluster or other nodes are restored to be down, calico bgp route establishment takes a long time, which takes 4 minutes. I expect that the state is that nodes are added to the cluster or nodes are restored to be down, and calico bgp route establishment will not be affected
Current Behavior
calico BGP route association takes 4 minutes
Possible Solution
bird source code modification
Steps to Reproduce (for bugs)
1.Example Modify the proto/bgp/bgp.c file with the following code
Context
Your Environment
The text was updated successfully, but these errors were encountered: