Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

calico bird component source code optimization #115

Open
xuchuan-666 opened this issue Dec 13, 2024 · 3 comments
Open

calico bird component source code optimization #115

xuchuan-666 opened this issue Dec 13, 2024 · 3 comments

Comments

@xuchuan-666
Copy link

xuchuan-666 commented Dec 13, 2024

Expected Behavior

When abnormal nodes are down or the security group is isolated in the cluster, when nodes are added to the cluster or other nodes are restored to be down, calico bgp route establishment takes a long time, which takes 4 minutes. I expect that the state is that nodes are added to the cluster or nodes are restored to be down, and calico bgp route establishment will not be affected

Current Behavior

calico BGP route association takes 4 minutes

Possible Solution

bird source code modification

Steps to Reproduce (for bugs)

1.Example Modify the proto/bgp/bgp.c file with the following code

static void
bgp_sock_err(sock *sk, int err)
{
  struct bgp_conn *conn = sk->data;
  struct bgp_proto *p = conn->bgp;

  /*
   * This error hook may be called either asynchronously from main
   * loop, or synchronously from sk_send().  But sk_send() is called
   * only from bgp_tx() and bgp_kick_tx(), which are both called
   * asynchronously from main loop. Moreover, they end if err hook is
   * called. Therefore, we could suppose that it is always called
   * asynchronously.
   */

  bgp_store_error(p, conn, BE_SOCKET, err);

  if (err)
    BGP_TRACE(D_EVENTS, "Connection lost (%M)", err);

  else
    BGP_TRACE(D_EVENTS, "Connection closed");

  /*
   * xc add code start
   */
  if (err == ECONNREFUSED || err == EHOSTUNREACH) {
    log(L_INFO "The link error message is Connection refused or No route to host, clear the host lock");
    proto_graceful_restart_unlock(&p->p);
  }
  /*
   * xc add code end
   */

  if ((conn->state == BS_ESTABLISHED) && p->gr_ready)
    bgp_handle_graceful_restart(p);

  bgp_conn_enter_idle_state(conn);
}

Context

Your Environment

  • Calico version 3.29.1
  • Orchestrator version 1.32
  • Operating System and version: linux
@MichalFupso
Copy link
Contributor

Hi @xuchuan-666, could you please share logs from calico-node and any bgp configuration you changed?

@xuchuan-666
Copy link
Author

I only modified the bird source code, did not modify any bgp configuration, I printed a log in the proto_graceful_restart_unlock method, and showed it in the image below. The final effect of the modification is that when there is a network unreachable node in the cluster, bird can also quickly complete the graceful restart, rather than waiting for the 240s timeout
image

bird code before adjustment:
image
After code adjustment:
image

@caseydavenport
Copy link
Member

I believe it's intentional as part of graceful restart that we wait a certain period of time before giving up. Won't changing this window impact the reliability of graceful restart in genuine failure scenarios?

During the GR, routing in the data-plane should still be functional as existing routes won't be removed.

I think the correct solution here is that if a node is down, it is removed from the cluster so that Calico doesn't attempt to establish BGP with it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants