Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Voq chassis orchagent crash with 34K routes #3329

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

saksarav-nokia
Copy link
Contributor

What I did
Don't add the remote system neighbor if the same neighbor exists.
Why I did it
The IMM has two asics and has 2 port channels in each asic and 2 port members in each port channel.
The ip address is configured on each port channel and bgp is enabled. The neighbor and routes are learned on these port channel.
In sonic-mgmt pc suite, the test case po-update removes the port members from one of the port channel, removes the ip address configured on that port channel, creates new port channel, adds the same port members to the new port channel, adds the same ip address to the new port channel.
In the remote asic, before all the routes learned on the old port channel are removed by routeOrch, orchagent trries to remove the neighbor and nexthop for the old portchannel. But since the routes are pending, the old nexthop and neighbor are not removed. Then the neighbor and nexthop for the new port channel are being added. If the neighbor is learned on remote system port in remote asic, the nexthop is added with alias as inband port's alias, so the key (ip,alias) is same for both old nexthop and new nexthop. When the new nexthop is added , it calls hasNextHop function to check if the nexthop with (ip-address, alias) as key and since the old nexthop is not removed yet, the hasNextHop returns true, however the assert(!hasNextHop) does n't trigger the crash. So addNextHop function replace the old nexthop with old rif-id with new nexthop with new old rif-id in the nexthop map. Then after all the routes learned on old port channel is removed, the old neighbor and old nexthop are being removed. Sine the old nexthop was replaced with new nexthop, when orchagent tries to delete the old nexthop, it actually deletes the new nexthop from SAI. Then when it tries to remove the old neighbor, SAI returns error since orchagent removed the new nexthop from SAI instead of old nexthop and old neighbor is still referenced by the old nexthop in SAI. So orchagent crashes when SAI returns error.
How I verified it
Ran pc and voq suite and verified it passes.
Details if related

@saksarav-nokia saksarav-nokia force-pushed the saksara-nokia-orchagent-crash-fix branch from 0dd3c64 to db2ee71 Compare October 17, 2024 14:06
@saksarav-nokia saksarav-nokia force-pushed the saksara-nokia-orchagent-crash-fix branch 5 times, most recently from 3fc02c7 to 23dbe33 Compare November 7, 2024 17:27
@saksarav-nokia saksarav-nokia force-pushed the saksara-nokia-orchagent-crash-fix branch 6 times, most recently from 1a47e76 to a244a11 Compare November 12, 2024 19:09
@saksarav-nokia
Copy link
Contributor Author

This PR fixes the issue sonic-net/sonic-buildimage#20507 which is introduced by #3269

@saksarav-nokia saksarav-nokia force-pushed the saksara-nokia-orchagent-crash-fix branch from 1f88894 to a5d3baa Compare November 15, 2024 17:35
@saksarav-nokia
Copy link
Contributor Author

@arlakshm for your viz

remote_neigh = nkey
break

assert remote_neigh != "", "Remote neigh not found in ASIC_DB"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the remote neigh entry will be present. Should we check if the remote neighbor entry is still old neighbor? also should we add a check to see if the nexthop is not updated because of addneighbor?

remote_neigh = nkey
break

assert remote_neigh != "", "Remote neigh not found in ASIC_DB"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add a check to see we have new nexthop for new neighbor?

@saksarav-nokia saksarav-nokia force-pushed the saksara-nokia-orchagent-crash-fix branch from 066ab71 to f12e81d Compare November 15, 2024 21:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants