Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[active-standby] Add oscillation logic when there is no heartbeat on both sides #221

Merged
merged 8 commits into from
Nov 10, 2023

Conversation

zjswhhh
Copy link
Contributor

@zjswhhh zjswhhh commented Nov 7, 2023

Description of PR

Summary:
Fixes # (issue)
This PR is to cover an extreme edge case when none of the dualtors can receive ICMP heartbeat. The goal is to oscillate mux direction between two sides so if one side recovers, mux direction can have a chance to park on this side.

Note that toggles in an unhealthy scenario like this will cause longer disruption than in a healthy scenario.

sign-off: Jing Zhang [email protected]

Type of change

  • Bug fix
  • New feature
  • Doc/Design
  • Unit test

Approach

What is the motivation for this PR?

To avoid mux direction from being parked on one side when missing heartbeat.

Work item tracking
  • Microsoft ADO (number only):
    25367027

TODO: will submit a separate PR for option to disable oscillation or increase interval.

How did you do it?

  1. Start a 5-min timer on active side when heartbeat is missing;
  2. When timer expires if still missing heartbeat && still active, toggle to standby.

How did you verify/test it?

Run the change on lab devices. Oscillation happened like expected (adjusted interval to 1min on lab device):

Nov  7 01:09:43.225217 TOR-B NOTICE mux#linkmgrd: DbInterface.cpp:237 handlePostSwitchCause: Ethernet124: post last switch cause Timed_Oscillation
Nov  7 01:11:59.883266 TOR-B NOTICE mux#linkmgrd: DbInterface.cpp:237 handlePostSwitchCause: Ethernet124: post last switch cause Timed_Oscillation
Nov  7 01:14:49.997030 TOR-B NOTICE mux#linkmgrd: DbInterface.cpp:237 handlePostSwitchCause: Ethernet124: post last switch cause Timed_Oscillation
Nov  7 01:17:45.068746 TOR-B NOTICE mux#linkmgrd: DbInterface.cpp:237 handlePostSwitchCause: Ethernet124: post last switch cause Timed_Oscillation
Nov  7 01:19:57.332493 TOR-B NOTICE mux#linkmgrd: DbInterface.cpp:237 handlePostSwitchCause: Ethernet124: post last switch cause Timed_Oscillation
Nov  7 01:22:14.209417 TOR-B NOTICE mux#linkmgrd: DbInterface.cpp:237 handlePostSwitchCause: Ethernet124: post last switch cause Timed_Oscillation
Nov  7 01:24:30.938922 TOR-B NOTICE mux#linkmgrd: DbInterface.cpp:237 handlePostSwitchCause: Ethernet124: post last switch cause Timed_Oscillation
Nov  7 01:26:47.640098 TOR-B NOTICE mux#linkmgrd: DbInterface.cpp:237 handlePostSwitchCause: Ethernet124: post last switch cause Timed_Oscillation
Nov  7 01:29:04.379680 TOR-B NOTICE mux#linkmgrd: DbInterface.cpp:237 handlePostSwitchCause: Ethernet124: post last switch cause Timed_Oscillation
Nov  7 01:31:13.777881 TOR-B NOTICE mux#linkmgrd: DbInterface.cpp:237 handlePostSwitchCause: Ethernet124: post last switch cause Timed_Oscillation

Nov  7 01:08:41.815651 TOR-A NOTICE mux#linkmgrd: DbInterface.cpp:237 handlePostSwitchCause: Ethernet124: post last switch cause Timed_Oscillation
Nov  7 01:10:59.213299 TOR-A NOTICE mux#linkmgrd: DbInterface.cpp:237 handlePostSwitchCause: Ethernet124: post last switch cause Timed_Oscillation
Nov  7 01:13:16.606989 TOR-A NOTICE mux#linkmgrd: DbInterface.cpp:237 handlePostSwitchCause: Ethernet124: post last switch cause Timed_Oscillation
Nov  7 01:16:12.141991 TOR-A NOTICE mux#linkmgrd: DbInterface.cpp:237 handlePostSwitchCause: Ethernet124: post last switch cause Timed_Oscillation
Nov  7 01:18:53.202499 TOR-A NOTICE mux#linkmgrd: DbInterface.cpp:237 handlePostSwitchCause: Ethernet124: post last switch cause Timed_Oscillation
Nov  7 01:21:09.851763 TOR-A NOTICE mux#linkmgrd: DbInterface.cpp:237 handlePostSwitchCause: Ethernet124: post last switch cause Timed_Oscillation
Nov  7 01:23:26.517250 TOR-A NOTICE mux#linkmgrd: DbInterface.cpp:237 handlePostSwitchCause: Ethernet124: post last switch cause Timed_Oscillation
Nov  7 01:25:43.215731 TOR-A NOTICE mux#linkmgrd: DbInterface.cpp:237 handlePostSwitchCause: Ethernet124: post last switch cause Timed_Oscillation
Nov  7 01:27:59.887524 TOR-A NOTICE mux#linkmgrd: DbInterface.cpp:237 handlePostSwitchCause: Ethernet124: post last switch cause Timed_Oscillation
Nov  7 01:30:10.159067 TOR-A NOTICE mux#linkmgrd: DbInterface.cpp:237 handlePostSwitchCause: Ethernet124: post last switch cause Timed_Oscillation

Any platform specific information?

Documentation

yxieca
yxieca previously approved these changes Nov 7, 2023
Copy link
Contributor

@lolyu lolyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have UT for this?

src/common/MuxConfig.h Outdated Show resolved Hide resolved
Copy link
Contributor

@lolyu lolyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lolyu
Copy link
Contributor

lolyu commented Nov 10, 2023

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@zjswhhh zjswhhh merged commit 489f6ce into sonic-net:master Nov 10, 2023
9 checks passed
@zjswhhh zjswhhh deleted the oscillation_master branch November 10, 2023 20:40
StormLiangMS pushed a commit to sonic-net/sonic-mgmt that referenced this pull request Mar 5, 2024
…case (#11878)

Description of PR
Summary:
Fixes # (issue)
On latest master image, test_cacl_application_dualtor case could fail due to the following reason:
Failed: Missing expected iptables rules: {'-A DHCP -m mark --mark 0x67005 -j DROP'}

This is caused by the oscillation logic: sonic-net/sonic-linkmgrd#221

As there is no icmp_responder running, the mux will start flap, if it flaps betrween expected_dhcp_rules_for_standby fixture and
the real iptables check, there could be some unexpected dhcp iptables which will cause case failure.

What is the motivation for this PR?
Fix Unexpected DHCP iptables rules for test_cacl_application_dualtor

How did you do it?
test_cacl_application_dualtor is used to verify dhcp iptables on dualtor, not oscillation.
Before dualtor case, set mux mode to manual
sudo config mux mode manual all
in test teardown, run to set to auto back.
sudo config mux mode auto all

How did you verify/test it?
Run cacl/test_cacl_application.py::test_cacl_application_dualtor on dualtor testbed against master image.

Any platform specific information?
mssonicbld pushed a commit to mssonicbld/sonic-mgmt that referenced this pull request May 6, 2024
…case (sonic-net#11878)

Description of PR
Summary:
Fixes # (issue)
On latest master image, test_cacl_application_dualtor case could fail due to the following reason:
Failed: Missing expected iptables rules: {'-A DHCP -m mark --mark 0x67005 -j DROP'}

This is caused by the oscillation logic: sonic-net/sonic-linkmgrd#221

As there is no icmp_responder running, the mux will start flap, if it flaps betrween expected_dhcp_rules_for_standby fixture and
the real iptables check, there could be some unexpected dhcp iptables which will cause case failure.

What is the motivation for this PR?
Fix Unexpected DHCP iptables rules for test_cacl_application_dualtor

How did you do it?
test_cacl_application_dualtor is used to verify dhcp iptables on dualtor, not oscillation.
Before dualtor case, set mux mode to manual
sudo config mux mode manual all
in test teardown, run to set to auto back.
sudo config mux mode auto all

How did you verify/test it?
Run cacl/test_cacl_application.py::test_cacl_application_dualtor on dualtor testbed against master image.

Any platform specific information?
mssonicbld pushed a commit to sonic-net/sonic-mgmt that referenced this pull request May 6, 2024
…case (#11878)

Description of PR
Summary:
Fixes # (issue)
On latest master image, test_cacl_application_dualtor case could fail due to the following reason:
Failed: Missing expected iptables rules: {'-A DHCP -m mark --mark 0x67005 -j DROP'}

This is caused by the oscillation logic: sonic-net/sonic-linkmgrd#221

As there is no icmp_responder running, the mux will start flap, if it flaps betrween expected_dhcp_rules_for_standby fixture and
the real iptables check, there could be some unexpected dhcp iptables which will cause case failure.

What is the motivation for this PR?
Fix Unexpected DHCP iptables rules for test_cacl_application_dualtor

How did you do it?
test_cacl_application_dualtor is used to verify dhcp iptables on dualtor, not oscillation.
Before dualtor case, set mux mode to manual
sudo config mux mode manual all
in test teardown, run to set to auto back.
sudo config mux mode auto all

How did you verify/test it?
Run cacl/test_cacl_application.py::test_cacl_application_dualtor on dualtor testbed against master image.

Any platform specific information?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants