-
Notifications
You must be signed in to change notification settings - Fork 413
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Network bonding configuration not working with fail_over_mac=follow #4605
Comments
|
Bonded network configurations with mode=active-backup and fail_over_mac=follow are not functioning due to a race when activating network profiles. activate_nm_connections() attempts to activate all its generated profiles that are not currently in the "active" state. As autoconnect-slaves is set, once br-ex is activated the bond and all its slaves are automatically activated. Their state is set to "activating" until they become active. The "activating" state is not tested for therefor some of the subordinate profiles maybe activated multiple times causing a race in the bonding driver and incorrectly configuring the bond. Link: openshift#4605 Signed-off-by: David Wilder <[email protected]>
The attached kernel traces show the the NetworkManager interaction with the bonding driver when configure-ovs.sh is run. The bonding-trace-fixed.txt has the patch installed. The trace file bond-trace-broken.txt shows how the MAC of the slaves are left set to the same value. |
With bonded network configurations slaves interfaces will be implicitly activate after br-ex is explicitly activated. This implicit activation can take a number of seconds, during this time if one and only one slave is explicitly activated the bonding driver may set the same MAC address to both slaves. This will cause the bond to fail when option fail_over_mac=follow is set. This change gives bond slaves up to 5 seconds to implicitly activate preventing the issue. Link: openshift#4605 Signed-off-by: David Wilder <[email protected]>
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
/remove-lifecycle stale |
Bonded network configurations with mode=active-backup and fail_over_mac=follow are not
functioning due to a race in /var/usrlocal/bin/configure-ovs.sh.
Steps:
NetworkManager Profiles: (/etc/NetworkManager/system-connections)
cat bond0.nmconnection
[connection]
id=bond0
type=bond
autoconnect-priority=-100
autoconnect-retries=1
interface-name=bond0
multi-connect=1
[bond]
fail_over_mac=follow
mode=active-backup
[ipv4]
method=manual
address=192.168.42.6/24,192.168.42.1
dns=192.168.42.1
[ipv6]
dhcp-timeout=90
method=auto
cat enP32807p1s0.nmconnection
[connection]
id=enP32807p1s0
type=ethernet
autoconnect-priority=-100
autoconnect-retries=1
interface-name=enP32807p1s0
master=bond0
multi-connect=1
slave-type=bond
wait-device-timeout=60000
cat enP32807p1s0.nmconnection.backup
[connection]
id=enP32807p1s0
type=ethernet
autoconnect-priority=-100
autoconnect-retries=1
interface-name=enP32807p1s0
master=bond0
multi-connect=1
slave-type=bond
wait-device-timeout=60000
When the node is booted, the initial start-up of the configuration (before ovs-configuration.service has run), the bonded configuration works fine.
ip a s
2: enP32807p1s0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond0 state UP group default qlen 1000
link/ether 32:f4:c4:ec:23:00 brd ff:ff:ff:ff:ff:ff
3: enP49154p1s0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond0 state UP group default qlen 1000
link/ether 32:f4:ca:4e:53:01 brd ff:ff:ff:ff:ff:ff
8: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 32:f4:c4:ec:23:00 brd ff:ff:ff:ff:ff:ff
inet 192.168.42.6/24 brd 192.168.42.255 scope global noprefixroute bond0
valid_lft forever preferred_lft forever
inet6 fe80::30f4:c4ff:feec:2300/64 scope link noprefixroute
valid_lft forever preferred_lft forever
......
However after ovs-configuration.service has run, the network in no-longer functioning.
2: enP32807p1s0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond0 state UP group default qlen 1000
link/ether 32:f4:c4:ec:23:00 brd ff:ff:ff:ff:ff:ff
3: enP49154p1s0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond0 state UP group default qlen 1000
link/ether 32:f4:c4:ec:23:00 brd ff:ff:ff:ff:ff:ff permaddr 32:f4:ca:4e:53:01
8: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue master ovs-system state UP group default qlen 1000
link/ether 32:f4:c4:ec:23:00 brd ff:ff:ff:ff:ff:ff
9: br-ex: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
link/ether 32:f4:c4:ec:23:00 brd ff:ff:ff:ff:ff:ff
inet 192.168.42.6/24 brd 192.168.42.255 scope global noprefixroute br-ex
valid_lft forever preferred_lft forever
inet6 fe80::30f4:c4ff:feec:2300/64 scope link noprefixroute
valid_lft forever preferred_lft forever
At this point the MACs of the bond's slaves (enP32807p1s0,enP49154p1s0) are the same. The purpose of fail_over_mac=follow is to insure the MACs will not be the same. This is preventing the bond from functioning. This initially appeared to be a problem with the bonding driver, after tracing the calls NetworkManager is making to the bonding driver I discovered the root of the problem is in configure-ovs.sh.
The function: activate_nm_connections() attempts to activate all its generated profiles that are not currently in the "active" state. In my case the following profiles are activated one at a time in this order:
br-ex, ovs-if-phys0, enP32807p1s0-slave-ovs-clone, enP49154p1s0-slave-ovs-clone, ovs-if-br-ex
However the generated profiles have autoconnect-slaves set, therefore when br-ex is activated ovs-if-phys0, enP32807p1s0-slave-ovs-clone and enP49154p1s0-slave-ovs-clone's state changes to "activating", as we are only checking for the "activated" state these profiles may be activated again. As the list is walked, some of the profile's state will automatically go from activating to active. These interfaces are not activated a second time leaving the state of the bond in an unpredictable state. I am able to see in the bonding traces why both slave interface have the same MAC.
My fix is to check for either activating or active states.
--- configure-ovs.sh 2024-09-20 15:29:03.160536239 -0700
+++ configure-ovs.sh.patched 2024-09-20 15:33:38.040336032 -0700
@@ -575,8 +575,8 @@
# But set the entry in master_interfaces to true if this is a slave
# Also set autoconnect to yes
local active_state=$(nmcli -g GENERAL.STATE conn show "$conn")
Additional environment details (platform, options, etc.):
Environment: IBM Power-VM
Kernel: 5.14.0-284.82.1.el9_2.ppc64le
oc version
Network interface: Mellanox Technologies ConnectX Family mlx5Gen Virtual Functions (SR-IOV).
NetworkManager Profiles: (/etc/NetworkManager/system-connections)
The text was updated successfully, but these errors were encountered: