Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Network bonding configuration not working with fail_over_mac=follow #4605

Open
djlwilder opened this issue Sep 20, 2024 · 4 comments · May be fixed by #4609
Open

Network bonding configuration not working with fail_over_mac=follow #4605

djlwilder opened this issue Sep 20, 2024 · 4 comments · May be fixed by #4609

Comments

@djlwilder
Copy link

Bonded network configurations with mode=active-backup and fail_over_mac=follow are not
functioning due to a race in /var/usrlocal/bin/configure-ovs.sh.

Steps:
NetworkManager Profiles: (/etc/NetworkManager/system-connections)

cat bond0.nmconnection

[connection]
id=bond0
type=bond
autoconnect-priority=-100
autoconnect-retries=1
interface-name=bond0
multi-connect=1
[bond]
fail_over_mac=follow
mode=active-backup
[ipv4]
method=manual
address=192.168.42.6/24,192.168.42.1
dns=192.168.42.1
[ipv6]
dhcp-timeout=90
method=auto

cat enP32807p1s0.nmconnection

[connection]
id=enP32807p1s0
type=ethernet
autoconnect-priority=-100
autoconnect-retries=1
interface-name=enP32807p1s0
master=bond0
multi-connect=1
slave-type=bond
wait-device-timeout=60000

cat enP32807p1s0.nmconnection.backup

[connection]
id=enP32807p1s0
type=ethernet
autoconnect-priority=-100
autoconnect-retries=1
interface-name=enP32807p1s0
master=bond0
multi-connect=1
slave-type=bond
wait-device-timeout=60000

When the node is booted, the initial start-up of the configuration (before ovs-configuration.service has run), the bonded configuration works fine.

ip a s

2: enP32807p1s0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond0 state UP group default qlen 1000
link/ether 32:f4:c4:ec:23:00 brd ff:ff:ff:ff:ff:ff
3: enP49154p1s0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond0 state UP group default qlen 1000
link/ether 32:f4:ca:4e:53:01 brd ff:ff:ff:ff:ff:ff
8: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 32:f4:c4:ec:23:00 brd ff:ff:ff:ff:ff:ff
inet 192.168.42.6/24 brd 192.168.42.255 scope global noprefixroute bond0
valid_lft forever preferred_lft forever
inet6 fe80::30f4:c4ff:feec:2300/64 scope link noprefixroute
valid_lft forever preferred_lft forever
......

However after ovs-configuration.service has run, the network in no-longer functioning.

2: enP32807p1s0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond0 state UP group default qlen 1000
link/ether 32:f4:c4:ec:23:00 brd ff:ff:ff:ff:ff:ff
3: enP49154p1s0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond0 state UP group default qlen 1000
link/ether 32:f4:c4:ec:23:00 brd ff:ff:ff:ff:ff:ff permaddr 32:f4:ca:4e:53:01
8: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue master ovs-system state UP group default qlen 1000
link/ether 32:f4:c4:ec:23:00 brd ff:ff:ff:ff:ff:ff
9: br-ex: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
link/ether 32:f4:c4:ec:23:00 brd ff:ff:ff:ff:ff:ff
inet 192.168.42.6/24 brd 192.168.42.255 scope global noprefixroute br-ex
valid_lft forever preferred_lft forever
inet6 fe80::30f4:c4ff:feec:2300/64 scope link noprefixroute
valid_lft forever preferred_lft forever

At this point the MACs of the bond's slaves (enP32807p1s0,enP49154p1s0) are the same. The purpose of fail_over_mac=follow is to insure the MACs will not be the same. This is preventing the bond from functioning. This initially appeared to be a problem with the bonding driver, after tracing the calls NetworkManager is making to the bonding driver I discovered the root of the problem is in configure-ovs.sh.

The function: activate_nm_connections() attempts to activate all its generated profiles that are not currently in the "active" state. In my case the following profiles are activated one at a time in this order:
br-ex, ovs-if-phys0, enP32807p1s0-slave-ovs-clone, enP49154p1s0-slave-ovs-clone, ovs-if-br-ex

However the generated profiles have autoconnect-slaves set, therefore when br-ex is activated ovs-if-phys0, enP32807p1s0-slave-ovs-clone and enP49154p1s0-slave-ovs-clone's state changes to "activating", as we are only checking for the "activated" state these profiles may be activated again. As the list is walked, some of the profile's state will automatically go from activating to active. These interfaces are not activated a second time leaving the state of the bond in an unpredictable state. I am able to see in the bonding traces why both slave interface have the same MAC.

My fix is to check for either activating or active states.

--- configure-ovs.sh 2024-09-20 15:29:03.160536239 -0700
+++ configure-ovs.sh.patched 2024-09-20 15:33:38.040336032 -0700
@@ -575,8 +575,8 @@
# But set the entry in master_interfaces to true if this is a slave
# Also set autoconnect to yes
local active_state=$(nmcli -g GENERAL.STATE conn show "$conn")

  •    if [ "$active_state" == "activated" ]; then
    
  •      echo "Connection $conn already activated"
    
  •    if [ "$active_state" == "activated" ] || [ "$active_state" == "activating" ]; then
    
  •      echo "Connection $conn already activated or activating"
         if $is_slave; then
           master_interfaces[$master_interface]=true
         fi
    

Additional environment details (platform, options, etc.):
Environment: IBM Power-VM
Kernel: 5.14.0-284.82.1.el9_2.ppc64le

oc version

Client Version: 4.15.30
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: 4.15.30
Kubernetes Version: v1.28.12+0c3c368

Network interface: Mellanox Technologies ConnectX Family mlx5Gen Virtual Functions (SR-IOV).
NetworkManager Profiles: (/etc/NetworkManager/system-connections)

@djlwilder
Copy link
Author

--- configure-ovs.sh	2024-09-20 15:29:03.160536239 -0700
+++ configure-ovs.sh.patched	2024-09-20 15:33:38.040336032 -0700
@@ -575,8 +575,8 @@
         # But set the entry in master_interfaces to true if this is a slave
         # Also set autoconnect to yes
         local active_state=$(nmcli -g GENERAL.STATE conn show "$conn")
-        if [ "$active_state" == "activated" ]; then
-          echo "Connection $conn already activated"
+        if [ "$active_state" == "activated" ] || [ "$active_state" == "activating" ]; then
+          echo "Connection $conn already activated or activating"
           if $is_slave; then
             master_interfaces[$master_interface]=true
           fi  >

djlwilder pushed a commit to djlwilder/machine-config-operator that referenced this issue Sep 23, 2024
Bonded network configurations with mode=active-backup and
fail_over_mac=follow are not functioning due to a race when
activating network profiles. activate_nm_connections() attempts
to activate all its generated profiles that are not currently
in the "active" state. As autoconnect-slaves is set, once
br-ex is activated the bond and all its slaves are automatically
activated. Their state is set to "activating" until they become
active. The "activating" state is not tested for therefor some of
the subordinate profiles maybe activated multiple times causing a
race in the bonding driver and incorrectly configuring the bond.

Link: openshift#4605
Signed-off-by: David Wilder <[email protected]>
@djlwilder
Copy link
Author

The attached kernel traces show the the NetworkManager interaction with the bonding driver when configure-ovs.sh is run. The bonding-trace-fixed.txt has the patch installed. The trace file bond-trace-broken.txt shows how the MAC of the slaves are left set to the same value.

bonding-trace-fixed.txt
bond-trace-broken.txt

djlwilder pushed a commit to djlwilder/machine-config-operator that referenced this issue Dec 9, 2024
With bonded network configurations slaves interfaces will be
implicitly activate after br-ex is explicitly activated. This
implicit activation can take a number of seconds, during this
time if one and only one slave is explicitly activated the bonding
driver may set the same MAC address to both slaves. This will
cause the bond to fail when option fail_over_mac=follow is set.
This change gives bond slaves up to 5 seconds to implicitly
activate preventing the issue.

Link: openshift#4605
Signed-off-by: David Wilder <[email protected]>
@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 30, 2024
@djlwilder
Copy link
Author

/remove-lifecycle stale

@openshift-ci openshift-ci bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants