-
Notifications
You must be signed in to change notification settings - Fork 413
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-44185: Race in configure-ovs.sh affects bonding interface configuration. #4609
base: master
Are you sure you want to change the base?
Conversation
Bonded network configurations with mode=active-backup and fail_over_mac=follow are not functioning due to a race when activating network profiles. activate_nm_connections() attempts to activate all its generated profiles that are not currently in the "active" state. As autoconnect-slaves is set, once br-ex is activated the bond and all its slaves are automatically activated. Their state is set to "activating" until they become active. The "activating" state is not tested for therefor some of the subordinate profiles maybe activated multiple times causing a race in the bonding driver and incorrectly configuring the bond. Link: openshift#4605 Signed-off-by: David Wilder <[email protected]>
Hi @djlwilder. Thanks for your PR. I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/ok-to-test |
@djlwilder: This pull request references Jira Issue OCPBUGS-44185, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
Requesting review from QA contact: The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
/assign cybertron @cybertron would you be able to take a look? |
/test e2e-azure-ovn-upgrade-out-of-change |
Hi All, thanks for reviewing my PR. Please let me know if you need further clarification on the issue or the fix. The debug traces I attached to issue 4605 should help explain the interaction between the NetworkManager and the bonding driver when configure_ovs.sh is executed. |
/retest-required |
1 similar comment
/retest-required |
Can you help me understand? If the profiles that we want active are active, and the profiles we want inactive are inactive, how is this a race in configure-ovs? If the observable state from configure-ovs matches what it wants, isn't the race in NetworkManager or some other place? There is two things that are being taken care of here that would be at risk with this change:
Unwanted profiles generally refers to profiles generated from kargs or leftovers from a previous boot if configure-ovs is told to use There is also some other considerations:
Overall, this looks to me as not a race in configure-ovs so not something that we can absolutely fix in configure-ovs. If we are going to mitigate here, I would have two asks:
|
Let me describe what I found in more detail so we can get on the same page. In my setup activate_nm_connections is called with the following list of profiles: I have added some debug to show the state that NetworkManager reports for each profile as we process the list. Here is a log showing the issue: (1) At the start of activate_nm_connections() profiles have no state (shown as None) (2) After br-ex is activated (nmcli conn up), ovs-if-phys0, enP32807p1s0-slave-ovs-clone and enP49154p1s0-slave-ovs-clone have a new state of "activating". Presumably NM is working to move these profiles to "active". (3) As the state of "activating" is not checked ovs-if-phys0 and enP32807p1s0-slave-ovs-clone are activated as well (nmcli conn up). (4) However, by the time we check the state of enP49154p1s0-slave-ovs-clone is already activated so no action is taken. The timing here is critical, in this next test I added a "sleep 1" after br-ex is reported as active. DEBUG-ShowNMStates: DebugLoc=top-of activate_nm_connections conn=br-ex State=None Finally, here is debug output with my change to configure-ovs.sh (checking for state=activating). DEBUG-ShowNMStates: DebugLoc=top-of activate_nm_connections conn=br-ex State=None So the problem is not if the profiles are active or not its how you got there. I suspect that by setting autoconnect-slaves=1 we are allowing NM The next question is why this is a problem. The debug traces I attached to issue 4605 should help explain hoe this affects the bonding driver. Thank you for your help. |
Anytime the NM has set a "activating" state we may have lost control over activation order already.
I agree
I have not tested with kargs, will do.
I agree, we have no control how long a NM will keep a profile in
Are you saying if "nmcli con up ovs-if-phys0" is run some point after configure-ovs is run? I definitely need to test this. Although I only see it as an issue if we follow up by activating the slaves one at a time as the script will do. (see below for more on this).
I hope this better explain the problem. In my example the problem was triggered because one *-slave-ovs-clone was explicitly activated and the other was not this is due to the unpredictable delay between activating->activated. If the timing was such that neither or both were explicitly activated the outcome would be different. (see also note above) Here is another approach that might address your concerns:
Ok, what should I do if some profiles don't get activated? |
I guess this is my question as well and thus my statement that the race must be somewhere else as activating a profile multiple times should not be a problem. However this is not a question we can answer, rather the RHEL team should. Anyway, I recognize the need to avoid the problem here as much as we can.
We are following the same order as the implicit activation order, except for the slaves which are technically activated at the same time and we have observed races with that implicit activation as well in the past like: With this I am saying that you might be fixing one race and introducing another one we already dealt with in the past. And that is why root causing this to RHEL becomes relevant.
Ignore, I thought the slave was already activating before we got into activate_nm_connections. I understand better what is going on now.
Would this work for you:
|
I am all for finding root cause. In this case I think that root cause is a limitation in the bonding drive coupled with how network manager interacts with it. I can open a bugzilla against RHEL for this. If you agree I would like to continue the path of avoiding the issue as long as we don't introduce more problems. I like your latest change suggestion, this should prevent the problem I am seeing. Let me experiment with this and I will post code and results soon. |
…ion." This reverts commit 50817d1.
With bonded network configurations slaves interfaces will be implicitly activate after br-ex is explicitly activated. This implicit activation can take a number of seconds, during this time if one and only one slave is explicitly activated the bonding driver may set the same MAC address to both slaves. This will cause the bond to fail when option fail_over_mac=follow is set. This change gives bond slaves up to 5 seconds to implicitly activate preventing the issue. Link: openshift#4605 Signed-off-by: David Wilder <[email protected]>
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: djlwilder The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
I updated my fork with the changes as we discussed and ran unit tests with bonding configured and fail_over_mac=follow. Platform: PowerVM
Unit tests:
Verified bond0 is functional and slave MACS have swapped. |
/retest-required |
/retest |
/test e2e-azure-ovn-upgrade-out-of-change |
/test e2e-gcp-op-ocl |
1 similar comment
/test e2e-gcp-op-ocl |
@djlwilder: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
@jcaamano Have you had a chance to look at my updated commit? Thank you. |
Bonded network configurations with mode=active-backup and fail_over_mac=follow are not functioning due to a race when activating network profiles. activate_nm_connections() attempts to activate all its generated profiles that are not currently in the "active" state. As autoconnect-slaves is set, once br-ex is activated the bond and all its slaves are automatically activated. Their state is set to "activating" until they become active. The "activating" state is not tested for therefor some of the subordinate profiles maybe activated multiple times causing a race in the bonding driver and incorrectly configuring the bond.
Link: #4605
Fixes: #4605
Please provide the following information:
Please see #4605
Testing: verify a 2 slave bond with mode=active-backup and fail_over_mac=follow converts to a working OVS config.
The slaves should not have the same MAC. Other configs should be tested as well.
pull request for inclusion in the changelog: Fix race in configure-ovs.sh that affects bonding interface configuration.