Traffic disturbance 2 minutes after node restart #11787

ljkiraly · 2024-04-12T08:32:11Z

Expected Behavior

The node restart should not have impact on traffic between elements running on other nodes.

Current Behavior

Two minutes after a worker restart there was a traffic outage.

Failure Information

Can not reproduce this, but fails often in nightly tests. Logs from a failed test run in
traffic_outage_after_node_reboot_log.tar.gz

The node reboot is at:
[2024-04-05T13:45:03.923Z] robustness-node-restart-test.sh: Rebooting node: worker-pool1-1dn6k2vc-n121-vpod1-pnes8010-ipv4

The traffic has been stopped between: [2024-04-05T13:47:06.910Z] and [2024-04-05T13:48:12.064Z]

Context

NSM Version: v1.13.0-rc2
The issue can be seen with NSM v1.12.1-rc.1 also.

denis-tingaikin · 2024-04-12T09:27:27Z

NSM Version: v1.13.0-rc1
The issue can be seen with NSM v1.12.1-rc.1 also.

Hm, as far as I know, we fixed something similar in v1.13.0.
Have you tried it on v1.13.0?

ljkiraly · 2024-04-12T11:38:54Z

NSM Version: v1.13.0-rc1
The issue can be seen with NSM v1.12.1-rc.1 also.

Hm, as far as I know, we fixed something similar in v1.13.0. Have you tried it on v1.13.0?

The logs are from a test run with v1.13.0-rc1. Is there a difference between v1.13.0 and v1.13.0-rc1? Just mentioned NSM v1.12.1-rc.1 to clarify that is not a new bug. It is considered as a medium priority issue.

denis-tingaikin · 2024-04-12T11:55:20Z

It is considered as a medium priority issue.

OK, good that it's not crirical.

The logs are from a test run with v1.13.0-rc1. Is there a difference between v1.13.0 and v1.13.0-rc1?

Yes, it has a difference. We have fixed a few bugs, like #11372 in v1.13.0 and 1.13.0- rc.1 doesn't contain the fix. 1.13.0-rc.2 contains the fix.

ljkiraly · 2024-04-15T06:55:56Z

Ah, I missed the version, sorry: the logs are from a test run with NSM v1.13.0-rc2. Fixing in description.

NikitaSkrynnik · 2024-07-09T09:51:07Z

We've checked several NSM versions and all of them has the same problem:

v1.13.2-rc.1
v1.13.0
v1.12.0
v1.11.2

NikitaSkrynnik · 2024-07-22T03:04:36Z

Current State

We found several problems that may occur after restarting a node:

1. Ping doesn't work periodically (periods are of the same length)

This issue is related to some bugs in point2point IPAM. Here is the draft fix for this: networkservicemesh/sdk#1647

2. `begin` queues requests but never executes them

The bug is the same as in networkservicemesh/cmd-forwarder-vpp#1134

3. A lot of additional unused routes on clients

After several node restarts some of the clients can have additional routes. Example:

default via 10.244.2.1 dev eth0 
10.244.2.0/24 via 10.244.2.1 dev eth0  src 10.244.2.12 
10.244.2.1 dev eth0 scope link  src 10.244.2.12 
172.16.0.34 dev nsm-v4 
172.16.0.40 dev nsm-v4 
172.16.0.56 dev nsm-v4 
172.16.0.92 dev nsm-v4

Only one of these addresses can be pinged. This problem is also related to point2point IPAM.

4. Sometimes ping doesn't work for a while after node restart but after some time starts to work again

The cause of this behaviour is still unknown. It might be related to some missing events after node is restarted. Still in progress.

NikitaSkrynnik · 2024-07-23T03:16:29Z

Found the solution for the fourth bug. It's in point2point IPAM again. Here is the PR that fixes issues 1 and 4 for point2point IPAM: networkservicemesh/sdk#1647

NikitaSkrynnik · 2024-07-23T14:46:31Z

NSE Image with fixes: nikitaxored/cmd-nse-icmp-responder:ipam-fix

szvincze · 2024-07-24T11:38:11Z

@NikitaSkrynnik: We tested the node restart scenario with this image and it was successful each time, so it seems this fix solves the problem.

NikitaSkrynnik · 2024-08-27T15:34:17Z

@szvincze to check if this issue is resolved in v1.14.0-rc.1 you can pass env variable NSM_IPAM_POLICY=strict to NSE. See example here: https://github.com/networkservicemesh/deployments-k8s/pull/12259/files#diff-0b132d08281a44a9d5d126bb154725aa05b4a1057c07158fdba858653c513c7cR31-R32

szvincze · 2024-09-25T13:59:06Z

@NikitaSkrynnik: We have verified it in an environment where we evaluate NSM releases and use NSE/NSC from NSM releases. There we had several issues, like traffic disturbance after worker node restart when the pods are back, temporary traffic outage for longer than 30 seconds for one NSE instance and several outages on the other traffic instances. Based on our tests we can say that with the latest release we haven't observe these issues.

But the @ljkiraly reported this issue from an environment where we use custom endpoints and clients, where unfortunately we still experience the same behavior.

denis-tingaikin self-assigned this Apr 12, 2024

denis-tingaikin added this to Release v1.14.0 Apr 12, 2024

denis-tingaikin moved this to In Progress in Release v1.14.0 Apr 12, 2024

denis-tingaikin moved this from In Progress to Blocked in Release v1.14.0 Apr 12, 2024

denis-tingaikin added the bug Something isn't working label Apr 12, 2024

denis-tingaikin moved this from Blocked to Todo in Release v1.14.0 Apr 16, 2024

denis-tingaikin assigned NikitaSkrynnik Jul 5, 2024

NikitaSkrynnik moved this from Todo to In Progress in Release v1.14.0 Jul 5, 2024

denis-tingaikin removed their assignment Jul 9, 2024

NikitaSkrynnik mentioned this issue Jul 22, 2024

Add a new wrapper for point2point IPAM that filters invalid addresses and routes networkservicemesh/sdk#1647

Closed

9 tasks

Ex4amp1e moved this from In Progress to Blocked in Release v1.14.0 Jul 23, 2024

Ex4amp1e moved this from Blocked to In Progress in Release v1.14.0 Aug 14, 2024

This was referenced Aug 21, 2024

Add an ability to choose IPAM policy networkservicemesh/cmd-nse-icmp-responder#612

Merged

Add an example that shows how NSE's IPAM Policies work #12259

Merged

denis-tingaikin moved this from In Progress to Under review in Release v1.14.0 Aug 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Traffic disturbance 2 minutes after node restart #11787

Traffic disturbance 2 minutes after node restart #11787

ljkiraly commented Apr 12, 2024 •

edited

Loading

denis-tingaikin commented Apr 12, 2024

ljkiraly commented Apr 12, 2024

denis-tingaikin commented Apr 12, 2024

ljkiraly commented Apr 15, 2024

NikitaSkrynnik commented Jul 9, 2024

NikitaSkrynnik commented Jul 22, 2024 •

edited

Loading

NikitaSkrynnik commented Jul 23, 2024 •

edited

Loading

NikitaSkrynnik commented Jul 23, 2024 •

edited

Loading

szvincze commented Jul 24, 2024

NikitaSkrynnik commented Aug 27, 2024 •

edited

Loading

szvincze commented Sep 25, 2024

Traffic disturbance 2 minutes after node restart #11787

Traffic disturbance 2 minutes after node restart #11787

Comments

ljkiraly commented Apr 12, 2024 • edited Loading

Expected Behavior

Current Behavior

Failure Information

Context

denis-tingaikin commented Apr 12, 2024

ljkiraly commented Apr 12, 2024

denis-tingaikin commented Apr 12, 2024

ljkiraly commented Apr 15, 2024

NikitaSkrynnik commented Jul 9, 2024

NikitaSkrynnik commented Jul 22, 2024 • edited Loading

Current State

1. Ping doesn't work periodically (periods are of the same length)

2. begin queues requests but never executes them

3. A lot of additional unused routes on clients

4. Sometimes ping doesn't work for a while after node restart but after some time starts to work again

NikitaSkrynnik commented Jul 23, 2024 • edited Loading

NikitaSkrynnik commented Jul 23, 2024 • edited Loading

szvincze commented Jul 24, 2024

NikitaSkrynnik commented Aug 27, 2024 • edited Loading

szvincze commented Sep 25, 2024

ljkiraly commented Apr 12, 2024 •

edited

Loading

NikitaSkrynnik commented Jul 22, 2024 •

edited

Loading

2. `begin` queues requests but never executes them

NikitaSkrynnik commented Jul 23, 2024 •

edited

Loading

NikitaSkrynnik commented Jul 23, 2024 •

edited

Loading

NikitaSkrynnik commented Aug 27, 2024 •

edited

Loading