clusterresolver: Avoid blocking for subsequent resolver updates in test #7937

arjan-bal · 2024-12-17T10:51:18Z

This test was added recent in #7858. In this test failure, the stack trace indicates that a goroutine was blocked forever. There were no failure logs though. This change avoids writing to the resolver update channel if its buffer is full.

RELEASE NOTES: N/A

codecov · 2024-12-17T10:58:02Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 81.95%. Comparing base (e8055ea) to head (8d114e6).
Report is 10 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #7937      +/-   ##
==========================================
- Coverage   82.08%   81.95%   -0.13%     
==========================================
  Files         379      381       +2     
  Lines       38261    38535     +274     
==========================================
+ Hits        31406    31582     +176     
- Misses       5551     5630      +79     
- Partials     1304     1323      +19

see 29 files with indirect coverage changes

easwars · 2024-12-17T23:40:27Z

xds/internal/balancer/clusterresolver/e2e_test/eds_impl_test.go

+					select {
+					case resolverUpdateCh <- ccs.ResolverState:
+					default:
+						// Don't block forever in case of multiple updates.
+					}


I received some comments against performing non-deterministic writes to channels like this in tests because it can lead to flakiness as well.

In the PR description, you mentioned that when the test failed, there was no logs. Were you able to repro this on g3 with logs enabled to figure out the exact cause of the problem?

I was able to consistently repro the failure in g3. There is a race which causes duplicate resolver updates to be sent to the leaf round robin policy.

Normal flow

clusterresolver starts the eds watch by calling resourceResolver.updateMechanisms:

grpc-go/xds/internal/balancer/clusterresolver/resource_resolver.go

Line 164 in b3bdacb

func (rr *resourceResolver) updateMechanisms(mechanisms []DiscoveryMechanism) {

After the EDS watch is started, updateMechanisms attempts to send an update to the child balancers by calling generateLocked:

grpc-go/xds/internal/balancer/clusterresolver/resource_resolver.go

Line 219 in b3bdacb

rr.generateLocked(func() {})

generateLocked sees that EDS is yet to produce the first result, so children are not updated:

grpc-go/xds/internal/balancer/clusterresolver/resource_resolver.go

Lines 285 to 294 in b3bdacb

func (rr *resourceResolver) generateLocked(onDone xdsresource.OnDoneFunc) {

var ret []priorityConfig

for _, rDM := range rr.children {

u, ok := rDM.r.lastUpdate()

if !ok {

// Don't send updates to parent until all resolvers have update to

// send.

onDone()

return

}

EDS produced the first result, it calls reousrceResolver.onUpdate which queues a call to generateLocked:

grpc-go/xds/internal/balancer/clusterresolver/resource_resolver.go

Lines 315 to 322 in b3bdacb

func (rr *resourceResolver) onUpdate(onDone xdsresource.OnDoneFunc) {

handleUpdate := func(context.Context) {

rr.mu.Lock()

rr.generateLocked(onDone)

rr.mu.Unlock()

}

rr.serializer.ScheduleOr(handleUpdate, func() { onDone() })

}

This time generateLocked updates the child balancer with a new resolver state.

In this flow, only one update is sent to the round_robin balancer.

Exceptional flow

clusterresolver starts the eds watch by calling resourceResolver.updateMechanisms:

grpc-go/xds/internal/balancer/clusterresolver/resource_resolver.go

Line 164 in b3bdacb

func (rr *resourceResolver) updateMechanisms(mechanisms []DiscoveryMechanism) {

While updateMechanisms is still executing, EDS produces it's first result, it calls reousrceResolver.onUpdate which queues a call to generateLocked:

grpc-go/xds/internal/balancer/clusterresolver/resource_resolver.go

Lines 315 to 322 in b3bdacb

func (rr *resourceResolver) onUpdate(onDone xdsresource.OnDoneFunc) {

handleUpdate := func(context.Context) {

rr.mu.Lock()

rr.generateLocked(onDone)

rr.mu.Unlock()

}

rr.serializer.ScheduleOr(handleUpdate, func() { onDone() })

}

updateMechanisms attempts to send an update to the child balancers by calling generateLocked:

grpc-go/xds/internal/balancer/clusterresolver/resource_resolver.go

Line 219 in b3bdacb

rr.generateLocked(func() {})

Since EDS has produced one result, generateLocked updates the child with new resolver state.

The call to generateLocked queued in 2 is executed, updating the child resolver again.

In this flow the leaf round robin gets two updates. Since the channel used to spy on resolver updates has capacity 1, the second write blocks indefinitely.

Thanks for the detailed explanation.

The fix you have here is the simplest one. But it still means that there is some non-determinism in the test. One way to handle this would be as follows:

In t.Run() change the order of the steps:

Start the management server, and override the OnStreamRequest to get notified when the EDS request specific to the test is being requested.

Create the xDS client.

Create the manual resolver and the grpc channel (and ask it to connect).

Wait for the EDS resource to be request from the management server.

Now, configure the resource on the management server.

Make the RPC and verify that it succeeds.

Verify that the expected update is pushed to the child policy.

I understand this will result in more changes compared to yours, but I feel this gets rid of the non-determinism that comes out of dropping updates from the parent.

Also, while you are here, if you could change calls to net.Listen("tcp", "localhost:0") with calls to testutils.LocalTCPListener(), that would be great too. The implementation for the latter is exactly the same as the former on OSS, but in forge it uses the portpicker to pick a free port before calling net.Listen because we have had flakes in the past on forge when using 0 for the port number.

Also, while you are here, if you could change calls to net.Listen("tcp", "localhost:0") with calls to testutils.LocalTCPListener()?

Done.

I tried following the mentioned approach. It seems to solve the flakiness by delaying the creation of the round robin balancer. However, if the clusterresolver LB policy is created with no localities, it logs an error causing the test to fail.

grpc-go/xds/internal/balancer/clusterresolver/clusterresolver.go

Lines 329 to 332 in 063d352

if b.child == nil {

b.logger.Errorf("xds: received ExitIdle with no child balancer")

break

}

If I provide empty localities in the initial xds resources, the child round robin is created and we end up with a similar problem of getting either 1 or 2 updates.

To avoid the issue with determining the channel size, I switched to using a mutex instead of a channel. The round robin picker may still see 1 or 2 updates, but the test will pass regardless of which update is used for comparison since the updates are duplicates.

@dfawley can you PTAL since @easwars is on leave.

This reverts commit 26e0bcc.

This reverts commit 7662fd8.

dfawley · 2024-12-23T19:16:59Z

xds/internal/balancer/clusterresolver/e2e_test/eds_impl_test.go

@@ -1232,7 +1233,9 @@ func (s) TestEDS_EndpointWithMultipleAddresses(t *testing.T) {
 					bd.Data.(balancer.Balancer).Close()
 				},
 				UpdateClientConnState: func(bd *stub.BalancerData, ccs balancer.ClientConnState) error {
-					resolverUpdateCh <- ccs.ResolverState


I think this would be more concise and express the same thing if it used an atomic.Pointer instead of a separate mutex.

Changed to use an atomic pointer.

Avoid blocking for subsequent resolver updates

7662fd8

arjan-bal added the Type: Testing label Dec 17, 2024

arjan-bal added this to the 1.70 Release milestone Dec 17, 2024

arjan-bal requested a review from easwars December 17, 2024 10:51

arjan-bal assigned easwars Dec 17, 2024

easwars reviewed Dec 17, 2024

View reviewed changes

easwars assigned arjan-bal and unassigned easwars Dec 17, 2024

arjan-bal assigned easwars and unassigned arjan-bal Dec 18, 2024

easwars assigned arjan-bal and unassigned easwars Dec 20, 2024

arjan-bal added 4 commits December 23, 2024 13:47

Separate the creation of the EDS watch from the Endpoint update

26e0bcc

Revert "Separate the creation of the EDS watch from the Endpoint update"

e6dede4

This reverts commit 26e0bcc.

Revert "Avoid blocking for subsequent resolver updates"

e8dc2ed

This reverts commit 7662fd8.

Use mutex instead of channel for sync

bb960bf

arjan-bal requested a review from dfawley December 23, 2024 09:19

arjan-bal assigned dfawley and unassigned arjan-bal Dec 23, 2024

dfawley reviewed Dec 23, 2024

View reviewed changes

dfawley assigned arjan-bal and unassigned dfawley Dec 23, 2024

Use atomic pointer

8d114e6

arjan-bal assigned dfawley and unassigned arjan-bal Dec 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

clusterresolver: Avoid blocking for subsequent resolver updates in test #7937

clusterresolver: Avoid blocking for subsequent resolver updates in test #7937

arjan-bal commented Dec 17, 2024 •

edited

Loading

codecov bot commented Dec 17, 2024 •

edited

Loading

easwars Dec 17, 2024

arjan-bal Dec 18, 2024

easwars Dec 20, 2024

easwars Dec 20, 2024

arjan-bal Dec 23, 2024

arjan-bal Dec 23, 2024

arjan-bal Dec 23, 2024

arjan-bal Dec 23, 2024

dfawley Dec 23, 2024

arjan-bal Dec 24, 2024

	func (rr *resourceResolver) generateLocked(onDone xdsresource.OnDoneFunc) {
	var ret []priorityConfig
	for _, rDM := range rr.children {
	u, ok := rDM.r.lastUpdate()
	if !ok {
	// Don't send updates to parent until all resolvers have update to
	// send.
	onDone()
	return
	}

	func (rr *resourceResolver) onUpdate(onDone xdsresource.OnDoneFunc) {
	handleUpdate := func(context.Context) {
	rr.mu.Lock()
	rr.generateLocked(onDone)
	rr.mu.Unlock()
	}
	rr.serializer.ScheduleOr(handleUpdate, func() { onDone() })
	}

	if b.child == nil {
	b.logger.Errorf("xds: received ExitIdle with no child balancer")
	break
	}

clusterresolver: Avoid blocking for subsequent resolver updates in test #7937

Are you sure you want to change the base?

clusterresolver: Avoid blocking for subsequent resolver updates in test #7937

Conversation

arjan-bal commented Dec 17, 2024 • edited Loading

codecov bot commented Dec 17, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Normal flow

Exceptional flow

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arjan-bal commented Dec 17, 2024 •

edited

Loading

codecov bot commented Dec 17, 2024 •

edited

Loading