-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
clusterresolver: Avoid blocking for subsequent resolver updates in test #7937
base: master
Are you sure you want to change the base?
Changes from 1 commit
7662fd8
26e0bcc
e6dede4
e8dc2ed
bb960bf
8d114e6
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
@@ -1232,7 +1232,11 @@ func (s) TestEDS_EndpointWithMultipleAddresses(t *testing.T) { | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
bd.Data.(balancer.Balancer).Close() | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
}, | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
UpdateClientConnState: func(bd *stub.BalancerData, ccs balancer.ClientConnState) error { | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
resolverUpdateCh <- ccs.ResolverState | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
select { | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
case resolverUpdateCh <- ccs.ResolverState: | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
default: | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
// Don't block forever in case of multiple updates. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
} | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I received some comments against performing non-deterministic writes to channels like this in tests because it can lead to flakiness as well. In the PR description, you mentioned that when the test failed, there was no logs. Were you able to repro this on g3 with logs enabled to figure out the exact cause of the problem? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I was able to consistently repro the failure in g3. There is a race which causes duplicate resolver updates to be sent to the leaf round robin policy. Normal flow
In this flow, only one update is sent to the round_robin balancer. Exceptional flow
In this flow the leaf round robin gets two updates. Since the channel used to spy on resolver updates has capacity 1, the second write blocks indefinitely. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for the detailed explanation. The fix you have here is the simplest one. But it still means that there is some non-determinism in the test. One way to handle this would be as follows:
I understand this will result in more changes compared to yours, but I feel this gets rid of the non-determinism that comes out of dropping updates from the parent. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also, while you are here, if you could change calls to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Done. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I tried following the mentioned approach. It seems to solve the flakiness by delaying the creation of the round robin balancer. However, if the clusterresolver LB policy is created with no localities, it logs an error causing the test to fail. grpc-go/xds/internal/balancer/clusterresolver/clusterresolver.go Lines 329 to 332 in 063d352
If I provide empty localities in the initial xds resources, the child round robin is created and we end up with a similar problem of getting either 1 or 2 updates. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To avoid the issue with determining the channel size, I switched to using a mutex instead of a channel. The round robin picker may still see 1 or 2 updates, but the test will pass regardless of which update is used for comparison since the updates are duplicates. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
return bd.Data.(balancer.Balancer).UpdateClientConnState(ccs) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
}, | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
}) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would be more concise and express the same thing if it used an
atomic.Pointer
instead of a separate mutex.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to use an atomic pointer.