Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow to send empty subscription and update version afterward #11264

Merged
merged 12 commits into from
Jul 30, 2024

Conversation

lujiajing1126
Copy link
Contributor

@lujiajing1126 lujiajing1126 commented Jun 5, 2024

Closes #11232

As described in the original issue:

If a CDS update revoke all subscribed resources, all EDS subscriptions will be cancelled.

adjustResourceSubscription issue

The first observation of the istiod logs is,

>>>>> CDS PUSH
2024-06-04T08:30:16.705222Z	info	ads	CDS: PUSH for node:e2e-service-consumer-base-5dd4cb9fbf-v6gjk.default resources:0 size:0B nonce:af2f0bd1-c785-4232-b381-de703b6e3e7d version:2024-06-04T08:30:16Z/7
>>>>> EDS PUSH immediately after CDS
2024-06-04T08:30:16.705384Z	info	ads	EDS: PUSH for node:e2e-service-consumer-base-5dd4cb9fbf-v6gjk.default resources:2 size:395B empty:0 cached:0/2 nonce:cea5b0b2-82d7-4c81-99cc-744ae9a7948e version:2024-06-04T08:30:16Z/7
2024-06-04T08:30:16.705464Z	debug	grpcgen	building lds for e2e-service-consumer-base-5dd4cb9fbf-v6gjk.default with filter:
map[e2e-service-provider.default.svc.cluster.local:{map[e2e-service-provider.default.svc.cluster.local:{}] map[80:{}]}]
2024-06-04T08:30:16.705535Z	info	ads	LDS: PUSH for node:e2e-service-consumer-base-5dd4cb9fbf-v6gjk.default resources:1 size:450B nonce:29cd5728-2464-48b7-ac40-1acb5bd2bc22 version:2024-06-04T08:30:16Z/7
2024-06-04T08:30:16.706144Z	info	ads	RDS: PUSH for node:e2e-service-consumer-base-5dd4cb9fbf-v6gjk.default resources:1 size:4.8kB nonce:1866c85b-4695-4b37-91b0-d9cb844b4181 version:2024-06-04T08:30:16Z/7
2024-06-04T08:30:16.710333Z	debug	ads	ADS:CDS: REQ e2e-service-consumer-base-5dd4cb9fbf-v6gjk.default-1 resources:2 nonce:af2f0bd1-c785-4232-b381-de703b6e3e7d version:2024-06-04T08:30:16Z/7 
2024-06-04T08:30:16.710475Z	debug	ads	ADS:CDS: ACK e2e-service-consumer-base-5dd4cb9fbf-v6gjk.default-1 2024-06-04T08:30:16Z/7 af2f0bd1-c785-4232-b381-de703b6e3e7d
2024-06-04T08:30:16.713459Z	debug	ads	ADS:EDS: REQ e2e-service-consumer-base-5dd4cb9fbf-v6gjk.default-1 resources:1 nonce:2dd9df3c-e64c-4a8e-9787-429e5dcf2719 version:2024-06-04T08:30:14Z/6 
2024-06-04T08:30:16.713508Z	debug	ads	ADS:EDS: REQ e2e-service-consumer-base-5dd4cb9fbf-v6gjk.default-1 Expired nonce received 2dd9df3c-e64c-4a8e-9787-429e5dcf2719, sent cea5b0b2-82d7-4c81-99cc-744ae9a7948e
>>>>>> NO resources:0 REQ log
2024-06-04T08:30:16.718041Z	debug	ads	ADS:LDS: REQ e2e-service-consumer-base-5dd4cb9fbf-v6gjk.default-1 resources:1 nonce:29cd5728-2464-48b7-ac40-1acb5bd2bc22 version:2024-06-04T08:30:16Z/7 
2024-06-04T08:30:16.718084Z	debug	ads	ADS:LDS: ACK e2e-service-consumer-base-5dd4cb9fbf-v6gjk.default-1 2024-06-04T08:30:16Z/7 29cd5728-2464-48b7-ac40-1acb5bd2bc22

If EDS is empty, adjustResourceSubscription does not work.

resourceVersion update issue

If all resources are removed from a type, e.g. EDS, subscribedResourceTypeUrls entry will be also removed. This will prevent nonce being updated from the upstream EDS PUSH.

Copy link

linux-foundation-easycla bot commented Jun 5, 2024

CLA Signed

The committers listed above are authorized under a signed CLA.

@@ -291,7 +291,6 @@ public void run() {
}
if (resourceSubscribers.get(type).isEmpty()) {
resourceSubscribers.remove(type);
subscribedResourceTypeUrls.remove(type.typeUrl());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We definitely want to clean up the resource. Nonces are per-resource? Then it seems when the client starts watches again the server should notice the lack of nonce. The issue might be instead that we aren't cleaning up AdsStream.respNonces?

Note that maybe we should do the new I/O you are causing in this PR, but maybe we allow sending the ACK even when subscribedResourceTypeUrls lacks the type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of removing line 294, have it call a cleanup method on the subscriber.controlPlaneClient (if it isn't null) to remove the nonce. You'll have to create the cleanup method that you'll be calling.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds good to me. @lujiajing1126, if it turns out to be annoying to make that change, tell us and we'll see how we can help. Also, if you think that wouldn't fully address what you noticed, say so. I don't fully understand "adjustResourceSubscription issue;" it just looks like the same nonce issue to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Also, if you are uncertain about the changes, you can send them out before you update/fix any tests. A sort of early review.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We definitely want to clean up the resource. Nonces are per-resource? Then it seems when the client starts watches again the server should notice the lack of nonce. The issue might be instead that we aren't cleaning up AdsStream.respNonces?

Instead of removing line 294, have it call a cleanup method on the subscriber.controlPlaneClient (if it isn't null) to remove the nonce. You'll have to create the cleanup method that you'll be calling.

I agree with both of you. Instead of creating a cleanup method, I've merged cleanup logic into the existing adjustResourceSubscription method. PTAL

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds good to me. @lujiajing1126, if it turns out to be annoying to make that change, tell us and we'll see how we can help. Also, if you think that wouldn't fully address what you noticed, say so. I don't fully understand "adjustResourceSubscription issue;" it just looks like the same nonce issue to me.

Yes. Exactly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Also, if you are uncertain about the changes, you can send them out before you update/fix any tests. A sort of early review.)

I tried to fix this issue based on the comment (without modifying the test cases)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please modify the failing test case to expect the nonce to be reset.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please modify the failing test case to expect the nonce to be reset.

Test case has been fixed with some additional helper to access the underlying private/package-private fields

@ejona86 ejona86 requested a review from larry-safran June 5, 2024 16:30
@@ -291,7 +291,6 @@ public void run() {
}
if (resourceSubscribers.get(type).isEmpty()) {
resourceSubscribers.remove(type);
subscribedResourceTypeUrls.remove(type.typeUrl());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of removing line 294, have it call a cleanup method on the subscriber.controlPlaneClient (if it isn't null) to remove the nonce. You'll have to create the cleanup method that you'll be calling.

@lujiajing1126
Copy link
Contributor Author

Gentle Ping @larry-safran @ejona86

@larry-safran larry-safran added the kokoro:run Add this label to a PR to tell Kokoro the code is safe and tests can be run label Jul 10, 2024
@grpc-kokoro grpc-kokoro removed the kokoro:run Add this label to a PR to tell Kokoro the code is safe and tests can be run label Jul 10, 2024
@ejona86 ejona86 added the kokoro:run Add this label to a PR to tell Kokoro the code is safe and tests can be run label Jul 10, 2024
@grpc-kokoro grpc-kokoro removed the kokoro:run Add this label to a PR to tell Kokoro the code is safe and tests can be run label Jul 10, 2024
@ejona86
Copy link
Member

ejona86 commented Jul 11, 2024

I'm not wild about the white box testing, especially since this issue was originally a problem for the control plane; it seems we should be able to test without exposing deep internals. I'd like to take a look and see if I can clean it up.

@ejona86 ejona86 added the TODO:release blocker Issue/PR is important enough to delay the release. Removed after release issues resolved label Jul 22, 2024
@ejona86
Copy link
Member

ejona86 commented Jul 24, 2024

@lujiajing1126, I changed up the test. But while doing so I realized your problem was probably not caused by the nonce; that's only used for ACK/NACK reporting. The problem was more likely that version wasn't being cleared. Take a look at my changes and see if they seem right to you.

Re-requesting review from @larry-safran since clearing version is an important difference from before.

@ejona86 ejona86 requested a review from larry-safran July 24, 2024 19:05
Copy link
Contributor

@larry-safran larry-safran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cleaning up versions makes sense to me and I don't see any opportunity for harm.

@larry-safran larry-safran added the kokoro:run Add this label to a PR to tell Kokoro the code is safe and tests can be run label Jul 24, 2024
@grpc-kokoro grpc-kokoro removed the kokoro:run Add this label to a PR to tell Kokoro the code is safe and tests can be run label Jul 24, 2024
@lujiajing1126
Copy link
Contributor Author

@lujiajing1126, I changed up the test. But while doing so I realized your problem was probably not caused by the nonce; that's only used for ACK/NACK reporting. The problem was more likely that version wasn't being cleared. Take a look at my changes and see if they seem right to you.

Re-requesting review from @larry-safran since clearing version is an important difference from before.

Thanks! LGTM

Version should be cleared as well.

@ejona86
Copy link
Member

ejona86 commented Jul 25, 2024

Looking at the fix, I'm still not certain it is right. I'm seeing that we don't send an unsubscription for the last resource. Looking at the Go implementation, it seems it does send such a request. I'm thinking maybe we should send one last request to fully unsubscribe, and then delete the version+nonce, simply because we don't care about them any more. But it seems it shouldn't matter for correctness.

It was gnawing on me because we don't have a problem with keeping the version+nonce around when unsubscribing from a resource, if there are other resources still present. And that's because we send an updated request.

We really need to ask around for this, I think.

@lujiajing1126
Copy link
Contributor Author

lujiajing1126 commented Jul 25, 2024

Looking at the fix, I'm still not certain it is right. I'm seeing that we don't send an unsubscription for the last resource. Looking at the Go implementation, it seems it does send such a request. I'm thinking maybe we should send one last request to fully unsubscribe, and then delete the version+nonce, simply because we don't care about them any more. But it seems it shouldn't matter for correctness.

In my first commit, i.e. 037cdef, I did send an empty request to the xdsServer. As I understood, this is required by the xDS protocol, quoted,

Whenever the client receives a new response, it will send another request indicating whether or not the resources in the response were valid (see ACK/NACK and resource type instance version for details).

Beside, according to the istio implementation, fully unsubscribing resources would definitely help reduce memory consumption at the server side.

https://github.com/istio/istio/blob/6aa63cb62372ba9e79940df6895273725765b5bb/pkg/xds/server.go#L369

@ejona86
Copy link
Member

ejona86 commented Jul 29, 2024

In my first commit, i.e. 037cdef, I did send an empty request to the xdsServer.

I found the thread where it was suggested to remove it. I think the reasoning was just a bit mistaken. I've added it back. Also, I verified this won't be considered a wildcard watch.

All but one usage of getSubscribedResources() creates an empty collection when the method returns null. And the one that doesn't is impossible for it to be null. I purposefully chose not to change that as part of this CL to keep the bug fix to-the-point.

@lujiajing1126
Copy link
Contributor Author

In my first commit, i.e. 037cdef, I did send an empty request to the xdsServer.

I found the thread where it was suggested to remove it. I think the reasoning was just a bit mistaken. I've added it back. Also, I verified this won't be considered a wildcard watch.

All but one usage of getSubscribedResources() creates an empty collection when the method returns null. And the one that doesn't is impossible for it to be null. I purposefully chose not to change that as part of this CL to keep the bug fix to-the-point.

Thanks! LGTM

@ejona86 ejona86 added the kokoro:run Add this label to a PR to tell Kokoro the code is safe and tests can be run label Jul 30, 2024
@grpc-kokoro grpc-kokoro removed the kokoro:run Add this label to a PR to tell Kokoro the code is safe and tests can be run label Jul 30, 2024
@ejona86 ejona86 merged commit 448ec4f into grpc:master Jul 30, 2024
13 checks passed
@ejona86 ejona86 added TODO:backport PR needs to be backported. Removed after backport complete and removed TODO:release blocker Issue/PR is important enough to delay the release. Removed after release issues resolved labels Jul 30, 2024
ejona86 pushed a commit to ejona86/grpc-java that referenced this pull request Aug 2, 2024
Otherwise, the server will continue sending updates and if we
re-subscribe to the last resource, the server won't re-send it. Also
completely remove the per-type state, as it could only add confusion.
ejona86 pushed a commit that referenced this pull request Aug 2, 2024
Otherwise, the server will continue sending updates and if we
re-subscribe to the last resource, the server won't re-send it. Also
completely remove the per-type state, as it could only add confusion.
@ejona86 ejona86 removed the TODO:backport PR needs to be backported. Removed after backport complete label Aug 6, 2024
@larry-safran larry-safran mentioned this pull request Aug 14, 2024
@lujiajing1126 lujiajing1126 deleted the fix-xds-client branch September 24, 2024 09:05
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 24, 2024
@ejona86
Copy link
Member

ejona86 commented Jan 6, 2025

#11796 fixes nonce handling; it needs to be remembered for the life of the ADS stream.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[xDS] NACK/ACK should be always reported even if subscribedResourceTypeUrls cannot be found
4 participants