-
Notifications
You must be signed in to change notification settings - Fork 145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-47505: pkg/operator/credentialsrequest: hasRecentlySynced backoff for STS too #812
base: master
Are you sure you want to change the base?
Conversation
@wking: This pull request references Jira Issue OCPBUGS-47505, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
Requesting review from QA contact: The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: wking The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
7f2f807
to
5a40e2b
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #812 +/- ##
==========================================
+ Coverage 47.01% 47.03% +0.02%
==========================================
Files 97 97
Lines 11873 11876 +3
==========================================
+ Hits 5582 5586 +4
Misses 5676 5676
+ Partials 615 614 -1
|
/test e2e-aws-manual-oidc |
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_cloud-credential-operator/812/pull-ci-openshift-cloud-credential-operator-master-e2e-aws-manual-oidc/1870223724137943040/artifacts/e2e-aws-manual-oidc/gather-audit-logs/artifacts/audit-logs.tar | tar -xz --strip-components=2
$ zgrep -h '"resource":"credentialsrequests"' kube-apiserver/*audit*.log.gz | jq -r '.verb + " " + (.responseStatus.code | tostring) + " " + (.objectRef | .resource + " " + .namespace) + " " + .user.username + " " + .userAgent' | sort | uniq -c | sort -n | tail -n3
147 get 200 credentialsrequests openshift-cloud-credential-operator system:serviceaccount:kube-system:generic-garbage-collector kube-controller-manager/v1.31.3 (linux/amd64) kubernetes/3c62f73/system:serviceaccount:kube-system:generic-garbage-collector
1176 get 200 credentialsrequests openshift-cloud-credential-operator system:serviceaccount:openshift-cluster-version:default cluster-version-operator/v0.0.0 (linux/amd64) kubernetes/$Format
7761 update 200 credentialsrequests openshift-cloud-credential-operator system:serviceaccount:openshift-cloud-credential-operator:cloud-credential-operator cloud-credential-operator/v0.0.0 (linux/amd64) kubernetes/$Format So still lots of /hold |
/test e2e-aws-manual-oidc |
2273526
to
9d55785
Compare
/test e2e-aws-manual-oidc |
9d55785
to
58662ad
Compare
/test e2e-aws-manual-oidc |
2 similar comments
/test e2e-aws-manual-oidc |
/test e2e-aws-manual-oidc |
b7ac447
to
5d83037
Compare
/test e2e-aws-manual-oidc |
7d7c50a
to
166a88b
Compare
Getting closer :) Current issue seems to be $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_cloud-credential-operator/812/pull-ci-openshift-cloud-credential-operator-master-e2e-aws-manual-oidc/1872474533076668416/artifacts/e2e-aws-manual-oidc/gather-extra/artifacts/pods/openshift-cloud-credential-operator_cloud-credential-operator-57d4b598f4-xnsqv_cloud-credential-operator.log | grep 'adding label selector\|ANDed.*ingress' | head -n3
time="2024-12-27T03:48:44Z" level=info msg="adding label selector cloudcredential.openshift.io/credentials-request=true to cache options for Secrets"
time="2024-12-27T03:48:52Z" level=info msg="The above are ANDed together to determine: lastsyncgeneration is current and lastsynctimestamp < 1h0m0s" NOT cloudCredsSecretUpdated=false NOT hasActiveFailureConditions=false NOT isInfrastructureUpdated=true NOT isStale=true cr.Status.Provisioned=false crSecretExists=false hasRecentlySynced=false infraResourceVersion=524 infraResourceVersionSynced=524 name=openshift-ingress
time="2024-12-27T03:48:55Z" level=info msg="The above are ANDed together to determine: lastsyncgeneration is current and lastsynctimestamp < 1h0m0s" NOT cloudCredsSecretUpdated=false NOT hasActiveFailureConditions=false NOT isInfrastructureUpdated=true NOT isStale=false cr.Status.Provisioned=true crSecretExists=false hasRecentlySynced=true infraResourceVersion=524 infraResourceVersionSynced=524 name=openshift-ingress
grep: write error: Broken pipe because the Secret lacks the $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_cloud-credential-operator/812/pull-ci-openshift-cloud-credential-operator-master-e2e-aws-manual-oidc/1872474533076668416/artifacts/e2e-aws-manual-oidc/gather-must-gather/artifacts/must-gather.tar | tar -xOz registry-build09-ci-openshift-org-ci-op-dfvtff1y-stable-sha256-3c1ee39e49aa50c29669b65a9a9b84b0c777bc324477a6e898e5afd0359752b9/namespaces/openshift-ingress-operator/core/secrets.yaml | yaml2json | jq '.items[].metadata | select(.name == "cloud-credentials") | del(.managedFields)'
{
"creationTimestamp": "2024-12-27T03:43:22Z",
"name": "cloud-credentials",
"namespace": "openshift-ingress-operator",
"resourceVersion": "636",
"uid": "27c83668-eb5f-49f0-b2d3-7227d149ccbf"
} But for some reason, the operator thinks a filtered watch is possible, see the |
6a4d449
to
2aa54ca
Compare
/test e2e-aws-manual-oidc |
1 similar comment
/test e2e-aws-manual-oidc |
b6570f4
to
ebfa76e
Compare
/test e2e-aws-manual-oidc |
Because "failed to process" is hard to debug, unless you have the detailed error message explaining why processing failed.
Instead of the hard-coded "an hour ago". Catching up with 452bbc4 (add new credentials field for AWS Secrets, 2020-11-03, openshift#264), which created the syncPeriod variable. This pivot avoids the risk of the variable being updated and pushing the logs out of sync with the new duration value.
…condition Adding a "NOT" to the logged cloudCredsSecretUpdated field, because the following 'if' condition is !cloudCredsSecretUpdated. The lack of "NOT" seems to have been accidental oversight when the logging fields were added in 0a0d849 (Changes to address PR comments from Steve ~3d ago, 2023-06-27, openshift#542). I'm also adding isInfrastructureUpdated logging to catch up with cea55c6 (Added implementation for AWS Day2 Tag reconcilation Support, 2024-09-24, openshift#759), when it was added to the 'if' condition but overlooked in field logging.
e9f9cc6 (Add & logic - new token CredReq.spec.cred* fields, 2023-06-27, openshift#542) created the STS-specfic branch here, and shifted the pre-existing hasRecentlySynced check to the non-STS branch. But that's leading to hot update loops, as the reconciler bangs away bumping status.lastSyncTimestamp (which we've had since the initial cloud-cred operator pull request [1]). For example in this recent CI run [2]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_cloud-credential-operator/789/pull-ci-openshift-cloud-credential-operator-master-e2e-aws-manual-oidc/1864768460005838848/artifacts/e2e-aws-manual-oidc/gather-extra/artifacts/pods/openshift-cloud-credential-operator_cloud-credential-operator-64bcbb54cc-vzs5s_cloud-credential-operator.log | grep -1 ingress | tail -n13 -- time="2024-12-05T21:44:49Z" level=info msg="status has changed, updating" controller=credreq cr=openshift-cloud-credential-operator/aws-ebs-csi-driver-operator secret=openshift-cluster-csi-drivers/ebs-cloud-credentials time="2024-12-05T21:44:49Z" level=info msg="operator detects timed access token enabled cluster (STS, Workload Identity, etc.)" controller=credreq cr=openshift-cloud-credential-operator/openshift-ingress time="2024-12-05T21:44:49Z" level=info msg="syncing credentials request" controller=credreq cr=openshift-cloud-credential-operator/openshift-ingress time="2024-12-05T21:44:49Z" level=info msg="status has changed, updating" controller=credreq cr=openshift-cloud-credential-operator/openshift-ingress secret=openshift-ingress-operator/cloud-credentials time="2024-12-05T21:44:49Z" level=info msg="reconciling clusteroperator status" -- time="2024-12-05T21:44:51Z" level=info msg="reconciling clusteroperator status" time="2024-12-05T21:44:52Z" level=info msg="operator detects timed access token enabled cluster (STS, Workload Identity, etc.)" controller=credreq cr=openshift-cloud-credential-operator/openshift-ingress time="2024-12-05T21:44:52Z" level=info msg="syncing credentials request" controller=credreq cr=openshift-cloud-credential-operator/openshift-ingress time="2024-12-05T21:44:52Z" level=info msg="reconciling clusteroperator status" time="2024-12-05T21:44:52Z" level=info msg="status has changed, updating" controller=credreq cr=openshift-cloud-credential-operator/openshift-ingress secret=openshift-ingress-operator/cloud-credentials time="2024-12-05T21:44:52Z" level=info msg="operator detects timed access token enabled cluster (STS, Workload Identity, etc.)" controller=credreq cr=openshift-cloud-credential-operator/openshift-machine-api-aws With this commit, we'll perform the same back-off in the STS case too, to avoid flooding the Kube API server with status.lastSyncTimestamp updates. [1]: openshift@a6d385a#diff-69794ca0db76a04660e3355ba9b824f34e7af1030d0a8114903d11847201c410R46 [2]: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_cloud-credential-operator/789/pull-ci-openshift-cloud-credential-operator-master-e2e-aws-manual-oidc/1864768460005838848
…rsion for STS Before this commit, STS updates didn't bump lastSyncInfrastructureResourceVersion, so isInfrastructureUpdated was always false and the controller thought it needed to update the CredentialsRequest status every time. With this commit, STS CredentialsRequests will have lastSyncInfrastructureResourceVersion updated (just like non-STS requests), so we have a chance at 'lastsyncgeneration is current...' cool-off periods (as long as all the other cool-off conditions are also met).
ebfa76e
to
81fd3d0
Compare
/test e2e-aws-manual-oidc |
81fd3d0
to
afc457a
Compare
/test e2e-aws-manual-oidc |
Even when awsSTSIAMRoleARN is empty, we want the label so that pkg/cmd/operator's NewOperator's filteredWatchPossible label-selector can find these Secrets. Then the controller will notice if they're deleted (so it can update the CredentialsRequest status to point that out) or when they haven't been changed (so it can avoid "I can't find the Secret!" overly-frequent bumping in the hasRecentlySynced calculation, because it thinks crSecretExists=false). And we want the annotation, so it's clear why the Secret needs to exist (because of the annotation-referenced CredentialsRequest). The risk here is that we might end up contending over label/annotation presence with the external controller that is populating the 'credentials' data inside the Secret. But the alternative of an unfiltered Secret informer in the client is still too resource-intensive, as described in the filteredWatchPossible comment and the a58a09c (*: use a filtered LIST + WATCH on Secrets for AWS STS, 2023-06-29, openshift#545) commit that added the filteredWatchPossible logic. Additional labels and annotations are properties that external controllers should be able to accept. For example, [1] has ArgoCD discussing: apiVersion: argoproj.io/v1alpha1 kind: ApplicationSet spec: # (...) preservedFields: annotations: ["my-custom-annotation"] labels: ["my-custom-label"] to ignore annotations and labels injected by external-to-ArgoCD controllers, which is what the CCO-specific annotation/label I'm touching now would be. Moving to 48d6ccc (pkg/operator: correctly fetch CA for AWS minter, 2023-07-19, openshift#575)'s LiveClient avoids confusing CreateOrPatch. With the cached .Client, it would have: 1. Failed to retrive an unlabeled Secret, because the externally-created Secret lacked the label that the Client's filteredWatchPossible informer is filtered on. 2. Thought that it should Create a new Secret. 3. Had that Create attempt fail on 'secrets "$NAME" already exists'. With the LiveClient, that becomes: 1. Successfully retrived an unlabeled Secret, with the uncached reader. 2. Thought that it should Patch the Secret. 3. Successfully Patch the Secret. 4. Once the Patch sets the label, future attempts to Get the Secret through the filtered informer cache will succeed. [1]: https://argo-cd.readthedocs.io/en/release-2.13/operator-manual/applicationset/Controlling-Resource-Modification/#preserving-changes-made-to-an-applications-annotations-and-labels
afc457a
to
68bd94e
Compare
/test e2e-aws-manual-oidc |
@wking: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
The e2e-aws-manual-oidc run looks good to me: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_cloud-credential-operator/812/pull-ci-openshift-cloud-credential-operator-master-e2e-aws-manual-oidc/1875391744414060544/artifacts/e2e-aws-manual-oidc/gather-extra/artifacts/pods/openshift-cloud-credential-operator_cloud-credential-operator-d785b46fc-9tll9_cloud-credential-operator.log | grep secret=openshift-ingress-operator/cloud-credentials
time="2025-01-04T04:21:17Z" level=info msg="status has changed, updating" controller=credreq cr=openshift-cloud-credential-operator/openshift-ingress-gcp secret=openshift-ingress-operator/cloud-credentials
time="2025-01-04T04:21:17Z" level=info msg="status has changed, updating" controller=credreq cr=openshift-cloud-credential-operator/openshift-ingress-azure secret=openshift-ingress-operator/cloud-credentials
time="2025-01-04T04:21:19Z" level=info msg="adding finalizer: cloudcredential.openshift.io/deprovision" controller=credreq cr=openshift-cloud-credential-operator/openshift-ingress secret=openshift-ingress-operator/cloud-credentials
time="2025-01-04T04:21:25Z" level=info msg="status has changed, updating" controller=credreq cr=openshift-cloud-credential-operator/openshift-ingress secret=openshift-ingress-operator/cloud-credentials
time="2025-01-04T04:21:28Z" level=info msg="status has changed, updating" controller=credreq cr=openshift-cloud-credential-operator/openshift-ingress secret=openshift-ingress-operator/cloud-credentials So a few bumps early in the cluster's install phase, and then quiet :). Kube API server audit logs are also much quieter now than they were before, without thousands of $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_cloud-credential-operator/812/pull-ci-openshift-cloud-credential-operator-master-e2e-aws-manual-oidc/1875391744414060544/artifacts/e2e-aws-manual-oidc/gather-audit-logs/artifacts/audit-logs.tar | tar -xz --strip-components=2
$ zgrep -h '"resource":"credentialsrequests"' kube-apiserver/*audit*.log.gz | jq -r '.verb + " " + (.responseStatus.code | tostring) + " " + (.objectRef | .resource + " " + .namespace) + " " + .user.username + " " + .userAgent' | sort | uniq -c | sort -n | tail -n3
49 get 200 credentialsrequests openshift-cloud-credential-operator system:serviceaccount:openshift-must-gather-7qfrg:default Go-http-client/2.0
196 get 200 credentialsrequests openshift-cloud-credential-operator system:serviceaccount:kube-system:generic-garbage-collector kube-controller-manager/v1.31.3 (linux/amd64) kubernetes/3c62f73/system:serviceaccount:kube-system:generic-garbage-collector
1225 get 200 credentialsrequests openshift-cloud-credential-operator system:serviceaccount:openshift-cluster-version:default cluster-version-operator/v0.0.0 (linux/amd64) kubernetes/$Format |
Happy CI, showing the issue fixed there, so: /hold cancel |
/test e2e-gco-manual-oidc e2e-azure-manual-oidc |
/test e2e-gcp-manual-oidc |
e9f9cc6 (#542) created the STS-specfic branch here, and shifted the pre-existing
hasRecentlySynced
check to the non-STS branch. But that's leading to hot update loops, as the reconciler bangs away bumpingstatus.lastSyncTimestamp
(which we've had since the initial cloud-cred operator pull request). For example in this recent CI run:With this commit, we'll perform the same back-off in the STS case too, to avoid flooding the Kube API server with
status.lastSyncTimestamp
updates.