Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CCO-626: pkg/operator/utils: Log diff on CredentialsRequest status change #811

Conversation

wking
Copy link
Member

@wking wking commented Dec 20, 2024

status has changed, updating shows that something is changing. But without a diff, it's hard to figure out what. For example in this recent CI run:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_cloud-credential-operator/789/pull-ci-openshift-cloud-credential-operator-master-e2e-aws-manual-oidc/1864768460005838848/artifacts/e2e-aws-manual-oidc/gather-extra/artifacts/pods/openshift-cloud-credential-operator_cloud-credential-operator-64bcbb54cc-vzs5s_cloud-credential-operator.log | grep -1 ingress | tail -n13
--
time="2024-12-05T21:44:49Z" level=info msg="status has changed, updating" controller=credreq cr=openshift-cloud-credential-operator/aws-ebs-csi-driver-operator secret=openshift-cluster-csi-drivers/ebs-cloud-credentials
time="2024-12-05T21:44:49Z" level=info msg="operator detects timed access token enabled cluster (STS, Workload Identity, etc.)" controller=credreq cr=openshift-cloud-credential-operator/openshift-ingress
time="2024-12-05T21:44:49Z" level=info msg="syncing credentials request" controller=credreq cr=openshift-cloud-credential-operator/openshift-ingress
time="2024-12-05T21:44:49Z" level=info msg="status has changed, updating" controller=credreq cr=openshift-cloud-credential-operator/openshift-ingress secret=openshift-ingress-operator/cloud-credentials
time="2024-12-05T21:44:49Z" level=info msg="reconciling clusteroperator status"
--
time="2024-12-05T21:44:51Z" level=info msg="reconciling clusteroperator status"
time="2024-12-05T21:44:52Z" level=info msg="operator detects timed access token enabled cluster (STS, Workload Identity, etc.)" controller=credreq cr=openshift-cloud-credential-operator/openshift-ingress
time="2024-12-05T21:44:52Z" level=info msg="syncing credentials request" controller=credreq cr=openshift-cloud-credential-operator/openshift-ingress
time="2024-12-05T21:44:52Z" level=info msg="reconciling clusteroperator status"
time="2024-12-05T21:44:52Z" level=info msg="status has changed, updating" controller=credreq cr=openshift-cloud-credential-operator/openshift-ingress secret=openshift-ingress-operator/cloud-credentials
time="2024-12-05T21:44:52Z" level=info msg="operator detects timed access token enabled cluster (STS, Workload Identity, etc.)" controller=credreq cr=openshift-cloud-credential-operator/openshift-machine-api-aws

Is that a hot loop? Is something really changing? Hard to debug.

With this commit, we'll log the change we make. It will increase the log level when there's a hot loop, but it will also make it easier for us to identify and fix hot loops, so overall log verbosity should go down (more text per update * orders of magnitude fewer updates).

@openshift-ci openshift-ci bot requested review from dlom and suhanime December 20, 2024 19:51
@wking wking force-pushed the log-diff-on-CredentialsRequest-status-change branch 2 times, most recently from 06797ff to e61e7f0 Compare December 20, 2024 20:42
Copy link

codecov bot commented Dec 20, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 47.01%. Comparing base (2395cbc) to head (056020e).
Report is 5 commits behind head on master.

Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #811   +/-   ##
=======================================
  Coverage   47.01%   47.01%           
=======================================
  Files          97       97           
  Lines       11873    11874    +1     
=======================================
+ Hits         5582     5583    +1     
  Misses       5676     5676           
  Partials      615      615           
Files with missing lines Coverage Δ
pkg/operator/utils/utils.go 83.13% <100.00%> (+0.06%) ⬆️

@wking
Copy link
Member Author

wking commented Dec 20, 2024

/test e2e-aws-manual-oidc

@wking
Copy link
Member Author

wking commented Dec 23, 2024

e2e-aws-manual-oidc:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_cloud-credential-operator/811/pull-ci-openshift-cloud-credential-operator-master-e2e-aws-manual-oidc/1870219565347115008/artifacts/e2e-aws-manual-oidc/gather-extra/artifacts/pods/openshift-cloud-credential-operator_cloud-credential-operator-6f9c9d8865-5rnr2_cloud-credential-operator.log | gre
p 'due to diff' | tail -n1
time="2024-12-20T22:49:10Z" level=info msg="Updating CredentialsRequest openshift-cloud-credential-operator/openshift-cluster-api-aws status due to diff: \u00a0\u00a0v1.CredentialsRequestStatus{\n\u00a0\u00a0\tProvisioned:                             true,\n-\u00a0\tLastSyncTimestamp:                       s\"2024-12-20 22:49:07 +0000 UTC\",\n+\u00a0\tLastSyncTimestamp:                       s\"2024-12-20 22:49:10 +0000 UTC\",\n\u00a0\u00a0\tLastSyncGeneration:                      1,\n\u00a0\u00a0\tLastSyncCloudCredsSecretResourceVersion: \"\",\n\u00a0\u00a0\t... // 3 identical fields\n\u00a0\u00a0}\n" controller=credreq cr=openshift-cloud-credential-operator/openshift-cluster-api-aws secret=openshift-cluster-api/capa-manager-bootstrap-credentials

Shuffle around with jq to expand those escapes:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_cloud-credential-operator/811/pull-ci-openshift-cloud-credential-operator-master-e2e-aws-manual-oidc/1870219565347115008/artifacts/e2e-aws-manual-oidc/gather-extra/artifacts/pods/openshift-cloud-credential-operator_cloud-credential-operator-6f9c9d8865-5rnr2_cloud-credential-operator.log | grep 'due to diff' | tail -n1 | sed 's/.*msg=//;s/ controller=credreq.*//' | jq -r .
Updating CredentialsRequest openshift-cloud-credential-operator/openshift-cluster-api-aws status due to diff:   v1.CredentialsRequestStatus{
        Provisioned:                             true,
-       LastSyncTimestamp:                       s"2024-12-20 22:49:07 +0000 UTC",
+       LastSyncTimestamp:                       s"2024-12-20 22:49:10 +0000 UTC",
        LastSyncGeneration:                      1,
        LastSyncCloudCredsSecretResourceVersion: "",
        ... // 3 identical fields
  }

So as expected (see discussion in #812), it's just the LastSyncTimestamp churn. Still, I think it's helpful to have these diffs logged to reduce the need for guesswork, and the CI run confirms that the diff-logging is working. I think we're good to go for this pull.

@jstuever
Copy link
Contributor

jstuever commented Jan 2, 2025

/assign

@wking wking force-pushed the log-diff-on-CredentialsRequest-status-change branch from 042bbc9 to 0bfbfe7 Compare January 2, 2025 19:17
@wking wking changed the title pkg/operator/utils: Log diff on CredentialsRequest status change CCO-626: pkg/operator/utils: Log diff on CredentialsRequest status change Jan 2, 2025
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jan 2, 2025
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Jan 2, 2025

@wking: This pull request references CCO-626 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set.

In response to this:

status has changed, updating shows that something is changing. But without a diff, it's hard to figure out what. For example in this recent CI run:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_cloud-credential-operator/789/pull-ci-openshift-cloud-credential-operator-master-e2e-aws-manual-oidc/1864768460005838848/artifacts/e2e-aws-manual-oidc/gather-extra/artifacts/pods/openshift-cloud-credential-operator_cloud-credential-operator-64bcbb54cc-vzs5s_cloud-credential-operator.log | grep -1 ingress | tail -n13
--
time="2024-12-05T21:44:49Z" level=info msg="status has changed, updating" controller=credreq cr=openshift-cloud-credential-operator/aws-ebs-csi-driver-operator secret=openshift-cluster-csi-drivers/ebs-cloud-credentials
time="2024-12-05T21:44:49Z" level=info msg="operator detects timed access token enabled cluster (STS, Workload Identity, etc.)" controller=credreq cr=openshift-cloud-credential-operator/openshift-ingress
time="2024-12-05T21:44:49Z" level=info msg="syncing credentials request" controller=credreq cr=openshift-cloud-credential-operator/openshift-ingress
time="2024-12-05T21:44:49Z" level=info msg="status has changed, updating" controller=credreq cr=openshift-cloud-credential-operator/openshift-ingress secret=openshift-ingress-operator/cloud-credentials
time="2024-12-05T21:44:49Z" level=info msg="reconciling clusteroperator status"
--
time="2024-12-05T21:44:51Z" level=info msg="reconciling clusteroperator status"
time="2024-12-05T21:44:52Z" level=info msg="operator detects timed access token enabled cluster (STS, Workload Identity, etc.)" controller=credreq cr=openshift-cloud-credential-operator/openshift-ingress
time="2024-12-05T21:44:52Z" level=info msg="syncing credentials request" controller=credreq cr=openshift-cloud-credential-operator/openshift-ingress
time="2024-12-05T21:44:52Z" level=info msg="reconciling clusteroperator status"
time="2024-12-05T21:44:52Z" level=info msg="status has changed, updating" controller=credreq cr=openshift-cloud-credential-operator/openshift-ingress secret=openshift-ingress-operator/cloud-credentials
time="2024-12-05T21:44:52Z" level=info msg="operator detects timed access token enabled cluster (STS, Workload Identity, etc.)" controller=credreq cr=openshift-cloud-credential-operator/openshift-machine-api-aws

Is that a hot loop? Is something really changing? Hard to debug.

With this commit, we'll log the change we make. It will increase the log level when there's a hot loop, but it will also make it easier for us to identify and fix hot loops, so overall log verbosity should go down (more text per update * orders of magnitude fewer updates).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@wking wking force-pushed the log-diff-on-CredentialsRequest-status-change branch from 0bfbfe7 to c0304f0 Compare January 2, 2025 19:33
@jstuever
Copy link
Contributor

jstuever commented Jan 2, 2025

/lgtm
/approve

@openshift-ci openshift-ci bot added lgtm Indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Jan 2, 2025
@wking
Copy link
Member Author

wking commented Jan 2, 2025

e2e-aws-ovn:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_cloud-credential-operator/811/pull-ci-openshift-cloud-credential-operator-master-e2e-aws-ovn/1874902068603392000/artifacts/e2e-aws-ovn/gather-extra/artifacts/pods/openshift-cloud-credential-operator_cloud-credential-operator-6d54b8c4d6-f5szz_cloud-credential-operator.log | grep -A1 'Updating CredentialsRequest .* status' | head -n2
time="2025-01-02T21:24:30Z" level=info msg="Updating CredentialsRequest openshift-cloud-credential-operator/cloud-credential-operator-iam-ro status" controller=credreq cr=openshift-cloud-credential-operator/cloud-credential-operator-iam-ro secret=openshift-cloud-credential-operator/cloud-credential-operator-iam-ro-creds
time="2025-01-02T21:24:30Z" level=info msg="syncing credentials request" controller=credreq cr=openshift-cloud-credential-operator/openshift-cloud-network-config-controller-aws

So that's got the info line and not the debug line in this cluster with the default operator log level, as expected. And it looks like the namespace/name I'm adding to the message overlaps with the cr=... structured log data that the logger already has, probably from early in ReconcileCredentialsRequest.Reconcile. Possibly I'm just dense, and wasn't able to unpack cr= as CredentialsRequest=?

'status has changed, updating' shows that *something* is changing.
But without a diff, it's hard to figure out what.  For example in this
recent CI run [1]:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_cloud-credential-operator/789/pull-ci-openshift-cloud-credential-operator-master-e2e-aws-manual-oidc/1864768460005838848/artifacts/e2e-aws-manual-oidc/gather-extra/artifacts/pods/openshift-cloud-credential-operator_cloud-credential-operator-64bcbb54cc-vzs5s_cloud-credential-operator.log | grep -1 ingress | tail -n13
  --
  time="2024-12-05T21:44:49Z" level=info msg="status has changed, updating" controller=credreq cr=openshift-cloud-credential-operator/aws-ebs-csi-driver-operator secret=openshift-cluster-csi-drivers/ebs-cloud-credentials
  time="2024-12-05T21:44:49Z" level=info msg="operator detects timed access token enabled cluster (STS, Workload Identity, etc.)" controller=credreq cr=openshift-cloud-credential-operator/openshift-ingress
  time="2024-12-05T21:44:49Z" level=info msg="syncing credentials request" controller=credreq cr=openshift-cloud-credential-operator/openshift-ingress
  time="2024-12-05T21:44:49Z" level=info msg="status has changed, updating" controller=credreq cr=openshift-cloud-credential-operator/openshift-ingress secret=openshift-ingress-operator/cloud-credentials
  time="2024-12-05T21:44:49Z" level=info msg="reconciling clusteroperator status"
  --
  time="2024-12-05T21:44:51Z" level=info msg="reconciling clusteroperator status"
  time="2024-12-05T21:44:52Z" level=info msg="operator detects timed access token enabled cluster (STS, Workload Identity, etc.)" controller=credreq cr=openshift-cloud-credential-operator/openshift-ingress
  time="2024-12-05T21:44:52Z" level=info msg="syncing credentials request" controller=credreq cr=openshift-cloud-credential-operator/openshift-ingress
  time="2024-12-05T21:44:52Z" level=info msg="reconciling clusteroperator status"
  time="2024-12-05T21:44:52Z" level=info msg="status has changed, updating" controller=credreq cr=openshift-cloud-credential-operator/openshift-ingress secret=openshift-ingress-operator/cloud-credentials
  time="2024-12-05T21:44:52Z" level=info msg="operator detects timed access token enabled cluster (STS, Workload Identity, etc.)" controller=credreq cr=openshift-cloud-credential-operator/openshift-machine-api-aws

Is that a hot loop?  Is something really changing?  Hard to debug.

With this commit, we'll log the change we make.  It will increase the
log level when there's a hot loop, but it will also make it easier for
us to identify and fix hot loops, so overall log verbosity should go
down (more text per update * orders of magnitude fewer updates).

We don't need to explicitly include the CredentialsRequest namespace
and name here, because earlier in the stack,
ReconcileCredentialsRequest.Reconcile sets that context with:

  logger := log.WithFields(log.Fields{
    "controller": controllerName,
    "cr":         fmt.Sprintf("%s/%s", request.NamespacedName.Namespace, request.NamespacedName.Name),
  })

as you can see in the
cr=openshift-cloud-credential-operator/openshift-ingress rendering in
the "status has changed" lines I quoted above.

[1]: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_cloud-credential-operator/789/pull-ci-openshift-cloud-credential-operator-master-e2e-aws-manual-oidc/1864768460005838848
@wking wking force-pushed the log-diff-on-CredentialsRequest-status-change branch from c0304f0 to 5a86535 Compare January 3, 2025 17:25
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Jan 3, 2025
I think it's worth addressing hotlooping issues, and that they'll be
rare enough that we don't have to worry too much about the log level
of the diff.  But Jeremiah is concerned about log volume, and we do
have a hot loop in [1] today that hasn't been fixed yet.  This commit
pushes the diff-rendering down to the debug level.  Users can set
cloudcredentials.operator.openshift.io spec.logLevel to Debug or
higher to see the diff.

[1]: https://issues.redhat.com/browse/OCPBUGS-47505
@wking wking force-pushed the log-diff-on-CredentialsRequest-status-change branch from 5a86535 to 056020e Compare January 3, 2025 17:32
@jstuever
Copy link
Contributor

jstuever commented Jan 3, 2025

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 3, 2025
Copy link
Contributor

openshift-ci bot commented Jan 3, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jstuever, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD ac90616 and 2 for PR HEAD 056020e in total

Copy link
Contributor

openshift-ci bot commented Jan 6, 2025

@wking: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot bot merged commit 8160a5b into openshift:master Jan 6, 2025
12 checks passed
@wking wking deleted the log-diff-on-CredentialsRequest-status-change branch January 6, 2025 20:43
@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

Distgit: ose-cloud-credential-operator
This PR has been included in build ose-cloud-credential-operator-container-v4.19.0-202501070436.p0.g8160a5b.assembly.stream.el9.
All builds following this will include this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants