-
Notifications
You must be signed in to change notification settings - Fork 412
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-9685: daemon: Always remove pending deployment before we do updates #3599
Conversation
We hit a confusing failure in https://issues.redhat.com/browse/OCPBUGS-8113 where the MCD will get stuck if deploying the RT kernel fails, because the switch to the RT kernel operates from the *booted* deployment state, but by default rpm-ostree wants to operate from pending. Move up the "cleanup pending deployment on failure" `defer` to right before we do anything else.
The RT kernel switch logic operates from the *booted* deployment, not pending. I had in my head that the MCO always cleaned up pending, but due to another bug we didn't. There's no reason to leave this cleanup to a defer; do it before we do anything else. (But keep the defer because it's cleaner to *also* cleanup if we fail)
I'm uncertain whether to call this the re-fixed version of the code for OCPBUGS-8113 - we have discovered a further underlying problem in https://issues.redhat.com/browse/OCPBUGS-9685 in that librpm is segfaulting. But...it does seem likely to me that fixing the retry loop will paper over whatever race condition (or possibly memory corruption? 😢 ) is leading librpm to segfault... /payload-job periodic-ci-openshift-release-master-ci-4.13-upgrade-from-stable-4.12-e2e-gcp-ovn-rt-upgrade |
@cgwalters: This pull request references Jira Issue OCPBUGS-8113, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
Requesting review from QA contact: The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
OK upon reflection I think let's link this to OCPBUGS-8113. |
@cgwalters: An error was encountered. No known errors were detected, please see the full error message for details. Full error message.
could not create PullRequestPayloadQualificationRun: Post "https://172.30.0.1:443/apis/ci.openshift.io/v1/namespaces/ci/pullrequestpayloadqualificationruns": net/http: TLS handshake timeout
Please contact an administrator to resolve this issue. |
Side note: There's a huge overlap between rpm-ostree's "pending deployment" logic and what the MCD is juggling externally to that with "current/pending/desired" machineconfig hashes. I strongly believe again that the right fix here is pushing config management down into the OS layer - then this bug would have never happened, because the system as a whole would always either be in state A or state B. |
/payload-job periodic-ci-openshift-release-master-ci-4.13-upgrade-from-stable-4.12-e2e-gcp-ovn-rt-upgrade |
@cgwalters: trigger 1 job(s) for the /payload-(job|aggregate) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/b22a21a0-be9a-11ed-8ca2-fef6efbc6fe7-0 |
/payload-aggregate periodic-ci-openshift-release-master-ci-4.13-upgrade-from-stable-4.12-e2e-gcp-ovn-rt-upgrade |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall lgtm, should be a safe operation since we shouldn't be removing any ongoing updates. Quick question below
@@ -2120,6 +2120,28 @@ func (dn *CoreOSDaemon) applyLayeredOSChanges(mcDiff machineConfigDiff, oldConfi | |||
defer os.Remove(extensionsRepo) | |||
} | |||
|
|||
// Always clean up pending, because the RT kernel switch logic below operates on booted, | |||
// not pending. | |||
if err := removePendingDeployment(); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In what ways can cleanup -p
fail? I assume it won't error if there isn't a pending deployment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, it doesn't give error. Tested it on local machine
$ rpm-ostree cleanup -p
Deployments unchanged.
$ echo $?
0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, it's idempotent. Also crucially, this code path is executed by CI, so if it didn't work, CI would fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
@@ -2120,6 +2120,28 @@ func (dn *CoreOSDaemon) applyLayeredOSChanges(mcDiff machineConfigDiff, oldConfi | |||
defer os.Remove(extensionsRepo) | |||
} | |||
|
|||
// Always clean up pending, because the RT kernel switch logic below operates on booted, | |||
// not pending. | |||
if err := removePendingDeployment(); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, it doesn't give error. Tested it on local machine
$ rpm-ostree cleanup -p
Deployments unchanged.
$ echo $?
0
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cgwalters, sinnykumari, yuqi-zhang The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Putting hold as we are waiting for payload test to finish https://pr-payload-tests.ci.openshift.org/runs/ci/b22a21a0-be9a-11ed-8ca2-fef6efbc6fe7-0 |
@cgwalters: This pull request references Jira Issue OCPBUGS-9685, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
OK I've been digging into this more and I now believe indeed we have two distinct bugs; they just happened to fail with similar symptoms. I've retargeted this change at https://issues.redhat.com/browse/OCPBUGS-9685 because it is definitely aiming to fix the issue we saw in the aggregated periodic, which is |
@cgwalters: This pull request references Jira Issue OCPBUGS-9685, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/retest-required |
Ah yes! I found evidence of this working in the job; rpm-ostree segfaulted on one node, but we successfully retried. Look at this journal:
Yet, the MCD retry must have worked. Unfortunately we don't have the previous pod logs from the MCD, but the success of the payload job combined with this evidence leads me to |
/skip |
/cherrypick release-4.13 |
@cgwalters: once the present PR merges, I will cherry-pick it on top of release-4.13 in a new PR and assign it to you. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@cgwalters: Jira Issue OCPBUGS-9685: All pull requests linked via external trackers have merged: Jira Issue OCPBUGS-9685 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@cgwalters: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
@cgwalters: new pull request created: #3601 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This is followup to #3580 - we're fixing the case where deploying the RT kernel fails and we want to retry.
daemon: Move cleanup of pending deployment earlier
We hit a confusing failure in https://issues.redhat.com/browse/OCPBUGS-8113
where the MCD will get stuck if deploying the RT kernel fails, because
the switch to the RT kernel operates from the booted deployment
state, but by default rpm-ostree wants to operate from pending.
Move up the "cleanup pending deployment on failure"
defer
toright before we do anything else.
daemon: Always remove pending deployment before we do updates
The RT kernel switch logic operates from the booted deployment,
not pending. I had in my head that the MCO always cleaned up
pending, but due to another bug we didn't.
There's no reason to leave this cleanup to a defer; do it
before we do anything else.
(But keep the defer because it's cleaner to also cleanup if
we fail)