Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-9685: daemon: Always remove pending deployment before we do updates #3599

Merged
merged 2 commits into from
Mar 9, 2023

Conversation

cgwalters
Copy link
Member

This is followup to #3580 - we're fixing the case where deploying the RT kernel fails and we want to retry.


daemon: Move cleanup of pending deployment earlier

We hit a confusing failure in https://issues.redhat.com/browse/OCPBUGS-8113
where the MCD will get stuck if deploying the RT kernel fails, because
the switch to the RT kernel operates from the booted deployment
state, but by default rpm-ostree wants to operate from pending.

Move up the "cleanup pending deployment on failure" defer to
right before we do anything else.


daemon: Always remove pending deployment before we do updates

The RT kernel switch logic operates from the booted deployment,
not pending. I had in my head that the MCO always cleaned up
pending, but due to another bug we didn't.

There's no reason to leave this cleanup to a defer; do it
before we do anything else.

(But keep the defer because it's cleaner to also cleanup if
we fail)


We hit a confusing failure in https://issues.redhat.com/browse/OCPBUGS-8113
where the MCD will get stuck if deploying the RT kernel fails, because
the switch to the RT kernel operates from the *booted* deployment
state, but by default rpm-ostree wants to operate from pending.

Move up the "cleanup pending deployment on failure" `defer` to
right before we do anything else.
The RT kernel switch logic operates from the *booted* deployment,
not pending.  I had in my head that the MCO always cleaned up
pending, but due to another bug we didn't.

There's no reason to leave this cleanup to a defer; do it
before we do anything else.

(But keep the defer because it's cleaner to *also* cleanup if
 we fail)
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 9, 2023
@cgwalters
Copy link
Member Author

I'm uncertain whether to call this the re-fixed version of the code for OCPBUGS-8113 - we have discovered a further underlying problem in https://issues.redhat.com/browse/OCPBUGS-9685 in that librpm is segfaulting.

But...it does seem likely to me that fixing the retry loop will paper over whatever race condition (or possibly memory corruption? 😢 ) is leading librpm to segfault...

/payload-job periodic-ci-openshift-release-master-ci-4.13-upgrade-from-stable-4.12-e2e-gcp-ovn-rt-upgrade

@cgwalters cgwalters changed the title daemon: Always remove pending deployment before we do updates OCPBUGS-8113: daemon: Always remove pending deployment before we do updates Mar 9, 2023
@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Mar 9, 2023
@openshift-ci-robot
Copy link
Contributor

@cgwalters: This pull request references Jira Issue OCPBUGS-8113, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.14.0) matches configured target version for branch (4.14.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @sergiordlr

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

This is followup to #3580 - we're fixing the case where deploying the RT kernel fails and we want to retry.


daemon: Move cleanup of pending deployment earlier

We hit a confusing failure in https://issues.redhat.com/browse/OCPBUGS-8113
where the MCD will get stuck if deploying the RT kernel fails, because
the switch to the RT kernel operates from the booted deployment
state, but by default rpm-ostree wants to operate from pending.

Move up the "cleanup pending deployment on failure" defer to
right before we do anything else.


daemon: Always remove pending deployment before we do updates

The RT kernel switch logic operates from the booted deployment,
not pending. I had in my head that the MCO always cleaned up
pending, but due to another bug we didn't.

There's no reason to leave this cleanup to a defer; do it
before we do anything else.

(But keep the defer because it's cleaner to also cleanup if
we fail)


Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@cgwalters
Copy link
Member Author

OK upon reflection I think let's link this to OCPBUGS-8113.

@openshift-ci openshift-ci bot requested a review from sergiordlr March 9, 2023 16:10
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 9, 2023

@cgwalters: An error was encountered. No known errors were detected, please see the full error message for details.

Full error message. could not create PullRequestPayloadQualificationRun: Post "https://172.30.0.1:443/apis/ci.openshift.io/v1/namespaces/ci/pullrequestpayloadqualificationruns": net/http: TLS handshake timeout

Please contact an administrator to resolve this issue.

@cgwalters
Copy link
Member Author

Side note: There's a huge overlap between rpm-ostree's "pending deployment" logic and what the MCD is juggling externally to that with "current/pending/desired" machineconfig hashes.

I strongly believe again that the right fix here is pushing config management down into the OS layer - then this bug would have never happened, because the system as a whole would always either be in state A or state B.

@cgwalters
Copy link
Member Author

/payload-job periodic-ci-openshift-release-master-ci-4.13-upgrade-from-stable-4.12-e2e-gcp-ovn-rt-upgrade

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 9, 2023

@cgwalters: trigger 1 job(s) for the /payload-(job|aggregate) command

  • periodic-ci-openshift-release-master-ci-4.13-upgrade-from-stable-4.12-e2e-gcp-ovn-rt-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/b22a21a0-be9a-11ed-8ca2-fef6efbc6fe7-0

@cgwalters
Copy link
Member Author

/payload-aggregate periodic-ci-openshift-release-master-ci-4.13-upgrade-from-stable-4.12-e2e-gcp-ovn-rt-upgrade

Copy link
Contributor

@yuqi-zhang yuqi-zhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall lgtm, should be a safe operation since we shouldn't be removing any ongoing updates. Quick question below

@@ -2120,6 +2120,28 @@ func (dn *CoreOSDaemon) applyLayeredOSChanges(mcDiff machineConfigDiff, oldConfi
defer os.Remove(extensionsRepo)
}

// Always clean up pending, because the RT kernel switch logic below operates on booted,
// not pending.
if err := removePendingDeployment(); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what ways can cleanup -p fail? I assume it won't error if there isn't a pending deployment?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, it doesn't give error. Tested it on local machine

$ rpm-ostree cleanup -p
Deployments unchanged.
$ echo $?
0

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, it's idempotent. Also crucially, this code path is executed by CI, so if it didn't work, CI would fail.

Copy link
Contributor

@sinnykumari sinnykumari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@@ -2120,6 +2120,28 @@ func (dn *CoreOSDaemon) applyLayeredOSChanges(mcDiff machineConfigDiff, oldConfi
defer os.Remove(extensionsRepo)
}

// Always clean up pending, because the RT kernel switch logic below operates on booted,
// not pending.
if err := removePendingDeployment(); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, it doesn't give error. Tested it on local machine

$ rpm-ostree cleanup -p
Deployments unchanged.
$ echo $?
0

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 9, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 9, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters, sinnykumari, yuqi-zhang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [cgwalters,sinnykumari,yuqi-zhang]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@sinnykumari
Copy link
Contributor

Putting hold as we are waiting for payload test to finish https://pr-payload-tests.ci.openshift.org/runs/ci/b22a21a0-be9a-11ed-8ca2-fef6efbc6fe7-0
/hold
Once payload test is green, should be good to get merged

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 9, 2023
@cgwalters cgwalters changed the title OCPBUGS-8113: daemon: Always remove pending deployment before we do updates OCPBUGS-9685: daemon: Always remove pending deployment before we do updates Mar 9, 2023
@openshift-ci-robot openshift-ci-robot added jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. and removed jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Mar 9, 2023
@openshift-ci-robot
Copy link
Contributor

@cgwalters: This pull request references Jira Issue OCPBUGS-9685, which is invalid:

  • expected the bug to target the "4.14.0" version, but it targets "4.13.0" instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

This is followup to #3580 - we're fixing the case where deploying the RT kernel fails and we want to retry.


daemon: Move cleanup of pending deployment earlier

We hit a confusing failure in https://issues.redhat.com/browse/OCPBUGS-8113
where the MCD will get stuck if deploying the RT kernel fails, because
the switch to the RT kernel operates from the booted deployment
state, but by default rpm-ostree wants to operate from pending.

Move up the "cleanup pending deployment on failure" defer to
right before we do anything else.


daemon: Always remove pending deployment before we do updates

The RT kernel switch logic operates from the booted deployment,
not pending. I had in my head that the MCO always cleaned up
pending, but due to another bug we didn't.

There's no reason to leave this cleanup to a defer; do it
before we do anything else.

(But keep the defer because it's cleaner to also cleanup if
we fail)


Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@cgwalters
Copy link
Member Author

cgwalters commented Mar 9, 2023

OK I've been digging into this more and I now believe indeed we have two distinct bugs; they just happened to fail with similar symptoms.

I've retargeted this change at https://issues.redhat.com/browse/OCPBUGS-9685
/jira refresh

because it is definitely aiming to fix the issue we saw in the aggregated periodic, which is
actually distinct from the bug linked to #3600

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Mar 9, 2023
@openshift-ci-robot
Copy link
Contributor

@cgwalters: This pull request references Jira Issue OCPBUGS-9685, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.14.0) matches configured target version for branch (4.14.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @rioliu-rh

In response to this:

OK I've been digging into this more and I now believe indeed we have to distinct bugs; they just happened to fail with similar symptoms.

I've retargeted this change at https://issues.redhat.com/browse/OCPBUGS-9685
/jira refresh

because it is definitely aiming to fix the issue we saw in the aggregated periodic, which is
actually distinct from the bug linked to #3600

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@sdodson
Copy link
Member

sdodson commented Mar 9, 2023

/retest-required

@cgwalters
Copy link
Member Author

Ah yes! I found evidence of this working in the job; rpm-ostree segfaulted on one node, but we successfully retried. Look at this journal:

Mar 09 19:16:54.479002 ci-op-rclyt598-1b6f8-g2sjq-worker-c-mqd8w systemd-coredump[102033]: Process 85197 (rpm-ostree) of user 0 dumped core.
Mar 09 19:16:54.512241 ci-op-rclyt598-1b6f8-g2sjq-worker-c-mqd8w systemd[1]: rpm-ostreed.service: Main process exited, code=killed, status=11/SEGV

Yet, the MCD retry must have worked. Unfortunately we don't have the previous pod logs from the MCD, but the success of the payload job combined with this evidence leads me to
/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 9, 2023
@cgwalters
Copy link
Member Author

/skip

@cgwalters
Copy link
Member Author

/cherrypick release-4.13

@openshift-cherrypick-robot

@cgwalters: once the present PR merges, I will cherry-pick it on top of release-4.13 in a new PR and assign it to you.

In response to this:

/cherrypick release-4.13

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@sdodson sdodson merged commit 539a2b1 into openshift:master Mar 9, 2023
@openshift-ci-robot
Copy link
Contributor

@cgwalters: Jira Issue OCPBUGS-9685: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-9685 has been moved to the MODIFIED state.

In response to this:

This is followup to #3580 - we're fixing the case where deploying the RT kernel fails and we want to retry.


daemon: Move cleanup of pending deployment earlier

We hit a confusing failure in https://issues.redhat.com/browse/OCPBUGS-8113
where the MCD will get stuck if deploying the RT kernel fails, because
the switch to the RT kernel operates from the booted deployment
state, but by default rpm-ostree wants to operate from pending.

Move up the "cleanup pending deployment on failure" defer to
right before we do anything else.


daemon: Always remove pending deployment before we do updates

The RT kernel switch logic operates from the booted deployment,
not pending. I had in my head that the MCO always cleaned up
pending, but due to another bug we didn't.

There's no reason to leave this cleanup to a defer; do it
before we do anything else.

(But keep the defer because it's cleaner to also cleanup if
we fail)


Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 9, 2023

@cgwalters: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/okd-scos-e2e-aws-ovn 112277e link false /test okd-scos-e2e-aws-ovn
ci/prow/okd-scos-e2e-gcp-ovn-upgrade 112277e link false /test okd-scos-e2e-gcp-ovn-upgrade
ci/prow/e2e-hypershift 112277e link false /test e2e-hypershift
ci/prow/e2e-alibabacloud-ovn 112277e link false /test e2e-alibabacloud-ovn

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-cherrypick-robot

@cgwalters: new pull request created: #3601

In response to this:

/cherrypick release-4.13

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants