OCPBUGS-9685: daemon: Always remove pending deployment before we do updates #3599

cgwalters · 2023-03-09T16:07:17Z

This is followup to #3580 - we're fixing the case where deploying the RT kernel fails and we want to retry.

daemon: Move cleanup of pending deployment earlier

We hit a confusing failure in https://issues.redhat.com/browse/OCPBUGS-8113
where the MCD will get stuck if deploying the RT kernel fails, because
the switch to the RT kernel operates from the booted deployment
state, but by default rpm-ostree wants to operate from pending.

Move up the "cleanup pending deployment on failure" defer to
right before we do anything else.

daemon: Always remove pending deployment before we do updates

The RT kernel switch logic operates from the booted deployment,
not pending. I had in my head that the MCO always cleaned up
pending, but due to another bug we didn't.

There's no reason to leave this cleanup to a defer; do it
before we do anything else.

(But keep the defer because it's cleaner to also cleanup if
we fail)

We hit a confusing failure in https://issues.redhat.com/browse/OCPBUGS-8113 where the MCD will get stuck if deploying the RT kernel fails, because the switch to the RT kernel operates from the *booted* deployment state, but by default rpm-ostree wants to operate from pending. Move up the "cleanup pending deployment on failure" `defer` to right before we do anything else.

The RT kernel switch logic operates from the *booted* deployment, not pending. I had in my head that the MCO always cleaned up pending, but due to another bug we didn't. There's no reason to leave this cleanup to a defer; do it before we do anything else. (But keep the defer because it's cleaner to *also* cleanup if we fail)

cgwalters · 2023-03-09T16:09:16Z

I'm uncertain whether to call this the re-fixed version of the code for OCPBUGS-8113 - we have discovered a further underlying problem in https://issues.redhat.com/browse/OCPBUGS-9685 in that librpm is segfaulting.

But...it does seem likely to me that fixing the retry loop will paper over whatever race condition (or possibly memory corruption? 😢 ) is leading librpm to segfault...

/payload-job periodic-ci-openshift-release-master-ci-4.13-upgrade-from-stable-4.12-e2e-gcp-ovn-rt-upgrade

openshift-ci-robot · 2023-03-09T16:09:49Z

@cgwalters: This pull request references Jira Issue OCPBUGS-8113, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.14.0) matches configured target version for branch (4.14.0)
bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @sergiordlr

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

This is followup to #3580 - we're fixing the case where deploying the RT kernel fails and we want to retry.

daemon: Move cleanup of pending deployment earlier

We hit a confusing failure in https://issues.redhat.com/browse/OCPBUGS-8113
where the MCD will get stuck if deploying the RT kernel fails, because
the switch to the RT kernel operates from the booted deployment
state, but by default rpm-ostree wants to operate from pending.

Move up the "cleanup pending deployment on failure" defer to
right before we do anything else.

daemon: Always remove pending deployment before we do updates

The RT kernel switch logic operates from the booted deployment,
not pending. I had in my head that the MCO always cleaned up
pending, but due to another bug we didn't.

There's no reason to leave this cleanup to a defer; do it
before we do anything else.

(But keep the defer because it's cleaner to also cleanup if
we fail)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

cgwalters · 2023-03-09T16:09:52Z

OK upon reflection I think let's link this to OCPBUGS-8113.

openshift-ci · 2023-03-09T16:10:14Z

@cgwalters: An error was encountered. No known errors were detected, please see the full error message for details.

Full error message.


could not create PullRequestPayloadQualificationRun: Post "https://172.30.0.1:443/apis/ci.openshift.io/v1/namespaces/ci/pullrequestpayloadqualificationruns": net/http: TLS handshake timeout

Please contact an administrator to resolve this issue.

cgwalters · 2023-03-09T16:18:37Z

Side note: There's a huge overlap between rpm-ostree's "pending deployment" logic and what the MCD is juggling externally to that with "current/pending/desired" machineconfig hashes.

I strongly believe again that the right fix here is pushing config management down into the OS layer - then this bug would have never happened, because the system as a whole would always either be in state A or state B.

cgwalters · 2023-03-09T16:51:53Z

/payload-job periodic-ci-openshift-release-master-ci-4.13-upgrade-from-stable-4.12-e2e-gcp-ovn-rt-upgrade

openshift-ci · 2023-03-09T16:51:57Z

@cgwalters: trigger 1 job(s) for the /payload-(job|aggregate) command

periodic-ci-openshift-release-master-ci-4.13-upgrade-from-stable-4.12-e2e-gcp-ovn-rt-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/b22a21a0-be9a-11ed-8ca2-fef6efbc6fe7-0

cgwalters · 2023-03-09T16:58:04Z

/payload-aggregate periodic-ci-openshift-release-master-ci-4.13-upgrade-from-stable-4.12-e2e-gcp-ovn-rt-upgrade

yuqi-zhang

overall lgtm, should be a safe operation since we shouldn't be removing any ongoing updates. Quick question below

yuqi-zhang · 2023-03-09T17:05:36Z

pkg/daemon/update.go

@@ -2120,6 +2120,28 @@ func (dn *CoreOSDaemon) applyLayeredOSChanges(mcDiff machineConfigDiff, oldConfi
 		defer os.Remove(extensionsRepo)
 	}

+	// Always clean up pending, because the RT kernel switch logic below operates on booted,
+	// not pending.
+	if err := removePendingDeployment(); err != nil {


In what ways can cleanup -p fail? I assume it won't error if there isn't a pending deployment?

yeah, it doesn't give error. Tested it on local machine

$ rpm-ostree cleanup -p Deployments unchanged. $ echo $? 0

Right, it's idempotent. Also crucially, this code path is executed by CI, so if it didn't work, CI would fail.

sinnykumari

/lgtm

sinnykumari · 2023-03-09T17:15:29Z

pkg/daemon/update.go

@@ -2120,6 +2120,28 @@ func (dn *CoreOSDaemon) applyLayeredOSChanges(mcDiff machineConfigDiff, oldConfi
 		defer os.Remove(extensionsRepo)
 	}

+	// Always clean up pending, because the RT kernel switch logic below operates on booted,
+	// not pending.
+	if err := removePendingDeployment(); err != nil {


yeah, it doesn't give error. Tested it on local machine

$ rpm-ostree cleanup -p Deployments unchanged. $ echo $? 0

openshift-ci · 2023-03-09T17:25:39Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters, sinnykumari, yuqi-zhang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [cgwalters,sinnykumari,yuqi-zhang]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

sinnykumari · 2023-03-09T17:26:43Z

Putting hold as we are waiting for payload test to finish https://pr-payload-tests.ci.openshift.org/runs/ci/b22a21a0-be9a-11ed-8ca2-fef6efbc6fe7-0
/hold
Once payload test is green, should be good to get merged

openshift-ci-robot · 2023-03-09T19:58:15Z

@cgwalters: This pull request references Jira Issue OCPBUGS-9685, which is invalid:

expected the bug to target the "4.14.0" version, but it targets "4.13.0" instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

This is followup to #3580 - we're fixing the case where deploying the RT kernel fails and we want to retry.

daemon: Move cleanup of pending deployment earlier

We hit a confusing failure in https://issues.redhat.com/browse/OCPBUGS-8113
where the MCD will get stuck if deploying the RT kernel fails, because
the switch to the RT kernel operates from the booted deployment
state, but by default rpm-ostree wants to operate from pending.

Move up the "cleanup pending deployment on failure" defer to
right before we do anything else.

daemon: Always remove pending deployment before we do updates

The RT kernel switch logic operates from the booted deployment,
not pending. I had in my head that the MCO always cleaned up
pending, but due to another bug we didn't.

There's no reason to leave this cleanup to a defer; do it
before we do anything else.

(But keep the defer because it's cleaner to also cleanup if
we fail)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

cgwalters · 2023-03-09T20:00:26Z

OK I've been digging into this more and I now believe indeed we have two distinct bugs; they just happened to fail with similar symptoms.

I've retargeted this change at https://issues.redhat.com/browse/OCPBUGS-9685
/jira refresh

because it is definitely aiming to fix the issue we saw in the aggregated periodic, which is
actually distinct from the bug linked to #3600

openshift-ci-robot · 2023-03-09T20:00:32Z

@cgwalters: This pull request references Jira Issue OCPBUGS-9685, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.14.0) matches configured target version for branch (4.14.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @rioliu-rh

In response to this:

OK I've been digging into this more and I now believe indeed we have to distinct bugs; they just happened to fail with similar symptoms.

I've retargeted this change at https://issues.redhat.com/browse/OCPBUGS-9685
/jira refresh

because it is definitely aiming to fix the issue we saw in the aggregated periodic, which is
actually distinct from the bug linked to #3600

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

sdodson · 2023-03-09T22:15:55Z

/retest-required

cgwalters · 2023-03-09T22:18:02Z

Ah yes! I found evidence of this working in the job; rpm-ostree segfaulted on one node, but we successfully retried. Look at this journal:

Mar 09 19:16:54.479002 ci-op-rclyt598-1b6f8-g2sjq-worker-c-mqd8w systemd-coredump[102033]: Process 85197 (rpm-ostree) of user 0 dumped core.
Mar 09 19:16:54.512241 ci-op-rclyt598-1b6f8-g2sjq-worker-c-mqd8w systemd[1]: rpm-ostreed.service: Main process exited, code=killed, status=11/SEGV

Yet, the MCD retry must have worked. Unfortunately we don't have the previous pod logs from the MCD, but the success of the payload job combined with this evidence leads me to
/hold cancel

cgwalters · 2023-03-09T22:41:07Z

/skip

cgwalters · 2023-03-09T23:30:22Z

/cherrypick release-4.13

openshift-cherrypick-robot · 2023-03-09T23:30:24Z

@cgwalters: once the present PR merges, I will cherry-pick it on top of release-4.13 in a new PR and assign it to you.

In response to this:

/cherrypick release-4.13

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot · 2023-03-09T23:34:11Z

@cgwalters: Jira Issue OCPBUGS-9685: All pull requests linked via external trackers have merged:

openshift/machine-config-operator#3599

Jira Issue OCPBUGS-9685 has been moved to the MODIFIED state.

In response to this:

This is followup to #3580 - we're fixing the case where deploying the RT kernel fails and we want to retry.

daemon: Move cleanup of pending deployment earlier

We hit a confusing failure in https://issues.redhat.com/browse/OCPBUGS-8113
where the MCD will get stuck if deploying the RT kernel fails, because
the switch to the RT kernel operates from the booted deployment
state, but by default rpm-ostree wants to operate from pending.

Move up the "cleanup pending deployment on failure" defer to
right before we do anything else.

daemon: Always remove pending deployment before we do updates

The RT kernel switch logic operates from the booted deployment,
not pending. I had in my head that the MCO always cleaned up
pending, but due to another bug we didn't.

There's no reason to leave this cleanup to a defer; do it
before we do anything else.

(But keep the defer because it's cleaner to also cleanup if
we fail)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci · 2023-03-09T23:34:41Z

@cgwalters: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/okd-scos-e2e-aws-ovn	`112277e`	link	false	`/test okd-scos-e2e-aws-ovn`
ci/prow/okd-scos-e2e-gcp-ovn-upgrade	`112277e`	link	false	`/test okd-scos-e2e-gcp-ovn-upgrade`
ci/prow/e2e-hypershift	`112277e`	link	false	`/test e2e-hypershift`
ci/prow/e2e-alibabacloud-ovn	`112277e`	link	false	`/test e2e-alibabacloud-ovn`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-cherrypick-robot · 2023-03-09T23:34:59Z

@cgwalters: new pull request created: #3601

In response to this:

/cherrypick release-4.13

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

cgwalters added 2 commits March 9, 2023 11:03

openshift-ci bot requested review from cheesesashimi and yuqi-zhang March 9, 2023 16:07

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 9, 2023

cgwalters changed the title ~~daemon: Always remove pending deployment before we do updates~~ OCPBUGS-8113: daemon: Always remove pending deployment before we do updates Mar 9, 2023

openshift-ci bot requested a review from sergiordlr March 9, 2023 16:10

yuqi-zhang approved these changes Mar 9, 2023

View reviewed changes

sinnykumari reviewed Mar 9, 2023

View reviewed changes

openshift-ci bot assigned sinnykumari Mar 9, 2023

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 9, 2023

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 9, 2023

cgwalters changed the title ~~OCPBUGS-8113: daemon: Always remove pending deployment before we do updates~~ OCPBUGS-9685: daemon: Always remove pending deployment before we do updates Mar 9, 2023

openshift-ci bot requested a review from rioliu-rh March 9, 2023 20:00

cgwalters mentioned this pull request Mar 9, 2023

OCPBUGS-8113: daemon: Only switchkernel if we are doing an OS update or kernel change #3600

Merged

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 9, 2023

sdodson merged commit 539a2b1 into openshift:master Mar 9, 2023

openshift-cherrypick-robot mentioned this pull request Mar 9, 2023

[release-4.13] OCPBUGS-9951: daemon: Always remove pending deployment before we do updates #3601

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCPBUGS-9685: daemon: Always remove pending deployment before we do updates #3599

OCPBUGS-9685: daemon: Always remove pending deployment before we do updates #3599

cgwalters commented Mar 9, 2023

cgwalters commented Mar 9, 2023

openshift-ci-robot commented Mar 9, 2023

cgwalters commented Mar 9, 2023

openshift-ci bot commented Mar 9, 2023

cgwalters commented Mar 9, 2023

cgwalters commented Mar 9, 2023

openshift-ci bot commented Mar 9, 2023

cgwalters commented Mar 9, 2023

yuqi-zhang left a comment

yuqi-zhang Mar 9, 2023

sinnykumari Mar 9, 2023

cgwalters Mar 9, 2023

sinnykumari left a comment

sinnykumari Mar 9, 2023

openshift-ci bot commented Mar 9, 2023

sinnykumari commented Mar 9, 2023

openshift-ci-robot commented Mar 9, 2023

cgwalters commented Mar 9, 2023 •

edited

Loading

openshift-ci-robot commented Mar 9, 2023

sdodson commented Mar 9, 2023

cgwalters commented Mar 9, 2023

cgwalters commented Mar 9, 2023

cgwalters commented Mar 9, 2023

openshift-cherrypick-robot commented Mar 9, 2023

openshift-ci-robot commented Mar 9, 2023

openshift-ci bot commented Mar 9, 2023

openshift-cherrypick-robot commented Mar 9, 2023

OCPBUGS-9685: daemon: Always remove pending deployment before we do updates #3599

OCPBUGS-9685: daemon: Always remove pending deployment before we do updates #3599

Conversation

cgwalters commented Mar 9, 2023

cgwalters commented Mar 9, 2023

openshift-ci-robot commented Mar 9, 2023

cgwalters commented Mar 9, 2023

openshift-ci bot commented Mar 9, 2023

cgwalters commented Mar 9, 2023

cgwalters commented Mar 9, 2023

openshift-ci bot commented Mar 9, 2023

cgwalters commented Mar 9, 2023

yuqi-zhang left a comment

Choose a reason for hiding this comment

yuqi-zhang Mar 9, 2023

Choose a reason for hiding this comment

sinnykumari Mar 9, 2023

Choose a reason for hiding this comment

cgwalters Mar 9, 2023

Choose a reason for hiding this comment

sinnykumari left a comment

Choose a reason for hiding this comment

sinnykumari Mar 9, 2023

Choose a reason for hiding this comment

openshift-ci bot commented Mar 9, 2023

sinnykumari commented Mar 9, 2023

openshift-ci-robot commented Mar 9, 2023

cgwalters commented Mar 9, 2023 • edited Loading

openshift-ci-robot commented Mar 9, 2023

sdodson commented Mar 9, 2023

cgwalters commented Mar 9, 2023

cgwalters commented Mar 9, 2023

cgwalters commented Mar 9, 2023

openshift-cherrypick-robot commented Mar 9, 2023

openshift-ci-robot commented Mar 9, 2023

openshift-ci bot commented Mar 9, 2023

openshift-cherrypick-robot commented Mar 9, 2023

cgwalters commented Mar 9, 2023 •

edited

Loading