Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmdlib: bump supermin VM memory to 3G #2940

Closed
wants to merge 1 commit into from

Conversation

jlebon
Copy link
Member

@jlebon jlebon commented Jun 23, 2022

Something has changed recently which causes us to hit the ENOMEM issue
more easily now:

openshift/os#594 (comment)

Mid-term, we could rework the compose so that only the OCI archive is
pulled through 9p rather than a full pull-local. Long-term, the fix is
to stop using 9p.

But for now to unblock CI, let's just bump the VM memory to 3G which
should help.

jmarrero
jmarrero previously approved these changes Jun 23, 2022
Copy link
Member

@jmarrero jmarrero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

travier
travier previously approved these changes Jun 23, 2022
@jlebon
Copy link
Member Author

jlebon commented Jun 23, 2022

Mid-term, we could rework the compose so that only the OCI archive is
pulled through 9p rather than a full pull-local.

Started working on a patch for this.

@dustymabe
Copy link
Member

dustymabe commented Jun 23, 2022

Mid-term, we could rework the compose so that only the OCI archive is
pulled through 9p rather than a full pull-local. Long-term, the fix is
to stop using 9p.

Could we attach the ociarchive file directly as a block device (read-only) with a fixed size to the VM and access it directly (i.e. cat/pipe it to stdin of a process that will extract it)?

@jlebon
Copy link
Member Author

jlebon commented Jun 23, 2022

Mid-term, we could rework the compose so that only the OCI archive is
pulled through 9p rather than a full pull-local. Long-term, the fix is
to stop using 9p.

Could we attach the ociarchive file directly as a block device (read-only) with a fixed size to the VM and access it directly (i.e. cat/pipe it to stdin of a process that will extract it)?

You're talking about the image creation part, right? The issue we're hitting here is rpm-ostree compose time. Image creation actually already works that way (pushing the container only through 9p and extracting inside of there). The path I'm proposing is applying the same idea but in reverse for pulling out the OSTree commit.

@jlebon jlebon enabled auto-merge (rebase) June 23, 2022 13:23
@jlebon jlebon disabled auto-merge June 23, 2022 13:59
Something has changed recently which causes us to hit the ENOMEM issue
more easily now:

openshift/os#594 (comment)

Mid-term, we could rework the compose so that only the OCI archive is
pulled through 9p rather than a full `pull-local`. Long-term, the fix is
to stop using 9p.

But for now to unblock CI, let's just bump the VM memory to 4G which
should help.
@jlebon
Copy link
Member Author

jlebon commented Jun 23, 2022

OK this is useful. Prow failed with ENOMEM, so clearly it's not just a matter of throwing more RAM at it. Unless this regression we're facing has really made 9p more memory hungry. Anyway, I bumped it to 4G now just to see if that works, but let's not merge this yet. I'll try to get the OCI archive-based patch working.

@jlebon jlebon dismissed stale reviews from travier and jmarrero via 83ce8d9 June 23, 2022 14:05
@jlebon jlebon force-pushed the pr/bump-supermin-mem branch from 3593869 to 83ce8d9 Compare June 23, 2022 14:05
@jlebon
Copy link
Member Author

jlebon commented Jun 23, 2022

golangci/golangci-lint info checking GitHub for latest tag
golangci/golangci-lint crit unable to find '' - use 'latest' or see https://github.com/golangci/golangci-lint/releases for details 

Hmm, not sure what's going on here.

/retest

@openshift-ci
Copy link

openshift-ci bot commented Jun 23, 2022

@jlebon: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/images 83ce8d9 link true /test images
ci/prow/rhcos 83ce8d9 link true /test rhcos

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@cgwalters
Copy link
Member

OK this is useful. Prow failed with ENOMEM, so clearly it's not just a matter of throwing more RAM at it.

Yeah I should have highlighted this comment more. It's not that we're running out of memory exactly, it's that 9p specifically is trying to do something broken. We're provoking that bug by doing lots of little files.

Adding more memory only papers over it in that it makes it less likely for the kernel to try reclaiming inodes.

Only real fixes are:

  • use virtiofs and hope that's better (I think it is)
  • Actually fix 9p in the kernel to stop trying to allocate more memory in a reclaim path
  • Stop doing lots of little files over 9p

@jlebon
Copy link
Member Author

jlebon commented Jun 24, 2022

Closing in favour of #2946.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants