Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v2.4.0 grub error out of memory #1842

Closed
Ognian opened this issue Sep 21, 2023 · 44 comments
Closed

v2.4.0 grub error out of memory #1842

Ognian opened this issue Sep 21, 2023 · 44 comments
Assignees
Labels
bug Something isn't working prio: high

Comments

@Ognian
Copy link
Contributor

Ognian commented Sep 21, 2023

After install from kairos-standard-opensuse-leap-amd64-generic-v2.4.0-k3sv1.26.6+k3s1.iso on /dev/mmcblk1 on a x86_64 (latte panda 3 d) I get immediately the following grub error:
image

@Ognian Ognian added the bug Something isn't working label Sep 21, 2023
@Ognian
Copy link
Contributor Author

Ognian commented Sep 22, 2023

same for kairos-standard-opensuse-tumbleweed-amd64-generic-v2.4.0-k3sv1.26.6+k3s1.iso

@Itxaka
Copy link
Member

Itxaka commented Sep 25, 2023

umm, this could be related to the gfx set by grub, you may need to set it to lower manually as we now set the gfxterm teminal to auto and it would try to get the highest mode available.

Maybe you can check with different gfxmode values?

@jimmykarily jimmykarily moved this from Incoming to Under review 🔍 in 🧙Issue tracking board Sep 25, 2023
@Itxaka Itxaka self-assigned this Sep 25, 2023
@Itxaka
Copy link
Member

Itxaka commented Sep 25, 2023

@Itxaka
Copy link
Member

Itxaka commented Sep 25, 2023

seem like elementary also hit this at one point, which seems to confirm that this is a gfx issue, setting a really high gfx setting but the framebuffer is not big enough to display that: elementary/installer#542

@Ognian
Copy link
Contributor Author

Ognian commented Sep 25, 2023

trying to change gfxmode from auto to 640x480, but it is wired:

  • adding in the grub.cfg in the first partition, as last line set gfxmode=640x480-> doesn't help
  • changing in the grub.cfg in the grub2 dir of COS_STATE from auto to 640x480-> doesn't help either, but produces an interesting picture:
image

It indeed changes something but actually where to do the change? or is it needed multiple times?

And actually why does it work from the usb stick and not after installing? I thought that the grub config is identical...

@Ognian
Copy link
Contributor Author

Ognian commented Sep 30, 2023

@Itxaka any news on this, any chance to be fixed in 2.4.1?

@Itxaka
Copy link
Member

Itxaka commented Oct 3, 2023

@Ognian unfortunately no. As this requires a change to grub default values, we needed to push 2.4.1 to fix some issues before getting to work into this as it requires extensive testing to find a good default.

@Ognian
Copy link
Contributor Author

Ognian commented Oct 3, 2023

Tested with 2.4.1 same issue!
Noticed the following:
image

@Itxaka
Copy link
Member

Itxaka commented Oct 3, 2023

Tested with 2.4.1 same issue! Noticed the following: image

Wait, so this means you are able to boot by manually setting the gfxmode rigth? But then on reboot it ignores it unless you set it manually?

Seems like we need to look for a safe default for the resolution

Those are just warnings being exposed. It happened before but we were not logging them properly, it should not affect that much, is just nicer to have those fonts bundled :)

@Ognian
Copy link
Contributor Author

Ognian commented Oct 3, 2023

I'll describe the process from the beginning:

  1. I'm downloading kairos-standard-opensuse-leap-amd64-generic-v2.4.1-k3sv1.26.6+k3s1.iso and burning it to an usb stick
  2. Im inserting the stick and booting from it (latte panda delta 3 -> x86_64 with build in eMMC). Stick is booting and I'm getting the qr code.
  3. I'm using the webui (ip:8080) to install on the build in eMMC (/dev/mmcblk1), pasting my cloud_config and checking reboot
  4. When it restarts, I remove the usb stick so it tries to boot from the eMMC (sd card). Here the out of memory error of grub appears

the grub.cfg on the USB stick is much shorter than the one written by the installer on the eMMC (= sd card). the grub configuration on the usb stick always works the one on the sd card never.

I tried to modify the one on the sd card by inserting set gfxmode=640x480 at different places, it changes the behavior BUT none of the attempts lead to booting kairos...

@AndreyNikiforov
Copy link

Also faced out of memory (OOM) issues when trying to install on old Acer Aspire1 laptop (4G ram & mmc). Bisected to loopback command that cases OOM. Suggestions on SO are to copy kernel & initrc from image to disk. Don't have progress as I am still learning grub...

@AndreyNikiforov
Copy link

Also faced out of memory (OOM) issues when trying to install on old Acer Aspire1 laptop (4G ram & mmc). Bisected to loopback command that cases OOM. Suggestions on SO are to copy kernel & initrc from image to disk. Don't have progress as I am still learning grub...

Enabling debugging with set debug=all let me pass through loopback - different error (not OOM). In debug I noticed that tpm module is used, so I turned off TPM in BIOS and kairos started successfully. Although I am unblocked, it is not clear what was the root cause. If it was indeed the lack of memory and TPM use just crossed a bar, then reducing memory foot print makes sense: use text mode by default, test with large images etc

@Itxaka
Copy link
Member

Itxaka commented Oct 5, 2023

I'll describe the process from the beginning:

1. I'm downloading `kairos-standard-opensuse-leap-amd64-generic-v2.4.1-k3sv1.26.6+k3s1.iso` and burning it to an usb stick

2. Im inserting the stick and booting from it (latte panda delta 3 -> x86_64 with build in eMMC). Stick is booting and I'm getting the qr code.

3. I'm using the webui (ip:8080) to install on the build in eMMC (/dev/mmcblk1), pasting my cloud_config and checking reboot

4. When it restarts, I remove the usb stick so it tries to boot from the eMMC (sd card). Here the out of memory error of grub appears

the grub.cfg on the USB stick is much shorter than the one written by the installer on the eMMC (= sd card). the grub configuration on the usb stick always works the one on the sd card never.

I tried to modify the one on the sd card by inserting set gfxmode=640x480 at different places, it changes the behavior BUT none of the attempts lead to booting kairos...

yep, this makes sense. Our grub.cfg for livecd does not have the gfxmode set, so it makes sense that on livecd/usb/live mode you do not hit this, its only once you restart from the installed system, then you hit this issue as we set the set gfxmode=auto

Let me test this somehow. Maybe I can make virtualbox reproduce it by setting the video card to a very low amount of ram or something similar....

@Itxaka
Copy link
Member

Itxaka commented Oct 5, 2023

Also faced out of memory (OOM) issues when trying to install on old Acer Aspire1 laptop (4G ram & mmc). Bisected to loopback command that cases OOM. Suggestions on SO are to copy kernel & initrc from image to disk. Don't have progress as I am still learning grub...

Enabling debugging with set debug=all let me pass through loopback - different error (not OOM). In debug I noticed that tpm module is used, so I turned off TPM in BIOS and kairos started successfully. Although I am unblocked, it is not clear what was the root cause. If it was indeed the lack of memory and TPM use just crossed a bar, then reducing memory foot print makes sense: use text mode by default, test with large images etc

very weird, 4Gb of ram should be more than enough for everything to load with no issues, after all the kernel and initrd cant be more than 200Mb in any of the flavors....

Wondering if its due to the modules or the gfx stuff in your case as well....

@Ognian
Copy link
Contributor Author

Ognian commented Oct 5, 2023

So I disabled TPM from BIOS (Thanks @AndreyNikiforov !)
I did a clean install of 2.4.1 from USB.
On first boot of the internal eMMC:
image

image pressed a key, booting continuous image image image The above errors don't look scary to see... After this it looks like it works...

@Itxaka
Copy link
Member

Itxaka commented Oct 5, 2023

Some comments found going trougth teh grub bugtracker:

Finally I found a comment regarding the screen size and GRUB. Apparently the 4k graphics size eats half the available 200MB RAM from GRUBs allotment. Thus any initrd.img larger than 100MB won't load.

Looks like TPM module is indeed involved! rhboot/grub2#102

So https://github.com/rhboot/grub2/commit/635f85b016839b9aaecdecee69a2ee98edb3e0ab was supposed to allow initrds to be allocated over 4GB. However, initrds are also being verified by the verifiers framework, or rather the tpm "verifier" measures them this way.

This causes the verifiers framework to read the entire file into memory first using standard memory allocation to verify it and then release it again before our allocator gets a chance to load the size and allocate it. This is um bad.

So it makes sense that disabling tpm makes it work as it doesnt try to fully load the initrd into memory for measure.

So it seems to be a mix of several things:

  • gfxterm set to auto (This exhausts memory if output is a 4k stream)
  • tpm trying to load initrd fully into memory
  • grub not able to see and map the full memory needed

HAve to think about this and check further in upstream grubs to see if this has been fixed somewhere but good catch folks.

Thanks @Ognian for reporting this and @AndreyNikiforov for the hint with the TPM. This would have been a nigthmare to track down otherwise!

@Itxaka
Copy link
Member

Itxaka commented Oct 5, 2023

our kernel on core images is around 13Mb
our initrd on core images is around 92/96Mb

It kind of makes sense that we go over that mentioned 100Mb by setting the gfx mode to auto if it choses a very high resolution....

@Itxaka
Copy link
Member

Itxaka commented Oct 5, 2023

By moving to compressing the initramfs with zstd it would gain us 4 extra Mb, which is not much, but its good enough to breathe I guess

@Ognian does this happen with a non-k3s build? If it also happens, are you able to build a custom image with the --zstd flag on initrd creation to see if it alleviates the issue?

The patch is as follows, its just 1 line:

diff --git a/Earthfile b/Earthfile
index b22b8c8..61eb545 100644
--- a/Earthfile
+++ b/Earthfile
@@ -441,7 +441,7 @@ base-image:
       IF [ -e "/usr/bin/dracut" ]
           # Regenerate initrd if necessary
           RUN --no-cache kernel=$(ls /lib/modules | head -n1) && depmod -a "${kernel}"
-          RUN --no-cache kernel=$(ls /lib/modules | head -n1) && dracut -f "/boot/initrd-${kernel}" "${kernel}" && ln -sf "initrd-${kernel}" /boot/initrd
+          RUN --no-cache kernel=$(ls /lib/modules | head -n1) && dracut --zstd -f "/boot/initrd${kernel}" "${kernel}" && ln -sf "initrd-${kernel}" /boot/initrd
       END
     END

And then simply run earthly +iso --FLAVOR=opensuse-leap --VARIANT=standard --K3S_VERSION=v1.26.6 to generate an iso under build

@Itxaka
Copy link
Member

Itxaka commented Oct 9, 2023

umm booting from master in 4k doesnt result in the issue being reproduced, even with tpm. Im wondering if its a tpm implementation issue rather than a grub one. We dont ship the tpm module with grub as a module so not sure if its integrated into grub directly.

I think we need to rework the grub.cfg to not load the gfxterm for now unless its needed as its giving us a lot of headaches.

@jimmykarily
Copy link
Contributor

We dropped gfxterm here: kairos-io/packages#473 . Please give it a try if the problem still occurs feel free to re-open.

@github-project-automation github-project-automation bot moved this from Under review 🔍 to Done ✅ in 🧙Issue tracking board Oct 23, 2023
@jeffmhastings
Copy link

jeffmhastings commented Nov 1, 2023

I'm running into the same problem using kairos-standard-ubuntu-22-lts-amd64-generic-v2.4.1-k3sv1.27.3+k3s1.iso. I also built from master, thinking that would pull in the changes from kairos-io/packages#473 (and I think it did because I my grub.cfg is now missing all the gfx stuff), but have the same result. I didn't have success disabling TPM either.

Edit: Disabling TPM and reinstalling gave me the same results as @Ognian (can't find regexp, boots after pressing a key). Anyway I'd definitely like to see this issue resolved (ideally without disabling TPM) so let me know if there's anything I can do to help.

@jimmykarily
Copy link
Contributor

Up to now it seems that to reproduce this issue one needs:

  • gfxmode set to auto
  • a 4k monitor (to make the above use a high resolution). Maybe 2k will also trigger it, not sure
  • a TPM chip on the machine
  • uefi booting

and we still miss something because @Itxaka tried the above combination and couldn't reproduce. His test was on qemu with virtual monitors though so maybe that's the reason (but grub thought the resolution was 4k)

@mevatron
Copy link

Up to now it seems that to reproduce this issue one needs:

  • gfxmode set to auto
  • a 4k monitor (to make the above use a high resolution). Maybe 2k will also trigger it, not sure
  • a TPM chip on the machine
  • uefi booting

I've looked for a way to disable TPM on the Surface Pro, but I don't think that is an available setting in its boot menu. What's the best way to test setting the gfxmode to a lower resolution in Kairos?

@jimmykarily
Copy link
Contributor

I would try this (warning: not tested):

Hopefully that should set the gfxmode on the installed system's grub. You can ofcourse check, after installation by editing the grub menu again and looking for that option.

@tyzbit
Copy link

tyzbit commented Nov 26, 2023

I know you said to use the live CD but I rebooted a node and tried running videoinfo in the GRUB prompt, it said the command was not found. I tried different combinations of set gfxmode= and set gfxpayload= in the custom one-time GRUB options and none of them prevented the error. It also seemed like none of them changed the video. For what it's worth, here's my config

@mevatron
Copy link

mevatron commented Nov 29, 2023

I know you said to use the live CD but I rebooted a node and tried running videoinfo in the GRUB prompt, it said the command was not found. I tried different combinations of set gfxmode= and set gfxpayload= in the custom one-time GRUB options and none of them prevented the error. It also seemed like none of them changed the video. For what it's worth, here's my config

I noticed that videoinfo wasn't on the Kairos grub menu as well, but I downloaded the Ubuntu Server 22.04 ISO and that seemed to do the trick.

Unfortunately, lowering the resolution didn't work for me either =/
PXL_20231123_170856024 MP

@jimmykarily
Copy link
Contributor

@santhoshdaivajna sent me on Slack that they are seeing the same issue on Intel NUC with 8 cpu/32G mem/>500G disk . We may be able to get access to a NUC to debug.

@mudler mudler moved this from Todo 🖊 to In Progress 🏃 in 🧙Issue tracking board Dec 1, 2023
@mudler mudler moved this from In Progress 🏃 to Todo 🖊 in 🧙Issue tracking board Dec 1, 2023
@mudler
Copy link
Member

mudler commented Dec 1, 2023

this reminds me https://bugs.launchpad.net/oem-priority/+bug/1842320/comments/125 - did we tried setting up gfxmode to 640x480 ?

@mudler
Copy link
Member

mudler commented Dec 1, 2023

@mudler
Copy link
Member

mudler commented Dec 1, 2023

maybe it's just the GRUB version causing issues here? @Ognian is that new to 2.4? we could cross check the GRUB versions to see if that's causing it

@Itxaka
Copy link
Member

Itxaka commented Dec 1, 2023

we think that the tumbleweed grub efi binary is the responsible of this and have reverted the change to use the leap one on kairos-io/packages#553

@mevatron
Copy link

mevatron commented Dec 2, 2023

@Itxaka Thanks for looking into this! Will this also help the ubuntu flavors, or is this specific to opensuse?

@Itxaka
Copy link
Member

Itxaka commented Dec 2, 2023

Should be for all, as we use the same grub artifacts for all of them

@Ognian
Copy link
Contributor Author

Ognian commented Dec 2, 2023

Yes this was new with

maybe it's just the GRUB version causing issues here? @Ognian is that new to 2.4? we could cross check the GRUB versions to see if that's causing it

yes, this was newly introduced with 2.4.
I just tested with 2.4.1 and upgraded to 2.4.2 and the result is with 2.4.2 as it was with 2.4.1 and 2.4.0:
with TPM -> out of memory error; without TPM -> boots OK
The last version I tested where it worked was v2.2.1
The version I have now is:
KAIROS_PRETTY_NAME="kairos-standard-opensuse-leap-15.5 v2.4.2-k3sv1.28.2+k3s1"
and
sudo grub2-install --version
grub2-install (GRUB2) 2.06

Hope this helps.
Ognian

@mevatron
Copy link

mevatron commented Dec 2, 2023 via email

@Ognian
Copy link
Contributor Author

Ognian commented Dec 2, 2023

Unfortunately, the Surface Pro 7+ doesn't allow TPM disable 😕 Is my next option switch dracut to host only=yes maybe? @Ognian are you running the grub2-install inside the the new container there? Thanks!

On Sat, Dec 2, 2023, 11:26 AM Ognian @.> wrote: Yes this was new with maybe it's just the GRUB version causing issues here? @Ognian https://github.com/Ognian is that new to 2.4? we could cross check the GRUB versions to see if that's causing it yes, this was newly introduced with 2.4. I just tested with 2.4.1 and upgraded to 2.4.2 and the result is with 2.4.2 as it was with 2.4.1 and 2.4.0: with TPM -> out of memory error; without TPM -> boots OK The last version I tested where it worked was v2.2.1 The version I have now is: KAIROS_PRETTY_NAME="kairos-standard-opensuse-leap-15.5 v2.4.2-k3sv1.28.2+k3s1" and sudo grub2-install --version grub2-install (GRUB2) 2.06 Hope this helps. Ognian — Reply to this email directly, view it on GitHub <#1842 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFOOWNDV4DB34L3HTYKIH3YHNQDHAVCNFSM6AAAAAA5BXH3WOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMZXGIYDQNZSGA . You are receiving this because you commented.Message ID: @.>

Yes

@mevatron
Copy link

mevatron commented Dec 3, 2023

this reminds me https://bugs.launchpad.net/oem-priority/+bug/1842320/comments/125 - did we tried setting up gfxmode to 640x480 ?

@mudler I've tried gfxmode=640x480x32 and gfxpayload=640x480x32, but unfortunately it didn't alleviate the OOM errors. I've also tried building from source with @Itxaka recommendation of zstd, which also wasn't enough apparently; however, on my builds from source + Auroraboot do not seem to change the resolution like when I adjust grub settings via cloud_init like it does with official Kairos images. So, maybe a combination will work if I can get the source builds working 🤔

@mevatron
Copy link

mevatron commented Dec 4, 2023

Just tested @alexander-bauer 's workaround of rmmod tpm on ubuntu-20.04 and it does indeed allow my system to boot, so seems to be related to TPM for me as well.

@mevatron
Copy link

mevatron commented Dec 6, 2023

@alexander-bauer I found an option that is a bit more robust to remove the tpm module from the grub.cfg.

Create a Dockerfile:

Pick your favorite Kairos image (e.g., ubuntu:20.04).

FROM quay.io/kairos/ubuntu:20.04-standard-amd64-generic-v2.4.2-k3sv1.28.2-k3s1

RUN sed -i '/insmod regexp/a rmmod tpm' /etc/cos/grub.cfg

Build the image:

docker build -t tpm2workaround -f Dockerfile .

Deploy with auroraboot:

For example, generate an ISO:

docker run --rm -ti \
  -v /var/run/docker.sock:/var/run/docker.sock \
  -v $(pwd)/config.yaml:/config.yaml \
  -v $(pwd)/build:/tmp/auroraboot \
  quay.io/kairos/auroraboot \
  --set "container_image=docker://tpm2workaround" \
  --set "disable_http_server=true" \
  --set "disable_netboot=true" \
  --set "state_dir=/tmp/auroraboot" \
  --cloud-config /config.yaml

@Itxaka or @mudler might know of an easier way to override this using one of the cloud-init stages, I tried after-install-chroot and before-install, but neither of those seemed to work.

Hope that helps until we get a more permanent fix!

@Itxaka
Copy link
Member

Itxaka commented Dec 6, 2023

Could also try the rc3 that we released yesterday to see if it fixes it, as we reverted the grub.efi to a different one which used to work!

@mudler
Copy link
Member

mudler commented Dec 6, 2023

VirtualBox_reinstal tsest_06_12_2023_16_50_51

Here I can reproduce it as well with rc3 and VirtualBox (ubuntu image: kairos-ubuntu-22.04-standard-amd64-generic-v2.4.3-rc3-k3sv1.28.2+k3s1.iso)

@mudler
Copy link
Member

mudler commented Dec 6, 2023

VirtualBox_reinstal tsest_06_12_2023_16_50_51

Here I can reproduce it as well with rc3 and VirtualBox (ubuntu image: kairos-ubuntu-22.04-standard-amd64-generic-v2.4.3-rc3-k3sv1.28.2+k3s1.iso)

seems it was just me - recreating the VM with more RAM did the trick

@mudler mudler moved this from Todo 🖊 to Under review 🔍 in 🧙Issue tracking board Dec 6, 2023
@mevatron
Copy link

mevatron commented Dec 7, 2023

Could also try the rc3 that we released yesterday to see if it fixes it, as we reverted the grub.efi to a different one which used to work!

I tested with quay.io/kairos/ubuntu:20.04-standard-amd64-generic-v2.4.3-rc3-k3s1.28.2-1, and that worked for the Surface Pro 7+! Many thanks @Itxaka!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working prio: high
Projects
Archived in project
Development

No branches or pull requests

9 participants