[Bug] default device is not large enough for efa installer #6869

vsoch · 2023-07-29T23:27:33Z

What were you trying to accomplish?

We are using eksctl to create a node group of hpc6a and hpc7g instances (using #6743) with efaEnabled set to true. With our fixes in that PR, about 3 weeks ago both instance types were working. However, recently they stopped working, and debugging revealed that the root filesystem didn't have enough room:

Stderr: 
        
        Transaction check error:
          installing package openmpi40-aws-4.1.5-3.aarch64 needs 3MB on the / filesystem
          installing package efa-2.5.0-1.amzn2.aarch64 needs 3MB on the / filesystem
          installing package libfabric-aws-devel-1.18.1-1.amzn2.aarch64 needs 7MB on the / filesystem
          installing package efa-profile-1.5-1.amzn2.noarch needs 7MB on the / filesystem
          installing package efa-config-1.15-1.amzn2.noarch needs 7MB on the / filesystem
        
        Error Summary
        -------------
        Disk Requirements:
          At least 7MB more space needed on the / filesystem.
        
        Error: Failed to install packages.
        Error: failed to install EFA packages, exiting
        /var/lib/cloud/instances/i-081cafd3368ab4d44/boothooks/part-001: line 7: pop: command not found

At first I very naively increased the root size, like this (yes I made it huge to rule this out!)

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: scaling-study-efa
  region: us-east-1
  version: "1.27"

availabilityZones: ["us-east-1a", "us-east-1b"]
managedNodeGroups:
  - name: workers
    availabilityZones: ["us-east-1a"]
    instanceType: hpc7g.16xlarge
    # Defaults to 80, let's make it really big
    volumeSize: 500
    minSize: 2
    maxSize: 2
    efaEnabled: true
    placement:
      groupName: eks-efa-testing
    labels: { "flux-operator": "true" }
    ssh:
      allow: true
      publicKeyPath: ~/.ssh/id_eks.pub

But that didn't work - same issue. I looked closer at the cloud init logs, and realized that our UserData boot script (that is installed EFA) actually happens before the cloud init growth step to increase the size of the filesystem. For context, it looks like user data is one of the early scripts:

Jul 28 17:07:37 cloud-init[1477]: util.py[DEBUG]: Running command ['/var/lib/cloud/instances/i-081cafd3368ab4d44/boothooks/part-001'] with allowed return codes [0] (shell=True, capture=True)
Jul 28 17:08:46 cloud-init[1477]: util.py[WARNING]: Boothooks script /var/lib/cloud/instances/i-081cafd3368ab4d44/boothooks/part-001 execution error
Jul 28 17:08:46 cloud-init[1477]: util.py[DEBUG]: Boothooks script /var/lib/cloud/instances/i-081cafd3368ab4d44/boothooks/part-001 execution error
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/cloudinit/handlers/boot_hook.py", line 53, in handle_part
    util.subp([filepath], env=env, shell=True)
  File "/usr/lib/python2.7/site-packages/cloudinit/util.py", line 2108, in subp
    cmd=args)
ProcessExecutionError: Unexpected error while running command.
Command: ['/var/lib/cloud/instances/i-081cafd3368ab4d44/boothooks/part-001']
Exit code: 127

And then the growpart (and confirmation of the resize) happens about a minute later:

Jul 28 17:08:46 cloud-init[1477]: util.py[DEBUG]: Running command ['amazon-linux-https', 'enable'] with allowed return codes [0] (shell=False, capture=True)
Jul 28 17:08:46 cloud-init[1477]: handlers.py[DEBUG]: finish: init-network/config-amazonlinux_repo_https: SUCCESS: config-amazonlinux_repo_https ran successfully
Jul 28 17:08:46 cloud-init[1477]: stages.py[DEBUG]: Running module growpart (<module 'cloudinit.config.cc_growpart' from '/usr/lib/python2.7/site-packages/cloudinit/config/cc_growpart.pyc'>) with frequency always
Jul 28 17:08:46 cloud-init[1477]: handlers.py[DEBUG]: start: init-network/config-growpart: running config-growpart with frequency always
Jul 28 17:08:46 cloud-init[1477]: helpers.py[DEBUG]: Running config-growpart using lock (<cloudinit.helpers.DummyLock object at 0xffff847de590>)
Jul 28 17:08:46 cloud-init[1477]: cc_growpart.py[DEBUG]: No 'growpart' entry in cfg.  Using default: {'ignore_growroot_disabled': False, 'mode': 'auto', 'devices': ['/']}
Jul 28 17:08:46 cloud-init[1477]: util.py[DEBUG]: Running command ['growpart', '--help'] with allowed return codes [0] (shell=False, capture=True)
Jul 28 17:08:46 cloud-init[1477]: util.py[DEBUG]: Reading from /proc/1477/mountinfo (quiet=False)
Jul 28 17:08:46 cloud-init[1477]: util.py[DEBUG]: Read 3005 bytes from /proc/1477/mountinfo
Jul 28 17:08:46 cloud-init[1477]: util.py[DEBUG]: Running command ['systemd-detect-virt', '--quiet', '--container'] with allowed return codes [0] (shell=False, capture=True)
Jul 28 17:08:46 cloud-init[1477]: util.py[DEBUG]: Running command ['running-in-container'] with allowed return codes [0] (shell=False, capture=True)
Jul 28 17:08:46 cloud-init[1477]: util.py[DEBUG]: Running command ['lxc-is-container'] with allowed return codes [0] (shell=False, capture=True)
Jul 28 17:08:46 cloud-init[1477]: util.py[DEBUG]: Reading from /proc/1/environ (quiet=False)
Jul 28 17:08:46 cloud-init[1477]: util.py[DEBUG]: Read 0 bytes from /proc/1/environ
Jul 28 17:08:46 cloud-init[1477]: util.py[DEBUG]: Reading from /proc/self/status (quiet=False)
Jul 28 17:08:46 cloud-init[1477]: util.py[DEBUG]: Read 1062 bytes from /proc/self/status
Jul 28 17:08:46 cloud-init[1477]: util.py[DEBUG]: Reading from /etc/os-release (quiet=False)
Jul 28 17:08:46 cloud-init[1477]: util.py[DEBUG]: Read 212 bytes from /etc/os-release
Jul 28 17:08:46 cloud-init[1477]: util.py[DEBUG]: Reading from /sys/class/block/nvme0n1p1/partition (quiet=False)
Jul 28 17:08:46 cloud-init[1477]: util.py[DEBUG]: Read 2 bytes from /sys/class/block/nvme0n1p1/partition
Jul 28 17:08:46 cloud-init[1477]: util.py[DEBUG]: Reading from /sys/devices/pci0000:00/0000:00:04.0/nvme/nvme0/nvme0n1/dev (quiet=False)
Jul 28 17:08:46 cloud-init[1477]: util.py[DEBUG]: Read 6 bytes from /sys/devices/pci0000:00/0000:00:04.0/nvme/nvme0/nvme0n1/dev
Jul 28 17:08:46 cloud-init[1477]: util.py[DEBUG]: Running command ['growpart', '--dry-run', '/dev/nvme0n1', '1'] with allowed return codes [0] (shell=False, capture=True)
Jul 28 17:08:46 cloud-init[1477]: util.py[DEBUG]: Running command ['growpart', '/dev/nvme0n1', '1'] with allowed return codes [0] (shell=False, capture=True)
Jul 28 17:08:48 cloud-init[1477]: util.py[DEBUG]: resize_devices took 1.518 seconds
Jul 28 17:08:48 cloud-init[1477]: cc_growpart.py[INFO]: '/' resized: changed (/dev/nvme0n1, 1) from 4283416064 to 85887794688
Jul 28 17:08:48 cloud-init[1477]: handlers.py[DEBUG]: finish: init-network/config-growpart: SUCCESS: config-growpart ran successfully
Jul 28 17:08:48 cloud-init[1477]: stages.py[DEBUG]: Running module resizefs (<module 'cloudinit.config.cc_resizefs' from '/usr/lib/python2.7/site-packages/cloudinit/config/cc_resizefs.pyc'>) with frequency always

Specifically these two lines tell us that the filesystem is larger:

Jul 28 17:08:48 cloud-init[1477]: util.py[DEBUG]: resize_devices took 1.518 seconds
Jul 28 17:08:48 cloud-init[1477]: cc_growpart.py[INFO]: '/' resized: changed (/dev/nvme0n1, 1) from 4283416064 to 85887794688

Of course that's useless here because we always run the EFA installer before we resize disks, and with the default (original) size, so it will always fail. This has me puzzled why sometimes it does work, and I can only guess there are differences in the default sizes, or perhaps if these commands are not run in serial, there is some kind of race condition to resize the partition vs. install EFA. Actually, I have the same log for a working node - let me check that out.

For the working one, I can also see the resize, and here are the timestamps:

Jul 28 17:09:08 cloud-init[1483]: util.py[DEBUG]: resize_devices took 1.087 seconds
Jul 28 17:09:08 cloud-init[1483]: cc_growpart.py[INFO]: '/' resized: changed (/dev/nvme0n1, 1) from 4283416064 to 85887794688

That verifies they were the same. But what happened earlier?

Jul 28 17:07:35 cloud-init[1483]: __init__.py[DEBUG]: Calling handler BootHookPartHandler: [['text/cloud-boothook']] (text/cloud-boothook, part-001, 2) with frequency once-per-instance
Jul 28 17:07:35 cloud-init[1483]: util.py[DEBUG]: Writing to /var/lib/cloud/instances/i-0f2effa68088972f2/boothooks/part-001 - wb: [700] 484 bytes
Jul 28 17:07:35 cloud-init[1483]: util.py[DEBUG]: Running command ['/var/lib/cloud/instances/i-0f2effa68088972f2/boothooks/part-001'] with allowed return codes [0] (shell=True, capture=True)

Well, that's even weirder - the script was successful that time, and it still ran earlier! I have no idea what could have happened, but I'm suspicious that the working script is smaller in size:

# this is the broken run
Jul 29 22:45:26 cloud-init[1483]: util.py[DEBUG]: Writing to /var/lib/cloud/instances/i-08caf274e14ee8fc7/boothooks/part-001 - wb: [700] 688 bytes

# this is the working run
Jul 28 17:07:35 cloud-init[1483]: util.py[DEBUG]: Writing to /var/lib/cloud/instances/i-0f2effa68088972f2/boothooks/part-001 - wb: [700] 484 bytes

That suggests that content of those scripts is not the same. I just tested a change to the init scripts to just remove the .tar.gz, and at least for a first try both 2 pods are running!

$ kubectl get pods --all-namespaces 
NAMESPACE     NAME                                        READY   STATUS    RESTARTS   AGE
kube-system   aws-efa-k8s-device-plugin-daemonset-frphb   1/1     Running   0          3m27s
kube-system   aws-efa-k8s-device-plugin-daemonset-n9w44   1/1     Running   0          3m26s
kube-system   aws-node-k2vkt                              1/1     Running   0          4m1s
kube-system   aws-node-x5z4p                              1/1     Running   0          4m1s
kube-system   coredns-79df7fff65-7cspc                    1/1     Running   0          13m
kube-system   coredns-79df7fff65-gdm99                    1/1     Running   0          13m
kube-system   kube-proxy-dcfvx                            1/1     Running   0          4m1s
kube-system   kube-proxy-g22l9                            1/1     Running   0          4m1s

This could be chance, so I'm going to tear it down and try a larger cluster and see if I have luck. If this is a temporary fix to give enough filesystem room for it to work, I will add to my current ARM PR (which I do hope you consider reviewing soon!) I will update here with what I find.

The text was updated successfully, but these errors were encountered:

github-actions · 2023-10-20T10:10:46Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

vsoch · 2023-10-20T13:36:43Z

This needs to be fixed, please don't close it.

github-actions · 2023-11-20T01:48:13Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

vsoch · 2023-11-20T02:06:22Z

Please do not close, it's not fixed. Thanks!

github-actions · 2023-12-22T01:47:17Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

vsoch · 2023-12-22T02:10:30Z

Please do not close the issue.

darren-recentive · 2024-01-28T11:03:29Z

@cPu1 @vsoch Any updates on this? :)

vsoch · 2024-01-28T13:14:51Z

I identified the issue and put in a PR and largely it’s been ignored. This (along with other bugs) still persist and it’s outside of my power to do anything. My team has a custom fork and branch with a bunch of fixes because nothing happens over here, and the tool is nonfunctional without them. :/

vsoch · 2024-01-28T13:17:19Z

This was the PR for this particular issue - closed. #6870

darren-recentive · 2024-01-28T14:14:02Z

@vsoch I see, thanks!
The bot seems to be like closing out tickets :[
Hopefully we get an update 🤞

vsoch · 2024-08-21T02:36:05Z

Please see #6870 (comment) and reopen, we just lost an entire evening (and about half our experiment funds for the cluster sizes in question) doing nothing because some node came up without efa.

cPu1 · 2024-08-21T08:05:48Z

Please see #6870 (comment) and reopen, we just lost an entire evening (and about half our experiment funds for the cluster sizes in question) doing nothing because some node came up without efa.

I have reopened the issue. We'll look into adjusting the stale bot settings to avoid running into this again.

vsoch added the kind/bug label Jul 29, 2023

vsoch mentioned this issue Jul 29, 2023

cleanup efa installer archive before install #6870

Merged

7 tasks

github-actions bot added the stale label Oct 20, 2023

github-actions bot removed the stale label Oct 21, 2023

github-actions bot added the stale label Nov 20, 2023

github-actions bot removed the stale label Nov 21, 2023

github-actions bot added the stale label Dec 22, 2023

cPu1 added priority/important-longterm Important over the long term, but may not be currently staffed and/or may require multiple releases and removed stale labels Dec 22, 2023

cPu1 closed this as completed in #6870 Oct 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] default device is not large enough for efa installer #6869

[Bug] default device is not large enough for efa installer #6869

vsoch commented Jul 29, 2023 •

edited

Loading

github-actions bot commented Oct 20, 2023

vsoch commented Oct 20, 2023

github-actions bot commented Nov 20, 2023

vsoch commented Nov 20, 2023

github-actions bot commented Dec 22, 2023

vsoch commented Dec 22, 2023

darren-recentive commented Jan 28, 2024

vsoch commented Jan 28, 2024

vsoch commented Jan 28, 2024

darren-recentive commented Jan 28, 2024

vsoch commented Aug 21, 2024 •

edited

Loading

cPu1 commented Aug 21, 2024

[Bug] default device is not large enough for efa installer #6869

[Bug] default device is not large enough for efa installer #6869

Comments

vsoch commented Jul 29, 2023 • edited Loading

What were you trying to accomplish?

github-actions bot commented Oct 20, 2023

vsoch commented Oct 20, 2023

github-actions bot commented Nov 20, 2023

vsoch commented Nov 20, 2023

github-actions bot commented Dec 22, 2023

vsoch commented Dec 22, 2023

darren-recentive commented Jan 28, 2024

vsoch commented Jan 28, 2024

vsoch commented Jan 28, 2024

darren-recentive commented Jan 28, 2024

vsoch commented Aug 21, 2024 • edited Loading

cPu1 commented Aug 21, 2024

vsoch commented Jul 29, 2023 •

edited

Loading

vsoch commented Aug 21, 2024 •

edited

Loading