-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] default device is not large enough for efa installer #6869
Comments
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
This needs to be fixed, please don't close it. |
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
Please do not close, it's not fixed. Thanks! |
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
Please do not close the issue. |
I identified the issue and put in a PR and largely it’s been ignored. This (along with other bugs) still persist and it’s outside of my power to do anything. My team has a custom fork and branch with a bunch of fixes because nothing happens over here, and the tool is nonfunctional without them. :/ |
This was the PR for this particular issue - closed. #6870 |
@vsoch I see, thanks! |
Please see #6870 (comment) and reopen, we just lost an entire evening (and about half our experiment funds for the cluster sizes in question) doing nothing because some node came up without efa. |
I have reopened the issue. We'll look into adjusting the stale bot settings to avoid running into this again. |
What were you trying to accomplish?
We are using eksctl to create a node group of hpc6a and hpc7g instances (using #6743) with efaEnabled set to true. With our fixes in that PR, about 3 weeks ago both instance types were working. However, recently they stopped working, and debugging revealed that the root filesystem didn't have enough room:
At first I very naively increased the root size, like this (yes I made it huge to rule this out!)
But that didn't work - same issue. I looked closer at the cloud init logs, and realized that our UserData boot script (that is installed EFA) actually happens before the cloud init growth step to increase the size of the filesystem. For context, it looks like user data is one of the early scripts:
And then the growpart (and confirmation of the resize) happens about a minute later:
Specifically these two lines tell us that the filesystem is larger:
Of course that's useless here because we always run the EFA installer before we resize disks, and with the default (original) size, so it will always fail. This has me puzzled why sometimes it does work, and I can only guess there are differences in the default sizes, or perhaps if these commands are not run in serial, there is some kind of race condition to resize the partition vs. install EFA. Actually, I have the same log for a working node - let me check that out.
For the working one, I can also see the resize, and here are the timestamps:
That verifies they were the same. But what happened earlier?
Well, that's even weirder - the script was successful that time, and it still ran earlier! I have no idea what could have happened, but I'm suspicious that the working script is smaller in size:
That suggests that content of those scripts is not the same. I just tested a change to the init scripts to just remove the .tar.gz, and at least for a first try both 2 pods are running!
This could be chance, so I'm going to tear it down and try a larger cluster and see if I have luck. If this is a temporary fix to give enough filesystem room for it to work, I will add to my current ARM PR (which I do hope you consider reviewing soon!) I will update here with what I find.
The text was updated successfully, but these errors were encountered: