Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add/hpc7g node arm support #6743

Merged
merged 3 commits into from
Jul 4, 2024

Conversation

vsoch
Copy link
Contributor

@vsoch vsoch commented Jun 28, 2023

Description

Problem: The new hpc7g images use the graviton2 processor (arm) but are not detected as such by eskctl. In addition, as we have been discussing in #6222, the daemon set for the efa device driver does not work with runAsNonRoot set to true. I believe this tweak is close, however the final step (I think) needs to also be to provide an ARM build for the driver itself. I believe this is proprietary code, so I wanted to ask here first about that. I was able to figure out the container entrypoint and output, in case that helps:

$ /usr/bin/efa-k8s-device-plugin 
2023/06/27 00:17:39 Fetching EFA devices.
2023/06/27 00:17:39 device: rdmap0s6,uverbs0,/sys/class/infiniband_verbs/uverbs0,/sys/class/infiniband/rdmap0s6

2023/06/27 00:17:39 EFA Device list: [{rdmap0s6 uverbs0 /sys/class/infiniband_verbs/uverbs0 /sys/class/infiniband/rdmap0s6}]
2023/06/27 00:17:39 Starting FS watcher.
2023/06/27 00:17:39 Starting OS watcher.
2023/06/27 00:17:39 device: rdmap0s6,uverbs0,/sys/class/infiniband_verbs/uverbs0,/sys/class/infiniband/rdmap0s6

2023/06/27 00:17:39 Starting to serve on /var/lib/kubelet/device-plugins/aws-efa-device-plugin.sock
2023/06/27 00:17:39 Registered device plugin with Kubelet

Note that online examples for efa (e.g., this repository) is not exactly what we want - the Dockerfile will build a container that can use EFA but not one that has that particular executable.

Let me know how you would like to proceed!

@@ -57,6 +57,7 @@ spec:
- g4dn.metal
- g5.48xlarge
- hpc6a.48xlarge
- hpc7g.16xlarge
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

V - lets update with all hpc7g types here since they all have EFA. Add hpc7g.8xlarge, hpc7g.4xlarge.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can do! I didn’t want to add them before being able to test. Any updates on if it might be possible to have an ARM build for the plugin?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! And per our discussion, I can offer to do a PR to add these images here https://github.com/aws-samples/efa-device-plugin-helm/blob/main/aws-efa-k8s-device-plugin after we finish up here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay opened, albeit we should not merge it! aws-samples/efa-device-plugin-helm#1

@vsoch
Copy link
Contributor Author

vsoch commented Jul 11, 2023

This is working now and can be reviewed. I changed nothing, and I have no idea why it's working. I think it might be related to the aws metadata (describe-instances) that sets conditions for the efa / network devices. If it wasn't completed/ready on the first tries, maybe that could lead to this outcome?

@vsoch
Copy link
Contributor Author

vsoch commented Dec 3, 2023

Why was this closed?

@cPu1 cPu1 added priority/important-longterm Important over the long term, but may not be currently staffed and/or may require multiple releases and removed stale labels Dec 4, 2023
@cPu1 cPu1 reopened this Dec 4, 2023
@cPu1
Copy link
Contributor

cPu1 commented Dec 4, 2023

Why was this closed?

@vsoch, apologies again, it was closed by the stale bot. I have reopened it now and applied a label that should prevent it from being automatically closed. Please give us some time to review it, the team is occupied with other deliverables.

@vsoch
Copy link
Contributor Author

vsoch commented Dec 4, 2023

Thank you!

@github-actions github-actions bot added the stale label Jan 4, 2024
@github-actions github-actions bot closed this Jan 9, 2024
@vsoch
Copy link
Contributor Author

vsoch commented Jan 9, 2024

Please re-open again, thank you!

@cPu1 cPu1 reopened this Jun 19, 2024
@vsoch vsoch force-pushed the add/hpc7g-node-arm-support branch from 2adbe41 to 85112d8 Compare June 19, 2024 21:09
@vsoch
Copy link
Contributor Author

vsoch commented Jun 19, 2024

Hi @cPu1 could you please give feedback to the CI errors? I'm seeing them show up in other PRs and it looks to be that an incorrect function signature is being used, for example:

  GetOutpostInstanceTypes(context.backgroundCtx,*outposts.GetOutpostInstanceTypesInput,func(*outposts.Options))
  		0: context.backgroundCtx{emptyCtx:context.emptyCtx{}}
  		1: &outposts.GetOutpostInstanceTypesInput{OutpostId:(*string)(0xc0000542c0), MaxResults:(*int32)(nil), NextToken:(*string)(nil), noSmithyDocumentSerde:document.NoSerde{}}
  		2: (func(*outposts.Options))(0x7ec320)

  The closest call I have is: 

  GetOutpostInstanceTypes(string,string)
  		0: "mock.Anything"
  		1: "mock.Anything"

and notably we don't touch the relevant code here. Is there another PR that is fixing these CI issues we should watch? Thanks!

@cPu1
Copy link
Contributor

cPu1 commented Jul 4, 2024

@vsoch could you please rebase with main?

@vsoch vsoch force-pushed the add/hpc7g-node-arm-support branch from 85112d8 to f95076b Compare July 4, 2024 19:02
@vsoch
Copy link
Contributor Author

vsoch commented Jul 4, 2024

All set!

@cPu1 cPu1 merged commit 231194c into eksctl-io:main Jul 4, 2024
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/improvement priority/important-longterm Important over the long term, but may not be currently staffed and/or may require multiple releases
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants