Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dramatic slowdown ( 50x typical) when --with-production used with *-linux-arm8 on multi-node NVIDIA Grace Hopper runs #3850

Open
jobordner opened this issue Oct 30, 2024 · 2 comments

Comments

@jobordner
Copy link

This is with a basic Enzo-E cosmology test problem on TACC Vista. Running with --with-production on multiple nodes seems to periodically trigger a 30s timeout(?) See the projections traces comparing with and without --with-production. This is independent of network build (both mpi and netlrts) and compiler (gcc or nvc). It only occurs when multiple nodes are used.

@evanramos-nvidia
Copy link

Hi James, could you try profiling the problematic case with NVIDIA Nsight Systems? (You may need to add -g to your --with-production build line to ensure the resulting binary has debug symbols.)

@jobordner
Copy link
Author

Hi Evan,

I tried a basic trace but don't see anything obvious when I look at the timeline, though I'm just learning to use Nsight Systems. The CPU appears to be at 99% utilization throughout, though.

Do you have any suggestions for nsys parameters, or tips for analysing the existing traces? I have a page at CharmIssue3850 with a link to reports for the 256-core run, though it's 1.3GB. Single reports are available as well, with an explicit link to the first.

Thanks,
James

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants