Dramatic slowdown ( 50x typical) when --with-production used with *-linux-arm8 on multi-node NVIDIA Grace Hopper runs #3850

jobordner · 2024-10-30T01:39:32Z

This is with a basic Enzo-E cosmology test problem on TACC Vista. Running with --with-production on multiple nodes seems to periodically trigger a 30s timeout(?) See the projections traces comparing with and without --with-production. This is independent of network build (both mpi and netlrts) and compiler (gcc or nvc). It only occurs when multiple nodes are used.

The text was updated successfully, but these errors were encountered:

evanramos-nvidia · 2024-11-04T16:27:50Z

Hi James, could you try profiling the problematic case with NVIDIA Nsight Systems? (You may need to add -g to your --with-production build line to ensure the resulting binary has debug symbols.)

jobordner · 2024-11-21T22:20:18Z

Hi Evan,

I tried a basic trace but don't see anything obvious when I look at the timeline, though I'm just learning to use Nsight Systems. The CPU appears to be at 99% utilization throughout, though.

Do you have any suggestions for nsys parameters, or tips for analysing the existing traces? I have a page at CharmIssue3850 with a link to reports for the 256-core run, though it's 1.3GB. Single reports are available as well, with an explicit link to the first.

Thanks,
James

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dramatic slowdown ( 50x typical) when --with-production used with *-linux-arm8 on multi-node NVIDIA Grace Hopper runs #3850

Dramatic slowdown ( 50x typical) when --with-production used with *-linux-arm8 on multi-node NVIDIA Grace Hopper runs #3850

jobordner commented Oct 30, 2024

evanramos-nvidia commented Nov 4, 2024

jobordner commented Nov 21, 2024

Dramatic slowdown ( 50x typical) when --with-production used with *-linux-arm8 on multi-node NVIDIA Grace Hopper runs #3850

Dramatic slowdown ( 50x typical) when --with-production used with *-linux-arm8 on multi-node NVIDIA Grace Hopper runs #3850

Comments

jobordner commented Oct 30, 2024

evanramos-nvidia commented Nov 4, 2024

jobordner commented Nov 21, 2024