mpi-linux-arm8 build crashes on > 128 cores for hello/3darray #3840

jobordner · 2024-09-27T19:15:57Z

I cannot get anything with more than about 128 total cores to run on TACC's Vista Grace-Hopper system, including the 3darray test problem. It's not dependent on number of nodes or cores-per-node individually: e.g. 16 nodes with 16 cores/node fails, as does 4 nodes with 64 cores/node, but 4 nodes with 16 cores/node runs. I'm using v8.0.1 with "mpi-linux-arm8". Similar outcomes for both GNU and NVIDIA compilers. Output is shown below for 4 node 64 cores/node configuration:

Running as 256 OS processes: ./hello 256
charmrun> /usr/bin/setarch aarch64 -R mpirun -np 256 ./hello 256
Charm++> Running on MPI library: Open MPI v5.0.5, package: Open MPI [email protected] Distribution, ident: 5.0.5, repo rev: v5.0.5, Jul 22, 2024 (MPI standard: 3.1)
Charm++> Level of thread support used: MPI_THREAD_SINGLE (desired: MPI_THREAD_SINGLE)
Charm++> Running in non-SMP mode: 256 processes (PEs)
Converse/Charm++ Commit ID: v8.0.1
Charm++ built without optimization.
Do not use for performance benchmarking (build with --with-production to do so).
Charm++ built with internal error checking enabled.
Do not use for performance benchmarking (build without --enable-error-checking to do so).
[i618-032:899711:0:899822] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xa00028010060)
[i618-032:899684:0:899805] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xa00028010060)
[i618-032:899704:0:899791] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xa00028010060)
==== backtrace (tid: 899791) ====
 0  /opt/apps/ucx/1.17.0/lib/libucs.so.0(ucs_handle_error+0x288) [0x40003765eb98]
 1  /opt/apps/ucx/1.17.0/lib/libucs.so.0(+0x2ece8) [0x40003765ece8]
 2  /opt/apps/ucx/1.17.0/lib/libucs.so.0(+0x2f07c) [0x40003765f07c]
 3  linux-vdso.so.1(__kernel_rt_sigreturn+0) [0x4000363707f0]
 4  /opt/apps/nvidia24/openmpi/5.0.5/lib/libpmix.so.2(pmix_gds_shmem2_fetch+0xf0) [0x400037873530]
 5  /opt/apps/nvidia24/openmpi/5.0.5/lib/libpmix.so.2(+0x638b8) [0x4000377538b8]
 6  /opt/apps/nvidia24/openmpi/5.0.5/lib/libevent_core-2.1.so.7(+0x24244) [0x400037984244]
 7  /opt/apps/nvidia24/openmpi/5.0.5/lib/libevent_core-2.1.so.7(+0x23954) [0x400037983954]
 8  /opt/apps/nvidia24/openmpi/5.0.5/lib/libevent_core-2.1.so.7(event_base_loop+0x1d4) [0x40003797cf14]
 9  /opt/apps/nvidia24/openmpi/5.0.5/lib/libpmix.so.2(+0xb5b2c) [0x4000377a5b2c]
10  /lib64/libc.so.6(+0x82a38) [0x400037002a38]
11  /lib64/libc.so.6(+0x2bb9c) [0x400036fabb9c]

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mpi-linux-arm8 build crashes on > 128 cores for hello/3darray #3840

mpi-linux-arm8 build crashes on > 128 cores for hello/3darray #3840

jobordner commented Sep 27, 2024

mpi-linux-arm8 build crashes on > 128 cores for hello/3darray #3840

mpi-linux-arm8 build crashes on > 128 cores for hello/3darray #3840

Comments

jobordner commented Sep 27, 2024