Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mpi-linux-arm8 build crashes on > 128 cores for hello/3darray #3840

Open
jobordner opened this issue Sep 27, 2024 · 0 comments
Open

mpi-linux-arm8 build crashes on > 128 cores for hello/3darray #3840

jobordner opened this issue Sep 27, 2024 · 0 comments

Comments

@jobordner
Copy link

I cannot get anything with more than about 128 total cores to run on TACC's Vista Grace-Hopper system, including the 3darray test problem. It's not dependent on number of nodes or cores-per-node individually: e.g. 16 nodes with 16 cores/node fails, as does 4 nodes with 64 cores/node, but 4 nodes with 16 cores/node runs. I'm using v8.0.1 with "mpi-linux-arm8". Similar outcomes for both GNU and NVIDIA compilers. Output is shown below for 4 node 64 cores/node configuration:

Running as 256 OS processes: ./hello 256
charmrun> /usr/bin/setarch aarch64 -R mpirun -np 256 ./hello 256
Charm++> Running on MPI library: Open MPI v5.0.5, package: Open MPI [email protected] Distribution, ident: 5.0.5, repo rev: v5.0.5, Jul 22, 2024 (MPI standard: 3.1)
Charm++> Level of thread support used: MPI_THREAD_SINGLE (desired: MPI_THREAD_SINGLE)
Charm++> Running in non-SMP mode: 256 processes (PEs)
Converse/Charm++ Commit ID: v8.0.1
Charm++ built without optimization.
Do not use for performance benchmarking (build with --with-production to do so).
Charm++ built with internal error checking enabled.
Do not use for performance benchmarking (build without --enable-error-checking to do so).
[i618-032:899711:0:899822] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xa00028010060)
[i618-032:899684:0:899805] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xa00028010060)
[i618-032:899704:0:899791] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xa00028010060)
==== backtrace (tid: 899791) ====
 0  /opt/apps/ucx/1.17.0/lib/libucs.so.0(ucs_handle_error+0x288) [0x40003765eb98]
 1  /opt/apps/ucx/1.17.0/lib/libucs.so.0(+0x2ece8) [0x40003765ece8]
 2  /opt/apps/ucx/1.17.0/lib/libucs.so.0(+0x2f07c) [0x40003765f07c]
 3  linux-vdso.so.1(__kernel_rt_sigreturn+0) [0x4000363707f0]
 4  /opt/apps/nvidia24/openmpi/5.0.5/lib/libpmix.so.2(pmix_gds_shmem2_fetch+0xf0) [0x400037873530]
 5  /opt/apps/nvidia24/openmpi/5.0.5/lib/libpmix.so.2(+0x638b8) [0x4000377538b8]
 6  /opt/apps/nvidia24/openmpi/5.0.5/lib/libevent_core-2.1.so.7(+0x24244) [0x400037984244]
 7  /opt/apps/nvidia24/openmpi/5.0.5/lib/libevent_core-2.1.so.7(+0x23954) [0x400037983954]
 8  /opt/apps/nvidia24/openmpi/5.0.5/lib/libevent_core-2.1.so.7(event_base_loop+0x1d4) [0x40003797cf14]
 9  /opt/apps/nvidia24/openmpi/5.0.5/lib/libpmix.so.2(+0xb5b2c) [0x4000377a5b2c]
10  /lib64/libc.so.6(+0x82a38) [0x400037002a38]
11  /lib64/libc.so.6(+0x2bb9c) [0x400036fabb9c]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant