Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Endpoint Resource Exhaustion in ULT Mode #1127

Open
markbrown314 opened this issue May 15, 2024 · 0 comments
Open

Endpoint Resource Exhaustion in ULT Mode #1127

markbrown314 opened this issue May 15, 2024 · 0 comments
Labels

Comments

@markbrown314
Copy link
Collaborator

When running a ULT job with 1024 PEs and 16 nodes with 8 ABT threads SOS fails to initialize the transport endpoint.

e.g. isx_micro
This is the warning:
[0132] WARN: transport_ofi.c:621: bind_enable_ep_resources
[0132] fi_enable on endpoint failed
[0132] WARN: transport_ofi.c:1430: shmem_transport_ofi_ctx_init
[0132] context bind/enable endpoint failed (No space left on device)

The job hangs afterwords.

Parameters:
PMI_MAX_KVS_ENTRIES=10000000
SHMEM_SYMMETRIC_SIZE=6G
SHMEM_ADAPTIVE_THREAD_SCHEDULE=1
FI_PROVIDER=cxi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant