-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
incorrect/inconsistent core pinning with OpenMPI #179
Comments
@boegel i read the manpage again and then a few more times, i also found https://stackoverflow.com/questions/28216897/syntax-of-the-map-by-option-in-openmpi-mpirun-v1-8 |
It seems that
On Swallot:
3CPU:
9CPU
On Skitty:
8CPUS
with
8CPUS:
|
After spending quite a bit of time on this issue this week, here's what I've figured out so far when using OpenMPI ( The problem is basically two-fold with the current version of Core pinningBy default, OpenMPI does This default isn't very good when Changes we should make here:
Process placement(a.k.a. "mapping" in OpenMPI terms) This is the biggest problem currently... When That looks OK, but the core assignment (mapping) is done sequentially by NUMA domain per node, so ranks are not spread properly across sockets within the same node... This gets worse in combination with the So we definitely shouldn't blindly use Doing this properly probably requires an adaptive strategy again, based on the value for With the |
@boegel be aware that bind to numa on epyc doesn't need to mean what you think it means. on doduo, the numa is actually the L2 cache, and it makes much more sense than having part of the socket as numa. i'm afraid we will need some more defaults options so people can choose the one that makes most sense. (to find that out, users need to make a communicatin map like teh IPM communciotn topology map to see hwo the ranks communicate, so they can decide on best placement) |
with pmi, you can use slurm control cfr https://slurm.schedmd.com/mc_support.html |
@boegel what we really need now is a way to get sequential mapping, where each rank is pinned "next to" the rpevious pinnend rank. this pinning is default for intel mpi (at least, when used with mympirun). |
@stdweird I'm guessing that needs to be under control of a specific For Open MPI, it probably boils down to That shouldn't be the default I guess, at least not when |
at least the behaviour between intel mpi and openmpi should be the same. |
@boegel even for hybrid, you want sequential blocks of mpi ranks per eg numa domain. what oipenmpi does so different is the roundrobin placement of the ranks. placement should be: split in sequential blocks per node |
@stdweird I'm happy to get back to this for hybrid (along with checking several scenarios like |
2x18-core Intel Xeon Gold 6140 (skitty, Skylake):
foss/2019b
(OpenMPI 3.1.4)foss/2020a
(OpenMPI 4.0.3)intel/2019b
orintel/2020a
2x48-core AMD EPYC 7552 (doduo, Zen2):
foss/2019b
(OpenMPI 3.1.4) +foss/2020a
(OpenMPI 4.0.3)intel/2019b
orintel/2020a
The text was updated successfully, but these errors were encountered: