Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dask-mpi fails with wheel packaging #83

Open
mahendrapaipuri opened this issue Dec 2, 2021 · 18 comments
Open

dask-mpi fails with wheel packaging #83

mahendrapaipuri opened this issue Dec 2, 2021 · 18 comments

Comments

@mahendrapaipuri
Copy link

mahendrapaipuri commented Dec 2, 2021

Using pip install dask-mpi

$ pip install dask-mpi
$ mpirun -np 2 dask-mpi --name=test-worker --nthreads=1 --memory-limit=0 --scheduler-file=test.json
distributed.http.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO -   Scheduler at: tcp://172.16.66.109:36539
distributed.scheduler - INFO -   dashboard at:                     :8787
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.16.66.109:36297'
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  getting local rank failed
  --> Returned value No permission (-17) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_ess_init failed
  --> Returned value No permission (-17) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_mpi_init: ompi_rte_init failed
  --> Returned "No permission" (-17) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)

Using python setup.py install

$ python setup.py install
$ mpirun -np 2 dask-mpi --name=test-worker --nthreads=1 --memory-limit=0 --scheduler-file=test.json
distributed.http.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO -   Scheduler at: tcp://172.16.66.109:44933
distributed.scheduler - INFO -   dashboard at:                     :8787
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.16.66.109:42437'
distributed.diskutils - INFO - Found stale lock file and directory '/home/mpaipuri/downloads/dask-mpi/dask-worker-space/worker-6h2hf4i6', purging
distributed.worker - INFO -       Start worker at:  tcp://172.16.66.109:37893
distributed.worker - INFO -          Listening to:  tcp://172.16.66.109:37893
distributed.worker - INFO -          dashboard at:        172.16.66.109:45119
distributed.worker - INFO - Waiting to connect to:  tcp://172.16.66.109:44933
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          1
distributed.worker - INFO -       Local Directory: /home/mpaipuri/downloads/dask-mpi/dask-worker-space/worker-t48hj0dc
distributed.worker - INFO - -------------------------------------------------
distributed.scheduler - INFO - Register worker <WorkerState 'tcp://172.16.66.109:37893', name: rascil-worker-1, status: undefined, memory: 0, processing: 0>
distributed.scheduler - INFO - Starting worker compute stream, tcp://172.16.66.109:37893
distributed.core - INFO - Starting established connection
distributed.worker - INFO -         Registered to:  tcp://172.16.66.109:44933
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection

What happened: Installing dask-mpi with wheel packaging fails but it works normally with egg packaging. Tested it on 2 different systems and same behaviour is observed

What you expected to happen: To work with both packaging methods

Anything else we need to know?: The only difference between two approaches is generated dask-mpi command line executable.

  • Dask version: 2021.11.2
  • Python version: 3.8
  • Operating System: Debian 11
  • Install method (conda, pip, source): pip and source
@mahendrapaipuri
Copy link
Author

@kmpaul Could you please look into it when you have time? Cheers!

@kmpaul
Copy link
Collaborator

kmpaul commented Dec 6, 2021

@mahendrapaipuri: Yes. I've been very busy these days, and I am currently (due to a fiasco transferring my settings to a new phone and now 2FA is not synced) locked out of PyPI, so I cannot take a look. From the looks of it, I may not have access to PyPI again for months, since they are notoriously slow in resetting people'e 2FA. Perhaps @andersy005 or @jacobtomlinson could help you.

@kmpaul
Copy link
Collaborator

kmpaul commented Dec 7, 2021

FYI: This does not seem to be reproducible on a Mac. I'll look into trying this in a linux Docker container to try reproducing the bug.

@mahendrapaipuri: Can I ask you how you install the dependencies for Dask-MPI? Are you installing Dask-Distributed with conda? How are you installing mpi4py? In total, describe to me the steps that come before the first step you describe above (pip install dask-mpi) to set up your Python environment.

@mahendrapaipuri
Copy link
Author

mahendrapaipuri commented Dec 7, 2021

Hello @kmpaul Thanks for taking a look. I created a bare conda environment and installed everything using pip. So the steps are

conda create -n test python=3.8 -y
conda activate test
pip install "dask[complete]"
pip install mpi4py dask-mpi

That's pretty much how I created my environment. Please let me know if I am missing any more details!

@kmpaul
Copy link
Collaborator

kmpaul commented Dec 7, 2021

@mahendrapaipuri: Thanks! I'll see what I can uncover in a Docker container.

@kmpaul
Copy link
Collaborator

kmpaul commented Dec 7, 2021

And just to be clear, you are using a system install of OpenMPI and not installing OpenMPI with Conda, correct?

@mahendrapaipuri
Copy link
Author

Well, I have tested using system installed OpenMPI and Spack installed OpenMPI. Both gave me same errors. But no, I did not use conda built OpenMPI.

@kmpaul
Copy link
Collaborator

kmpaul commented Dec 7, 2021

Ok. Then I'll try to diagnose the issue with system-installed OpenMPI (and possibly test with a Conda-installed OpenMPI, too).

@kmpaul
Copy link
Collaborator

kmpaul commented Dec 7, 2021

I can confirm the bug exists on Debian in a Docker container. I'm investigating why the wheel and egg installs yield different behaviors.

@mahendrapaipuri
Copy link
Author

@kmpaul, I have tested on CentOS 8 too and ended up with same issue. I am not an expert in Python packaging, but what I had noticed is the difference between both approaches is just the generated dask-mpi binary file.

@kmpaul
Copy link
Collaborator

kmpaul commented Dec 7, 2021

I'll take a look at that.

FYI: I've noticed that with the pip install dask-mpi version (i.e., the one that fails), it works if you disable the use of Nannies. That is, if you change your CLI command to:

$ mpirun -np 2 dask-mpi --name=test-worker --nthreads=1 --memory-limit=0 --scheduler-file=test.json --worker-class distributed.Worker

it starts up correctly.

@kmpaul
Copy link
Collaborator

kmpaul commented Dec 7, 2021

@mahendrapaipuri: Thanks for the tip! I can verify that regardless of how you install dask-mpi, if you use the binary created by the egg-install, it works. So, now to get into why the egg-installed entry point works and the wheel-installed entry point does not.

@mahendrapaipuri
Copy link
Author

@kmpaul Precisely!! I have noticed that too. I am quite curious why binary from egg-install works and not the one from wheel. I ran out of ideas on how to debug it when I was digging into it!! Thanks again for taking time and looking into it.

@kmpaul
Copy link
Collaborator

kmpaul commented Dec 7, 2021

Ok. I've looked a little deeper and I've been able to simplify the egg-installed dask-mpi binary to take the following form (eliminating all unused try branches and unnecessary functions when the input is fixed):

#!/root/miniconda3/envs/test/bin/python
import re
import sys
from importlib.metadata import distribution

if __name__ == '__main__':
    sys.argv[0] = re.sub(r'(-script\.pyw?|\.exe)?$', '', sys.argv[0])
    go = distribution('dask-mpi').entry_points[0].load()
    sys.exit(go())

This works, as expected. However, if you simply make go a globally defined symbol, like so:

#!/root/miniconda3/envs/test/bin/python
import re
import sys
from importlib.metadata import distribution
go = distribution('dask-mpi').entry_points[0].load()

if __name__ == '__main__':
    sys.argv[0] = re.sub(r'(-script\.pyw?|\.exe)?$', '', sys.argv[0])
    sys.exit(go())

it now fails in the same way that the wheel-installed binary fails. This is not surprising since the wheel-installed binary looks like:

!/root/miniconda3/envs/test/bin/python
# -*- coding: utf-8 -*-
import re
import sys
from dask_mpi.cli import go
if __name__ == '__main__':
    sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0])
    sys.exit(go())

So, this has to do with scope. I'll dig a little further to see why.

@kmpaul
Copy link
Collaborator

kmpaul commented Dec 7, 2021

As you might expect, if you change the wheel-installed binary to look like this:

!/root/miniconda3/envs/test/bin/python
# -*- coding: utf-8 -*-
import re
import sys
if __name__ == '__main__':
    sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0])
    from dask_mpi.cli import go
    sys.exit(go())

It works.

@kmpaul
Copy link
Collaborator

kmpaul commented Dec 8, 2021

@mahendrapaipuri: I've spent the day looking into this, and I cannot figure it out. I don't know why one entry point script should work and the other not work. It is a mystery to me.

@mahendrapaipuri
Copy link
Author

@kmpaul Thanks a lot for looking into it. To me as well, sort of mystery. Hope someone else can figure it out.

@kmpaul
Copy link
Collaborator

kmpaul commented Dec 8, 2021

I have another idea, and if I have time today, I am going to look into it. I'll let you know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants