Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Darshan work #35

Open
lyon-fnal opened this issue Dec 14, 2020 · 4 comments
Open

Make Darshan work #35

lyon-fnal opened this issue Dec 14, 2020 · 4 comments
Assignees

Comments

@lyon-fnal
Copy link
Owner

Try Darshan with Julia. It loads with LD_PRELOAD. May need to change HDF5.jl and MPI.jl so that they don't explicitly give the library name on ccall`.

@lyon-fnal lyon-fnal self-assigned this Dec 14, 2020
@lyon-fnal
Copy link
Owner Author

lyon-fnal commented Dec 29, 2020

Problem in MPI.jl. The shared object needs to be loaded with Libdl.dlopen in MPI.__init__ because loading the .so is a runtime thing. But in src/MPI.jl close to the top there's an include("implementations.jl") That defines Get_library_version which does a ccall to MPI_Get_library_version. That would be ok, except that in that file, it calls Get_library_version(). Since this happens before MPI.__init__, the library isn't loaded and it fails. Move this call to the __init__ function and make MPI_LIBRARY_VERSION_STRING a reference. Do the same thing for Get_version too further down in implementations.jl.

In fact I think all of implementations.jl (or at least all of the calls) needs to be moved to the __init__ function.'

Actually - scratch the work with implementations.jl - all it does is get versions and implementation types. So it's fine to do the ccall with the library - that stuff should not be overridden.

@lyon-fnal
Copy link
Owner Author

lyon-fnal commented Dec 31, 2020

Darshan works on NERSC Cori Haswell! Here's a summary of what I did...

  • I have my own build of Darshan 3.2.1 against cray-hdf5-parallel/1.12.0.0 and cray-mpich/7.7.10 (see here for configure and build instructions). Configure has --enable-hdf5-mod=$HDF5_DIR.
  • The Darshan build includes a module specification. Activate it with,
module rm darshan   # Unload the default Cori Darshan (currently an old 3.1.7)
module use $HOME/apps.cori-hsw/darshan-3.2.1/share/craype-2.x/modulefiles
module load darshan
  • Build my modified MPI.jl and HDF5.jl against system libraries (see above).
  • Run a simple script with srun --export=ALL,LD_PRELOAD=libdarshan.so julia --project tryit.jl (need to run in batch or on a Cori node with sallocate).

Things to do:

  • Submit PRs for MPI.jl #450 and HDF5.jl #791
  • Try with a more complex script
  • Try with the package compiler (I don't think that will make a difference)

lyon-fnal added a commit that referenced this issue Jan 24, 2021
@lyon-fnal
Copy link
Owner Author

So I have this working with HDF5.jl. I've altered energyByCal.jl to make collective reads an option. Even with it turned on, I don't see any collective reads in Darshan (MPIIO_COLL_READS is 0). Even with H5D_USE_MPIIO_COLLECTIVE true for each column. Not sure what's going on there. Seems like HDF5 is making some decision about using collective reads or not.

mpio.jl DOES do collective reads. For that, each rank reads from its own chunk.

Trying mpio2.jl where ranks will not all read from their own chunk.

@lyon-fnal
Copy link
Owner Author

mpio2.jl (which is now a nice example of collective writes) does read many chunks. And it's doing a collective read (one per rank). Not sure why energyByCal.jl does no collective reads. Maybe this doesn't really matter.

lyon-fnal added a commit that referenced this issue Mar 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant