-
Notifications
You must be signed in to change notification settings - Fork 264
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test failure in pnetcdf with netCDF 4.8.0 #2038
Comments
Thanks! I’ll take a look :) |
I am traveling so won't be able to dig into this today, but I'll take a look when I get back to the 'office'. Also tagging the author of the MPI code, @edhartnett, in case an obvious answer leaps out! Thanks again @ArchangeGabriel! |
Was this solved in 4.8.1 (I can’t test right now because of #2085)? |
@DennisHeimbigner I think you wanted to post that in #2085. ;) |
So it’s not fixed. |
netCDF 4.8 has been stuck in the [staging] repository for two months on Arch and this is blocking other updates, so I’ve been asked to do something about it. |
Revisiting this now, leaving a comment so that it rises to the top of my queue. |
So it bears further testing, but in a new debug environment with pnetcdf |
So it seems I'm going to need a little bit more information to duplicate this. In my environments (Ubuntu, mpicc is build for |
As always with my issues, you can find the version of any packages on https://archlinux.org/packages/. In this case, this is |
I don't think the MPICH/OpenMPI version or gcc version matters. |
I will try installing openmpi and see if the issue occurs; I do not observe any failures with |
Did you run it with more than one MPI process? |
@wkliao Yes, I'm executing via |
Could you please add the following lines to your
Here is what I am getting.
|
We may be on to something here as I am receiving an error when using |
The output I'm observing is as follows, using
|
Interesting, seeing inconsistent behavior now. I'll follow up after I pursue this further, but it appears that the issue may also now be manifesting in the mpich2-based VM image. Frustrating, as it very much wasn't before, but that's the nature of software development I guess. Stand by as we sort this out. |
OK, I've been taking a look at this and indeed this seems to be a bug in netcdf parallel I/O code. The problem is here:
Obviously the dimlen is not getting the correct result because it is being updated on each processor, but the result is not being propagated to the other processors. I will take a look at what can be done. To start with, I have converted the test from the pnetcdf issue into a netcdf test, tst_parallel6.c. I will update here when I get more progress... |
OK, the problem is in libhdf5/hdf5internal.c:
OK, so H5Dget_space() is returning local information. I have tried using H5fsync() to sync the file, but this did not work either. I don't know any way to force HDF5 to update this information except by closing and reopening the file. And that's no good. So what should happen here? @wkliao suggestions are welcome... |
In PnetCDF, we call MPI_Allreduce to get the maximal record number among all writing processes if the I/O mode is collective. If it is in independent mode, then the consistency on record number is not guaranteed. |
OK I will take a stab at this tomorrow morning... |
OK I have a PR up with a fix. (#2310). Feedback welcome! |
While rebuild pnetcdf for netCDF 4.8.0, I’ve encountered a test failure, reported at Parallel-NetCDF/PnetCDF#72.
@wkliao thinks this is an issue with netCDF as he explained in the linked ticket. My MPI/netCDF is rusty, so I trust him about that and decided to open a ticket here. ;)
The text was updated successfully, but these errors were encountered: