forked from NOAA-GFDL/FMS
-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MPI_Type, MPI_Alltoallw, mpp_global_field update #5
Open
aidanheerdegen
wants to merge
1
commit into
master
Choose a base branch
from
mpi_alltoallw
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This patch contains three new features for FMS: Support for MPI datatypes, an MPI_Alltoallw interface, and modifications to mpp_global_field to use these changes for select operations. These changes were primarily made to improve stability of large (>4000 rank) MPI jobs under OpenMPI at NCI. There are differences in the performance of mpp_global_field, occasionally even very large differences, but there is no consistency across various MPI libraries. One method will be faster in one library, and slower in another, even across MPI versions. Generally, the MPI_Alltoallw method showed improved performance on our system, but this is not a universal result. We therefore introduce a flag to control this feature. The inclusion of MPI_Type support may also be seen as an opportunity to introduce other new MPI features for other operations, e.g. halo exchange. Detailed changes are summarised below. - MPI data transfer type ("MPI_Type") support has been added to FMS. This is done with the following features: - A `mpp_type` derived type has been added, which manages the type details and hides the MPI internals from the model developer. Types are managed inside of an internal linked list, `datatypes`. Note: The name `mpp_type` is very similar to the preprocessor variable `MPP_TYPE_` and should possibly be renamed to something else, e.g. `mpp_datatype`.* - `mpp_type_create` and `mpp_type_free` are used to create and release these types within the MPI library. These append and remove mpp_types from the internal linked list, and include reference counters to manage duplicates. - A `mpp_byte` type is created as a module-level variable for default operations. NOTE: As the first element of the list, it also inadvertently provides access to the rest of `datatypes`, which is private, but there is probably some ways to address this.* - A MPI_Alltoallw wrapper, using MPI_Types, has been added to the mpp_alltoall interface. - An implementation of mpp_global_field using MPI_Alltoallw and mpp_types has been added. In addition to replacing the point-to-point operations with a collective, it also eliminates the need to use the internal MPP stack. Since MPI_Alltoallw requires that the input field by contiguous, it is only enabled for data domains (i.e. compute + halo). This limitation can be overcome, either by copying or more careful attention to layout, but it can be addressed in a future patch. This method is enabled in the `mpp_domains_nml` namelist group, by setting the `use_alltoallw` flag to True. Provisional interfaces to SHMEM and serial ("nocomm") builds have been added, although they are as yet untested and primarily meant as placeholders for now. This patch also includes the following changes to support this work. - In `get_peset`, the method used to generate MPI subcommunicators has been changed; specifically `MPI_Comm_create` has been replaced with `MPI_Comm_create_group`. The former is blocking over all ranks, while the latter is only blocking over ranks in the subgroup. This was done to accommodate IO domains of a single rank, usually due to masking, which would result in no communication and cause a model hang. It seems that more recent changes in FMS related to handling single-rank communicators were made to avoid this particular scenario from happening, but I still think that it's more correct to use `MPI_Comm_create_group` and have left the change. This is an MPI 3.0 feature, so this might be an issue for older MPI libraries. - Logical interfaces added to mpp_alltoall and mpp_alltoallv - Single-rank PE checks in mpp_alltoall were removed to prevent model hangs with the subcommunicators. - NULL_PE checks have been added to the original point-to-point implementation of mpp_global_field, although these may not be required anymore due to changes in the subcommunicator implementation. This work was by Nic Hannah, and may actually be part of an existing pull request. (TODO: Check this!) - Timer events have been added to mpp_type_create and mpp_type_free, although they are not yet initialized anywhere. - The diagnostic field count was increased from 150 to 250, to support the current needs of researchers.
Hey @marshallward I am just merging this work you did into our FMS fork. Can you just confirm that this is ok, I assume the code is fine, it is what was accepted into the upstream FMS but I guess we had to branch before it was merged into |
Not really in a good position to test it out, but it looks OK to me. If it's not breaking your runs then I suspect it's fine to merge. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is the work of @marshallward
This patch contains three new features for FMS: Support for MPI datatypes, an
MPI_Alltoallw interface, and modifications to mpp_global_field to use these
changes for select operations.
These changes were primarily made to improve stability of large (>4000
rank) MPI jobs under OpenMPI at NCI.
There are differences in the performance of mpp_global_field,
occasionally even very large differences, but there is no consistency
across various MPI libraries. One method will be faster in one library,
and slower in another, even across MPI versions. Generally, the
MPI_Alltoallw method showed improved performance on our system, but this
is not a universal result. We therefore introduce a flag to control
this feature.
The inclusion of MPI_Type support may also be seen as an opportunity to
introduce other new MPI features for other operations, e.g. halo
exchange.
Detailed changes are summarised below.
MPI data transfer type ("MPI_Type") support has been added to FMS. This is
done with the following features:
mpp_type
derived type has been added, which manages the type detailsand hides the MPI internals from the model developer. Types are managed
inside of an internal linked list,
datatypes
.Note: The name
mpp_type
is very similar to the preprocessor variableMPP_TYPE_
and should possibly be renamed to something else, e.g.mpp_datatype
.*mpp_type_create
andmpp_type_free
are used to create and release thesetypes within the MPI library. These append and remove mpp_types from the
internal linked list, and include reference counters to manage duplicates.
A
mpp_byte
type is created as a module-level variable for defaultoperations.
NOTE: As the first element of the list, it also inadvertently provides
access to the rest of
datatypes
, which is private, but there is probablysome ways to address this.*
A MPI_Alltoallw wrapper, using MPI_Types, has been added to the mpp_alltoall
interface.
An implementation of mpp_global_field using MPI_Alltoallw and mpp_types has
been added. In addition to replacing the point-to-point operations with a
collective, it also eliminates the need to use the internal MPP stack.
Since MPI_Alltoallw requires that the input field by contiguous, it is only
enabled for data domains (i.e. compute + halo). This limitation can be
overcome, either by copying or more careful attention to layout, but it can
be addressed in a future patch.
This method is enabled in the
mpp_domains_nml
namelist group, by settingthe
use_alltoallw
flag to True.Provisional interfaces to SHMEM and serial ("nocomm") builds have been added,
although they are as yet untested and primarily meant as placeholders for now.
This patch also includes the following changes to support this work.
In
get_peset
, the method used to generate MPI subcommunicators has beenchanged; specifically
MPI_Comm_create
has been replaced withMPI_Comm_create_group
. The former is blocking over all ranks, while thelatter is only blocking over ranks in the subgroup.
This was done to accommodate IO domains of a single rank, usually due to
masking, which would result in no communication and cause a model hang.
It seems that more recent changes in FMS related to handling single-rank
communicators were made to avoid this particular scenario from happening, but
I still think that it's more correct to use
MPI_Comm_create_group
and haveleft the change.
This is an MPI 3.0 feature, so this might be an issue for older MPI
libraries.
Logical interfaces added to mpp_alltoall and mpp_alltoallv
Single-rank PE checks in mpp_alltoall were removed to prevent model hangs
with the subcommunicators.
NULL_PE checks have been added to the original point-to-point implementation
of mpp_global_field, although these may not be required anymore due to
changes in the subcommunicator implementation.
This work was by Nic Hannah, and may actually be part of an existing pull
request. (TODO: Check this!)
Timer events have been added to mpp_type_create and mpp_type_free, although
they are not yet initialized anywhere.
The diagnostic field count was increased from 150 to 250, to support the
current needs of researchers.