This patch contains three new features for FMS: Support for MPI datatypes, an
MPI_Alltoallw interface, and modifications to mpp_global_field to use these
changes for select operations.
These changes were primarily made to improve stability of large (>4000
rank) MPI jobs under OpenMPI at NCI.
There are differences in the performance of mpp_global_field,
occasionally even very large differences, but there is no consistency
across various MPI libraries. One method will be faster in one library,
and slower in another, even across MPI versions. Generally, the
MPI_Alltoallw method showed improved performance on our system, but this
is not a universal result. We therefore introduce a flag to control
this feature.
The inclusion of MPI_Type support may also be seen as an opportunity to
introduce other new MPI features for other operations, e.g. halo
exchange.
Detailed changes are summarised below.
- MPI data transfer type ("MPI_Type") support has been added to FMS. This is
done with the following features:
- A `mpp_type` derived type has been added, which manages the type details
and hides the MPI internals from the model developer. Types are managed
inside of an internal linked list, `datatypes`.
Note: The name `mpp_type` is very similar to the preprocessor variable
`MPP_TYPE_` and should possibly be renamed to something else, e.g.
`mpp_datatype`.*
- `mpp_type_create` and `mpp_type_free` are used to create and release these
types within the MPI library. These append and remove mpp_types from the
internal linked list, and include reference counters to manage duplicates.
- A `mpp_byte` type is created as a module-level variable for default
operations.
NOTE: As the first element of the list, it also inadvertently provides
access to the rest of `datatypes`, which is private, but there is probably
some ways to address this.*
- A MPI_Alltoallw wrapper, using MPI_Types, has been added to the mpp_alltoall
interface.
- An implementation of mpp_global_field using MPI_Alltoallw and mpp_types has
been added. In addition to replacing the point-to-point operations with a
collective, it also eliminates the need to use the internal MPP stack.
Since MPI_Alltoallw requires that the input field by contiguous, it is only
enabled for data domains (i.e. compute + halo). This limitation can be
overcome, either by copying or more careful attention to layout, but it can
be addressed in a future patch.
This method is enabled in the `mpp_domains_nml` namelist group, by setting
the `use_alltoallw` flag to True.
Provisional interfaces to SHMEM and serial ("nocomm") builds have been added,
although they are as yet untested and primarily meant as placeholders for now.
This patch also includes the following changes to support this work.
- In `get_peset`, the method used to generate MPI subcommunicators has been
changed; specifically `MPI_Comm_create` has been replaced with
`MPI_Comm_create_group`. The former is blocking over all ranks, while the
latter is only blocking over ranks in the subgroup.
This was done to accommodate IO domains of a single rank, usually due to
masking, which would result in no communication and cause a model hang.
It seems that more recent changes in FMS related to handling single-rank
communicators were made to avoid this particular scenario from happening, but
I still think that it's more correct to use `MPI_Comm_create_group` and have
left the change.
This is an MPI 3.0 feature, so this might be an issue for older MPI
libraries.
- Logical interfaces added to mpp_alltoall and mpp_alltoallv
- Single-rank PE checks in mpp_alltoall were removed to prevent model hangs
with the subcommunicators.
- NULL_PE checks have been added to the original point-to-point implementation
of mpp_global_field, although these may not be required anymore due to
changes in the subcommunicator implementation.
This work was by Nic Hannah, and may actually be part of an existing pull
request. (TODO: Check this!)
- Timer events have been added to mpp_type_create and mpp_type_free, although
they are not yet initialized anywhere.
- The diagnostic field count was increased from 150 to 250, to support the
current needs of researchers.