GFDL-ESM2M piControl does not run #377

Jete90 · 2023-02-14T16:11:50Z

Hello,

I downloaded the MOM5 code to the WHOI supercomputer.

After compiling GFDL-ESM2M, I tried to run it.
Unfortunately, I quickly ran into segmentation faults when running it.

I attached the error message below.

It might be due to the modules/compiler versions that I am using.

This is what my environment looks like:

source $MODULESHOME/init/csh
module load intel
module load netcdf/intel/4.6.1
module load openmpi/intel

setenv mpirunCommand "mpirun -np"

Kind regards

Jens

ERROR MESSAGE

[...]

LND(ATMOCNLND)= 0.153673308874230 0.153673308874230
0.153673308871445
NOTE from PE 0: xgrid_mod: reading exchange grid information from mosaic grid file
NOTE from load_xgrid(xgrid_mod): field 'scale' exist in the file INPUT/land_mos
aicXocean_mosaic.nc, this field will be read and the exchange grid cell area wi
ll be multiplied by scale
Checked data is array of constant 1
LND(LNDOCN)= 0.703873657789463 0.703873657789466
0.703873657789463
OCN(LNDOCN)= 0.703873657789467 0.703873657789463
0.703873657789466

FATAL from PE 31: ==>Error from coupler_types_mod (CT_spawn_1d_3d): Disordered k-dimension index bound list 1 0

FATAL from PE 32: ==>Error from coupler_types_mod (CT_spawn_1d_3d): Disordered k-dimension index bound list 1 0

[.....]

fms_ESM2M.x 0000000000452D04 Unknown Unknown Unknown
fms_ESM2M.x 000000000045BD03 Unknown Unknown Unknown
fms_ESM2M.x 00000000004556BF Unknown Unknown Unknown
fms_ESM2M.x 000000000040E19E Unknown Unknown Unknown
libc-2.17.so 00002AAAAC544555 __libc_start_main Unknown Unknown
fms_ESM2M.x 000000000040E0A9 Unknown Unknown Unknown

MPI_ABORT was invoked on rank 30 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.

forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
fms_ESM2M.x 0000000002A8FDEE for__signal_handl Unknown Unknown
libpthread-2.17.s 00002AAAAC315630 Unknown Unknown Unknown
libpthread-2.17.s 00002AAAAC312573 pthread_spin_lock Unknown Unknown
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
fms_ESM2M.x 0000000002A8FDEE for__signal_handl Unknown Unknown
libpthread-2.17.s 00002AAAAC315630 Unknown Unknown Unknown
libpthread-2.17.s 00002AAAAC312573 pthread_spin_lock Unknown Unknown
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
fms_ESM2M.x 0000000002A8FDEE for__signal_handl Unknown Unknown
libpthread-2.17.s 00002AAAAC315630 Unknown Unknown Unknown
libpthread-2.17.s 00002AAAAC312573 pthread_spin_lock Unknown Unknown
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
fms_ESM2M.x 0000000002A8FDEE for__signal_handl Unknown Unknown
libpthread-2.17.s 00002AAAAC315630 Unknown Unknown Unknown
libpthread-2.17.s 00002AAAAC312573 pthread_spin_lock Unknown Unknown
[pn030:263631] *** Process received signal ***
[pn030:263631] Signal: Segmentation fault (11)
[pn030:263631] Signal code: Address not mapped (1)
[pn030:263631] Failing at address: 0x28
[pn030:263631] [ 0] /lib64/libpthread.so.0(+0xf630)[0x2aaaabe1d630]
[pn030:263631] [ 1] /vortexfs1/apps/openmpi-3.0.1-intel/lib/openmpi/mca_pmix_pmix2x.so(+0xb2723)[0x2aaab86c1723]
[pn030:263631] [ 2] /vortexfs1/apps/openmpi-3.0.1-intel/lib/openmpi/mca_pmix_pmix2x.so(pmix_ptl_base_recv_handler+0x579)[0x2aaab86c24a9]
[pn030:263631] [ 3] /vortexfs1/apps/openmpi-3.0.1-intel/lib/libopen-pal.so.40(opal_libevent2022_event_base_loop+0xa09)[0x2aaaab021829]
[pn030:263631] [ 4] /vortexfs1/apps/openmpi-3.0.1-intel/lib/openmpi/mca_pmix_pmix2x.so(+0x9d0f2)[0x2aaab86ac0f2]
[pn030:263631] [ 5] /lib64/libpthread.so.0(+0x7ea5)[0x2aaaabe15ea5]
[pn030:263631] [ 6] /lib64/libc.so.6(clone+0x6d)[0x2aaaac128b0d]
[pn030:263631] *** End of error message ***
Segmentation fault
ERROR: Model failed to run to completion

russfiedler · 2023-02-15T03:50:16Z

@Jete90 This bug originates from using an old netCDF version as documented here NOAA-GFDL/CM4#11 and NOAA-GFDL/icebergs#44

You'll need to update to 4.7.3 or later.

wienkers · 2023-03-08T17:33:42Z

As a follow-up to Jens' question: Does this mean that many of the .res.nc included in the ESM2M piControl test setup provided are corrupt ? I have netCDF v4.7.4, and regardless of compiling with the netCDF4 flag on/off, I still receive the above error that Jens runs into.
Thank you in advance for your help!
Aaron

russfiedler · 2023-03-09T03:01:21Z

@wienkers The bug was specific to the iceberg restarts as far as I remember. It's quite possible there are other problems with non ocean restarts.

wienkers · 2023-03-09T16:03:43Z

Thank you for the quick reply @russfiedler
After a bit more digging, this seems to no longer be arising from the netCDF bug.
The error:

Error from coupler_types_mod (CT_spawn_1d_3d): Disordered k-dimension index bound list    1    0

points back to flux_exchange_init, where

call mpp_get_compute_domain( Ice%domain, is, ie, js, je )
    kd = size(Ice%ice_mask,3)
    call coupler_type_copy(ex_gas_fields_ice, Ice%ocean_fields, is, ie, js, je, kd,     &
         'ice_flux', Ice%axes, Time, suffix = '_ice›)

At run-time, kd = 6 on the Ice/Atm processes (as it should for num_part = 6 in the input.nml), but kd = 0 on the Ocean processes (which then each throw the error). This block of code is evaluated on all processes; however, it seems like the call to subroutine ice_model_init in coupler_init which allocates Ice%ice_mask only occurs for the Ice processes. So the size information about Ice%ice_mask needed then in the above block of code just becomes 0.

russfiedler · 2023-03-14T04:23:48Z

@wienkers Ah, yes, I vaguely remember that was a possibility and that it should only be evaluated on Ice processors. I can't remember if it's sufficient to encase the code in an if(Ice%pe) then...endif block. It should be.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GFDL-ESM2M piControl does not run #377

GFDL-ESM2M piControl does not run #377

Jete90 commented Feb 14, 2023

russfiedler commented Feb 15, 2023

wienkers commented Mar 8, 2023

russfiedler commented Mar 9, 2023

wienkers commented Mar 9, 2023

russfiedler commented Mar 14, 2023

GFDL-ESM2M piControl does not run #377

GFDL-ESM2M piControl does not run #377

Comments

Jete90 commented Feb 14, 2023

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. You may or may not see output from other processes, depending on exactly when Open MPI kills them.

russfiedler commented Feb 15, 2023

wienkers commented Mar 8, 2023

russfiedler commented Mar 9, 2023

wienkers commented Mar 9, 2023

russfiedler commented Mar 14, 2023

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.