Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ocean/ice post failed occasionally #2328

Closed
jiandewang opened this issue Feb 20, 2024 · 37 comments · Fixed by #2484
Closed

ocean/ice post failed occasionally #2328

jiandewang opened this issue Feb 20, 2024 · 37 comments · Fixed by #2484
Labels
bug Something isn't working triage Issues that are triage

Comments

@jiandewang
Copy link
Contributor

jiandewang commented Feb 20, 2024

What is wrong?

ocean and ice post failed occasionally

What should have happened?

post job shall work all the time

What machines are impacted?

All or N/A

Steps to reproduce

clone the latest g-w
setup C1152 and make a test run

Additional information

this problem doesn't happen all the time but it does happen occasionally

Do you have a proposed solution?

@jiandewang jiandewang added bug Something isn't working triage Issues that are triage labels Feb 20, 2024
@jiandewang
Copy link
Contributor Author

can someone point me where is the script to use FORTARN remap ocean tripolar master file to regular files ?

@aerorahul
Copy link
Contributor

@jiandewang
The workflow waits 120s before launching the ocean/ice postprocessing. Should this 2 min limit be increased? Are we sure the IO is not stalling?

@jiandewang
Copy link
Contributor Author

@aerorahul then something else is happening in my test. Let me check the log file carefully and get back here.

@jiandewang
Copy link
Contributor Author

@aerorahul can you check dogwood
/lfs/h2/emc/ptmp/jiande.wang/HR3-work/C1152/COMROOT/2019123000/C1152/logs/2019123000/gfsocean_prod_f042-f054.log.0

line 7782 to see what's happened here ? 2nd try from rocoto for this job is OK.

@JessicaMeixner-NOAA
Copy link
Contributor

@jiandewang thank you so much for making this issue and for posting the log with the error! While it's great that it runs on completion, understanding why it's an issue is important too. I took a look at the log file and saw on line 7746 we have the FATAL ERROR:

+ oceanice_nc2grib2.sh[270](ocean): echo 'FATAL ERROR: '\''ocean.1p00.nc'\'' does not exist, ABORT!'

Line 270 can be found here: https://github.com/NOAA-EMC/global-workflow/blob/develop/ush/oceanice_nc2grib2.sh#L270

@JessicaMeixner-NOAA
Copy link
Contributor

Tagging @GwenChen-NOAA to see if she can help us diagnose this as well.

@jiandewang do you have other job logs with failures we can see if the same issues is there too?

@jiandewang
Copy link
Contributor Author

see the same directory, simply do "ls oclog.0", there is another case

@aerorahul
Copy link
Contributor

oceanice_products.163888 ❯❯❯ pwd
/lfs/h2/emc/ptmp/jiande.wang/HR3-work/RUNDIRS/C1152/oceanice_products.163888
oceanice_products.163888 ❯❯❯ less ocean.post.log
*** FATAL ERROR ***: get variable: timeocean.nc:NetCDF: Index exceeds dimension bound

It seems like the model output copied as ocean.nc at forecast hour 54 had an issue. The interpolation code failed.
We should update the executable to report non-zero exit status in case of failure.

@aerorahul
Copy link
Contributor

@jiandewang
Copy link
Contributor Author

seems I need to change the title for this issue

@jiandewang jiandewang changed the title ocean/ice post dependency ocean/ice post failed occasionally Feb 20, 2024
@JessicaMeixner-NOAA
Copy link
Contributor

@aerorahul @jiandewang - I can start PRs for this update that Rahul mentioned if that'd be helpful. Since re-runs typically work I don't think fixing this is essential for HR3 but it is essential for us to get this fixed, but if others disagree please let me know. We could even just make this 1 line change when running HR3 and then also get more information as to why this is occasionally failing and bundle the update with other updates to possibly address the failure reason as well? What are others thoughts on the best path forward?

@jiandewang
Copy link
Contributor Author

@JessicaMeixner-NOAA yes re-run from rocoto will always work as the input file will be fully ready by the re-run time. For HR3 runs we can manually increase the wait time as a workaround.

@JessicaMeixner-NOAA
Copy link
Contributor

I'm confused how a manually increasing the wait time (above the already 120 s wait time) is needed? Or is this another place where we need a wait?

@aerorahul
Copy link
Contributor

#2328 (comment)
Are we sure the IO is not stalling?
The issue in the ocnicepost.x is a symptom, but not the cause. If the model had created the full file, the executable would have succeeded -- this is seen from the re-run (nevertheless, the stop ierr fix should be applied).
Is 120s not enough? We can raise this to 600s. However, I would also ask to make sure the IO is not stalling, or the closing of the file is not being held back for whatever reason (clocks, buffer flush, etc.)

@jiandewang
Copy link
Contributor Author

jiandewang commented Feb 21, 2024

all of us checked the error message in a hurry yesterday afternoon and didn't fully understand what's the root cause. In early version of g-w (using ncl to do the remap) we use fv3 log as a triggler and this worked almost all the time as ocean output is fully ready by the time fv3 log appeared (or maybe couple of seconds later). In this version of g-w we are using ocean output file name itself as triggler so things changed. Let's say model just starts 06Z integration and we are waiting for it to reach 12Z before fire the post for 06Z-12Z post job. But for MOM the 06-12Z avg output file will start to appear on disk once 06Z integration started (with almost the right size but it is filled by all garbage data). So 2min wait is far less than enough. My test yesterday using 24x16 fv3 layout show it took ~8 min for each 6hr ocean integration. So the waiting time depends on the relative speed and of course PE setting.

The error message we saw yesterday (time ocean.nc index out of bound) indicate the ocean input file has not been fully written yet. When it is fully ready, the time should be
time = UNLIMITED ; // (1 currently)
However before it is fully ready, it will show something else (I need to do a offline test to sort this out, forgot what it is right now)
this is why oceanicepost.x failed at that point.

@JessicaMeixner-NOAA
Copy link
Contributor

@jiandewang does MOM6 create the netcdf file early on and then wait a while before writing the full file? Like initialize it at the beginning of a 6 hour window and then write it at the end for example?

@aerorahul @jiandewang agreed- we need to understand the root cause.

@aerorahul
Copy link
Contributor

@jiandewang
Great. So 120s is not enough. We can raise that or find a third way to validate the "complete-ness" of the file.
Depending on fv3 log for triggering ocean/ice post-processing is not the right solution either.

@JessicaMeixner-NOAA
Copy link
Contributor

Linking to this issue: ufs-community/ufs-weather-model#1652 which is about adding log files for triggering. It looks like we have an ice log file, WW3 file i'm hoping to work on soon, and we needed to determine an alternative than the original FMS idea.

@jiandewang
Copy link
Contributor Author

I had a FMS issue requested a log flag but was denied by FMS group

@jiandewang
Copy link
Contributor Author

@jiandewang does MOM6 create the netcdf file early on and then wait a while before writing the full file? Like initialize it at the beginning of a 6 hour window and then write it at the end for example?

@aerorahul @jiandewang agreed- we need to understand the root cause.

@JessicaMeixner-NOAA yes 06-12Z file appears once it starts 06Z integration

@JessicaMeixner-NOAA
Copy link
Contributor

Okay, if it appears immediately then we definitely need a longer wait time while we wait for an alternative way to write a log file to indicate the file has been completely written.

@jiandewang
Copy link
Contributor Author

@jiandewang Great. So 120s is not enough. We can raise that or find a third way to validate the "complete-ness" of the file. Depending on fv3 log for triggering ocean/ice post-processing is not the right solution either.

@aerorahul we struggled for a reliable ocean IO flag so long time, but today's discussion may give us some clue on how to use the output file header information as a proxy. I assume it will take some time.

@aerorahul
Copy link
Contributor

@jiandewang Great. So 120s is not enough. We can raise that or find a third way to validate the "complete-ness" of the file. Depending on fv3 log for triggering ocean/ice post-processing is not the right solution either.

@aerorahul we struggled for a reliable ocean IO flag so long time, but today's discussion may give us some clue on how to use the output file header information as a proxy. I assume it will take some time.

@jiandewang
If you have a way to determine if a file is complete or incomplete, via a tool such as ncdump or a script, let me know and I can work it in as a dependency checker.

@jiandewang
Copy link
Contributor Author

@aerorahul yes I have similiar mind as you. Give me a bit time for offline test and will get back to you.

@jiandewang
Copy link
Contributor Author

https://github.com/NOAA-EMC/gfs-utils/blob/4b7f6095d260b7fcd9c99c337454e170f1aa7f2f/src/ocnicepost.fd/utils_mod.F90#L590 should be updated to

stop ierr

we will need to let model exit rather than just a waring and keep going, as we will may have garbage data in final grib2 file.

@aerorahul
Copy link
Contributor

@jiandewang @JessicaMeixner-NOAA
https://github.com/NOAA-EMC/gfs-utils/tree/feature/netcdf-error and PR 48 aborts with a non-zero exit code on error.

@aerorahul
Copy link
Contributor

@aerorahul yes I have similiar mind as you. Give me a bit time for offline test and will get back to you.

@jiandewang Have you had a chance to look at this and come up with a way to check if the MOM6 output is fully formed?

@aerorahul
Copy link
Contributor

@jiandewang
Any futher thoughts?

@jiandewang
Copy link
Contributor Author

@aerorahul I had a careful monitoring on ocean files while the model is running. My previous thoughts is not correct. Let me use this as an example: when model just started to run 6Z, the ocean_06*nc file will appear on disk but it is very small in size and use ncdump will pop up something like "not a netcdf file".

@aerorahul
Copy link
Contributor

@jiandewang
Can you provide me this "incomplete" file on hera/orion?
And, mind if I give it a crack to come up with a solution?

@jiandewang
Copy link
Contributor Author

@jiandewang Can you provide me this "incomplete" file on hera/orion? And, mind if I give it a crack to come up with a solution?

I don't have this kind of file in hand at this moment but will have that for you shortly

@JessicaMeixner-NOAA
Copy link
Contributor

@aerorahul I set up a meeting for next Friday to brainstorm some solutions. We've also included those working on MOM6 for HAFS as they likely need something similar (or might already have something). If you or anyone else would like to be added to this meeting, let me know!

@jiandewang
Copy link
Contributor Author

@aerorahul I just made a fresh try on wcoss2 based on HR3b and copied them to HERA. See
/scratch1/NCEPDEV/climate/Jiande.Wang/working/scratch/NC-file
-rw-r--r-- 1 Jiande.Wang climate 1219551143 Mar 21 11:39 06-complete.nc
-rw-r--r-- 1 Jiande.Wang climate 933318375 Mar 21 11:32 06-incomplete.nc
-rw-r--r-- 1 Jiande.Wang climate 1219551143 Mar 21 11:45 12-complete.nc
-rw-r--r-- 1 Jiande.Wang climate 933318375 Mar 21 11:38 12-incomplete.nc
you can tell the file size difference between "complete" and "incomplete"
if you do ncdump,
imcomplete will show time = UNLIMITED ; // (0 currently)
while completed one will show time = UNLIMITED ; // (1 currently)
this maybe a good clue for us to use as a flag.
In my previous test (maybe 1 months ago) I noticed the imcompleted one had a very small size in the very beginning and that file can't be viewed by ncdump. But this didn't happen today (maybe it happened but it became larger size after several seconds which bypassed my eyeball).

so my suggestion is:
(1) once the file appera on disk, do ncdump, if it complains something like "not a netcdf file" (probably we can use return code as a flag here) sleep ....
(2) if ncdump shows time = UNLIMITED ; // (0 currently), keep on sleeping
(3) once it becomes time = UNLIMITED ; // (1 currently), sleep 10-20s to make sure it is fully written, then do post

@aerorahul
Copy link
Contributor

I got the files, and will work on a solution. Thanks @jiandewang

@JessicaMeixner-NOAA
Copy link
Contributor

@aerorahul - Just wanted to provide an update from a meeting that @jiandewang John and I had. John shared his HAFS experience and we revisited the issue NOAA-GFDL/FMS#1140 and based on that, the next file existing does actually appear to be the best trigger (and then I guess the forecast job being complete for the last hour). Let me know if you'd like to discuss this more offline.

@aerorahul
Copy link
Contributor

I don't think this heterogenous dependency will be straight-forward to implement.
It would be much simpler for the ufs-weather-model to write out a log.ocean.fHHH.txt instead and depend on that trigger.

@JessicaMeixner-NOAA
Copy link
Contributor

@aerorahul let me know if we can talk about this offline to explain some details I might have poorly explained above.

aerorahul added a commit that referenced this issue Apr 17, 2024
…eck ocean output (#2484)

This PR:
- adds a Rocoto dependency tag that executes a shell command. The return
code of the shell expression serves as a dependency check
- adds a script that executes `ncdump` on a netCDF file. If the file is
a valid netCDF file, the return code is 0, else it is non-zero
- combines the above 2 to use as a dependency check for MOM6 output. If
the model is still in the process of writing out the ocean output, the
rocoto will execute the shell script and gather the return code.

This PR also:
- changes permissions on some `ush/` scripts that did not have
executable permissions.

Resolves #2328
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage Issues that are triage
Projects
None yet
3 participants