-
Notifications
You must be signed in to change notification settings - Fork 168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ocean/ice post failed occasionally #2328
Comments
can someone point me where is the script to use FORTARN remap ocean tripolar master file to regular files ? |
@jiandewang |
@aerorahul then something else is happening in my test. Let me check the log file carefully and get back here. |
@aerorahul can you check dogwood line 7782 to see what's happened here ? 2nd try from rocoto for this job is OK. |
@jiandewang thank you so much for making this issue and for posting the log with the error! While it's great that it runs on completion, understanding why it's an issue is important too. I took a look at the log file and saw on line 7746 we have the FATAL ERROR:
Line 270 can be found here: https://github.com/NOAA-EMC/global-workflow/blob/develop/ush/oceanice_nc2grib2.sh#L270 |
Tagging @GwenChen-NOAA to see if she can help us diagnose this as well. @jiandewang do you have other job logs with failures we can see if the same issues is there too? |
see the same directory, simply do "ls oclog.0", there is another case |
It seems like the model output copied as |
https://github.com/NOAA-EMC/gfs-utils/blob/4b7f6095d260b7fcd9c99c337454e170f1aa7f2f/src/ocnicepost.fd/utils_mod.F90#L590 should be updated to stop ierr |
seems I need to change the title for this issue |
@aerorahul @jiandewang - I can start PRs for this update that Rahul mentioned if that'd be helpful. Since re-runs typically work I don't think fixing this is essential for HR3 but it is essential for us to get this fixed, but if others disagree please let me know. We could even just make this 1 line change when running HR3 and then also get more information as to why this is occasionally failing and bundle the update with other updates to possibly address the failure reason as well? What are others thoughts on the best path forward? |
@JessicaMeixner-NOAA yes re-run from rocoto will always work as the input file will be fully ready by the re-run time. For HR3 runs we can manually increase the wait time as a workaround. |
I'm confused how a manually increasing the wait time (above the already 120 s wait time) is needed? Or is this another place where we need a wait? |
#2328 (comment) |
all of us checked the error message in a hurry yesterday afternoon and didn't fully understand what's the root cause. In early version of g-w (using ncl to do the remap) we use fv3 log as a triggler and this worked almost all the time as ocean output is fully ready by the time fv3 log appeared (or maybe couple of seconds later). In this version of g-w we are using ocean output file name itself as triggler so things changed. Let's say model just starts 06Z integration and we are waiting for it to reach 12Z before fire the post for 06Z-12Z post job. But for MOM the 06-12Z avg output file will start to appear on disk once 06Z integration started (with almost the right size but it is filled by all garbage data). So 2min wait is far less than enough. My test yesterday using 24x16 fv3 layout show it took ~8 min for each 6hr ocean integration. So the waiting time depends on the relative speed and of course PE setting. The error message we saw yesterday (time ocean.nc index out of bound) indicate the ocean input file has not been fully written yet. When it is fully ready, the time should be |
@jiandewang does MOM6 create the netcdf file early on and then wait a while before writing the full file? Like initialize it at the beginning of a 6 hour window and then write it at the end for example? @aerorahul @jiandewang agreed- we need to understand the root cause. |
@jiandewang |
Linking to this issue: ufs-community/ufs-weather-model#1652 which is about adding log files for triggering. It looks like we have an ice log file, WW3 file i'm hoping to work on soon, and we needed to determine an alternative than the original FMS idea. |
I had a FMS issue requested a log flag but was denied by FMS group |
@JessicaMeixner-NOAA yes 06-12Z file appears once it starts 06Z integration |
Okay, if it appears immediately then we definitely need a longer wait time while we wait for an alternative way to write a log file to indicate the file has been completely written. |
@aerorahul we struggled for a reliable ocean IO flag so long time, but today's discussion may give us some clue on how to use the output file header information as a proxy. I assume it will take some time. |
@jiandewang |
@aerorahul yes I have similiar mind as you. Give me a bit time for offline test and will get back to you. |
we will need to let model exit rather than just a waring and keep going, as we will may have garbage data in final grib2 file. |
@jiandewang @JessicaMeixner-NOAA |
@jiandewang Have you had a chance to look at this and come up with a way to check if the MOM6 output is fully formed? |
@jiandewang |
@aerorahul I had a careful monitoring on ocean files while the model is running. My previous thoughts is not correct. Let me use this as an example: when model just started to run 6Z, the ocean_06*nc file will appear on disk but it is very small in size and use ncdump will pop up something like "not a netcdf file". |
@jiandewang |
I don't have this kind of file in hand at this moment but will have that for you shortly |
@aerorahul I set up a meeting for next Friday to brainstorm some solutions. We've also included those working on MOM6 for HAFS as they likely need something similar (or might already have something). If you or anyone else would like to be added to this meeting, let me know! |
@aerorahul I just made a fresh try on wcoss2 based on HR3b and copied them to HERA. See so my suggestion is: |
I got the files, and will work on a solution. Thanks @jiandewang |
@aerorahul - Just wanted to provide an update from a meeting that @jiandewang John and I had. John shared his HAFS experience and we revisited the issue NOAA-GFDL/FMS#1140 and based on that, the next file existing does actually appear to be the best trigger (and then I guess the forecast job being complete for the last hour). Let me know if you'd like to discuss this more offline. |
I don't think this heterogenous dependency will be straight-forward to implement. |
@aerorahul let me know if we can talk about this offline to explain some details I might have poorly explained above. |
…eck ocean output (#2484) This PR: - adds a Rocoto dependency tag that executes a shell command. The return code of the shell expression serves as a dependency check - adds a script that executes `ncdump` on a netCDF file. If the file is a valid netCDF file, the return code is 0, else it is non-zero - combines the above 2 to use as a dependency check for MOM6 output. If the model is still in the process of writing out the ocean output, the rocoto will execute the shell script and gather the return code. This PR also: - changes permissions on some `ush/` scripts that did not have executable permissions. Resolves #2328
What is wrong?
ocean and ice post failed occasionally
What should have happened?
post job shall work all the time
What machines are impacted?
All or N/A
Steps to reproduce
clone the latest g-w
setup C1152 and make a test run
Additional information
this problem doesn't happen all the time but it does happen occasionally
Do you have a proposed solution?
The text was updated successfully, but these errors were encountered: