Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce post-operation callback #2115

Merged
merged 1 commit into from
Jan 9, 2025
Merged

Conversation

charleskawczynski
Copy link
Member

It seems to me that, to date, one of the most challenging tasks that users have is: debug a large simulation that is breaking, e.g., yielding NaNs somewhere. This is especially complex for large models with many terms and implicit time-stepping with all the bells and whistles that we offer.

This PR attempts to provide tools to tackle this giant task, by introducing two functions:

  • call_post_op_callback
  • post_op_callback

One way that users may leverage this is, for example:

import ClimaCore
ClimaCore.DataLayouts.call_post_op_callback() = true
function ClimaCore.DataLayouts.post_op_callback(data)
    if any(isnan, data)
        error("Found NaNs!")
    end
end

# run simulation

Now, this assums that any(isnan, data) will work on all of the results-- it may not, but users can catch whatever data post_op_callback is called with and handle them accordingly. As such, this sort of debugging tool is capable of revealing that a given simulation errors in a wide range of locations, like here, here, or here, which may provide hints to users as to what is wrong.

I'm not sure exactly how useful this will be, but I'm curious myself to try in order to debug the flaky ClimaTimestepper issues (where I'm having this experiencing this problem of hunting down NaNs).

Since this would be a purely debugging tool, I would argue that it should not be a feature used in production (i.e., removing it should not be a breaking change).

Thoughts?

@charleskawczynski
Copy link
Member Author

Here's an example of this feature "in action" where I'm using it to debug the flaky job(s) in ClimaTimesteppers.

Indeed, NaNs is first introduced (spatially) in solve_newton! -> ldiv! -> run_field_matrix_solver! -> single_field_solve!.

It looks like, since this was caught early enough, there is spatial information in where NaNs is showing up, which I hope turns out to be useful (I after-all suspected that the issue with this problem related to the boundary conditions).

@charleskawczynski
Copy link
Member Author

One piece of feedback I received from today's standup meeting was: let's also call this callback with all of the arguments. I had thought about this, and I agree. I think we can call it with exactly two: the returned object (which will help make dispatch easier), and all of the arguments (which will allow users to inspect the inputs more easily).

For interactive use, this could actually be really helpful for users/debugging when tied together with Infiltrator.jl.

I'll try to update this PR to add the input arguments into the callback.

@charleskawczynski charleskawczynski force-pushed the ck/debugging_hooks branch 2 times, most recently from 1531bf4 to f9ce66e Compare January 8, 2025 20:42
@charleskawczynski charleskawczynski merged commit 70b7c71 into main Jan 9, 2025
34 of 35 checks passed
@charleskawczynski charleskawczynski deleted the ck/debugging_hooks branch January 9, 2025 03:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant