Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Quick band-aid fix for MPI termination #374

Closed
wants to merge 1 commit into from

Conversation

jchodera
Copy link
Member

This is a quick band-aid fix to ensure we don't leave straggler processes around after MPI processes receive a SIGABRT, SIGINT, or SIGTERM from Torque.

A much better approach would be to use context managers to ensure that all files are gracefully synced and closed and MPI is aborted as the program terminates, but this will require some refactoring/redesign. As a result, the current solution may result in some additional data corruption that is hopefully minimized by the fact that NetCDF syncs the files each iteration.

@jchodera
Copy link
Member Author

Attempts to address issues in cBio/cbio-cluster#415

@jchodera
Copy link
Member Author

Initial tests indicate the code just ignores the Torque kill signal.

@jchodera
Copy link
Member Author

jchodera commented Jun 7, 2016

Fixes #356

@andrrizzi
Copy link
Contributor

I'm not sure why this didn't work, but I wrote a function that configures MPI to Abort whenever it receives those 3 signals and when any exception is raised. I tested it, and it seems to work. At least, I can see a application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0 in the log when the job hits the wall time.

If you want I can open a pull request and check in that function.

@andrrizzi
Copy link
Contributor

I also wanted to write a decorator for the function repex._write_iteration_netcdf() to not get interrupted by SIGTERM or SIGINT so that we won't risk data corruption.

@andrrizzi
Copy link
Contributor

More specifically, I want to implement something like this: http://stackoverflow.com/a/21919644

@jchodera
Copy link
Member Author

This is a great idea!

@jchodera
Copy link
Member Author

Do you want to open your own version of this PR and we can close this one?

@andrrizzi
Copy link
Contributor

Ok! I'll close this one when I've tested everything in mine.

@jchodera jchodera closed this Jun 28, 2016
@jchodera
Copy link
Member Author

Closed in favor of #412.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants