You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'd like to stress test condor and figure out where we can rely on it to restart jobs and where we need Galaxy to be smarter
Condor
Setup a condor cluster with VGCN, inkl Central Manager + dedicated submit node + maybe 4 m1.small executors.
Launch 1M tiny jobs (just exit 0 or so); do all 1M complete? At what throughput? Dump output of condor_history to file, process, make some nice graphs maybe.
Do the same. Repeatedly reboot the central manager at maybe 5 minute intervals throughout test. Is everything coming back successfully?
Do the same, but kill the central manager during the middle of the test. Just delete it in openstack + replace it. What is lost?
Do the same, but repeatedly reboot compute nodes randomly
Same but repeatedly kill + replace compute nodes (e.g. with terraform.)
(If 1M is too high and takes multiple hours then decrease until the tests run in ~20 minutes.)
Galaxy
Setup same, but add a galaxy server + NFS server. (We can help here.)
Launch thousands of jobs that take some time to complete (e.g. sleep 60; echo "hi" in a tool), and repeatedly kill compute nodes. Do the jobs complete successfully with their expected output?
The text was updated successfully, but these errors were encountered:
I tested it with 10k small jobs, which took around 6,5 minutes to submit (50k already took more than 30m). The jobs were a simple sleep 1; echo "$(hostname)". I tried rebooting and replacing the exec-nodes and the central manager, and at least in my tests all jobs completed successfully with the correct outputs. The Galaxy-Part I still need to do.
@erasche I'm in the office tomorrow. If you are there and have the time, I could show you the rest of my results.
I'd like to stress test condor and figure out where we can rely on it to restart jobs and where we need Galaxy to be smarter
Condor
Setup a condor cluster with VGCN, inkl Central Manager + dedicated submit node + maybe 4 m1.small executors.
exit 0
or so); do all 1M complete? At what throughput? Dump output ofcondor_history
to file, process, make some nice graphs maybe.(If 1M is too high and takes multiple hours then decrease until the tests run in ~20 minutes.)
Galaxy
Setup same, but add a galaxy server + NFS server. (We can help here.)
sleep 60; echo "hi"
in a tool), and repeatedly kill compute nodes. Do the jobs complete successfully with their expected output?The text was updated successfully, but these errors were encountered: