Cluster Testing Plan #17

hexylena · 2018-12-13T13:46:05Z

I'd like to stress test condor and figure out where we can rely on it to restart jobs and where we need Galaxy to be smarter

Condor

Setup a condor cluster with VGCN, inkl Central Manager + dedicated submit node + maybe 4 m1.small executors.

Launch 1M tiny jobs (just exit 0 or so); do all 1M complete? At what throughput? Dump output of condor_history to file, process, make some nice graphs maybe.
Do the same. Repeatedly reboot the central manager at maybe 5 minute intervals throughout test. Is everything coming back successfully?
Do the same, but kill the central manager during the middle of the test. Just delete it in openstack + replace it. What is lost?
Do the same, but repeatedly reboot compute nodes randomly
Same but repeatedly kill + replace compute nodes (e.g. with terraform.)

(If 1M is too high and takes multiple hours then decrease until the tests run in ~20 minutes.)

Galaxy

Setup same, but add a galaxy server + NFS server. (We can help here.)

Launch thousands of jobs that take some time to complete (e.g. sleep 60; echo "hi" in a tool), and repeatedly kill compute nodes. Do the jobs complete successfully with their expected output?

The text was updated successfully, but these errors were encountered:

bgruening · 2019-05-27T13:14:33Z

ping @AndreasSko

AndreasSko · 2019-07-31T13:21:34Z

I tested it with 10k small jobs, which took around 6,5 minutes to submit (50k already took more than 30m). The jobs were a simple sleep 1; echo "$(hostname)". I tried rebooting and replacing the exec-nodes and the central manager, and at least in my tests all jobs completed successfully with the correct outputs. The Galaxy-Part I still need to do.

@erasche I'm in the office tomorrow. If you are there and have the time, I could show you the rest of my results.

hexylena · 2019-07-31T13:22:51Z

Sounds great! Let's talk then :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster Testing Plan #17

Cluster Testing Plan #17

hexylena commented Dec 13, 2018 •

edited

Loading

bgruening commented May 27, 2019

AndreasSko commented Jul 31, 2019

hexylena commented Jul 31, 2019

Cluster Testing Plan #17

Cluster Testing Plan #17

Comments

hexylena commented Dec 13, 2018 • edited Loading

Condor

Galaxy

bgruening commented May 27, 2019

AndreasSko commented Jul 31, 2019

hexylena commented Jul 31, 2019

hexylena commented Dec 13, 2018 •

edited

Loading