-
Notifications
You must be signed in to change notification settings - Fork 566
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up test suite to <20 minutes from >30 minutes #6127
Comments
I've recently looked into it an the length of the aarch64 tests is almost completely due to the client attach/detach tests all timing out after 10 minutes. We should be able to reduce this by either completely disabling the de/attach tests, fixing them to actually work, or reducing the timeouts. |
Looks like that is #5740 which says it happened after a Jenkins hardware change. The client.detach test did not exist until 2 days ago in PR #6513. It has been flaky on x86-64 since being added which I filed as #6536; the failures look like the client.attach_test failures which had happened a few times on the Linux x86-64 suite: #6452. But, those x86-64 failures are not timeouts, so it is not clear any of that is related to #5740. |
Now that we've enabled ptrace privileges on the a64 testing machine we can remove the attach/detach tests from the flaky list as they no longer hang. The attach test passed 100x in a row for me so it doesn't seem to be hitting flakes seen on other platforms like #6452. Issue: #5740, #6558, #6127 Fixes #5740
Now that we've enabled ptrace privileges on the a64 testing machine we can remove the attach/detach tests from the flaky list as they no longer hang. The attach_test, attach_blocking, and deatch_test each passed 200x in a row on this machine so it doesn't seem to be hitting flakes seen on other platforms like #6452. Issue: #5740, #6558, #6127 Fixes #5740
You can see the drop from 33 mins down to 8 mins at the point I enabled ptrace: https://github.com/DynamoRIO/dynamorio/actions/workflows/ci-aarch64-native.yml Maybe the only final action item would be to reduce the timeouts: but we already did that, with the default at 90s. So I'm confused as to why the attach/detach tests would take 10 mins to time out: I don't see timeout overrides for those. Hmm actually it seems to only explicitly set a CTest timeout property for runcmp -- I will fix that, as part of this and #6558. |
Before, we relied on drrun -s for all test suite timeouts except for runcmp tests where we set a CTest timeout. This resulted in the default 10 minute CTest timeout for all tests, which was the only timeout for runall tests and caused long suite times on the AArch64 machine which accidentally had no ptrace privileges (#5740, #6558, Here, we set the CTest time for runall in addition to runcmp, and for all other tests with no timeout specified (which are presumably relying on drrun -s) we set a timeout of the drrun timeout plus 30 seconds. Tested on the attach test: Before: ``` 123: Test timeout computed to be: 1500 ``` Now: ``` $ echo 1 | sudo tee /proc/sys/kernel/yama/ptrace_scope; /usr/bin/time ctest -V -R client.attach_test; echo 0 | sudo tee /proc/sys/kernel/yama/ptrace_scope ... Start 121: code_api|client.attach_test ... 121: Test timeout computed to be: 90 1/1 Test #121: code_api|client.attach_test ......***Timeout 90.11 sec ... The following tests FAILED: 121 - code_api|client.attach_test (Timeout) ... Command exited with non-zero status 8 1.13user 0.80system 1:30.14elapsed 2%CPU (0avgtext+0avgdata 13196maxresident)k ``` Fixes #6127 Issue: #6127, #6558, #5740
Glad to see it. I'm making sure this is set properly on the new AWS runners too. |
Before, we relied on drrun -s for all test suite timeouts except for runcmp tests where we set a CTest timeout. This resulted in the default 10 minute CTest timeout for all tests, which was the only timeout for runall tests and caused long suite times on the AArch64 machine which accidentally had no ptrace privileges (#5740, #6558, Here, we set the CTest time for runall in addition to runcmp, and for all other tests with no timeout specified (which are presumably relying on drrun -s) we set a timeout of the drrun timeout plus 30 seconds. Tested on the attach test: Before: ``` 123: Test timeout computed to be: 1500 ``` Now: ``` $ echo 1 | sudo tee /proc/sys/kernel/yama/ptrace_scope; /usr/bin/time ctest -V -R client.attach_test; echo 0 | sudo tee /proc/sys/kernel/yama/ptrace_scope ... Start 121: code_api|client.attach_test ... 121: Test timeout computed to be: 90 1/1 Test #121: code_api|client.attach_test ......***Timeout 90.11 sec ... The following tests FAILED: 121 - code_api|client.attach_test (Timeout) ... Command exited with non-zero status 8 1.13user 0.80system 1:30.14elapsed 2%CPU (0avgtext+0avgdata 13196maxresident)k ``` The property being set on more tests was confirmed on debug x86-64: Before: ``` $ grep -c TIMEOUT suite/tests/CTestTestfile.cmake 94 ``` After: ``` $ grep -c TIMEOUT suite/tests/CTestTestfile.cmake 463 ``` There seem to still be a few missing the property: the ones that don't go through suite/. There were other efforts to avoid hangs on those such as PR #6137. Issue: #6127, #6558, #5740
Xref some discussion at #6096 (comment) and #6096 (comment) at adding parallelism to the ci-aarch64-native runner to avoid a backlog from simultaneous PR's waiting on each other.
The latency of the ci-aarch64-native job is also something to look at: when it was Jenkins it was pretty fast, under 10 minutes IIRC. Now you can see it is consistently over 33 minutes and is by far the slowest of our jobs:
https://github.com/DynamoRIO/dynamorio/actions/workflows/ci-aarch64-native.yml
The general goal is to be under 20 minutes. The Windows and Linux x86 jobs have gotten slower over time and are both over 20 minutes now. This issue covers trying to speed all of these up. The ci-aarch64-native seems the highest priority at the longest and because it used to be faster making it seems like there may be some low-hanging fruit there for improvements.
The text was updated successfully, but these errors were encountered: