-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Github workflow: Segmentation fault #108
Comments
I looked into the configuration of the CMake tests, and besides being overly complex, they also seem to be fundamentally flawed (meaning they aren't testing anything). First I added a simple check to # execute the test command that was added earlier.
execute_process( COMMAND "${TEST}"
OUTPUT_FILE "${OUTPUT}"
RESULT_VARIABLE RET )
if(NOT RET EQUAL 0)
message("Error: ${RET}")
endif()
[...] which prints add_test( ${testName}_SP "${CMAKE_COMMAND}"
-DTEST=${TEST_LOC} -t "SP" -s ${s} -l ${l} -f ${TEST_INPUT}
[...] will try execute the directory and not the actual test executable inside the directory. Simplifying add_test(
NAME ${testName}_SP
COMMAND ${target} -t "SP" -s ${s} -l ${l} -f "${TEST_INPUT}") then reveals the segfault:
|
Two more observations on the actual error: The problem occurs both in debug and release mode and it doesn't seem to behave deterministic. While most of the time I get the segfault, sometimes the tests finish but produce garbage solutions:
EDIT:
|
I think it would be best to open 3 separate issues:
|
Here's what valgrind has to say about it:
|
So, the relevant part is
|
So I think I tracked down the origin of the problem. Not sure what the correct fix would be, though. In Lines 573 to 577 in 29ea08a
at the time of calling, expanders[type + 1] is not initialized (where type = LUSUP ).
That is due to superlu/SRC/superlu_enum_consts.h Lines 37 to 38 in 29ea08a
and dLUMemInit only initializing the first four positions of expandersLines 243 to 250 in 29ea08a
EDIT: |
I was wondering about the test Lines 573 to 577 in 29ea08a
Maybe the whole problem originates in a change of Before that change the order was typedef enum {LUSUP, UCOL, LSUB, USUB, LLVL, ULVL} MemType; so testing for typedef enum {USUB, LSUB, UCOL, LUSUP, LLVL, ULVL, NO_MEMTYPE} MemType; and maybe that was just missed in other places of the code, like So testing for |
I just tested replacing
I guess this is as far as I can go without digging into the memory management details of SuperLU. |
I can reproduce the issue. Your analysis looks good, I am convinced that this was introduced by the commit you mentioned! Skimming through the commit, the changes to the enum are nowhere motivated and thus most probably wrong. @xiaoyeli What do you think? Can we (partially) revert 52fc55d? Or do you know which pieces are needed from superlu_dist to fix these examples? |
Resolved it in Master. |
Alright. Now the remaining two issues mentioned above #108 (comment) should be addressed. Regarding the Github workflow: since the CMake build script works pretty well, I'd suggest installing cmake and then use it to build and test. Something along the lines (not tested) - uses: actions/checkout@v3
- name: Configure
run: cmake -B build
- name: Build
run: cmake --build build --parallel
- name: Test
run: ctest --test-dir build --output-on-failure |
Btw, if you look at the test output of the Github workflow, you still see a bunch of segfaults. I strongly suggest that you fix the test setup, so the workflow reflects those problems. |
I just tested the cmake workflow here https://github.com/wo80/superlu/commit/e494d2ac8c1bb17d475432273cdf8b60ba6f391a @xiaoyeli Please let me know if you want me to merge this into #112 EDIT: I applied the change suggested in #108 (comment) (see https://github.com/wo80/superlu/commit/53794fa76ae7f92c619f5b7940cc08ffa8daae1b) and this makes the tests fail. The segfault is also still present. |
After merging the upstream changes, the segfault seems to be fixed. But now the "LA" d_tests fail with
see https://github.com/wo80/superlu/actions/runs/6146404541/job/16675708872 Failing tests:
Valgrind output:
|
I think I found the culprit: cf93b7e Lines 134 to 146 in 90ee45d
This was an external contribution merged two days ago. And it's the perfect demonstration, how important a functional CI test setup is. So I'll quote myself from #112:
|
I rebased #112 and added the changes from my fix/github-workflow branch (now deleted). |
I haven't addressed the above issue in EDIT: just for demonstration https://github.com/wo80/superlu/actions/runs/6149774863/job/16686331442 |
f63265a seems to fix the issue, the workflow tests are passing. One question remaining, though: Comparing |
I would like to extend this list:
|
I just pushed the fix to the following:
|
@xiaoyeli This can be closed now. Thanks for the fix! |
I was skipping over the Checks tab of my recent pull request and in the Tests section I saw a couple of
Segmentation fault (core dumped)
. This error is also present in all other pull requests running the workflow.The first, obvious problem: a test that is producing an error should make the workflow check fail.
But I can also reproduce this error for example in
dlinsolx
on WindowsWhile this fails on Windows, on Arch linux the above command succeeds, but the test command fails with a segmentation fault:
Can anybody reproduce this?
The text was updated successfully, but these errors were encountered: