Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Github workflow: Segmentation fault #108

Closed
wo80 opened this issue Aug 4, 2023 · 21 comments
Closed

Github workflow: Segmentation fault #108

wo80 opened this issue Aug 4, 2023 · 21 comments

Comments

@wo80
Copy link
Contributor

wo80 commented Aug 4, 2023

I was skipping over the Checks tab of my recent pull request and in the Tests section I saw a couple of Segmentation fault (core dumped). This error is also present in all other pull requests running the workflow.

The first, obvious problem: a test that is producing an error should make the workflow check fail.

But I can also reproduce this error for example in dlinsolx on Windows

./dlinsolx -l 100000000 < ../../EXAMPLE/g20.rua

While this fails on Windows, on Arch linux the above command succeeds, but the test command fails with a segmentation fault:

./d_test -t "SP" -s 5 -l 100000000 -f ../../EXAMPLE/g20.rua

Can anybody reproduce this?

@wo80
Copy link
Contributor Author

wo80 commented Aug 5, 2023

I looked into the configuration of the CMake tests, and besides being overly complex, they also seem to be fundamentally flawed (meaning they aren't testing anything).

First I added a simple check to runtest.cmake:

# execute the test command that was added earlier.
execute_process( COMMAND "${TEST}" 
  OUTPUT_FILE "${OUTPUT}"
  RESULT_VARIABLE RET )

if(NOT RET EQUAL 0)
  message("Error: ${RET}")
endif()

[...]

which prints Error: permission denied. This is because in TESTING/CMakeLists.txt the command set(TEST_LOC ${CMAKE_CURRENT_BINARY_DIR}) returns a directory and so

add_test( ${testName}_SP  "${CMAKE_COMMAND}"
  -DTEST=${TEST_LOC} -t "SP" -s ${s} -l ${l} -f ${TEST_INPUT}
  [...]

will try execute the directory and not the actual test executable inside the directory. Simplifying add_test to

add_test(
  NAME ${testName}_SP
  COMMAND ${target} -t "SP" -s ${s} -l ${l} -f "${TEST_INPUT}")

then reveals the segfault:

Test project /projects/superlu/build/Testing
      Start  1: s_test_9_2_0_LA
 1/24 Test  #1: s_test_9_2_0_LA ..................   Passed    0.02 sec
      Start  2: s_test_9_2_10000000_LA
 2/24 Test  #2: s_test_9_2_10000000_LA ...........   Passed    0.02 sec
      Start  3: s_test_19_2_0_LA
 3/24 Test  #3: s_test_19_2_0_LA .................   Passed    0.03 sec
      Start  4: s_test_19_2_10000000_LA
 4/24 Test  #4: s_test_19_2_10000000_LA ..........   Passed    0.03 sec
      Start  5: s_test_2_0_SP
 5/24 Test  #5: s_test_2_0_SP ....................   Passed    0.06 sec
      Start  6: s_test_2_10000000_SP
 6/24 Test  #6: s_test_2_10000000_SP .............   Passed    0.07 sec
      Start  7: d_test_9_2_0_LA
 7/24 Test  #7: d_test_9_2_0_LA ..................   Passed    0.02 sec
      Start  8: d_test_9_2_10000000_LA
 8/24 Test  #8: d_test_9_2_10000000_LA ...........   Passed    0.02 sec
      Start  9: d_test_19_2_0_LA
 9/24 Test  #9: d_test_19_2_0_LA .................   Passed    0.03 sec
      Start 10: d_test_19_2_10000000_LA
10/24 Test #10: d_test_19_2_10000000_LA ..........   Passed    0.03 sec
      Start 11: d_test_2_0_SP
11/24 Test #11: d_test_2_0_SP ....................   Passed    0.06 sec
      Start 12: d_test_2_10000000_SP
12/24 Test #12: d_test_2_10000000_SP .............***Exception: SegFault  0.01 sec
      Start 13: c_test_9_2_0_LA
13/24 Test #13: c_test_9_2_0_LA ..................   Passed    0.02 sec
      Start 14: c_test_9_2_10000000_LA
14/24 Test #14: c_test_9_2_10000000_LA ...........   Passed    0.03 sec
      Start 15: c_test_19_2_0_LA
15/24 Test #15: c_test_19_2_0_LA .................   Passed    0.06 sec
      Start 16: c_test_19_2_10000000_LA
16/24 Test #16: c_test_19_2_10000000_LA ..........   Passed    0.06 sec
      Start 17: c_test_2_0_SP
17/24 Test #17: c_test_2_0_SP ....................   Passed    0.12 sec
      Start 18: c_test_2_10000000_SP
18/24 Test #18: c_test_2_10000000_SP .............***Exception: SegFault  0.01 sec
      Start 19: z_test_9_2_0_LA
19/24 Test #19: z_test_9_2_0_LA ..................   Passed    0.03 sec
      Start 20: z_test_9_2_10000000_LA
20/24 Test #20: z_test_9_2_10000000_LA ...........   Passed    0.03 sec
      Start 21: z_test_19_2_0_LA
21/24 Test #21: z_test_19_2_0_LA .................   Passed    0.07 sec
      Start 22: z_test_19_2_10000000_LA
22/24 Test #22: z_test_19_2_10000000_LA ..........   Passed    0.07 sec
      Start 23: z_test_2_0_SP
23/24 Test #23: z_test_2_0_SP ....................   Passed    0.15 sec
      Start 24: z_test_2_10000000_SP
24/24 Test #24: z_test_2_10000000_SP .............   Passed    0.16 sec

92% tests passed, 2 tests failed out of 24

Total Test time (real) =   1.23 sec

The following tests FAILED:
         12 - d_test_2_10000000_SP (SEGFAULT)
         18 - c_test_2_10000000_SP (SEGFAULT)
Errors while running CTest

@wo80
Copy link
Contributor Author

wo80 commented Aug 5, 2023

Two more observations on the actual error:

The problem occurs both in debug and release mode and it doesn't seem to behave deterministic. While most of the time I get the segfault, sometimes the tests finish but produce garbage solutions:

    [...]
    dgssvx:fact=   3, trans=   1, equed=B, n=400, imat=0, test(1)=  6.2187e+09
    dgssvx:fact=   3, trans=   1, equed=B, n=400, imat=0, test(2)=  3.0712e+10
    dgssvx:fact=   3, trans=   1, equed=B, n=400, imat=0, test(4)=  3.8735e+09
    dgssvx:fact=   3, trans=   0, equed=B, n=400, imat=0, test(1)=  1.9755e+14
    dgssvx:fact=   3, trans=   0, equed=B, n=400, imat=0, test(2)=  5.4097e+13
    dgssvx:fact=   3, trans=   0, equed=B, n=400, imat=0, test(4)=  2.2462e+13
    dgssvx:fact=   3, trans=   1, equed=B, n=400, imat=0, test(1)=  6.2187e+09
    dgssvx:fact=   3, trans=   1, equed=B, n=400, imat=0, test(2)=  3.0712e+10
    dgssvx:fact=   3, trans=   1, equed=B, n=400, imat=0, test(4)=  3.8735e+09
DGE driver: 92 out of 144 tests failed to pass the threshold

EDIT:
To make ctest recognize this as a test failure, the drivers (cdrive.c etc.) should not return 0, but

    return nfail == 0 ? EXIT_SUCCESS : EXIT_FAILURE;

@wo80
Copy link
Contributor Author

wo80 commented Aug 5, 2023

I think it would be best to open 3 separate issues:

  1. The Github workflow should fail when a test fails (shouldn't matter if an actual test condition fails or a segfault occurs)
  2. The CMake test setup needs to be fixed (addressed in PR Fix CMake test setup #112 )
  3. The actual cause of the segfault needs to be investigated

@wo80
Copy link
Contributor Author

wo80 commented Aug 8, 2023

Here's what valgrind has to say about it:

/projects/superlu/build/TESTING$ valgrind ./d_test -t "SP" -s 5 -l 5000000 -f ../../EXAMPLE/g20.rua
==11462== Memcheck, a memory error detector
==11462== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==11462== Using Valgrind-3.19.0 and LibVEX; rerun with -h for copyright info
==11462== Command: ./d_test -t SP -s 5 -l 5000000 -f ../../EXAMPLE/g20.rua
==11462== 
.. test sparse matrix in file: ../../EXAMPLE/g20.rua
g20, symm permuted by SYMMMD                                            SYM     
==11462== Conditional jump or move depends on uninitialised value(s)
==11462==    at 0x12BCC2: relax_snode (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x11C973: dgstrf (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x10A1B2: main (in /projects/superlu/build/TESTING/d_test)
==11462== 
==11462== Conditional jump or move depends on uninitialised value(s)
==11462==    at 0x116E87: user_bcopy (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x12541C: dexpand (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x124EC2: dLUMemXpand (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x12300E: dcolumn_bmod (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x11CFA0: dgstrf (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x10A1B2: main (in /projects/superlu/build/TESTING/d_test)
==11462== 
==11462== Use of uninitialised value of size 8
==11462==    at 0x116E6C: user_bcopy (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x12541C: dexpand (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x124EC2: dLUMemXpand (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x12300E: dcolumn_bmod (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x11CFA0: dgstrf (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x10A1B2: main (in /projects/superlu/build/TESTING/d_test)
==11462== 
==11462== Use of uninitialised value of size 8
==11462==    at 0x116E73: user_bcopy (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x12541C: dexpand (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x124EC2: dLUMemXpand (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x12300E: dcolumn_bmod (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x11CFA0: dgstrf (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x10A1B2: main (in /projects/superlu/build/TESTING/d_test)
==11462== 
==11462== Invalid read of size 1
==11462==    at 0x116E6C: user_bcopy (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x12541C: dexpand (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x124EC2: dLUMemXpand (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x12300E: dcolumn_bmod (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x11CFA0: dgstrf (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x10A1B2: main (in /projects/superlu/build/TESTING/d_test)
==11462==  Address 0x4f2503f is 1 bytes before a block of size 5,000,000 alloc'd
==11462==    at 0x48407B4: malloc (vg_replace_malloc.c:381)
==11462==    by 0x116CAD: superlu_malloc (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x1094F3: main (in /projects/superlu/build/TESTING/d_test)
==11462== 
==11462== Invalid write of size 1
==11462==    at 0x116E73: user_bcopy (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x12541C: dexpand (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x124EC2: dLUMemXpand (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x12300E: dcolumn_bmod (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x11CFA0: dgstrf (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x10A1B2: main (in /projects/superlu/build/TESTING/d_test)
==11462==  Address 0x4f2503f is 1 bytes before a block of size 5,000,000 alloc'd
==11462==    at 0x48407B4: malloc (vg_replace_malloc.c:381)
==11462==    by 0x116CAD: superlu_malloc (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x1094F3: main (in /projects/superlu/build/TESTING/d_test)
==11462== 
==11462== 
==11462== More than 10000000 total errors detected.  I'm not reporting any more.
==11462== Final error counts will be inaccurate.  Go fix your program!
==11462== Rerun with --error-limit=no to disable this cutoff.  Note
==11462== that errors may occur in your program without prior warning from
==11462== Valgrind, because errors are no longer being displayed.
==11462== 
==11462== 
==11462== Process terminating with default action of signal 11 (SIGSEGV)
==11462==  Bad permissions for mapped region at address 0x4B13FFF
==11462==    at 0x116E73: user_bcopy (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x12541C: dexpand (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x124EC2: dLUMemXpand (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x12300E: dcolumn_bmod (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x11CFA0: dgstrf (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x10A1B2: main (in /projects/superlu/build/TESTING/d_test)
==11462== 
==11462== HEAP SUMMARY:
==11462==     in use at exit: 5,169,304 bytes in 37 blocks
==11462==   total heap usage: 46 allocs, 9 frees, 5,185,088 bytes allocated
==11462== 
==11462== LEAK SUMMARY:
==11462==    definitely lost: 0 bytes in 0 blocks
==11462==    indirectly lost: 0 bytes in 0 blocks
==11462==      possibly lost: 0 bytes in 0 blocks
==11462==    still reachable: 5,169,304 bytes in 37 blocks
==11462==         suppressed: 0 bytes in 0 blocks
==11462== Rerun with --leak-check=full to see details of leaked memory
==11462== 
==11462== Use --track-origins=yes to see where uninitialised values come from
==11462== For lists of detected and suppressed errors, rerun with: -s
==11462== ERROR SUMMARY: 10000000 errors from 6 contexts (suppressed: 0 from 0)

@wo80
Copy link
Contributor Author

wo80 commented Aug 8, 2023

So, the relevant part is

Invalid write of size 1
   at 0x116E73: user_bcopy
   by 0x12541C: dexpand
   by 0x124EC2: dLUMemXpand
   by 0x12300E: dcolumn_bmod
   by 0x11CFA0: dgstrf
   by 0x10A1B2: main
 Address 0x4f2503f is 1 bytes before a block of size 5,000,000 alloc'd

@wo80
Copy link
Contributor Author

wo80 commented Aug 11, 2023

So I think I tracked down the origin of the problem. Not sure what the correct fix would be, though.

In dmemory.c method dexpand

superlu/SRC/dmemory.c

Lines 573 to 577 in 29ea08a

if ( type != USUB ) {
new_mem = (void*)((char*)expanders[type + 1].mem + extra);
bytes_to_copy = (char*)Glu->stack.array + Glu->stack.top1
- (char*)expanders[type + 1].mem;
user_bcopy(expanders[type+1].mem, new_mem, bytes_to_copy);

at the time of calling, expanders[type + 1] is not initialized (where type = LUSUP).

That is due to

//typedef enum {LUSUP, UCOL, LSUB, USUB, LLVL, ULVL, NO_MEMTYPE} MemType;
typedef enum {USUB, LSUB, UCOL, LUSUP, LLVL, ULVL, NO_MEMTYPE} MemType;

and dLUMemInit only initializing the first four positions of expanders

superlu/SRC/dmemory.c

Lines 243 to 250 in 29ea08a

}
lusup = (double *) dexpand( &nzlumax, LUSUP, 0, 0, Glu );
ucol = (double *) dexpand( &nzumax, UCOL, 0, 0, Glu );
lsub = (int_t *) dexpand( &nzlmax, LSUB, 0, 0, Glu );
usub = (int_t *) dexpand( &nzumax, USUB, 0, 1, Glu );
while ( !lusup || !ucol || !lsub || !usub ) {

EDIT:
I only debugged d_test. The same will most likely be the cause in c, s and z versions of the code.

@wo80
Copy link
Contributor Author

wo80 commented Aug 11, 2023

I was wondering about the test type != USUB in

superlu/SRC/dmemory.c

Lines 573 to 577 in 29ea08a

if ( type != USUB ) {
new_mem = (void*)((char*)expanders[type + 1].mem + extra);
bytes_to_copy = (char*)Glu->stack.array + Glu->stack.top1
- (char*)expanders[type + 1].mem;
user_bcopy(expanders[type+1].mem, new_mem, bytes_to_copy);

Maybe the whole problem originates in a change of MemType made in
52fc55d#diff-4964d63c55baaf45c54e2d8b0485e230848b1a9da1d7c9fa40bdc3f77442c08d

Before that change the order was

typedef enum {LUSUP, UCOL, LSUB, USUB, LLVL, ULVL}              MemType;

so testing for USUB would have been correct. But the order changed to

typedef enum {USUB, LSUB, UCOL, LUSUP, LLVL, ULVL, NO_MEMTYPE}  MemType;

and maybe that was just missed in other places of the code, like dexpand.

So testing for type != LUSUP might be the correct fix. But I don't have enough insight into the SuperLU implementation details to be sure :-)

@wo80
Copy link
Contributor Author

wo80 commented Aug 12, 2023

I just tested replacing type != USUB with type != LUSUP. Though this prevents the segfault, it does not prevent some of the tests to fail (assuming the return type fix of the test drivers mentioned in #108 (comment) is applied).

The following tests FAILED:
	  6 - s_test_2_10000000_SP (Failed)
	 12 - d_test_2_10000000_SP (Failed)
	 18 - c_test_2_10000000_SP (Failed)
	 24 - z_test_2_10000000_SP (Failed)

I guess this is as far as I can go without digging into the memory management details of SuperLU.

@gruenich
Copy link
Contributor

I can reproduce the issue. Your analysis looks good, I am convinced that this was introduced by the commit you mentioned! Skimming through the commit, the changes to the enum are nowhere motivated and thus most probably wrong.

@xiaoyeli What do you think? Can we (partially) revert 52fc55d? Or do you know which pieces are needed from superlu_dist to fix these examples?

xiaoyeli added a commit that referenced this issue Sep 11, 2023
…um_consts.h

Resolved issue #108: correct the macro order of MemType in superlu_enum_consts.h;
Merged PR #116: complex -> singlecomplex (consistent with doublecomplex);
Merged PR #117: fill the missing perm_r[] for rank deficient cases;
Resolved issue #119: missing BLAS names in slu_Cnames.h
@xiaoyeli
Copy link
Owner

Resolved it in Master.

@wo80
Copy link
Contributor Author

wo80 commented Sep 11, 2023

Resolved it in Master.

Alright. Now the remaining two issues mentioned above #108 (comment) should be addressed.

Regarding the Github workflow: since the CMake build script works pretty well, I'd suggest installing cmake and then use it to build and test. Something along the lines (not tested)

   - uses: actions/checkout@v3

    - name: Configure
      run: cmake -B build      

    - name: Build
      run: cmake --build build --parallel

    - name: Test
      run: ctest --test-dir build --output-on-failure

@wo80
Copy link
Contributor Author

wo80 commented Sep 11, 2023

Btw, if you look at the test output of the Github workflow, you still see a bunch of segfaults.

I strongly suggest that you fix the test setup, so the workflow reflects those problems.

@wo80
Copy link
Contributor Author

wo80 commented Sep 11, 2023

I just tested the cmake workflow here https://github.com/wo80/superlu/commit/e494d2ac8c1bb17d475432273cdf8b60ba6f391a and all tests are passing.

@xiaoyeli Please let me know if you want me to merge this into #112

EDIT: I applied the change suggested in #108 (comment) (see https://github.com/wo80/superlu/commit/53794fa76ae7f92c619f5b7940cc08ffa8daae1b) and this makes the tests fail. The segfault is also still present.

@wo80
Copy link
Contributor Author

wo80 commented Sep 11, 2023

After merging the upstream changes, the segfault seems to be fixed. But now the "LA" d_tests fail with

Subprocess aborted***Exception:   0.15 sec
dgstrf info 1
dgstrf info 1
dgstrf info 19
double free or corruption (out)

see https://github.com/wo80/superlu/actions/runs/6146404541/job/16675708872

Failing tests:

 7/24 Test  #7: d_test_9_2_0_LA ..................Subprocess aborted***Exception
 8/24 Test  #8: d_test_19_2_0_LA .................Subprocess aborted***Exception
 9/24 Test  #9: d_test_2_0_SP ....................Passed
10/24 Test #10: d_test_9_2_10000000_LA ...........Subprocess aborted***Exception
11/24 Test #11: d_test_19_2_10000000_LA ..........Subprocess aborted***Exception
12/24 Test #12: d_test_2_10000000_SP .............Passed

Valgrind output:

valgrind --track-origins=yes --leak-check=full ./d_test -t "LA" -n 9 -s 2 -l 0
Memcheck, a memory error detector
Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
Using Valgrind-3.19.0 and LibVEX; rerun with -h for copyright info
Command: ./d_test -t LA -n 9 -s 2 -l 0

dgstrf info 1
dgstrf info 1
dgstrf info 9
Invalid read of size 8
   at 0x10C6BA: dgst01
   by 0x10B4CA: main
 Address 0x5158f78 is 8 bytes before a block of size 72 alloc'd
   at 0x48407B4: malloc (vg_replace_malloc.c:381)
   by 0x117DAA: superlu_malloc
   by 0x126CBE: doubleCalloc
   by 0x10C19D: dgst01
   by 0x10B4CA: main

[...] more of those errors

dgstrf info 9
dgstrf info 5
Invalid read of size 4
   at 0x10C382: dgst01
   by 0x10B4CA: main
 Address 0x516bc5c is 4 bytes before a block of size 40 alloc'd
   at 0x48407B4: malloc (vg_replace_malloc.c:381)
   by 0x117DAA: superlu_malloc
   by 0x117FAE: int32Malloc
   by 0x1258B5: dLUMemInit
   by 0x11D857: dgstrf
   by 0x118D40: dgssv
   by 0x10B422: main

Invalid read of size 4
   at 0x10C3FA: dgst01
   by 0x10B4CA: main
 Address 0x516c8bc is 4 bytes before a block of size 648 alloc'd
   at 0x48407B4: malloc (vg_replace_malloc.c:381)
   by 0x117DAA: superlu_malloc
   by 0x1264D8: dexpand
   by 0x125A0D: dLUMemInit
   by 0x11D857: dgstrf
   by 0x118D40: dgssv
   by 0x10B422: main

[...] more of those errors

dgstrf info 5
All tests for DGE driver passed the threshold (  1158 tests run)

HEAP SUMMARY:
    in use at exit: 319,748 bytes in 578 blocks
  total heap usage: 23,849 allocs, 23,271 frees, 11,336,936 bytes allocated

2,688 (960 direct, 1,728 indirect) bytes in 24 blocks are definitely lost in loss record 15 of 23
   at 0x48407B4: malloc (vg_replace_malloc.c:381)
   by 0x117DAA: superlu_malloc
   by 0x118265: sp_preorder
   by 0x119862: dgssvx
   by 0x10B767: main

74,880 (1,408 direct, 73,472 indirect) bytes in 44 blocks are definitely lost in loss record 21 of 23
   at 0x48407B4: malloc (vg_replace_malloc.c:381)
   by 0x117DAA: superlu_malloc
   by 0x126E18: dCreate_CompCol_Matrix
   by 0x11E487: dgstrf
   by 0x1198E5: dgssvx
   by 0x10B767: main

81,216 (2,464 direct, 78,752 indirect) bytes in 44 blocks are definitely lost in loss record 22 of 23
   at 0x48407B4: malloc (vg_replace_malloc.c:381)
   by 0x117DAA: superlu_malloc
   by 0x127347: dCreate_SuperNode_Matrix
   by 0x11E44C: dgstrf
   by 0x1198E5: dgssvx
   by 0x10B767: main

LEAK SUMMARY:
   definitely lost: 4,832 bytes in 112 blocks
   indirectly lost: 153,952 bytes in 444 blocks
     possibly lost: 0 bytes in 0 blocks
   still reachable: 160,964 bytes in 22 blocks
        suppressed: 0 bytes in 0 blocks
Reachable blocks (those to which a pointer was found) are not shown.
To see them, rerun with: --leak-check=full --show-leak-kinds=all

For lists of detected and suppressed errors, rerun with: -s
ERROR SUMMARY: 3883 errors from 27 contexts (suppressed: 0 from 0)

@wo80
Copy link
Contributor Author

wo80 commented Sep 11, 2023

I think I found the culprit: cf93b7e

superlu/SRC/dpivotL.c

Lines 134 to 146 in 90ee45d

/* Test for singularity */
if ( pivmax == 0.0 ) {
#if 0
// There is no valid pivot.
// jcol represents the rank of U
// report the rank let dgstrf handle the pivot
#if 1
*pivrow = lsub_ptr[pivptr];
perm_r[*pivrow] = jcol;
#else
perm_r[diagind] = jcol;
#endif
#endif

This was an external contribution merged two days ago. And it's the perfect demonstration, how important a functional CI test setup is. So I'll quote myself from #112:

I think it's important to have tests reflecting reality and I think that this should be merged rather sooner than later (even if the issue remains unresolved for now).

@wo80
Copy link
Contributor Author

wo80 commented Sep 11, 2023

I rebased #112 and added the changes from my fix/github-workflow branch (now deleted).

@wo80
Copy link
Contributor Author

wo80 commented Sep 11, 2023

I haven't addressed the above issue in dpivotL.c. I think it's better you @xiaoyeli fix this in a single commit. Then the workflow shouldn't fail anymore.

EDIT: just for demonstration https://github.com/wo80/superlu/actions/runs/6149774863/job/16686331442

@wo80
Copy link
Contributor Author

wo80 commented Sep 12, 2023

f63265a seems to fix the issue, the workflow tests are passing.

One question remaining, though:

Comparing dpivotL.c to the (c|s|z)pivotL.c variants, the code re-assigning perm_r[*pivrow] in the section labelled /* Test for singularity */ (see #108 (comment) above) is disabled in the d variant, but not in the c|s|z variants. Which one is correct?

@gruenich
Copy link
Contributor

I think it would be best to open 3 separate issues:

  1. The Github workflow should fail when a test fails (shouldn't matter if an actual test condition fails or a segfault occurs)
  2. The CMake test setup needs to be fixed (addressed in PR Fix CMake test setup #112 )
  3. The actual cause of the segfault needs to be investigated
  1. This is addressed in Repair contiuous testing, also add test for Doxygen and Fortran #131, some of your changes and some additions from myself.
  2. Fixed by your commits, merged as [cmake] Simplify creation of tests #114.
  3. Segfault is also fixed, the checks for Repair contiuous testing, also add test for Doxygen and Fortran #131 are passing.

I would like to extend this list:
4. We need a Windows runner, I created #132 for this to not extend this thread any longer.
5. Your last question should be answered, @xiaoyeli do you know the answer to this question?

Comparing dpivotL.c to the (c|s|z)pivotL.c variants, the code re-assigning perm_r[*pivrow] in the section labelled /* Test for singularity */ (see #108 (comment) above) is disabled in the d variant, but not in the c|s|z variants. Which one is correct?

@xiaoyeli
Copy link
Owner

I just pushed the fix to the following:

Comparing dpivotL.c to the (c|s|z)pivotL.c variants, the code re-assigning perm_r[pivrow] in the section labelled / Test for singularity */ (see #108 (comment) above) is disabled in the d variant, but not in the c|s|z variants. Which one is correct?

@gruenich
Copy link
Contributor

@xiaoyeli This can be closed now. Thanks for the fix!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants