Github workflow: Segmentation fault #108

wo80 · 2023-08-04T14:55:11Z

I was skipping over the Checks tab of my recent pull request and in the Tests section I saw a couple of Segmentation fault (core dumped). This error is also present in all other pull requests running the workflow.

The first, obvious problem: a test that is producing an error should make the workflow check fail.

But I can also reproduce this error for example in dlinsolx on Windows

./dlinsolx -l 100000000 < ../../EXAMPLE/g20.rua

While this fails on Windows, on Arch linux the above command succeeds, but the test command fails with a segmentation fault:

./d_test -t "SP" -s 5 -l 100000000 -f ../../EXAMPLE/g20.rua

Can anybody reproduce this?

The text was updated successfully, but these errors were encountered:

wo80 · 2023-08-05T20:28:30Z

I looked into the configuration of the CMake tests, and besides being overly complex, they also seem to be fundamentally flawed (meaning they aren't testing anything).

First I added a simple check to runtest.cmake:

# execute the test command that was added earlier.
execute_process( COMMAND "${TEST}" 
  OUTPUT_FILE "${OUTPUT}"
  RESULT_VARIABLE RET )

if(NOT RET EQUAL 0)
  message("Error: ${RET}")
endif()

[...]

which prints Error: permission denied. This is because in TESTING/CMakeLists.txt the command set(TEST_LOC ${CMAKE_CURRENT_BINARY_DIR}) returns a directory and so

add_test( ${testName}_SP  "${CMAKE_COMMAND}"
  -DTEST=${TEST_LOC} -t "SP" -s ${s} -l ${l} -f ${TEST_INPUT}
  [...]

will try execute the directory and not the actual test executable inside the directory. Simplifying add_test to

add_test(
  NAME ${testName}_SP
  COMMAND ${target} -t "SP" -s ${s} -l ${l} -f "${TEST_INPUT}")

then reveals the segfault:

Test project /projects/superlu/build/Testing
      Start  1: s_test_9_2_0_LA
 1/24 Test  #1: s_test_9_2_0_LA ..................   Passed    0.02 sec
      Start  2: s_test_9_2_10000000_LA
 2/24 Test  #2: s_test_9_2_10000000_LA ...........   Passed    0.02 sec
      Start  3: s_test_19_2_0_LA
 3/24 Test  #3: s_test_19_2_0_LA .................   Passed    0.03 sec
      Start  4: s_test_19_2_10000000_LA
 4/24 Test  #4: s_test_19_2_10000000_LA ..........   Passed    0.03 sec
      Start  5: s_test_2_0_SP
 5/24 Test  #5: s_test_2_0_SP ....................   Passed    0.06 sec
      Start  6: s_test_2_10000000_SP
 6/24 Test  #6: s_test_2_10000000_SP .............   Passed    0.07 sec
      Start  7: d_test_9_2_0_LA
 7/24 Test  #7: d_test_9_2_0_LA ..................   Passed    0.02 sec
      Start  8: d_test_9_2_10000000_LA
 8/24 Test  #8: d_test_9_2_10000000_LA ...........   Passed    0.02 sec
      Start  9: d_test_19_2_0_LA
 9/24 Test  #9: d_test_19_2_0_LA .................   Passed    0.03 sec
      Start 10: d_test_19_2_10000000_LA
10/24 Test #10: d_test_19_2_10000000_LA ..........   Passed    0.03 sec
      Start 11: d_test_2_0_SP
11/24 Test #11: d_test_2_0_SP ....................   Passed    0.06 sec
      Start 12: d_test_2_10000000_SP
12/24 Test #12: d_test_2_10000000_SP .............***Exception: SegFault  0.01 sec
      Start 13: c_test_9_2_0_LA
13/24 Test #13: c_test_9_2_0_LA ..................   Passed    0.02 sec
      Start 14: c_test_9_2_10000000_LA
14/24 Test #14: c_test_9_2_10000000_LA ...........   Passed    0.03 sec
      Start 15: c_test_19_2_0_LA
15/24 Test #15: c_test_19_2_0_LA .................   Passed    0.06 sec
      Start 16: c_test_19_2_10000000_LA
16/24 Test #16: c_test_19_2_10000000_LA ..........   Passed    0.06 sec
      Start 17: c_test_2_0_SP
17/24 Test #17: c_test_2_0_SP ....................   Passed    0.12 sec
      Start 18: c_test_2_10000000_SP
18/24 Test #18: c_test_2_10000000_SP .............***Exception: SegFault  0.01 sec
      Start 19: z_test_9_2_0_LA
19/24 Test #19: z_test_9_2_0_LA ..................   Passed    0.03 sec
      Start 20: z_test_9_2_10000000_LA
20/24 Test #20: z_test_9_2_10000000_LA ...........   Passed    0.03 sec
      Start 21: z_test_19_2_0_LA
21/24 Test #21: z_test_19_2_0_LA .................   Passed    0.07 sec
      Start 22: z_test_19_2_10000000_LA
22/24 Test #22: z_test_19_2_10000000_LA ..........   Passed    0.07 sec
      Start 23: z_test_2_0_SP
23/24 Test #23: z_test_2_0_SP ....................   Passed    0.15 sec
      Start 24: z_test_2_10000000_SP
24/24 Test #24: z_test_2_10000000_SP .............   Passed    0.16 sec

92% tests passed, 2 tests failed out of 24

Total Test time (real) =   1.23 sec

The following tests FAILED:
         12 - d_test_2_10000000_SP (SEGFAULT)
         18 - c_test_2_10000000_SP (SEGFAULT)
Errors while running CTest

wo80 · 2023-08-05T20:53:03Z

Two more observations on the actual error:

The problem occurs both in debug and release mode and it doesn't seem to behave deterministic. While most of the time I get the segfault, sometimes the tests finish but produce garbage solutions:

    [...]
    dgssvx:fact=   3, trans=   1, equed=B, n=400, imat=0, test(1)=  6.2187e+09
    dgssvx:fact=   3, trans=   1, equed=B, n=400, imat=0, test(2)=  3.0712e+10
    dgssvx:fact=   3, trans=   1, equed=B, n=400, imat=0, test(4)=  3.8735e+09
    dgssvx:fact=   3, trans=   0, equed=B, n=400, imat=0, test(1)=  1.9755e+14
    dgssvx:fact=   3, trans=   0, equed=B, n=400, imat=0, test(2)=  5.4097e+13
    dgssvx:fact=   3, trans=   0, equed=B, n=400, imat=0, test(4)=  2.2462e+13
    dgssvx:fact=   3, trans=   1, equed=B, n=400, imat=0, test(1)=  6.2187e+09
    dgssvx:fact=   3, trans=   1, equed=B, n=400, imat=0, test(2)=  3.0712e+10
    dgssvx:fact=   3, trans=   1, equed=B, n=400, imat=0, test(4)=  3.8735e+09
DGE driver: 92 out of 144 tests failed to pass the threshold

EDIT:
To make ctest recognize this as a test failure, the drivers (cdrive.c etc.) should not return 0, but

    return nfail == 0 ? EXIT_SUCCESS : EXIT_FAILURE;

wo80 · 2023-08-05T21:02:34Z

I think it would be best to open 3 separate issues:

The Github workflow should fail when a test fails (shouldn't matter if an actual test condition fails or a segfault occurs)
The CMake test setup needs to be fixed (addressed in PR Fix CMake test setup #112 )
The actual cause of the segfault needs to be investigated

wo80 · 2023-08-08T11:27:37Z

Here's what valgrind has to say about it:

/projects/superlu/build/TESTING$ valgrind ./d_test -t "SP" -s 5 -l 5000000 -f ../../EXAMPLE/g20.rua
==11462== Memcheck, a memory error detector
==11462== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==11462== Using Valgrind-3.19.0 and LibVEX; rerun with -h for copyright info
==11462== Command: ./d_test -t SP -s 5 -l 5000000 -f ../../EXAMPLE/g20.rua
==11462== 
.. test sparse matrix in file: ../../EXAMPLE/g20.rua
g20, symm permuted by SYMMMD                                            SYM     
==11462== Conditional jump or move depends on uninitialised value(s)
==11462==    at 0x12BCC2: relax_snode (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x11C973: dgstrf (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x10A1B2: main (in /projects/superlu/build/TESTING/d_test)
==11462== 
==11462== Conditional jump or move depends on uninitialised value(s)
==11462==    at 0x116E87: user_bcopy (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x12541C: dexpand (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x124EC2: dLUMemXpand (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x12300E: dcolumn_bmod (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x11CFA0: dgstrf (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x10A1B2: main (in /projects/superlu/build/TESTING/d_test)
==11462== 
==11462== Use of uninitialised value of size 8
==11462==    at 0x116E6C: user_bcopy (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x12541C: dexpand (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x124EC2: dLUMemXpand (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x12300E: dcolumn_bmod (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x11CFA0: dgstrf (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x10A1B2: main (in /projects/superlu/build/TESTING/d_test)
==11462== 
==11462== Use of uninitialised value of size 8
==11462==    at 0x116E73: user_bcopy (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x12541C: dexpand (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x124EC2: dLUMemXpand (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x12300E: dcolumn_bmod (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x11CFA0: dgstrf (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x10A1B2: main (in /projects/superlu/build/TESTING/d_test)
==11462== 
==11462== Invalid read of size 1
==11462==    at 0x116E6C: user_bcopy (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x12541C: dexpand (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x124EC2: dLUMemXpand (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x12300E: dcolumn_bmod (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x11CFA0: dgstrf (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x10A1B2: main (in /projects/superlu/build/TESTING/d_test)
==11462==  Address 0x4f2503f is 1 bytes before a block of size 5,000,000 alloc'd
==11462==    at 0x48407B4: malloc (vg_replace_malloc.c:381)
==11462==    by 0x116CAD: superlu_malloc (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x1094F3: main (in /projects/superlu/build/TESTING/d_test)
==11462== 
==11462== Invalid write of size 1
==11462==    at 0x116E73: user_bcopy (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x12541C: dexpand (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x124EC2: dLUMemXpand (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x12300E: dcolumn_bmod (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x11CFA0: dgstrf (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x10A1B2: main (in /projects/superlu/build/TESTING/d_test)
==11462==  Address 0x4f2503f is 1 bytes before a block of size 5,000,000 alloc'd
==11462==    at 0x48407B4: malloc (vg_replace_malloc.c:381)
==11462==    by 0x116CAD: superlu_malloc (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x1094F3: main (in /projects/superlu/build/TESTING/d_test)
==11462== 
==11462== 
==11462== More than 10000000 total errors detected.  I'm not reporting any more.
==11462== Final error counts will be inaccurate.  Go fix your program!
==11462== Rerun with --error-limit=no to disable this cutoff.  Note
==11462== that errors may occur in your program without prior warning from
==11462== Valgrind, because errors are no longer being displayed.
==11462== 
==11462== 
==11462== Process terminating with default action of signal 11 (SIGSEGV)
==11462==  Bad permissions for mapped region at address 0x4B13FFF
==11462==    at 0x116E73: user_bcopy (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x12541C: dexpand (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x124EC2: dLUMemXpand (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x12300E: dcolumn_bmod (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x11CFA0: dgstrf (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x10A1B2: main (in /projects/superlu/build/TESTING/d_test)
==11462== 
==11462== HEAP SUMMARY:
==11462==     in use at exit: 5,169,304 bytes in 37 blocks
==11462==   total heap usage: 46 allocs, 9 frees, 5,185,088 bytes allocated
==11462== 
==11462== LEAK SUMMARY:
==11462==    definitely lost: 0 bytes in 0 blocks
==11462==    indirectly lost: 0 bytes in 0 blocks
==11462==      possibly lost: 0 bytes in 0 blocks
==11462==    still reachable: 5,169,304 bytes in 37 blocks
==11462==         suppressed: 0 bytes in 0 blocks
==11462== Rerun with --leak-check=full to see details of leaked memory
==11462== 
==11462== Use --track-origins=yes to see where uninitialised values come from
==11462== For lists of detected and suppressed errors, rerun with: -s
==11462== ERROR SUMMARY: 10000000 errors from 6 contexts (suppressed: 0 from 0)

wo80 · 2023-08-08T11:34:46Z

So, the relevant part is

Invalid write of size 1
   at 0x116E73: user_bcopy
   by 0x12541C: dexpand
   by 0x124EC2: dLUMemXpand
   by 0x12300E: dcolumn_bmod
   by 0x11CFA0: dgstrf
   by 0x10A1B2: main
 Address 0x4f2503f is 1 bytes before a block of size 5,000,000 alloc'd

wo80 · 2023-08-11T05:57:02Z

So I think I tracked down the origin of the problem. Not sure what the correct fix would be, though.

In dmemory.c method dexpand

superlu/SRC/dmemory.c

Lines 573 to 577 in 29ea08a

    
              if ( type != USUB ) { 
        
           new_mem = (void*)((char*)expanders[type + 1].mem + extra); 
        
           bytes_to_copy = (char*)Glu->stack.array + Glu->stack.top1 
        
               - (char*)expanders[type + 1].mem; 
        
           user_bcopy(expanders[type+1].mem, new_mem, bytes_to_copy);

at the time of calling, expanders[type + 1] is not initialized (where type = LUSUP).

That is due to

superlu/SRC/superlu_enum_consts.h

Lines 37 to 38 in 29ea08a

    
           //typedef enum {LUSUP, UCOL, LSUB, USUB, LLVL, ULVL, NO_MEMTYPE}  MemType; 
        
           typedef enum {USUB, LSUB, UCOL, LUSUP, LLVL, ULVL, NO_MEMTYPE}  MemType;

and dLUMemInit only initializing the first four positions of expanders

superlu/SRC/dmemory.c

Lines 243 to 250 in 29ea08a

    
           } 
        
           lusup = (double *) dexpand( &nzlumax, LUSUP, 0, 0, Glu ); 
        
           ucol  = (double *) dexpand( &nzumax, UCOL, 0, 0, Glu ); 
        
           lsub  = (int_t *) dexpand( &nzlmax, LSUB, 0, 0, Glu ); 
        
           usub  = (int_t *) dexpand( &nzumax, USUB, 0, 1, Glu ); 
        
           while ( !lusup || !ucol || !lsub || !usub ) {

EDIT:
I only debugged d_test. The same will most likely be the cause in c, s and z versions of the code.

wo80 · 2023-08-11T06:42:28Z

I was wondering about the test type != USUB in

superlu/SRC/dmemory.c

Lines 573 to 577 in 29ea08a

    
              if ( type != USUB ) { 
        
           new_mem = (void*)((char*)expanders[type + 1].mem + extra); 
        
           bytes_to_copy = (char*)Glu->stack.array + Glu->stack.top1 
        
               - (char*)expanders[type + 1].mem; 
        
           user_bcopy(expanders[type+1].mem, new_mem, bytes_to_copy);

Maybe the whole problem originates in a change of MemType made in
52fc55d#diff-4964d63c55baaf45c54e2d8b0485e230848b1a9da1d7c9fa40bdc3f77442c08d

Before that change the order was

typedef enum {LUSUP, UCOL, LSUB, USUB, LLVL, ULVL}              MemType;

so testing for USUB would have been correct. But the order changed to

typedef enum {USUB, LSUB, UCOL, LUSUP, LLVL, ULVL, NO_MEMTYPE}  MemType;

and maybe that was just missed in other places of the code, like dexpand.

So testing for type != LUSUP might be the correct fix. But I don't have enough insight into the SuperLU implementation details to be sure :-)

wo80 · 2023-08-12T09:58:37Z

I just tested replacing type != USUB with type != LUSUP. Though this prevents the segfault, it does not prevent some of the tests to fail (assuming the return type fix of the test drivers mentioned in #108 (comment) is applied).

The following tests FAILED:
	  6 - s_test_2_10000000_SP (Failed)
	 12 - d_test_2_10000000_SP (Failed)
	 18 - c_test_2_10000000_SP (Failed)
	 24 - z_test_2_10000000_SP (Failed)

I guess this is as far as I can go without digging into the memory management details of SuperLU.

gruenich · 2023-08-12T18:04:07Z

I can reproduce the issue. Your analysis looks good, I am convinced that this was introduced by the commit you mentioned! Skimming through the commit, the changes to the enum are nowhere motivated and thus most probably wrong.

@xiaoyeli What do you think? Can we (partially) revert 52fc55d? Or do you know which pieces are needed from superlu_dist to fix these examples?

…um_consts.h Resolved issue #108: correct the macro order of MemType in superlu_enum_consts.h; Merged PR #116: complex -> singlecomplex (consistent with doublecomplex); Merged PR #117: fill the missing perm_r[] for rank deficient cases; Resolved issue #119: missing BLAS names in slu_Cnames.h

xiaoyeli · 2023-09-11T04:36:56Z

Resolved it in Master.

wo80 · 2023-09-11T08:51:25Z

Resolved it in Master.

Alright. Now the remaining two issues mentioned above #108 (comment) should be addressed.

Regarding the Github workflow: since the CMake build script works pretty well, I'd suggest installing cmake and then use it to build and test. Something along the lines (not tested)

   - uses: actions/checkout@v3

    - name: Configure
      run: cmake -B build      

    - name: Build
      run: cmake --build build --parallel

    - name: Test
      run: ctest --test-dir build --output-on-failure

wo80 · 2023-09-11T09:41:14Z

Btw, if you look at the test output of the Github workflow, you still see a bunch of segfaults.

I strongly suggest that you fix the test setup, so the workflow reflects those problems.

wo80 · 2023-09-11T12:07:42Z

I just tested the cmake workflow here https://github.com/wo80/superlu/commit/e494d2ac8c1bb17d475432273cdf8b60ba6f391a ~~and all tests are passing~~.

@xiaoyeli Please let me know if you want me to merge this into #112

EDIT: I applied the change suggested in #108 (comment) (see https://github.com/wo80/superlu/commit/53794fa76ae7f92c619f5b7940cc08ffa8daae1b) and this makes the tests fail. The segfault is also still present.

wo80 · 2023-09-11T15:24:07Z

After merging the upstream changes, the segfault seems to be fixed. But now the "LA" d_tests fail with

Subprocess aborted***Exception:   0.15 sec
dgstrf info 1
dgstrf info 1
dgstrf info 19
double free or corruption (out)

see https://github.com/wo80/superlu/actions/runs/6146404541/job/16675708872

Failing tests:

 7/24 Test  #7: d_test_9_2_0_LA ..................Subprocess aborted***Exception
 8/24 Test  #8: d_test_19_2_0_LA .................Subprocess aborted***Exception
 9/24 Test  #9: d_test_2_0_SP ....................Passed
10/24 Test #10: d_test_9_2_10000000_LA ...........Subprocess aborted***Exception
11/24 Test #11: d_test_19_2_10000000_LA ..........Subprocess aborted***Exception
12/24 Test #12: d_test_2_10000000_SP .............Passed

Valgrind output:

valgrind --track-origins=yes --leak-check=full ./d_test -t "LA" -n 9 -s 2 -l 0
Memcheck, a memory error detector
Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
Using Valgrind-3.19.0 and LibVEX; rerun with -h for copyright info
Command: ./d_test -t LA -n 9 -s 2 -l 0

dgstrf info 1
dgstrf info 1
dgstrf info 9
Invalid read of size 8
   at 0x10C6BA: dgst01
   by 0x10B4CA: main
 Address 0x5158f78 is 8 bytes before a block of size 72 alloc'd
   at 0x48407B4: malloc (vg_replace_malloc.c:381)
   by 0x117DAA: superlu_malloc
   by 0x126CBE: doubleCalloc
   by 0x10C19D: dgst01
   by 0x10B4CA: main

[...] more of those errors

dgstrf info 9
dgstrf info 5
Invalid read of size 4
   at 0x10C382: dgst01
   by 0x10B4CA: main
 Address 0x516bc5c is 4 bytes before a block of size 40 alloc'd
   at 0x48407B4: malloc (vg_replace_malloc.c:381)
   by 0x117DAA: superlu_malloc
   by 0x117FAE: int32Malloc
   by 0x1258B5: dLUMemInit
   by 0x11D857: dgstrf
   by 0x118D40: dgssv
   by 0x10B422: main

Invalid read of size 4
   at 0x10C3FA: dgst01
   by 0x10B4CA: main
 Address 0x516c8bc is 4 bytes before a block of size 648 alloc'd
   at 0x48407B4: malloc (vg_replace_malloc.c:381)
   by 0x117DAA: superlu_malloc
   by 0x1264D8: dexpand
   by 0x125A0D: dLUMemInit
   by 0x11D857: dgstrf
   by 0x118D40: dgssv
   by 0x10B422: main

[...] more of those errors

dgstrf info 5
All tests for DGE driver passed the threshold (  1158 tests run)

HEAP SUMMARY:
    in use at exit: 319,748 bytes in 578 blocks
  total heap usage: 23,849 allocs, 23,271 frees, 11,336,936 bytes allocated

2,688 (960 direct, 1,728 indirect) bytes in 24 blocks are definitely lost in loss record 15 of 23
   at 0x48407B4: malloc (vg_replace_malloc.c:381)
   by 0x117DAA: superlu_malloc
   by 0x118265: sp_preorder
   by 0x119862: dgssvx
   by 0x10B767: main

74,880 (1,408 direct, 73,472 indirect) bytes in 44 blocks are definitely lost in loss record 21 of 23
   at 0x48407B4: malloc (vg_replace_malloc.c:381)
   by 0x117DAA: superlu_malloc
   by 0x126E18: dCreate_CompCol_Matrix
   by 0x11E487: dgstrf
   by 0x1198E5: dgssvx
   by 0x10B767: main

81,216 (2,464 direct, 78,752 indirect) bytes in 44 blocks are definitely lost in loss record 22 of 23
   at 0x48407B4: malloc (vg_replace_malloc.c:381)
   by 0x117DAA: superlu_malloc
   by 0x127347: dCreate_SuperNode_Matrix
   by 0x11E44C: dgstrf
   by 0x1198E5: dgssvx
   by 0x10B767: main

LEAK SUMMARY:
   definitely lost: 4,832 bytes in 112 blocks
   indirectly lost: 153,952 bytes in 444 blocks
     possibly lost: 0 bytes in 0 blocks
   still reachable: 160,964 bytes in 22 blocks
        suppressed: 0 bytes in 0 blocks
Reachable blocks (those to which a pointer was found) are not shown.
To see them, rerun with: --leak-check=full --show-leak-kinds=all

For lists of detected and suppressed errors, rerun with: -s
ERROR SUMMARY: 3883 errors from 27 contexts (suppressed: 0 from 0)

wo80 · 2023-09-11T17:02:50Z

I think I found the culprit: cf93b7e

superlu/SRC/dpivotL.c

Lines 134 to 146 in 90ee45d

    
               /* Test for singularity */ 
        
               if ( pivmax == 0.0 ) { 
        
           #if 0 
        
                   // There is no valid pivot.   
        
                   // jcol represents the rank of U 
        
                   // report the rank let dgstrf handle the pivot 
        
           #if 1 
        
               *pivrow = lsub_ptr[pivptr]; 
        
               perm_r[*pivrow] = jcol; 
        
           #else 
        
               perm_r[diagind] = jcol; 
        
           #endif 
        
           #endif

This was an external contribution merged two days ago. And it's the perfect demonstration, how important a functional CI test setup is. So I'll quote myself from #112:

I think it's important to have tests reflecting reality and I think that this should be merged rather sooner than later (even if the issue remains unresolved for now).

wo80 · 2023-09-11T17:13:14Z

I rebased #112 and added the changes from my fix/github-workflow branch (now deleted).

wo80 · 2023-09-11T17:18:59Z

I haven't addressed the above issue in dpivotL.c. I think it's better you @xiaoyeli fix this in a single commit. Then the workflow shouldn't fail anymore.

EDIT: just for demonstration https://github.com/wo80/superlu/actions/runs/6149774863/job/16686331442

wo80 · 2023-09-12T10:51:27Z

f63265a seems to fix the issue, the workflow tests are passing.

One question remaining, though:

Comparing dpivotL.c to the (c|s|z)pivotL.c variants, the code re-assigning perm_r[*pivrow] in the section labelled /* Test for singularity */ (see #108 (comment) above) is disabled in the d variant, but not in the c|s|z variants. Which one is correct?

gruenich · 2023-11-25T12:13:19Z

I think it would be best to open 3 separate issues:

The Github workflow should fail when a test fails (shouldn't matter if an actual test condition fails or a segfault occurs)

The CMake test setup needs to be fixed (addressed in PR Fix CMake test setup #112 )

The actual cause of the segfault needs to be investigated

This is addressed in Repair contiuous testing, also add test for Doxygen and Fortran #131, some of your changes and some additions from myself.
Fixed by your commits, merged as [cmake] Simplify creation of tests #114.
Segfault is also fixed, the checks for Repair contiuous testing, also add test for Doxygen and Fortran #131 are passing.

I would like to extend this list:
4. We need a Windows runner, I created #132 for this to not extend this thread any longer.
5. Your last question should be answered, @xiaoyeli do you know the answer to this question?

Comparing dpivotL.c to the (c|s|z)pivotL.c variants, the code re-assigning perm_r[*pivrow] in the section labelled /* Test for singularity */ (see #108 (comment) above) is disabled in the d variant, but not in the c|s|z variants. Which one is correct?

xiaoyeli · 2023-11-26T17:44:16Z

I just pushed the fix to the following:

Comparing dpivotL.c to the (c|s|z)pivotL.c variants, the code re-assigning perm_r[pivrow] in the section labelled / Test for singularity */ (see #108 (comment) above) is disabled in the d variant, but not in the c|s|z variants. Which one is correct?

gruenich · 2024-07-22T20:39:22Z

@xiaoyeli This can be closed now. Thanks for the fix!

wo80 mentioned this issue Aug 8, 2023

Fix CMake test setup #112

Closed

gruenich mentioned this issue Nov 26, 2023

Draft: add windows test runner #133

Merged

xiaoyeli closed this as completed Jul 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Github workflow: Segmentation fault #108

Github workflow: Segmentation fault #108

wo80 commented Aug 4, 2023

wo80 commented Aug 5, 2023 •

edited

Loading

wo80 commented Aug 5, 2023 •

edited

Loading

wo80 commented Aug 5, 2023 •

edited

Loading

wo80 commented Aug 8, 2023

wo80 commented Aug 8, 2023 •

edited

Loading

wo80 commented Aug 11, 2023 •

edited

Loading

wo80 commented Aug 11, 2023 •

edited

Loading

wo80 commented Aug 12, 2023

gruenich commented Aug 12, 2023

xiaoyeli commented Sep 11, 2023

wo80 commented Sep 11, 2023 •

edited

Loading

wo80 commented Sep 11, 2023

wo80 commented Sep 11, 2023 •

edited

Loading

wo80 commented Sep 11, 2023 •

edited

Loading

wo80 commented Sep 11, 2023 •

edited

Loading

wo80 commented Sep 11, 2023 •

edited

Loading

wo80 commented Sep 11, 2023 •

edited

Loading

wo80 commented Sep 12, 2023

gruenich commented Nov 25, 2023

xiaoyeli commented Nov 26, 2023

gruenich commented Jul 22, 2024

Github workflow: Segmentation fault #108

Github workflow: Segmentation fault #108

Comments

wo80 commented Aug 4, 2023

wo80 commented Aug 5, 2023 • edited Loading

wo80 commented Aug 5, 2023 • edited Loading

wo80 commented Aug 5, 2023 • edited Loading

wo80 commented Aug 8, 2023

wo80 commented Aug 8, 2023 • edited Loading

wo80 commented Aug 11, 2023 • edited Loading

wo80 commented Aug 11, 2023 • edited Loading

wo80 commented Aug 12, 2023

gruenich commented Aug 12, 2023

xiaoyeli commented Sep 11, 2023

wo80 commented Sep 11, 2023 • edited Loading

wo80 commented Sep 11, 2023

wo80 commented Sep 11, 2023 • edited Loading

wo80 commented Sep 11, 2023 • edited Loading

wo80 commented Sep 11, 2023 • edited Loading

wo80 commented Sep 11, 2023 • edited Loading

wo80 commented Sep 11, 2023 • edited Loading

wo80 commented Sep 12, 2023

gruenich commented Nov 25, 2023

xiaoyeli commented Nov 26, 2023

gruenich commented Jul 22, 2024

wo80 commented Aug 5, 2023 •

edited

Loading

wo80 commented Aug 5, 2023 •

edited

Loading

wo80 commented Aug 5, 2023 •

edited

Loading

wo80 commented Aug 8, 2023 •

edited

Loading

wo80 commented Aug 11, 2023 •

edited

Loading

wo80 commented Aug 11, 2023 •

edited

Loading

wo80 commented Sep 11, 2023 •

edited

Loading

wo80 commented Sep 11, 2023 •

edited

Loading

wo80 commented Sep 11, 2023 •

edited

Loading

wo80 commented Sep 11, 2023 •

edited

Loading

wo80 commented Sep 11, 2023 •

edited

Loading

wo80 commented Sep 11, 2023 •

edited

Loading