Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pkg hang in CI while building packages #4051

Open
NHDaly opened this issue Oct 17, 2024 · 2 comments
Open

Pkg hang in CI while building packages #4051

NHDaly opened this issue Oct 17, 2024 · 2 comments

Comments

@NHDaly
Copy link
Member

NHDaly commented Oct 17, 2024

We don't know why, but we noticed Pkg hung in our CI last week. It hasn't reproduced since, so it is probably quite rare.

The log was proceeding like this, and then just stopped:

2024-08-28 14:28:43   ✓ RAI_Flags
2024-08-28 14:28:43   ✓ HiGHS
2024-08-28 14:28:44   ✓ JobAPI
2024-08-28 14:28:45   ✓ DCAnalysis
2024-08-28 14:28:45 StorageIntegration Waiting for background task / IO / timer.
2024-08-28 14:28:45 [pid 3438049] waiting for IO to finish:
2024-08-28 14:28:45  Handle type        uv_handle_t->data
2024-08-28 14:28:45 This means that a package has started a background task or event source that has not finished running. For precompilation to complete successfully, the event source needs to be closed explicitly. See the developer documentation on fixing precompilation hangs for more help.
2024-08-28 14:28:45   ✓ RAI_Net
2024-08-28 14:28:45   ✓ StorageIntegration
2024-08-28 14:28:46   ✓ TreeDecomposer
2024-08-28 14:28:46   ✓ PagedDataStructures
2024-08-28 14:28:48   ✓ PagedDataStructuresTestHelpers
2024-08-28 14:28:48   ✓ UpdateAPI

UpdateAPI is not our last package, so it should have continued.

When a job times out after 30 min of silence, our build script runs the following commands to try to debug the hang:

    julia_pids=$(pgrep julia || true)

    if [[ -z "$julia_pids" ]] ; then
      exit 0
    fi
    echo "============================================================="
    echo "Dumping any information we can on julia/rai-server processes"
    echo "============================================================="
    for p in $julia_pids; do
      echo "===== task and thread backtraces"
      ${pkgs.lldb}/bin/lldb -p $p -b -o "pro hand -p true -s false -n false SIGSEGV" -o "bt all" -o "expr (void) jl_print_task_backtraces(0)" -o "process detach"
      echo
      echo
    done
    echo "============================================================="

But we found only a single julia process -- the one coordinating the tests -- and it was just waiting on packages to finish. So it doesn't look like there was a running process for any packages that were stuck. So it must have been some sort of distributed multi-process coordination issue?

Here's the relevant lines:

2024-08-28 15:28:53 thread (1) wait at ./task.jl:995
2024-08-28 15:28:53 thread (1) #wait#646 at ./condition.jl:130
2024-08-28 15:28:53 thread (1) wait at ./condition.jl:125 [inlined]
2024-08-28 15:28:53 thread (1) wait at ./lock.jl:457
2024-08-28 15:28:53 thread (1) #precompile#226 at /build/julia-v1.10.2+RAI-0e5b029ae532427dffec09232297920ba8f0be2b/usr/share/julia/stdlib/v1.10/Pkg/src/API.jl:1570
2024-08-28 15:28:53 thread (1) precompile at /build/julia-v1.10.2+RAI-0e5b029ae532427dffec09232297920ba8f0be2b/usr/share/julia/stdlib/v1.10/Pkg/src/API.jl:1078 [inlined]
2024-08-28 15:28:53 thread (1) #_auto_precompile#6 at /build/julia-v1.10.2+RAI-0e5b029ae532427dffec09232297920ba8f0be2b/usr/share/julia/stdlib/v1.10/Pkg/src/Pkg.jl:805 [inlined]
2024-08-28 15:28:53 thread (1) _auto_precompile at /build/julia-v1.10.2+RAI-0e5b029ae532427dffec09232297920ba8f0be2b/usr/share/julia/stdlib/v1.10/Pkg/src/Pkg.jl:803 [inlined]
2024-08-28 15:28:53 thread (1) _auto_precompile at /build/julia-v1.10.2+RAI-0e5b029ae532427dffec09232297920ba8f0be2b/usr/share/julia/stdlib/v1.10/Pkg/src/Pkg.jl:803 [inlined]

And here's the larger snippets of the logs we have that indicate this:

2024-08-28 14:28:48   ✓ PagedDataStructuresTestHelpers
2024-08-28 14:28:48   ✓ UpdateAPI
2024-08-28 15:28:51 =============================================================
2024-08-28 15:28:51 Dumping any information we can on julia/rai-server processes
2024-08-28 15:28:51 =============================================================
2024-08-28 15:28:51 ===== task and thread backtraces
2024-08-28 15:28:51 (lldb) process attach --pid 3426002
2024-08-28 15:28:52 Process 3426002 stopped
2024-08-28 15:28:52 * thread #1, name = 'julia', stop reason = signal SIGSTOP
2024-08-28 15:28:52     frame #0: 0x00007ffff7ec635f libc.so.6`epoll_wait + 79
2024-08-28 15:28:52 libc.so.6`epoll_wait:
2024-08-28 15:28:52 ->  0x7ffff7ec635f <+79>: cmpq   $-0x1000, %rax            ; imm = 0xF000 
2024-08-28 15:28:52     0x7ffff7ec6365 <+85>: ja     0x7ffff7ec639a            ; <+138>
2024-08-28 15:28:52     0x7ffff7ec6367 <+87>: movl   %r8d, %edi
2024-08-28 15:28:52     0x7ffff7ec636a <+90>: movl   %eax, 0xc(%rsp)
2024-08-28 15:28:52   thread #2, name = 'julia', stop reason = signal SIGSTOP
2024-08-28 15:28:52     frame #0: 0x00007ffff7e066cc libc.so.6`__sigtimedwait + 156
[...]
2024-08-28 15:28:52 Executable module set to "/nix/store/y2jsq0p1q7523slwzn403r7w5zd98s9z-julia-1.10.2/bin/julia".
2024-08-28 15:28:52 Architecture set to: x86_64-unknown-linux-gnu.
2024-08-28 15:28:52 (lldb) pro hand -p true -s false -n false SIGSEGV
2024-08-28 15:28:52 NAME         PASS   STOP   NOTIFY
2024-08-28 15:28:52 ===========  =====  =====  ======
2024-08-28 15:28:52 SIGSEGV      true   false  false
2024-08-28 15:28:52 (lldb) bt all
2024-08-28 15:28:52 * thread #1, name = 'julia', stop reason = signal SIGSTOP
2024-08-28 15:28:52   * frame #0: 0x00007ffff7ec635f libc.so.6`epoll_wait + 79
2024-08-28 15:28:52     frame #1: 0x00007ffff7534133 libjulia-internal.so.1.10`uv__io_poll(loop=0x00007ffff79d8ac0, timeout=<unavailable>) at epoll.c:236:7
2024-08-28 15:28:52   thread #2, name = 'julia', stop reason = signal SIGSTOP
[...]
2024-08-28 15:28:53 (lldb) expr (void) jl_print_task_backtraces(0)
2024-08-28 15:28:53 thread (1) ++++ Task backtraces
2024-08-28 15:28:53 thread (1) ==== Thread 1 created 334 live tasks
2024-08-28 15:28:53 thread (1)     ---- Root task (0x7fffecf3c010)
2024-08-28 15:28:53 thread (1)          (sticky: 1, started: 1, state: 0, tid: 1)
2024-08-28 15:28:53 thread (1) jl_start_fiber_swap at /build/julia-v1.10.2+RAI-0e5b029ae532427dffec09232297920ba8f0be2b/src/task.c:1433
2024-08-28 15:28:53 thread (1) ctx_switch at /build/julia-v1.10.2+RAI-0e5b029ae532427dffec09232297920ba8f0be2b/src/task.c:617
2024-08-28 15:28:53 thread (1) ijl_switch at /build/julia-v1.10.2+RAI-0e5b029ae532427dffec09232297920ba8f0be2b/src/task.c:650
2024-08-28 15:28:53 thread (1) try_yieldto at ./task.jl:921
2024-08-28 15:28:53 thread (1) wait at ./task.jl:995
2024-08-28 15:28:53 thread (1) #wait#646 at ./condition.jl:130
2024-08-28 15:28:53 thread (1) wait at ./condition.jl:125 [inlined]
2024-08-28 15:28:53 thread (1) wait at ./lock.jl:457
2024-08-28 15:28:53 thread (1) #precompile#226 at /build/julia-v1.10.2+RAI-0e5b029ae532427dffec09232297920ba8f0be2b/usr/share/julia/stdlib/v1.10/Pkg/src/API.jl:1570
2024-08-28 15:28:53 thread (1) precompile at /build/julia-v1.10.2+RAI-0e5b029ae532427dffec09232297920ba8f0be2b/usr/share/julia/stdlib/v1.10/Pkg/src/API.jl:1078 [inlined]
2024-08-28 15:28:53 thread (1) #_auto_precompile#6 at /build/julia-v1.10.2+RAI-0e5b029ae532427dffec09232297920ba8f0be2b/usr/share/julia/stdlib/v1.10/Pkg/src/Pkg.jl:805 [inlined]
2024-08-28 15:28:53 thread (1) _auto_precompile at /build/julia-v1.10.2+RAI-0e5b029ae532427dffec09232297920ba8f0be2b/usr/share/julia/stdlib/v1.10/Pkg/src/Pkg.jl:803 [inlined]
2024-08-28 15:28:53 thread (1) _auto_precompile at /build/julia-v1.10.2+RAI-0e5b029ae532427dffec09232297920ba8f0be2b/usr/share/julia/stdlib/v1.10/Pkg/src/Pkg.jl:803 [inlined]
2024-08-28 15:28:53 thread (1) #build#87 at /build/julia-v1.10.2+RAI-0e5b029ae532427dffec09232297920ba8f0be2b/usr/share/julia/stdlib/v1.10/Pkg/src/API.jl:160
2024-08-28 15:28:53 thread (1) build at /build/julia-v1.10.2+RAI-0e5b029ae532427dffec09232297920ba8f0be2b/usr/share/julia/stdlib/v1.10/Pkg/src/API.jl:148
2024-08-28 15:28:53 thread (1) #build#85 at /build/julia-v1.10.2+RAI-0e5b029ae532427dffec09232297920ba8f0be2b/usr/share/julia/stdlib/v1.10/Pkg/src/API.jl:147 [inlined]
2024-08-28 15:28:53 thread (1) build at /build/julia-v1.10.2+RAI-0e5b029ae532427dffec09232297920ba8f0be2b/usr/share/julia/stdlib/v1.10/Pkg/src/API.jl:147 [inlined]
2024-08-28 15:28:53 thread (1) #build#84 at /build/julia-v1.10.2+RAI-0e5b029ae532427dffec09232297920ba8f0be2b/usr/share/julia/stdlib/v1.10/Pkg/src/API.jl:146 [inlined]
2024-08-28 15:28:53 thread (1) build at /build/julia-v1.10.2+RAI-0e5b029ae532427dffec09232297920ba8f0be2b/usr/share/julia/stdlib/v1.10/Pkg/src/API.jl:146
2024-08-28 15:28:53 thread (1) unknown function (ip: 0x7fffec23a7d5)
2024-08-28 15:28:53 thread (1) jl_apply at /build/julia-v1.10.2+RAI-0e5b029ae532427dffec09232297920ba8f0be2b/src/julia.h:1982 [inlined]

I'm not sure if there's anything else to do here. Maybe you just close this as "Cannot Reproduce." But we wanted to report it just so you know.

Version info:

2024-08-28 14:27:35 Julia Version 1.10.2+RAI
2024-08-28 14:27:35 Build Info:
2024-08-28 14:27:35 
2024-08-28 14:27:35     Note: This is an unofficial build, please report bugs to the project
2024-08-28 14:27:35     responsible for this build and not to the Julia project unless you can
2024-08-28 14:27:35     reproduce the issue using official builds available at https://julialang.org/downloads
2024-08-28 14:27:35 
2024-08-28 14:27:35 Platform Info:
2024-08-28 14:27:35   OS: Linux (x86_64-unknown-linux-gnu)
2024-08-28 14:27:35   CPU: 32 × AMD Ryzen 9 7950X3D 16-Core Processor
2024-08-28 14:27:35   WORD_SIZE: 64
2024-08-28 14:27:35   LIBM: libopenlibm
2024-08-28 14:27:35   LLVM: libLLVM-15.0.7 (ORCJIT, znver3)
2024-08-28 14:27:35 Threads: 32 default, 2 interactive, 32 GC (on 32 virtual cores)
2024-08-28 14:27:35 Environment:
2024-08-28 14:27:35   JULIA_PROJECT = /tmp/nix-build-raicode.drv-0/rai-server-source
2024-08-28 14:27:35   JULIA_NUM_THREADS = 32,2
@IanButterworth
Copy link
Member

Can you reproduce on an official build?

@NHDaly
Copy link
Member Author

NHDaly commented Oct 18, 2024

We have not, as far as I know, even reproduced on our custom build. :/

So i am inclined to close this, but i just wanted to report it in case anyone else sees this from time to time as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants