You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We don't know why, but we noticed Pkg hung in our CI last week. It hasn't reproduced since, so it is probably quite rare.
The log was proceeding like this, and then just stopped:
2024-08-28 14:28:43 ✓ RAI_Flags
2024-08-28 14:28:43 ✓ HiGHS
2024-08-28 14:28:44 ✓ JobAPI
2024-08-28 14:28:45 ✓ DCAnalysis
2024-08-28 14:28:45 StorageIntegration Waiting for background task / IO / timer.
2024-08-28 14:28:45 [pid 3438049] waiting for IO to finish:
2024-08-28 14:28:45 Handle type uv_handle_t->data
2024-08-28 14:28:45 This means that a package has started a background task or event source that has not finished running. For precompilation to complete successfully, the event source needs to be closed explicitly. See the developer documentation on fixing precompilation hangs for more help.
2024-08-28 14:28:45 ✓ RAI_Net
2024-08-28 14:28:45 ✓ StorageIntegration
2024-08-28 14:28:46 ✓ TreeDecomposer
2024-08-28 14:28:46 ✓ PagedDataStructures
2024-08-28 14:28:48 ✓ PagedDataStructuresTestHelpers
2024-08-28 14:28:48 ✓ UpdateAPI
UpdateAPI is not our last package, so it should have continued.
When a job times out after 30 min of silence, our build script runs the following commands to try to debug the hang:
julia_pids=$(pgrep julia || true)if [[ -z"$julia_pids" ]] ;thenexit 0
fiecho"============================================================="echo"Dumping any information we can on julia/rai-server processes"echo"============================================================="forpin$julia_pids;doecho"===== task and thread backtraces"${pkgs.lldb}/bin/lldb -p $p -b -o "pro hand -p true -s false -n false SIGSEGV" -o "bt all" -o "expr (void) jl_print_task_backtraces(0)" -o "process detach"echoechodoneecho"============================================================="
But we found only a single julia process -- the one coordinating the tests -- and it was just waiting on packages to finish. So it doesn't look like there was a running process for any packages that were stuck. So it must have been some sort of distributed multi-process coordination issue?
Here's the relevant lines:
2024-08-28 15:28:53 thread (1) wait at ./task.jl:995
2024-08-28 15:28:53 thread (1) #wait#646 at ./condition.jl:130
2024-08-28 15:28:53 thread (1) wait at ./condition.jl:125 [inlined]
2024-08-28 15:28:53 thread (1) wait at ./lock.jl:457
2024-08-28 15:28:53 thread (1) #precompile#226 at /build/julia-v1.10.2+RAI-0e5b029ae532427dffec09232297920ba8f0be2b/usr/share/julia/stdlib/v1.10/Pkg/src/API.jl:1570
2024-08-28 15:28:53 thread (1) precompile at /build/julia-v1.10.2+RAI-0e5b029ae532427dffec09232297920ba8f0be2b/usr/share/julia/stdlib/v1.10/Pkg/src/API.jl:1078 [inlined]
2024-08-28 15:28:53 thread (1) #_auto_precompile#6 at /build/julia-v1.10.2+RAI-0e5b029ae532427dffec09232297920ba8f0be2b/usr/share/julia/stdlib/v1.10/Pkg/src/Pkg.jl:805 [inlined]
2024-08-28 15:28:53 thread (1) _auto_precompile at /build/julia-v1.10.2+RAI-0e5b029ae532427dffec09232297920ba8f0be2b/usr/share/julia/stdlib/v1.10/Pkg/src/Pkg.jl:803 [inlined]
2024-08-28 15:28:53 thread (1) _auto_precompile at /build/julia-v1.10.2+RAI-0e5b029ae532427dffec09232297920ba8f0be2b/usr/share/julia/stdlib/v1.10/Pkg/src/Pkg.jl:803 [inlined]
And here's the larger snippets of the logs we have that indicate this:
2024-08-28 14:28:48 ✓ PagedDataStructuresTestHelpers
2024-08-28 14:28:48 ✓ UpdateAPI
2024-08-28 15:28:51 =============================================================
2024-08-28 15:28:51 Dumping any information we can on julia/rai-server processes
2024-08-28 15:28:51 =============================================================
2024-08-28 15:28:51 ===== task and thread backtraces
2024-08-28 15:28:51 (lldb) process attach --pid 3426002
2024-08-28 15:28:52 Process 3426002 stopped
2024-08-28 15:28:52 * thread #1, name = 'julia', stop reason = signal SIGSTOP
2024-08-28 15:28:52 frame #0: 0x00007ffff7ec635f libc.so.6`epoll_wait + 79
2024-08-28 15:28:52 libc.so.6`epoll_wait:
2024-08-28 15:28:52 -> 0x7ffff7ec635f <+79>: cmpq $-0x1000, %rax ; imm = 0xF000
2024-08-28 15:28:52 0x7ffff7ec6365 <+85>: ja 0x7ffff7ec639a ; <+138>
2024-08-28 15:28:52 0x7ffff7ec6367 <+87>: movl %r8d, %edi
2024-08-28 15:28:52 0x7ffff7ec636a <+90>: movl %eax, 0xc(%rsp)
2024-08-28 15:28:52 thread #2, name = 'julia', stop reason = signal SIGSTOP
2024-08-28 15:28:52 frame #0: 0x00007ffff7e066cc libc.so.6`__sigtimedwait + 156
[...]
2024-08-28 15:28:52 Executable module set to "/nix/store/y2jsq0p1q7523slwzn403r7w5zd98s9z-julia-1.10.2/bin/julia".
2024-08-28 15:28:52 Architecture set to: x86_64-unknown-linux-gnu.
2024-08-28 15:28:52 (lldb) pro hand -p true -s false -n false SIGSEGV
2024-08-28 15:28:52 NAME PASS STOP NOTIFY
2024-08-28 15:28:52 =========== ===== ===== ======
2024-08-28 15:28:52 SIGSEGV true false false
2024-08-28 15:28:52 (lldb) bt all
2024-08-28 15:28:52 * thread #1, name = 'julia', stop reason = signal SIGSTOP
2024-08-28 15:28:52 * frame #0: 0x00007ffff7ec635f libc.so.6`epoll_wait + 79
2024-08-28 15:28:52 frame #1: 0x00007ffff7534133 libjulia-internal.so.1.10`uv__io_poll(loop=0x00007ffff79d8ac0, timeout=<unavailable>) at epoll.c:236:7
2024-08-28 15:28:52 thread #2, name = 'julia', stop reason = signal SIGSTOP
[...]
2024-08-28 15:28:53 (lldb) expr (void) jl_print_task_backtraces(0)
2024-08-28 15:28:53 thread (1) ++++ Task backtraces
2024-08-28 15:28:53 thread (1) ==== Thread 1 created 334 live tasks
2024-08-28 15:28:53 thread (1) ---- Root task (0x7fffecf3c010)
2024-08-28 15:28:53 thread (1) (sticky: 1, started: 1, state: 0, tid: 1)
2024-08-28 15:28:53 thread (1) jl_start_fiber_swap at /build/julia-v1.10.2+RAI-0e5b029ae532427dffec09232297920ba8f0be2b/src/task.c:1433
2024-08-28 15:28:53 thread (1) ctx_switch at /build/julia-v1.10.2+RAI-0e5b029ae532427dffec09232297920ba8f0be2b/src/task.c:617
2024-08-28 15:28:53 thread (1) ijl_switch at /build/julia-v1.10.2+RAI-0e5b029ae532427dffec09232297920ba8f0be2b/src/task.c:650
2024-08-28 15:28:53 thread (1) try_yieldto at ./task.jl:921
2024-08-28 15:28:53 thread (1) wait at ./task.jl:995
2024-08-28 15:28:53 thread (1) #wait#646 at ./condition.jl:130
2024-08-28 15:28:53 thread (1) wait at ./condition.jl:125 [inlined]
2024-08-28 15:28:53 thread (1) wait at ./lock.jl:457
2024-08-28 15:28:53 thread (1) #precompile#226 at /build/julia-v1.10.2+RAI-0e5b029ae532427dffec09232297920ba8f0be2b/usr/share/julia/stdlib/v1.10/Pkg/src/API.jl:1570
2024-08-28 15:28:53 thread (1) precompile at /build/julia-v1.10.2+RAI-0e5b029ae532427dffec09232297920ba8f0be2b/usr/share/julia/stdlib/v1.10/Pkg/src/API.jl:1078 [inlined]
2024-08-28 15:28:53 thread (1) #_auto_precompile#6 at /build/julia-v1.10.2+RAI-0e5b029ae532427dffec09232297920ba8f0be2b/usr/share/julia/stdlib/v1.10/Pkg/src/Pkg.jl:805 [inlined]
2024-08-28 15:28:53 thread (1) _auto_precompile at /build/julia-v1.10.2+RAI-0e5b029ae532427dffec09232297920ba8f0be2b/usr/share/julia/stdlib/v1.10/Pkg/src/Pkg.jl:803 [inlined]
2024-08-28 15:28:53 thread (1) _auto_precompile at /build/julia-v1.10.2+RAI-0e5b029ae532427dffec09232297920ba8f0be2b/usr/share/julia/stdlib/v1.10/Pkg/src/Pkg.jl:803 [inlined]
2024-08-28 15:28:53 thread (1) #build#87 at /build/julia-v1.10.2+RAI-0e5b029ae532427dffec09232297920ba8f0be2b/usr/share/julia/stdlib/v1.10/Pkg/src/API.jl:160
2024-08-28 15:28:53 thread (1) build at /build/julia-v1.10.2+RAI-0e5b029ae532427dffec09232297920ba8f0be2b/usr/share/julia/stdlib/v1.10/Pkg/src/API.jl:148
2024-08-28 15:28:53 thread (1) #build#85 at /build/julia-v1.10.2+RAI-0e5b029ae532427dffec09232297920ba8f0be2b/usr/share/julia/stdlib/v1.10/Pkg/src/API.jl:147 [inlined]
2024-08-28 15:28:53 thread (1) build at /build/julia-v1.10.2+RAI-0e5b029ae532427dffec09232297920ba8f0be2b/usr/share/julia/stdlib/v1.10/Pkg/src/API.jl:147 [inlined]
2024-08-28 15:28:53 thread (1) #build#84 at /build/julia-v1.10.2+RAI-0e5b029ae532427dffec09232297920ba8f0be2b/usr/share/julia/stdlib/v1.10/Pkg/src/API.jl:146 [inlined]
2024-08-28 15:28:53 thread (1) build at /build/julia-v1.10.2+RAI-0e5b029ae532427dffec09232297920ba8f0be2b/usr/share/julia/stdlib/v1.10/Pkg/src/API.jl:146
2024-08-28 15:28:53 thread (1) unknown function (ip: 0x7fffec23a7d5)
2024-08-28 15:28:53 thread (1) jl_apply at /build/julia-v1.10.2+RAI-0e5b029ae532427dffec09232297920ba8f0be2b/src/julia.h:1982 [inlined]
I'm not sure if there's anything else to do here. Maybe you just close this as "Cannot Reproduce." But we wanted to report it just so you know.
Version info:
2024-08-28 14:27:35 Julia Version 1.10.2+RAI
2024-08-28 14:27:35 Build Info:
2024-08-28 14:27:35
2024-08-28 14:27:35 Note: This is an unofficial build, please report bugs to the project
2024-08-28 14:27:35 responsible for this build and not to the Julia project unless you can
2024-08-28 14:27:35 reproduce the issue using official builds available at https://julialang.org/downloads
2024-08-28 14:27:35
2024-08-28 14:27:35 Platform Info:
2024-08-28 14:27:35 OS: Linux (x86_64-unknown-linux-gnu)
2024-08-28 14:27:35 CPU: 32 × AMD Ryzen 9 7950X3D 16-Core Processor
2024-08-28 14:27:35 WORD_SIZE: 64
2024-08-28 14:27:35 LIBM: libopenlibm
2024-08-28 14:27:35 LLVM: libLLVM-15.0.7 (ORCJIT, znver3)
2024-08-28 14:27:35 Threads: 32 default, 2 interactive, 32 GC (on 32 virtual cores)
2024-08-28 14:27:35 Environment:
2024-08-28 14:27:35 JULIA_PROJECT = /tmp/nix-build-raicode.drv-0/rai-server-source
2024-08-28 14:27:35 JULIA_NUM_THREADS = 32,2
The text was updated successfully, but these errors were encountered:
We don't know why, but we noticed Pkg hung in our CI last week. It hasn't reproduced since, so it is probably quite rare.
The log was proceeding like this, and then just stopped:
UpdateAPI
is not our last package, so it should have continued.When a job times out after 30 min of silence, our build script runs the following commands to try to debug the hang:
But we found only a single
julia
process -- the one coordinating the tests -- and it was just waiting on packages to finish. So it doesn't look like there was a running process for any packages that were stuck. So it must have been some sort of distributed multi-process coordination issue?Here's the relevant lines:
And here's the larger snippets of the logs we have that indicate this:
I'm not sure if there's anything else to do here. Maybe you just close this as "Cannot Reproduce." But we wanted to report it just so you know.
Version info:
The text was updated successfully, but these errors were encountered: