Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault for simple Zygote pullback #980

Closed
tam724 opened this issue Oct 16, 2024 · 9 comments · Fixed by LuxDL/LuxLib.jl#175
Closed

Segfault for simple Zygote pullback #980

tam724 opened this issue Oct 16, 2024 · 9 comments · Fixed by LuxDL/LuxLib.jl#175
Assignees
Labels
bug Something isn't working high-priority

Comments

@tam724
Copy link

tam724 commented Oct 16, 2024

When I try to run the following script, I get a segmentation fault when calling back(1.0).
Sometimes the first call crashes, sometimes it crashes after subsequent calls to back(1.0). Also the values of grad are inconsistent (NaN) when running multiple times.

using Zygote
using Lux

N = 100_000
x = rand(Float32, 2, N)
model = Chain(Dense(2, 10, Lux.relu), Dense(10, 10, Lux.relu))
ps, st = Lux.setup(Lux.Random.default_rng(), model)

function func(p)
    y, _ = Lux.apply(model, x, p, st)
    return sum(y)
end

val, back = Zygote.pullback(func, ps)
grad = back(1.0) # run this multiple times

I was able to track this issue to the version change #v.0.5.39 (works fine) to #v0.5.40 (segfault as in #v1.1.0) of Lux.jl and tested it with julia versions 1.11 and 1.10.5.
It happens consistently on my machine, but I was able to also reproduce it on a different machine (though rarely).

run with output (core dump)

Note the inconsistent values of grad in e.g. layer_2.weight.

julia> grad = back(1.0) # run this multiple times
((layer_1 = (weight = Float32[-3253.4514 -1274.5569; -6458.4355 -1406.1785; … ; 1378.7833 3709.4807; -7.2269697 -5.138177], bias = Float32[-5747.6567, 7613.077, -93326.4, 0.0, 96500.53, 236872.0, 0.0, 0.0, 13297.207, -224.2776]), layer_2 = (weight = Float32[0.0 0.0 … 2.0412973f13 7.7898606f12; 48.753906 125275.97 … 65118.145 20124.055; … ; 3006.7197 171585.72 … 33813.184 2454.77; 0.0 0.0 … 0.0 0.0], bias = Float32[0.0, 62086.0, 339.0, 69592.0, 85523.0, 95.0, 0.0, 80075.0, 100000.0, 0.0])),)

julia> grad = back(1.0) # run this multiple times
((layer_1 = (weight = Float32[-3253.4514 -1274.5569; -6458.4355 -1406.1785; … ; 1378.7833 3709.4807; -7.2269697 -5.138177], bias = Float32[-5747.6567, 7613.077, -93326.4, 0.0, 96500.53, 236872.0, 0.0, 0.0, 13297.207, -224.2776]), layer_2 = (weight = Float32[0.0 0.0 … 0.0 0.0; 48.753906 125275.97 … 65118.145 20124.055; … ; 3006.7197 171585.72 … 33813.184 2454.77; 0.0 0.0 … 0.0 0.0], bias = Float32[0.0, 62086.0, 339.0, 69592.0, 85523.0, 95.0, 0.0, 80075.0, 100000.0, 0.0])),)

julia> grad = back(1.0) # run this multiple times
((layer_1 = (weight = Float32[-3253.4514 -1274.5569; -6458.4355 -1406.1785; … ; 1378.7833 3709.4807; -7.2269697 -5.138177], bias = Float32[-5747.6567, 7613.077, -93326.4, 0.0, 96500.53, 236872.0, 0.0, 0.0, 13297.207, -224.2776]), layer_2 = (weight = Float32[0.0 0.0 … 2.0412973f13 7.7898606f12; 48.753906 125275.97 … 65118.145 20124.055; … ; 3006.7197 171585.72 … 33813.184 2454.77; 0.0 0.0 … 0.0 0.0], bias = Float32[0.0, 62086.0, 339.0, 69592.0, 85523.0, 95.0, 0.0, 80075.0, 100000.0, 0.0])),)

julia> grad = back(1.0) # run this multiple times
((layer_1 = (weight = Float32[-3253.4514 -1274.5569; -6458.4355 -1406.1785; … ; 1378.7833 3709.4807; -7.2269697 -5.138177], bias = Float32[-5747.6567, 7613.077, -93326.4, 0.0, 96500.53, 236872.0, 0.0, 0.0, 13297.207, -224.2776]), layer_2 = (weight = Float32[0.0 0.0 … 0.0 0.0; 48.753906 125275.97 … 65118.145 20124.055; … ; 3006.7197 171585.72 … 33813.184 2454.77; 0.0 0.0 … 0.0 0.0], bias = Float32[0.0, 62086.0, 339.0, 69592.0, 85523.0, 95.0, 0.0, 80075.0, 100000.0, 0.0])),)

julia> grad = back(1.0) # run this multiple times

[104443] signal 11 (128): Segmentation fault
in expression starting at REPL[10]:1
jl_gc_pool_alloc_inner at /cache/build/builder-amdci5-1/julialang/julia-master/src/gc.c:1335
jl_gc_pool_alloc_noinline at /cache/build/builder-amdci5-1/julialang/julia-master/src/gc.c:1392 [inlined]
jl_gc_alloc_ at /cache/build/builder-amdci5-1/julialang/julia-master/src/julia_internal.h:523 [inlined]
jl_gc_alloc at /cache/build/builder-amdci5-1/julialang/julia-master/src/gc.c:3952
_new_genericmemory_ at /cache/build/builder-amdci5-1/julialang/julia-master/src/genericmemory.c:56 [inlined]
jl_alloc_genericmemory at /cache/build/builder-amdci5-1/julialang/julia-master/src/genericmemory.c:99
GenericMemory at ./boot.jl:516 [inlined]
new_as_memoryref at ./boot.jl:535 [inlined]
Array at ./boot.jl:582 [inlined]
Array at ./boot.jl:592 [inlined]
similar at ./array.jl:361 [inlined]
similar at ./abstractarray.jl:824 [inlined]
matmul at /home/tam/.julia/packages/LuxLib/oPsev/src/impl/matmul.jl:39 [inlined]
matmul at /home/tam/.julia/packages/LuxLib/oPsev/src/impl/matmul.jl:34 [inlined]
∇matmul_bias at /home/tam/.julia/packages/LuxLib/oPsev/src/impl/dense.jl:218
∇matmul_bias at /home/tam/.julia/packages/LuxLib/oPsev/src/impl/dense.jl:217 [inlined]
#80 at /home/tam/.julia/packages/LuxLib/oPsev/src/impl/dense.jl:52
ZBack at /home/tam/.julia/packages/Zygote/NRp5C/src/compiler/chainrules.jl:212 [inlined]
fused_dense at /home/tam/.julia/packages/LuxLib/oPsev/src/impl/dense.jl:11 [inlined]
fused_dense_bias_activation at /home/tam/.julia/packages/LuxLib/oPsev/src/api/dense.jl:30 [inlined]
Dense at /home/tam/.julia/packages/Lux/VkHFW/src/layers/basic.jl:343 [inlined]
apply at /home/tam/.julia/packages/LuxCore/IBKvY/src/LuxCore.jl:155 [inlined]
applychain at /home/tam/.julia/packages/Lux/VkHFW/src/layers/containers.jl:0 [inlined]
Pullback at /home/tam/.julia/packages/Zygote/NRp5C/src/compiler/interface2.jl:0
Chain at /home/tam/.julia/packages/Lux/VkHFW/src/layers/containers.jl:480 [inlined]
apply at /home/tam/.julia/packages/LuxCore/IBKvY/src/LuxCore.jl:155 [inlined]
func at ./REPL[8]:2 [inlined]
#78 at /home/tam/.julia/packages/Zygote/NRp5C/src/compiler/interface.jl:91
unknown function (ip: 0x7f2f30598cca)
jl_apply at /cache/build/builder-amdci5-1/julialang/julia-master/src/julia.h:2157 [inlined]
do_call at /cache/build/builder-amdci5-1/julialang/julia-master/src/interpreter.c:126
eval_value at /cache/build/builder-amdci5-1/julialang/julia-master/src/interpreter.c:223
eval_stmt_value at /cache/build/builder-amdci5-1/julialang/julia-master/src/interpreter.c:174 [inlined]
eval_body at /cache/build/builder-amdci5-1/julialang/julia-master/src/interpreter.c:663
jl_interpret_toplevel_thunk at /cache/build/builder-amdci5-1/julialang/julia-master/src/interpreter.c:821
jl_toplevel_eval_flex at /cache/build/builder-amdci5-1/julialang/julia-master/src/toplevel.c:943
jl_toplevel_eval_flex at /cache/build/builder-amdci5-1/julialang/julia-master/src/toplevel.c:886
ijl_toplevel_eval_in at /cache/build/builder-amdci5-1/julialang/julia-master/src/toplevel.c:994
eval at ./boot.jl:430 [inlined]
eval_user_input at /cache/build/builder-amdci5-1/julialang/julia-master/usr/share/julia/stdlib/v1.11/REPL/src/REPL.jl:226
repl_backend_loop at /cache/build/builder-amdci5-1/julialang/julia-master/usr/share/julia/stdlib/v1.11/REPL/src/REPL.jl:323
#start_repl_backend#59 at /cache/build/builder-amdci5-1/julialang/julia-master/usr/share/julia/stdlib/v1.11/REPL/src/REPL.jl:308
start_repl_backend at /cache/build/builder-amdci5-1/julialang/julia-master/usr/share/julia/stdlib/v1.11/REPL/src/REPL.jl:305
#run_repl#72 at /cache/build/builder-amdci5-1/julialang/julia-master/usr/share/julia/stdlib/v1.11/REPL/src/REPL.jl:464
run_repl at /cache/build/builder-amdci5-1/julialang/julia-master/usr/share/julia/stdlib/v1.11/REPL/src/REPL.jl:450
jfptr_run_repl_10212 at /home/tam/.julia/juliaup/julia-1.11.0+0.x64.linux.gnu/share/julia/compiled/v1.11/REPL/u0gqU_bFCI4.so (unknown line)
#1138 at ./client.jl:446
jfptr_YY.1138_14881 at /home/tam/.julia/juliaup/julia-1.11.0+0.x64.linux.gnu/share/julia/compiled/v1.11/REPL/u0gqU_bFCI4.so (unknown line)
jl_apply at /cache/build/builder-amdci5-1/julialang/julia-master/src/julia.h:2157 [inlined]
jl_f__call_latest at /cache/build/builder-amdci5-1/julialang/julia-master/src/builtins.c:875
#invokelatest#2 at ./essentials.jl:1054 [inlined]
invokelatest at ./essentials.jl:1051 [inlined]
run_main_repl at ./client.jl:430
repl_main at ./client.jl:567 [inlined]
_start at ./client.jl:541
jfptr__start_72051.1 at /home/tam/.julia/juliaup/julia-1.11.0+0.x64.linux.gnu/lib/julia/sys.so (unknown line)
jl_apply at /cache/build/builder-amdci5-1/julialang/julia-master/src/julia.h:2157 [inlined]
true_main at /cache/build/builder-amdci5-1/julialang/julia-master/src/jlapi.c:900
jl_repl_entrypoint at /cache/build/builder-amdci5-1/julialang/julia-master/src/jlapi.c:1059
main at /cache/build/builder-amdci5-1/julialang/julia-master/cli/loader_exe.c:58
__libc_start_call_main at /lib64/libc.so.6 (unknown line)
__libc_start_main at /lib64/libc.so.6 (unknown line)
unknown function (ip: 0x4010b8)
Allocations: 93044140 (Pool: 93040630; Big: 3510); GC: 45
Segmentation fault (core dumped)
julia versioninfo
julia> versioninfo()
Julia Version 1.11.0
Commit 501a4f25c2b (2024-10-07 11:40 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 16 × 11th Gen Intel(R) Core(TM) i9-11950H @ 2.60GHz
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, tigerlake)
Threads: 1 default, 0 interactive, 1 GC (on 16 virtual cores)
@avik-pal avik-pal added the bug Something isn't working label Oct 16, 2024
@avik-pal
Copy link
Member

This is going to be messy. Can't seem to reproduce this.

Ran it 100K times without a segfault.

val, back = Zygote.pullback(func, ps)
for i in 1:100_000
    grad = back(1.0) # run this multiple times
end

If you remove the last relu does it still cause issues for you?

@tam724
Copy link
Author

tam724 commented Oct 16, 2024

Yes, it seems to be very machine dependent, but after it also occurred on another machine, I thought it is worth an issue.

Without the last relu it seems to work.
However, after adding another layer without activation function

model = Chain(Dense(2, 10, Lux.relu), Dense(10, 10, relu), Dense(10, 10))

I get the same segfault.

@avik-pal
Copy link
Member

Can you try using LuxDL/LuxLib.jl#175 and check if it still segfaults?

@tam724
Copy link
Author

tam724 commented Oct 16, 2024

Yes, still occurs.

dump
...
(temp) pkg> status
Status `~/Projects/temp/Project.toml`
  [b2108857] Lux v1.1.0
  [82251201] LuxLib v1.3.4 `https://github.com/LuxDL/LuxLib.jl.git#a37aef0`
  [e88e6eb3] Zygote v0.6.72

julia> val, back = Zygote.pullback(func, ps);

julia> grad = back(1.0) # run this multiple times

[365506] signal 11 (128): Segmentation fault
in expression starting at none:0
jl_gc_pool_alloc_inner at /cache/build/builder-amdci5-1/julialang/julia-master/src/gc.c:1335
ijl_gc_pool_alloc_instrumented at /cache/build/builder-amdci5-1/julialang/julia-master/src/gc.c:1383
IncrementalCompact at ./compiler/ssair/ir.jl:701
IncrementalCompact at ./compiler/ssair/ir.jl:724 [inlined]
compact! at ./compiler/ssair/ir.jl:2001
compact! at ./compiler/ssair/ir.jl:2001 [inlined]
run_passes_ipo_safe at ./compiler/optimize.jl:994
run_passes_ipo_safe at ./compiler/optimize.jl:1009 [inlined]
optimize at ./compiler/optimize.jl:983
jfptr_optimize_42409.1 at /home/tam/.julia/juliaup/julia-1.11.0+0.x64.linux.gnu/lib/julia/sys.so (unknown line)
finish_nocycle at ./compiler/typeinfer.jl:265
_typeinf at ./compiler/typeinfer.jl:249
typeinf at ./compiler/typeinfer.jl:215
const_prop_call at ./compiler/abstractinterpretation.jl:1278
abstract_call_method_with_const_args at ./compiler/abstractinterpretation.jl:853
abstract_call_method_with_const_args at ./compiler/abstractinterpretation.jl:823
abstract_call_gf_by_type at ./compiler/abstractinterpretation.jl:111
abstract_call_known at ./compiler/abstractinterpretation.jl:2200
abstract_call at ./compiler/abstractinterpretation.jl:2282
abstract_call at ./compiler/abstractinterpretation.jl:2275
abstract_call at ./compiler/abstractinterpretation.jl:2420
abstract_eval_call at ./compiler/abstractinterpretation.jl:2435
abstract_eval_statement_expr at ./compiler/abstractinterpretation.jl:2451
abstract_eval_statement at ./compiler/abstractinterpretation.jl:2749
abstract_eval_basic_statement at ./compiler/abstractinterpretation.jl:3065
typeinf_local at ./compiler/abstractinterpretation.jl:3319
typeinf_nocycle at ./compiler/abstractinterpretation.jl:3401
_typeinf at ./compiler/typeinfer.jl:244
typeinf at ./compiler/typeinfer.jl:215
const_prop_call at ./compiler/abstractinterpretation.jl:1278
abstract_call_method_with_const_args at ./compiler/abstractinterpretation.jl:853
abstract_call_method_with_const_args at ./compiler/abstractinterpretation.jl:823
abstract_call_gf_by_type at ./compiler/abstractinterpretation.jl:111
abstract_call_known at ./compiler/abstractinterpretation.jl:2200
abstract_call at ./compiler/abstractinterpretation.jl:2282
abstract_call at ./compiler/abstractinterpretation.jl:2275
abstract_call at ./compiler/abstractinterpretation.jl:2420
abstract_eval_call at ./compiler/abstractinterpretation.jl:2435
abstract_eval_statement_expr at ./compiler/abstractinterpretation.jl:2451
abstract_eval_statement at ./compiler/abstractinterpretation.jl:2749
abstract_eval_basic_statement at ./compiler/abstractinterpretation.jl:3065
typeinf_local at ./compiler/abstractinterpretation.jl:3319
typeinf_nocycle at ./compiler/abstractinterpretation.jl:3401
_typeinf at ./compiler/typeinfer.jl:244
typeinf at ./compiler/typeinfer.jl:215
const_prop_call at ./compiler/abstractinterpretation.jl:1278
abstract_call_method_with_const_args at ./compiler/abstractinterpretation.jl:853
abstract_call_method_with_const_args at ./compiler/abstractinterpretation.jl:823
abstract_call_gf_by_type at ./compiler/abstractinterpretation.jl:111
abstract_call_known at ./compiler/abstractinterpretation.jl:2200
abstract_call at ./compiler/abstractinterpretation.jl:2282
abstract_call at ./compiler/abstractinterpretation.jl:2275
abstract_call at ./compiler/abstractinterpretation.jl:2420
abstract_eval_call at ./compiler/abstractinterpretation.jl:2435
abstract_eval_statement_expr at ./compiler/abstractinterpretation.jl:2451
abstract_eval_statement at ./compiler/abstractinterpretation.jl:2749
abstract_eval_basic_statement at ./compiler/abstractinterpretation.jl:3041
typeinf_local at ./compiler/abstractinterpretation.jl:3319
typeinf_nocycle at ./compiler/abstractinterpretation.jl:3401
_typeinf at ./compiler/typeinfer.jl:244
typeinf at ./compiler/typeinfer.jl:215
const_prop_call at ./compiler/abstractinterpretation.jl:1278
abstract_call_method_with_const_args at ./compiler/abstractinterpretation.jl:853
abstract_call_method_with_const_args at ./compiler/abstractinterpretation.jl:823
abstract_call_gf_by_type at ./compiler/abstractinterpretation.jl:111
abstract_call_known at ./compiler/abstractinterpretation.jl:2200
abstract_call at ./compiler/abstractinterpretation.jl:2282
abstract_apply at ./compiler/abstractinterpretation.jl:1690
abstract_call_known at ./compiler/abstractinterpretation.jl:2102
abstract_call at ./compiler/abstractinterpretation.jl:2282
abstract_call at ./compiler/abstractinterpretation.jl:2275
abstract_call at ./compiler/abstractinterpretation.jl:2420
abstract_eval_call at ./compiler/abstractinterpretation.jl:2435
abstract_eval_statement_expr at ./compiler/abstractinterpretation.jl:2451
abstract_eval_statement at ./compiler/abstractinterpretation.jl:2749
abstract_eval_basic_statement at ./compiler/abstractinterpretation.jl:3065
typeinf_local at ./compiler/abstractinterpretation.jl:3319
typeinf_nocycle at ./compiler/abstractinterpretation.jl:3401
_typeinf at ./compiler/typeinfer.jl:244
typeinf at ./compiler/typeinfer.jl:215
const_prop_call at ./compiler/abstractinterpretation.jl:1278
abstract_call_method_with_const_args at ./compiler/abstractinterpretation.jl:853
abstract_call_method_with_const_args at ./compiler/abstractinterpretation.jl:823
abstract_call_gf_by_type at ./compiler/abstractinterpretation.jl:111
abstract_call_known at ./compiler/abstractinterpretation.jl:2200
abstract_call at ./compiler/abstractinterpretation.jl:2282
abstract_call at ./compiler/abstractinterpretation.jl:2275
abstract_call at ./compiler/abstractinterpretation.jl:2420
abstract_eval_call at ./compiler/abstractinterpretation.jl:2435
abstract_eval_statement_expr at ./compiler/abstractinterpretation.jl:2451
abstract_eval_statement at ./compiler/abstractinterpretation.jl:2749
abstract_eval_basic_statement at ./compiler/abstractinterpretation.jl:3065
typeinf_local at ./compiler/abstractinterpretation.jl:3319
typeinf_nocycle at ./compiler/abstractinterpretation.jl:3401
_typeinf at ./compiler/typeinfer.jl:244
typeinf at ./compiler/typeinfer.jl:215
typeinf_edge at ./compiler/typeinfer.jl:923
abstract_call_method at ./compiler/abstractinterpretation.jl:660
abstract_call_gf_by_type at ./compiler/abstractinterpretation.jl:102
abstract_call_known at ./compiler/abstractinterpretation.jl:2200
abstract_call at ./compiler/abstractinterpretation.jl:2282
abstract_call at ./compiler/abstractinterpretation.jl:2275
abstract_call at ./compiler/abstractinterpretation.jl:2420
abstract_eval_call at ./compiler/abstractinterpretation.jl:2435
abstract_eval_statement_expr at ./compiler/abstractinterpretation.jl:2451
abstract_eval_statement at ./compiler/abstractinterpretation.jl:2749
abstract_eval_basic_statement at ./compiler/abstractinterpretation.jl:3065
typeinf_local at ./compiler/abstractinterpretation.jl:3319
typeinf_nocycle at ./compiler/abstractinterpretation.jl:3401
_typeinf at ./compiler/typeinfer.jl:244
typeinf at ./compiler/typeinfer.jl:215
typeinf_edge at ./compiler/typeinfer.jl:923
abstract_call_method at ./compiler/abstractinterpretation.jl:660
abstract_call_gf_by_type at ./compiler/abstractinterpretation.jl:102
abstract_call_known at ./compiler/abstractinterpretation.jl:2200
abstract_call at ./compiler/abstractinterpretation.jl:2282
abstract_call at ./compiler/abstractinterpretation.jl:2275
abstract_call at ./compiler/abstractinterpretation.jl:2420
abstract_eval_call at ./compiler/abstractinterpretation.jl:2435
abstract_eval_statement_expr at ./compiler/abstractinterpretation.jl:2451
abstract_eval_statement at ./compiler/abstractinterpretation.jl:2749
abstract_eval_basic_statement at ./compiler/abstractinterpretation.jl:3065
typeinf_local at ./compiler/abstractinterpretation.jl:3319
typeinf_nocycle at ./compiler/abstractinterpretation.jl:3401
_typeinf at ./compiler/typeinfer.jl:244
typeinf at ./compiler/typeinfer.jl:215
typeinf_edge at ./compiler/typeinfer.jl:923
abstract_call_method at ./compiler/abstractinterpretation.jl:660
abstract_call_gf_by_type at ./compiler/abstractinterpretation.jl:102
abstract_call_known at ./compiler/abstractinterpretation.jl:2200
abstract_call at ./compiler/abstractinterpretation.jl:2282
abstract_call at ./compiler/abstractinterpretation.jl:2275
abstract_call at ./compiler/abstractinterpretation.jl:2420
abstract_eval_call at ./compiler/abstractinterpretation.jl:2435
abstract_eval_statement_expr at ./compiler/abstractinterpretation.jl:2451
abstract_eval_statement at ./compiler/abstractinterpretation.jl:2749
abstract_eval_basic_statement at ./compiler/abstractinterpretation.jl:3065
typeinf_local at ./compiler/abstractinterpretation.jl:3319
typeinf_nocycle at ./compiler/abstractinterpretation.jl:3401
_typeinf at ./compiler/typeinfer.jl:244
typeinf at ./compiler/typeinfer.jl:215
typeinf_edge at ./compiler/typeinfer.jl:923
abstract_call_method at ./compiler/abstractinterpretation.jl:660
abstract_call_gf_by_type at ./compiler/abstractinterpretation.jl:102
abstract_call_known at ./compiler/abstractinterpretation.jl:2200
abstract_call at ./compiler/abstractinterpretation.jl:2282
abstract_call at ./compiler/abstractinterpretation.jl:2275
abstract_call at ./compiler/abstractinterpretation.jl:2420
abstract_eval_call at ./compiler/abstractinterpretation.jl:2435
abstract_eval_statement_expr at ./compiler/abstractinterpretation.jl:2451
abstract_eval_statement at ./compiler/abstractinterpretation.jl:2749
abstract_eval_basic_statement at ./compiler/abstractinterpretation.jl:3065
typeinf_local at ./compiler/abstractinterpretation.jl:3319
typeinf_nocycle at ./compiler/abstractinterpretation.jl:3401
_typeinf at ./compiler/typeinfer.jl:244
typeinf at ./compiler/typeinfer.jl:215
typeinf_ext at ./compiler/typeinfer.jl:1101
typeinf_ext_toplevel at ./compiler/typeinfer.jl:1139
typeinf_ext_toplevel at ./compiler/typeinfer.jl:1135
jfptr_typeinf_ext_toplevel_39604.1 at /home/tam/.julia/juliaup/julia-1.11.0+0.x64.linux.gnu/lib/julia/sys.so (unknown line)
jl_apply at /cache/build/builder-amdci5-1/julialang/julia-master/src/julia.h:2157 [inlined]
jl_type_infer at /cache/build/builder-amdci5-1/julialang/julia-master/src/gf.c:390
jl_generate_fptr_impl at /cache/build/builder-amdci5-1/julialang/julia-master/src/jitlayers.cpp:511
jl_compile_method_internal at /cache/build/builder-amdci5-1/julialang/julia-master/src/gf.c:2536 [inlined]
jl_compile_method_internal at /cache/build/builder-amdci5-1/julialang/julia-master/src/gf.c:2423
_jl_invoke at /cache/build/builder-amdci5-1/julialang/julia-master/src/gf.c:2940 [inlined]
ijl_apply_generic at /cache/build/builder-amdci5-1/julialang/julia-master/src/gf.c:3125
#68 at /cache/build/builder-amdci5-1/julialang/julia-master/usr/share/julia/stdlib/v1.11/REPL/src/REPL.jl:348
jfptr_YY.68_10156 at /home/tam/.julia/juliaup/julia-1.11.0+0.x64.linux.gnu/share/julia/compiled/v1.11/REPL/u0gqU_bFCI4.so (unknown line)
with_repl_linfo at /cache/build/builder-amdci5-1/julialang/julia-master/usr/share/julia/stdlib/v1.11/REPL/src/REPL.jl:646
jfptr_with_repl_linfo_10298 at /home/tam/.julia/juliaup/julia-1.11.0+0.x64.linux.gnu/share/julia/compiled/v1.11/REPL/u0gqU_bFCI4.so (unknown line)
display at /cache/build/builder-amdci5-1/julialang/julia-master/usr/share/julia/stdlib/v1.11/REPL/src/REPL.jl:334
display at /cache/build/builder-amdci5-1/julialang/julia-master/usr/share/julia/stdlib/v1.11/REPL/src/REPL.jl:353 [inlined]
display at ./multimedia.jl:340
jfptr_display_13763 at /home/tam/.julia/juliaup/julia-1.11.0+0.x64.linux.gnu/share/julia/compiled/v1.11/REPL/u0gqU_bFCI4.so (unknown line)
jl_apply at /cache/build/builder-amdci5-1/julialang/julia-master/src/julia.h:2157 [inlined]
jl_f__call_latest at /cache/build/builder-amdci5-1/julialang/julia-master/src/builtins.c:875
#invokelatest#2 at ./essentials.jl:1054 [inlined]
invokelatest at ./essentials.jl:1051 [inlined]
print_response at /cache/build/builder-amdci5-1/julialang/julia-master/usr/share/julia/stdlib/v1.11/REPL/src/REPL.jl:390
#70 at /cache/build/builder-amdci5-1/julialang/julia-master/usr/share/julia/stdlib/v1.11/REPL/src/REPL.jl:359
jfptr_YY.70_10194 at /home/tam/.julia/juliaup/julia-1.11.0+0.x64.linux.gnu/share/julia/compiled/v1.11/REPL/u0gqU_bFCI4.so (unknown line)
with_repl_linfo at /cache/build/builder-amdci5-1/julialang/julia-master/usr/share/julia/stdlib/v1.11/REPL/src/REPL.jl:646
jfptr_with_repl_linfo_10298 at /home/tam/.julia/juliaup/julia-1.11.0+0.x64.linux.gnu/share/julia/compiled/v1.11/REPL/u0gqU_bFCI4.so (unknown line)
print_response at /cache/build/builder-amdci5-1/julialang/julia-master/usr/share/julia/stdlib/v1.11/REPL/src/REPL.jl:357
do_respond at /cache/build/builder-amdci5-1/julialang/julia-master/usr/share/julia/stdlib/v1.11/REPL/src/REPL.jl:988
jfptr_do_respond_10361 at /home/tam/.julia/juliaup/julia-1.11.0+0.x64.linux.gnu/share/julia/compiled/v1.11/REPL/u0gqU_bFCI4.so (unknown line)
jl_apply at /cache/build/builder-amdci5-1/julialang/julia-master/src/julia.h:2157 [inlined]
jl_f__call_latest at /cache/build/builder-amdci5-1/julialang/julia-master/src/builtins.c:875
#invokelatest#2 at ./essentials.jl:1054 [inlined]
invokelatest at ./essentials.jl:1051 [inlined]
run_interface at /cache/build/builder-amdci5-1/julialang/julia-master/usr/share/julia/stdlib/v1.11/REPL/src/LineEdit.jl:2749
jfptr_run_interface_8811 at /home/tam/.julia/juliaup/julia-1.11.0+0.x64.linux.gnu/share/julia/compiled/v1.11/REPL/u0gqU_bFCI4.so (unknown line)
run_frontend at /cache/build/builder-amdci5-1/julialang/julia-master/usr/share/julia/stdlib/v1.11/REPL/src/REPL.jl:1456
#75 at /cache/build/builder-amdci5-1/julialang/julia-master/usr/share/julia/stdlib/v1.11/REPL/src/REPL.jl:461
jfptr_YY.75_10252 at /home/tam/.julia/juliaup/julia-1.11.0+0.x64.linux.gnu/share/julia/compiled/v1.11/REPL/u0gqU_bFCI4.so (unknown line)
jl_apply at /cache/build/builder-amdci5-1/julialang/julia-master/src/julia.h:2157 [inlined]
start_task at /cache/build/builder-amdci5-1/julialang/julia-master/src/task.c:1202
Allocations: 84364785 (Pool: 84362236; Big: 2549); GC: 55
Segmentation fault (core dumped)

@tam724
Copy link
Author

tam724 commented Oct 17, 2024

I started digging into this and was able to identify the call to matmul in the reverse pass of fused_dense as the reason for the segfault.
The segfault can be even more isolated (independent of Lux.jl/LuxLib.jl it just happens to use Octavian.jl). Even this segfaults on my machine (and/or gives wrong results):

using Pkg

Pkg.activate(temp=true)
Pkg.add("Octavian")

using Octavian

N = 100_000
a = rand(Float32, 10, N)
b = rand(Float32, N, 10)

function my_matmul(a, b)
    c = similar(a, promote_type(eltype(a), eltype(b)), size(a, 1), size(b, 2))
    Octavian.matmul!(c, a, b, true, false)
    return c
end

for i in 1:100
    my_matmul(a, b)
end

This can also more reliably be reproduced.

@avik-pal
Copy link
Member

This is unfortunate, octavian and LV is quite fast in the small parameter regime. I can make Octavian/LV opt-in, so if users load it only then use those optimized kernels.

@avik-pal avik-pal self-assigned this Oct 17, 2024
@avik-pal avik-pal linked a pull request Oct 17, 2024 that will close this issue
@tam724
Copy link
Author

tam724 commented Oct 18, 2024

Thanks! This sounds like a tolerable workaround.
LuxDL/LuxLib.jl#175 fixes the issue for me and works as expected. (The MWE works, but obviously breaks after adding using Octavian).

@avik-pal
Copy link
Member

The MWE works, but obviously breaks after adding using Octavian

I will add a preference to completely disable it as well, but I am waiting for an enzyme fix before I can merge the PR

@avik-pal
Copy link
Member

Added in the PR:

[LuxLib]
disable_loop_vectorization = true

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working high-priority
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants