-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segfault for simple Zygote pullback #980
Comments
This is going to be messy. Can't seem to reproduce this. Ran it 100K times without a segfault. val, back = Zygote.pullback(func, ps)
for i in 1:100_000
grad = back(1.0) # run this multiple times
end If you remove the last |
Yes, it seems to be very machine dependent, but after it also occurred on another machine, I thought it is worth an issue. Without the last model = Chain(Dense(2, 10, Lux.relu), Dense(10, 10, relu), Dense(10, 10)) I get the same segfault. |
Can you try using LuxDL/LuxLib.jl#175 and check if it still segfaults? |
Yes, still occurs. dump
|
I started digging into this and was able to identify the call to using Pkg
Pkg.activate(temp=true)
Pkg.add("Octavian")
using Octavian
N = 100_000
a = rand(Float32, 10, N)
b = rand(Float32, N, 10)
function my_matmul(a, b)
c = similar(a, promote_type(eltype(a), eltype(b)), size(a, 1), size(b, 2))
Octavian.matmul!(c, a, b, true, false)
return c
end
for i in 1:100
my_matmul(a, b)
end This can also more reliably be reproduced. |
This is unfortunate, octavian and LV is quite fast in the small parameter regime. I can make Octavian/LV opt-in, so if users load it only then use those optimized kernels. |
Thanks! This sounds like a tolerable workaround. |
I will add a preference to completely disable it as well, but I am waiting for an enzyme fix before I can merge the PR |
Added in the PR: [LuxLib]
disable_loop_vectorization = true |
When I try to run the following script, I get a segmentation fault when calling
back(1.0)
.Sometimes the first call crashes, sometimes it crashes after subsequent calls to
back(1.0)
. Also the values ofgrad
are inconsistent (NaN) when running multiple times.I was able to track this issue to the version change #v.0.5.39 (works fine) to #v0.5.40 (segfault as in #v1.1.0) of Lux.jl and tested it with julia versions 1.11 and 1.10.5.
It happens consistently on my machine, but I was able to also reproduce it on a different machine (though rarely).
run with output (core dump)
Note the inconsistent values of
grad
in e.g.layer_2.weight
.julia versioninfo
The text was updated successfully, but these errors were encountered: