Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vulkan: build fixes for 32b #10927

Merged
merged 2 commits into from
Dec 22, 2024

Conversation

jeffbolznv
Copy link
Collaborator

Should fix #10923

@jeffbolznv jeffbolznv requested a review from 0cc4m December 21, 2024 05:21
@github-actions github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Dec 21, 2024
@ggerganov
Copy link
Owner

The ggml-ci has been reporting "maybe uninitialized" warnings for a while:

https://github.com/ggml-org/ci/blob/31168d7a582ded11a0dec489a62fb8bef74349a8/llama.cpp/a9/1a41364b25705dbb81ae996bc35c3440c63b35/ggml-6-x86-vulkan-t4/stdall#L538

Might want to fix these too.

@jeffbolznv
Copy link
Collaborator Author

Second commit ought to fix the uninitialized variables, though I couldn't reproduce the warnings/errors locally.

@0cc4m 0cc4m merged commit ebdee94 into ggerganov:master Dec 22, 2024
48 checks passed
@ggerganov
Copy link
Owner

ggerganov commented Dec 22, 2024

Thanks. Also, I remember you discussed recently the segfault upon program exit, but not sure which discussion was this. Do you have any ideas how this would be resolved? It's preventing the ggml-ci from running beyond the first test. (cc @netrunnereve)

@netrunnereve
Copy link
Collaborator

For what it's worth here's the thread discussing the segfault. It seems to be intermittent so if you restart the CI again you might be able to avoid it for now.

#10528

@jeffbolznv
Copy link
Collaborator Author

Right, the discussion is in #10528. I don't currently have a linux system to repro that, but if @0cc4m isn't able to work on it soon then I might be able to set something up.

@ggerganov
Copy link
Owner

It's no longer segfault-ing after the restart. However, the CI appears to have revealed an issue when computing embeddings:

https://github.com/ggml-org/ci/tree/results/llama.cpp/eb/dee9478ca7ba65497b9b96f7457698c6ee5115/ggml-6-x86-vulkan-t4

@ggerganov
Copy link
Owner

This command is randomly segfault-ing upon exit:

./bin/llama-embedding --model ../models-mnt/rerank-tiny/ggml-model-f16.gguf -p "what is panda?</s></s>hi\nwhat is panda?</s></s>it's a bear\nwhat is panda?</s></s>The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China." -ngl 99 -c 0 --pooling rank --embd-normalize -1 --verbose-prompt

I tried to get a stacktrace, but it's optimized out even in Debug for some reason:

batch_decode: n_tokens = 62, n_seq = 3

rerank score 0:    0.023
rerank score 1:    0.024
rerank score 2:    0.199

llama_perf_context_print:        load time =    1679.94 ms
llama_perf_context_print: prompt eval time =       7.25 ms /    62 tokens (    0.12 ms per token,  8556.44 tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =      11.52 ms /    63 tokens
[Thread 0x7fffe2a006c0 (LWP 78286) exited]
[Thread 0x7fffdea006c0 (LWP 78291) exited]
[Thread 0x7fffe20006c0 (LWP 78287) exited]

Thread 6 "[vkps] Update" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffdf4006c0 (LWP 78290)]
0x00007fffe4801960 in ?? ()
(gdb) bt
#0  0x00007fffe4801960 in ?? ()
#1  0x0000000067685496 in ?? ()
#2  0x0000000036c9f0a0 in ?? ()
#3  0x0000000067685496 in ?? ()
#4  0x00000000000deb4c in ?? ()
#5  0x0000000000000007 in ?? ()
#6  0x00005555574cc838 in ?? ()
#7  0x181391d9e7cb40e8 in ?? ()
#8  0x00007fffe5c84320 in ?? ()
#9  0x0000555555a737d0 in ?? ()
#10 0x00007fffe4b392b4 in ?? ()
#11 0x00005555574cc958 in ?? ()
#12 0x181391d9f98f5968 in ?? ()
#13 0x181391d9e7af0a40 in ?? ()
#14 0x00005555574be3e0 in ?? ()
#15 0x203a6362696c6720 in ?? ()
#16 0x000055555768a3f0 in ?? ()
#17 0x00007fffe4803e20 in ?? ()
#18 0x00007fffdf4006c0 in ?? ()
#19 0xffffffffffffff60 in ?? ()
#20 0x0000000000000002 in ?? ()
#21 0x00007fffffffa3e0 in ?? ()
#22 0x00007fffe4804dfa in ?? ()
#23 0x00007fffdf4006c0 in ?? ()
#24 0x00007fffdf400cdc in ?? ()
#25 0x00007fffdf3ffef0 in ?? ()
#26 0x00007ffff669ca94 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:447
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb) q

@jeffbolznv
Copy link
Collaborator Author

Based on the thread name it's the same as #10528. The stack is entirely in a driver thread, so I wouldn't expect to be able to get a useful stack trace.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Vulkan Issues specific to the Vulkan backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Compile bug: macOS Vulkan build fails
4 participants