-
Notifications
You must be signed in to change notification settings - Fork 14k
Closed
Labels
CUDARelated to the CUDA backendRelated to the CUDA backendNvidia GPUIssues specific to Nvidia GPUsIssues specific to Nvidia GPUsbugSomething isn't workingSomething isn't working
Description
Name and Version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 4 CUDA devices:
Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Device 2: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
Device 3: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
version: 7285 (6016d0bd4)
built with GNU 15.2.1 for Linux x86_64
This is the current build at the time of this writing. Built with:
cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON -DGGML_SCHED_MAX_COPIES=1 -DCMAKE_CUDA_COMPILER=/usr/local/cuda-13/bin/nvcc
Also goes back at least a couple days ago and probably more.
Operating systems
Linux
GGML backends
CUDA
Hardware
I restrict the model to one 4090
Models
BGE-M3 embeddings model
The model was downloaded from https://huggingface.co/lm-kit/bge-m3-gguf
Problem description & steps to reproduce
when using bge-m3 model I get a CUDA misaligned access crash.
reproducing request (if I make the count of chars to be 4096 it does not crash, but I am not sure output is correct with all the nuls):
curl -s http://localhost:8081/v1/embeddings \
-X POST \
-H "Content-Type: application/json" \
-d '{"model": "bge-m3", "input": "'$(perl -e 'print("a"x4093)')'"}' | jq .data;
First Bad Commit
I don't know exactly when it was added, but e.g. ik_llama still works so it must have been recent enough
Relevant log output
main: starting the main loop...
srv update_slots: all slots are idle
slot get_availabl: id 3 | task -1 | selected slot by LRU, t_last = -1
slot launch_slot_: id 3 | task -1 | sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
slot launch_slot_: id 3 | task 0 | processing task
slot update_slots: id 3 | task 0 | new prompt, n_ctx_slot = 8192, n_keep = 0, task.n_tokens = 823
slot update_slots: id 3 | task 0 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id 3 | task 0 | prompt processing progress, n_tokens = 823, batch.n_tokens = 823, progress = 1.000000
slot update_slots: id 3 | task 0 | prompt done, n_tokens = 823, batch.n_tokens = 823
/home/green/git/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:92: CUDA error
CUDA error: misaligned address
current device: 0, in function ggml_backend_cuda_get_tensor_async at /home/green/git/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:2778
cudaMemcpyAsync(data, (const char *)tensor->data + offset, size, cudaMemcpyDeviceToHost, cuda_ctx->stream())
[New LWP 226676]
[New LWP 226675]
[New LWP 226674]
[New LWP 226673]
[New LWP 226672]
[New LWP 226671]
[New LWP 226670]
[New LWP 226669]
[New LWP 226668]
[New LWP 226667]
[New LWP 226666]
[New LWP 226665]
[New LWP 226664]
[New LWP 226663]
[New LWP 226662]
[New LWP 226661]
[New LWP 226660]
[New LWP 226659]
[New LWP 226658]
[New LWP 226657]
[New LWP 226656]
[New LWP 226655]
[New LWP 226654]
[New LWP 226653]
[New LWP 226652]
[New LWP 226651]
[New LWP 226650]
[New LWP 226649]
[New LWP 226648]
[New LWP 226647]
[New LWP 226646]
[New LWP 226645]
[New LWP 226644]
[New LWP 226643]
[New LWP 226642]
[New LWP 226641]
[New LWP 226640]
[New LWP 226639]
[New LWP 226638]
[New LWP 226637]
[New LWP 226636]
[New LWP 226635]
[New LWP 226634]
[New LWP 226633]
[New LWP 226632]
[New LWP 226631]
[New LWP 226630]
[New LWP 226629]
[New LWP 226628]
[New LWP 226627]
[New LWP 226626]
[New LWP 226625]
[New LWP 226624]
[New LWP 226623]
[New LWP 226622]
[New LWP 226621]
[New LWP 226620]
[New LWP 226619]
[New LWP 226618]
[New LWP 226617]
[New LWP 226616]
[New LWP 226615]
[New LWP 226614]
[New LWP 226613]
[New LWP 226612]
[New LWP 226611]
[New LWP 226610]
[New LWP 226609]
[New LWP 226608]
[New LWP 226607]
[New LWP 226606]
[New LWP 226605]
[New LWP 226604]
[New LWP 226603]
[New LWP 226602]
[New LWP 226601]
[New LWP 226600]
[New LWP 226599]
[New LWP 226598]
[New LWP 226597]
[New LWP 226596]
[New LWP 226595]
[New LWP 226594]
[New LWP 226593]
[New LWP 226592]
[New LWP 226591]
[New LWP 226590]
[New LWP 226589]
[New LWP 226588]
[New LWP 226587]
[New LWP 226586]
[New LWP 226585]
[New LWP 226584]
[New LWP 226583]
[New LWP 226582]
[New LWP 226581]
[New LWP 226580]
[New LWP 226579]
[New LWP 226578]
[New LWP 226577]
[New LWP 226576]
[New LWP 226575]
[New LWP 226574]
[New LWP 226573]
[New LWP 226572]
[New LWP 226571]
[New LWP 226570]
[New LWP 226569]
[New LWP 226568]
[New LWP 226567]
[New LWP 226566]
[New LWP 226565]
[New LWP 226564]
[New LWP 226563]
[New LWP 226562]
[New LWP 226561]
[New LWP 226560]
[New LWP 226559]
[New LWP 226558]
[New LWP 226557]
[New LWP 226556]
[New LWP 226555]
[New LWP 226554]
[New LWP 226553]
[New LWP 226552]
[New LWP 226551]
[New LWP 226550]
[New LWP 226549]
[New LWP 226548]
[New LWP 226547]
[New LWP 226546]
[New LWP 226545]
[New LWP 226544]
[New LWP 226543]
[New LWP 226542]
[New LWP 226541]
[New LWP 226540]
[New LWP 226539]
[New LWP 226538]
[New LWP 226537]
[New LWP 226536]
[New LWP 226535]
[New LWP 226534]
[New LWP 226533]
[New LWP 226532]
[New LWP 226531]
[New LWP 226530]
[New LWP 226529]
[New LWP 226528]
[New LWP 226527]
[New LWP 226526]
[New LWP 226525]
[New LWP 226524]
[New LWP 226523]
[New LWP 226522]
[New LWP 226521]
[New LWP 226520]
[New LWP 226519]
[New LWP 226518]
[New LWP 226517]
[New LWP 226516]
[New LWP 226515]
[New LWP 226514]
[New LWP 226513]
[New LWP 226512]
[New LWP 226511]
[New LWP 226510]
[New LWP 226509]
[New LWP 226508]
[New LWP 226507]
[New LWP 226506]
[New LWP 226501]
This GDB supports auto-downloading debuginfo from the following URLs:
<https://debuginfod.fedoraproject.org/>
Enable debuginfod for this session? (y or [n]) [answered N; input not from terminal]
Debuginfod has been disabled.
To make this setting permanent, add 'set debuginfod enabled off' to .gdbinit.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x00007f4232e88462 in __syscall_cancel_arch () from /lib64/libc.so.6
#0 0x00007f4232e88462 in __syscall_cancel_arch () from /lib64/libc.so.6
#1 0x00007f4232e7c75c in __internal_syscall_cancel () from /lib64/libc.so.6
#2 0x00007f4232e7c7a4 in __syscall_cancel () from /lib64/libc.so.6
#3 0x00007f4232eecc0f in wait4 () from /lib64/libc.so.6
#4 0x00007f423c403963 in ggml_print_backtrace () from /home/green/git/llama.cpp/build/bin/libggml-base.so.0
#5 0x00007f423c403aaf in ggml_abort () from /home/green/git/llama.cpp/build/bin/libggml-base.so.0
#6 0x00007f423951dc23 in ggml_cuda_error(char const*, char const*, char const*, int, char const*) () from /home/green/git/llama.cpp/build/bin/libggml-cuda.so.0
#7 0x00007f42395216de in ggml_backend_cuda_get_tensor_async(ggml_backend*, ggml_tensor const*, void*, unsigned long, unsigned long) () from /home/green/git/llama.cpp/build/bin/libggml-cuda.so.0
#8 0x00007f423c63efb0 in llama_context::encode(llama_batch const&) () from /home/green/git/llama.cpp/build/bin/libllama.so.0
#9 0x00007f423c6403fe in llama_decode () from /home/green/git/llama.cpp/build/bin/libllama.so.0
#10 0x0000000000529912 in server_context_impl::update_slots() ()
#11 0x00000000004d3410 in server_queue::start_loop() ()
#12 0x0000000000465595 in main ()
[Inferior 1 (process 226500) detached]
/tmp/start: line 8: 226500 Aborted (core dumped) CUDA_VISIBLE_DEVICES=0 /home/green/git/llama.cpp/build/bin/llama-server --numa distribute --port 8081 --model /usr/local/ai/models/bge-m3/bge-m3-F16.gguf --ctx-size 8192 -b 8192 -ub 8192 --embeddings --no-mmapMetadata
Metadata
Assignees
Labels
CUDARelated to the CUDA backendRelated to the CUDA backendNvidia GPUIssues specific to Nvidia GPUsIssues specific to Nvidia GPUsbugSomething isn't workingSomething isn't working