Eval bug: crash with bge-m3 embedding: CUDA error: misaligned address

### Name and Version

```
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 4 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
  Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
  Device 2: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
  Device 3: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
version: 7285 (6016d0bd4)
built with GNU 15.2.1 for Linux x86_64
```
This is the current build at the time of this writing. Built with:
```
cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON -DGGML_SCHED_MAX_COPIES=1 -DCMAKE_CUDA_COMPILER=/usr/local/cuda-13/bin/nvcc
```
Also goes back at least a couple days ago and probably more.



### Operating systems

Linux

### GGML backends

CUDA

### Hardware

I restrict the model to one 4090

### Models

BGE-M3 embeddings model

The model was downloaded from https://huggingface.co/lm-kit/bge-m3-gguf

### Problem description & steps to reproduce

when using bge-m3 model I get a CUDA misaligned access crash.

reproducing  request (if I make the count of chars to be 4096 it does not crash, but I am not sure output is correct with all the nuls):
```
curl -s http://localhost:8081/v1/embeddings \
    -X POST \
    -H "Content-Type: application/json" \
    -d '{"model": "bge-m3", "input": "'$(perl -e 'print("a"x4093)')'"}' | jq .data;
```

### First Bad Commit

I don't know exactly when it was added, but e.g. ik_llama still works so it must have been recent enough

### Relevant log output

```shell
main: starting the main loop...
srv  update_slots: all slots are idle
slot get_availabl: id  3 | task -1 | selected slot by LRU, t_last = -1
slot launch_slot_: id  3 | task -1 | sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
slot launch_slot_: id  3 | task 0 | processing task
slot update_slots: id  3 | task 0 | new prompt, n_ctx_slot = 8192, n_keep = 0, task.n_tokens = 823
slot update_slots: id  3 | task 0 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id  3 | task 0 | prompt processing progress, n_tokens = 823, batch.n_tokens = 823, progress = 1.000000
slot update_slots: id  3 | task 0 | prompt done, n_tokens = 823, batch.n_tokens = 823
/home/green/git/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:92: CUDA error
CUDA error: misaligned address
  current device: 0, in function ggml_backend_cuda_get_tensor_async at /home/green/git/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:2778
  cudaMemcpyAsync(data, (const char *)tensor->data + offset, size, cudaMemcpyDeviceToHost, cuda_ctx->stream())
[New LWP 226676]
[New LWP 226675]
[New LWP 226674]
[New LWP 226673]
[New LWP 226672]
[New LWP 226671]
[New LWP 226670]
[New LWP 226669]
[New LWP 226668]
[New LWP 226667]
[New LWP 226666]
[New LWP 226665]
[New LWP 226664]
[New LWP 226663]
[New LWP 226662]
[New LWP 226661]
[New LWP 226660]
[New LWP 226659]
[New LWP 226658]
[New LWP 226657]
[New LWP 226656]
[New LWP 226655]
[New LWP 226654]
[New LWP 226653]
[New LWP 226652]
[New LWP 226651]
[New LWP 226650]
[New LWP 226649]
[New LWP 226648]
[New LWP 226647]
[New LWP 226646]
[New LWP 226645]
[New LWP 226644]
[New LWP 226643]
[New LWP 226642]
[New LWP 226641]
[New LWP 226640]
[New LWP 226639]
[New LWP 226638]
[New LWP 226637]
[New LWP 226636]
[New LWP 226635]
[New LWP 226634]
[New LWP 226633]
[New LWP 226632]
[New LWP 226631]
[New LWP 226630]
[New LWP 226629]
[New LWP 226628]
[New LWP 226627]
[New LWP 226626]
[New LWP 226625]
[New LWP 226624]
[New LWP 226623]
[New LWP 226622]
[New LWP 226621]
[New LWP 226620]
[New LWP 226619]
[New LWP 226618]
[New LWP 226617]
[New LWP 226616]
[New LWP 226615]
[New LWP 226614]
[New LWP 226613]
[New LWP 226612]
[New LWP 226611]
[New LWP 226610]
[New LWP 226609]
[New LWP 226608]
[New LWP 226607]
[New LWP 226606]
[New LWP 226605]
[New LWP 226604]
[New LWP 226603]
[New LWP 226602]
[New LWP 226601]
[New LWP 226600]
[New LWP 226599]
[New LWP 226598]
[New LWP 226597]
[New LWP 226596]
[New LWP 226595]
[New LWP 226594]
[New LWP 226593]
[New LWP 226592]
[New LWP 226591]
[New LWP 226590]
[New LWP 226589]
[New LWP 226588]
[New LWP 226587]
[New LWP 226586]
[New LWP 226585]
[New LWP 226584]
[New LWP 226583]
[New LWP 226582]
[New LWP 226581]
[New LWP 226580]
[New LWP 226579]
[New LWP 226578]
[New LWP 226577]
[New LWP 226576]
[New LWP 226575]
[New LWP 226574]
[New LWP 226573]
[New LWP 226572]
[New LWP 226571]
[New LWP 226570]
[New LWP 226569]
[New LWP 226568]
[New LWP 226567]
[New LWP 226566]
[New LWP 226565]
[New LWP 226564]
[New LWP 226563]
[New LWP 226562]
[New LWP 226561]
[New LWP 226560]
[New LWP 226559]
[New LWP 226558]
[New LWP 226557]
[New LWP 226556]
[New LWP 226555]
[New LWP 226554]
[New LWP 226553]
[New LWP 226552]
[New LWP 226551]
[New LWP 226550]
[New LWP 226549]
[New LWP 226548]
[New LWP 226547]
[New LWP 226546]
[New LWP 226545]
[New LWP 226544]
[New LWP 226543]
[New LWP 226542]
[New LWP 226541]
[New LWP 226540]
[New LWP 226539]
[New LWP 226538]
[New LWP 226537]
[New LWP 226536]
[New LWP 226535]
[New LWP 226534]
[New LWP 226533]
[New LWP 226532]
[New LWP 226531]
[New LWP 226530]
[New LWP 226529]
[New LWP 226528]
[New LWP 226527]
[New LWP 226526]
[New LWP 226525]
[New LWP 226524]
[New LWP 226523]
[New LWP 226522]
[New LWP 226521]
[New LWP 226520]
[New LWP 226519]
[New LWP 226518]
[New LWP 226517]
[New LWP 226516]
[New LWP 226515]
[New LWP 226514]
[New LWP 226513]
[New LWP 226512]
[New LWP 226511]
[New LWP 226510]
[New LWP 226509]
[New LWP 226508]
[New LWP 226507]
[New LWP 226506]
[New LWP 226501]

This GDB supports auto-downloading debuginfo from the following URLs:
  <https://debuginfod.fedoraproject.org/>
Enable debuginfod for this session? (y or [n]) [answered N; input not from terminal]
Debuginfod has been disabled.
To make this setting permanent, add 'set debuginfod enabled off' to .gdbinit.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x00007f4232e88462 in __syscall_cancel_arch () from /lib64/libc.so.6
#0  0x00007f4232e88462 in __syscall_cancel_arch () from /lib64/libc.so.6
#1  0x00007f4232e7c75c in __internal_syscall_cancel () from /lib64/libc.so.6
#2  0x00007f4232e7c7a4 in __syscall_cancel () from /lib64/libc.so.6
#3  0x00007f4232eecc0f in wait4 () from /lib64/libc.so.6
#4  0x00007f423c403963 in ggml_print_backtrace () from /home/green/git/llama.cpp/build/bin/libggml-base.so.0
#5  0x00007f423c403aaf in ggml_abort () from /home/green/git/llama.cpp/build/bin/libggml-base.so.0
#6  0x00007f423951dc23 in ggml_cuda_error(char const*, char const*, char const*, int, char const*) () from /home/green/git/llama.cpp/build/bin/libggml-cuda.so.0
#7  0x00007f42395216de in ggml_backend_cuda_get_tensor_async(ggml_backend*, ggml_tensor const*, void*, unsigned long, unsigned long) () from /home/green/git/llama.cpp/build/bin/libggml-cuda.so.0
#8  0x00007f423c63efb0 in llama_context::encode(llama_batch const&) () from /home/green/git/llama.cpp/build/bin/libllama.so.0
#9  0x00007f423c6403fe in llama_decode () from /home/green/git/llama.cpp/build/bin/libllama.so.0
#10 0x0000000000529912 in server_context_impl::update_slots() ()
#11 0x00000000004d3410 in server_queue::start_loop() ()
#12 0x0000000000465595 in main ()
[Inferior 1 (process 226500) detached]
/tmp/start: line 8: 226500 Aborted                 (core dumped) CUDA_VISIBLE_DEVICES=0 /home/green/git/llama.cpp/build/bin/llama-server --numa distribute --port 8081 --model /usr/local/ai/models/bge-m3/bge-m3-F16.gguf --ctx-size 8192 -b 8192 -ub 8192 --embeddings --no-mmap
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Eval bug: crash with bge-m3 embedding: CUDA error: misaligned address #17796

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Eval bug: crash with bge-m3 embedding: CUDA error: misaligned address #17796

Description

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions