Skip to content

Conversation

@JohannesGaessler
Copy link
Collaborator

Fixes #17796 .

The problem is that for determining whether the kernel can assume aligned and padded allocations the code on master is only considering the tensor shapes, but not the tensor strides. So with the bandwidth-optimal copy sizes of 4/16 bytes you can end up with misaligned pointers.

Performance changes
GPU Model Microbatch size Test t/s 6b82eb7 t/s d4cbe79 Speedup
RTX 3090 llama 7B Q4_0 1 pp512@d32768 40.05 40.02 1.00
RTX 3090 llama 7B Q4_0 2 pp512@d32768 76.19 75.11 0.99
RTX 3090 llama 7B Q4_0 4 pp512@d32768 146.70 144.65 0.99
RTX 3090 llama 7B Q4_0 8 pp512@d32768 250.73 247.66 0.99
RTX 3090 llama 7B Q4_0 16 pp512@d32768 504.32 503.10 1.00
RTX 3090 llama 7B Q4_0 32 pp512@d32768 905.59 907.51 1.00
RTX 3090 llama 7B Q4_0 64 pp512@d32768 1475.74 1473.77 1.00
RTX 3090 llama 7B Q4_0 128 pp512@d32768 1567.34 1573.62 1.00
RTX 3090 llama 7B Q4_0 256 pp512@d32768 1598.65 1557.23 0.97
RTX 3090 llama 7B Q4_0 512 pp512@d32768 1432.27 1389.90 0.97
RTX 3090 llama 8B Q4_0 1 pp512@d32768 93.02 92.98 1.00
RTX 3090 llama 8B Q4_0 2 pp512@d32768 170.50 170.45 1.00
RTX 3090 llama 8B Q4_0 4 pp512@d32768 304.23 304.14 1.00
RTX 3090 llama 8B Q4_0 8 pp512@d32768 439.36 440.12 1.00
RTX 3090 llama 8B Q4_0 16 pp512@d32768 874.90 874.54 1.00
RTX 3090 llama 8B Q4_0 32 pp512@d32768 1199.90 1197.56 1.00
RTX 3090 llama 8B Q4_0 64 pp512@d32768 1507.73 1504.91 1.00
RTX 3090 llama 8B Q4_0 128 pp512@d32768 1646.17 1641.61 1.00
RTX 3090 llama 8B Q4_0 256 pp512@d32768 1775.90 1770.39 1.00
RTX 3090 llama 8B Q4_0 512 pp512@d32768 1545.68 1550.91 1.00
RTX 4090 llama 7B Q4_0 1 pp512@d32768 43.75 43.75 1.00
RTX 4090 llama 7B Q4_0 2 pp512@d32768 84.11 83.30 0.99
RTX 4090 llama 7B Q4_0 4 pp512@d32768 167.64 166.11 0.99
RTX 4090 llama 7B Q4_0 8 pp512@d32768 314.29 312.07 0.99
RTX 4090 llama 7B Q4_0 16 pp512@d32768 606.69 605.35 1.00
RTX 4090 llama 7B Q4_0 32 pp512@d32768 1140.19 1140.27 1.00
RTX 4090 llama 7B Q4_0 64 pp512@d32768 2043.45 2043.16 1.00
RTX 4090 llama 7B Q4_0 128 pp512@d32768 3146.05 3127.82 0.99
RTX 4090 llama 7B Q4_0 256 pp512@d32768 3152.41 3136.11 0.99
RTX 4090 llama 7B Q4_0 512 pp512@d32768 2563.87 2555.08 1.00
RTX 4090 llama 8B Q4_0 1 pp512@d32768 104.92 104.94 1.00
RTX 4090 llama 8B Q4_0 2 pp512@d32768 190.14 190.76 1.00
RTX 4090 llama 8B Q4_0 4 pp512@d32768 367.05 367.31 1.00
RTX 4090 llama 8B Q4_0 8 pp512@d32768 647.03 646.37 1.00
RTX 4090 llama 8B Q4_0 16 pp512@d32768 1142.95 1142.36 1.00
RTX 4090 llama 8B Q4_0 32 pp512@d32768 2067.27 2068.11 1.00
RTX 4090 llama 8B Q4_0 64 pp512@d32768 3155.18 3153.10 1.00
RTX 4090 llama 8B Q4_0 128 pp512@d32768 3516.50 3512.25 1.00
RTX 4090 llama 8B Q4_0 256 pp512@d32768 3260.76 3247.34 1.00
RTX 4090 llama 8B Q4_0 512 pp512@d32768 2679.73 2662.31 0.99
RTX 5090 llama 7B Q4_0 1 pp512@d32768 74.51 74.51 1.00
RTX 5090 llama 7B Q4_0 2 pp512@d32768 131.12 129.75 0.99
RTX 5090 llama 7B Q4_0 4 pp512@d32768 257.78 255.16 0.99
RTX 5090 llama 7B Q4_0 8 pp512@d32768 467.28 463.29 0.99
RTX 5090 llama 7B Q4_0 16 pp512@d32768 837.73 881.04 1.05
RTX 5090 llama 7B Q4_0 32 pp512@d32768 1627.55 1602.04 0.98
RTX 5090 llama 7B Q4_0 64 pp512@d32768 2623.73 2551.33 0.97
RTX 5090 llama 7B Q4_0 128 pp512@d32768 2803.75 2771.09 0.99
RTX 5090 llama 7B Q4_0 512 pp512@d32768 2684.81 2525.63 0.94
RTX 5090 llama 8B Q4_0 1 pp512@d32768 165.94 165.86 1.00
RTX 5090 llama 8B Q4_0 2 pp512@d32768 269.26 269.14 1.00
RTX 5090 llama 8B Q4_0 4 pp512@d32768 486.58 486.46 1.00
RTX 5090 llama 8B Q4_0 8 pp512@d32768 810.03 809.42 1.00
RTX 5090 llama 8B Q4_0 16 pp512@d32768 1405.56 1404.92 1.00
RTX 5090 llama 8B Q4_0 32 pp512@d32768 2330.73 2331.65 1.00
RTX 5090 llama 8B Q4_0 64 pp512@d32768 2866.00 2861.02 1.00
RTX 5090 llama 8B Q4_0 128 pp512@d32768 3502.23 3495.91 1.00
RTX 5090 llama 8B Q4_0 256 pp512@d32768 3370.96 3370.50 1.00
RTX 5090 llama 8B Q4_0 512 pp512@d32768 2922.97 2929.01 1.00
V100-PCIE-32GB llama 7B Q4_0 1 pp512@d32768 37.36 37.36 1.00
V100-PCIE-32GB llama 7B Q4_0 2 pp512@d32768 68.72 68.70 1.00
V100-PCIE-32GB llama 7B Q4_0 4 pp512@d32768 126.26 126.18 1.00
V100-PCIE-32GB llama 7B Q4_0 8 pp512@d32768 222.84 223.12 1.00
V100-PCIE-32GB llama 7B Q4_0 16 pp512@d32768 332.87 333.29 1.00
V100-PCIE-32GB llama 7B Q4_0 32 pp512@d32768 524.77 594.61 1.13
V100-PCIE-32GB llama 7B Q4_0 64 pp512@d32768 437.89 460.85 1.05
V100-PCIE-32GB llama 7B Q4_0 128 pp512@d32768 634.31 679.05 1.07
V100-PCIE-32GB llama 7B Q4_0 256 pp512@d32768 857.35 842.09 0.98
V100-PCIE-32GB llama 7B Q4_0 512 pp512@d32768 949.65 921.29 0.97
V100-PCIE-32GB llama 8B Q4_0 1 pp512@d32768 81.32 81.50 1.00
V100-PCIE-32GB llama 8B Q4_0 2 pp512@d32768 153.67 154.03 1.00
V100-PCIE-32GB llama 8B Q4_0 4 pp512@d32768 220.77 220.32 1.00
V100-PCIE-32GB llama 8B Q4_0 8 pp512@d32768 311.82 311.82 1.00
V100-PCIE-32GB llama 8B Q4_0 16 pp512@d32768 471.04 470.87 1.00
V100-PCIE-32GB llama 8B Q4_0 32 pp512@d32768 671.06 671.55 1.00
V100-PCIE-32GB llama 8B Q4_0 64 pp512@d32768 465.56 465.20 1.00
V100-PCIE-32GB llama 8B Q4_0 128 pp512@d32768 697.67 696.30 1.00
V100-PCIE-32GB llama 8B Q4_0 256 pp512@d32768 857.68 856.99 1.00
V100-PCIE-32GB llama 8B Q4_0 512 pp512@d32768 935.99 932.80 1.00

For most models the performance should remain unchanged since they use GQA and this PR only affects models without GQA or if insufficient padding is used. I'm not sure what is happening with the V100 performance at medium batch sizes - I didn't yet really optimize this kernel for that particular GPU so I'm wondering whether the new loop structure results in different register usage.

@JohannesGaessler JohannesGaessler force-pushed the cuda-fa-fix-unpadded-strides branch from 31c2cdd to 8b836a9 Compare December 9, 2025 21:39
@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Dec 9, 2025
@CISC
Copy link
Collaborator

CISC commented Dec 9, 2025

@JohannesGaessler JohannesGaessler force-pushed the cuda-fa-fix-unpadded-strides branch from 8b836a9 to 44026de Compare December 10, 2025 00:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Eval bug: crash with bge-m3 embedding: CUDA error: misaligned address

3 participants