Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vulkan: Optimize some mat-vec mul quant shaders #10296

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

jeffbolznv
Copy link
Collaborator

Compute two result elements per workgroup (for Q{4,5}_{0,1}). This reuses the B loads across the rows and also reuses some addressing calculations. This required manually partially unrolling the loop, since the compiler is less willing to unroll outer loops.

Add bounds-checking on the last iteration of the loop. I think this was at least partly broken before. I'd also like to be able to disable robustness for some of these pipelines in the future, to get a bit more perf.

Optimize the Q4_K shader to vectorize most loads and reduce the number of bit twiddling instructions. It should be possible to do something similar to other Qi_K shaders. I can maybe do this, but happy for somebody else to do it.

Perf results below. Still slower than CUDA (which is using dp4a), but a nice boost. Definitely worth testing on some other hardware, too.

Before:
  MUL_MAT(type_a=f32,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   10224 runs -   492.38 us/run - 117.44 MFLOP/run - 238.52 GFLOPS
  MUL_MAT(type_a=f16,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   20448 runs -   248.94 us/run - 117.44 MFLOP/run - 471.76 GFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  54528 runs -    92.30 us/run - 117.44 MFLOP/run -   1.27 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  45156 runs -   110.73 us/run - 117.44 MFLOP/run -   1.06 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  21300 runs -   241.06 us/run - 117.44 MFLOP/run - 487.19 GFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  16188 runs -   309.08 us/run - 117.44 MFLOP/run - 379.97 GFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  34932 runs -   145.27 us/run - 117.44 MFLOP/run - 808.44 GFLOPS
  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  51120 runs -    97.85 us/run - 117.44 MFLOP/run -   1.20 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  39192 runs -   128.78 us/run - 117.44 MFLOP/run - 911.94 GFLOPS
  MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  50268 runs -    99.85 us/run - 117.44 MFLOP/run -   1.18 TFLOPS
  MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  40896 runs -   124.83 us/run - 117.44 MFLOP/run - 940.77 GFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  40896 runs -   123.86 us/run - 117.44 MFLOP/run - 948.14 GFLOPS
  MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                11928 runs -   440.35 us/run - 117.44 MFLOP/run - 266.70 GFLOPS
  
| baichuan 13B Q4_0              |   7.44 GiB |    13.90 B | Vulkan     | 1000 |         tg128 |         35.03  0.13 |
| starcoder2 7B Q4_0             |   3.88 GiB |     7.40 B | Vulkan     | 1000 |         tg128 |         55.92  0.37 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     | 1000 |         tg128 |         93.81  0.97 |

After:
  MUL_MAT(type_a=f32,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   10224 runs -   493.00 us/run - 117.44 MFLOP/run - 238.21 GFLOPS
  MUL_MAT(type_a=f16,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   20448 runs -   251.57 us/run - 117.44 MFLOP/run - 466.83 GFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  65604 runs -    77.18 us/run - 117.44 MFLOP/run -   1.52 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  55380 runs -    91.59 us/run - 117.44 MFLOP/run -   1.28 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  36636 runs -   139.45 us/run - 117.44 MFLOP/run - 842.17 GFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  33228 runs -   154.00 us/run - 117.44 MFLOP/run - 762.60 GFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  35784 runs -   141.88 us/run - 117.44 MFLOP/run - 827.74 GFLOPS
  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  53676 runs -    94.49 us/run - 117.44 MFLOP/run -   1.24 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  39192 runs -   129.18 us/run - 117.44 MFLOP/run - 909.13 GFLOPS
  MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  82644 runs -    60.54 us/run - 117.44 MFLOP/run -   1.94 TFLOPS
  MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  39192 runs -   130.01 us/run - 117.44 MFLOP/run - 903.33 GFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  40896 runs -   123.02 us/run - 117.44 MFLOP/run - 954.66 GFLOPS
  MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                11076 runs -   459.25 us/run - 117.44 MFLOP/run - 255.72 GFLOPS

| baichuan 13B Q4_0              |   7.44 GiB |    13.90 B | Vulkan     | 1000 |         tg128 |         40.93  0.30 |
| starcoder2 7B Q4_0             |   3.88 GiB |     7.40 B | Vulkan     | 1000 |         tg128 |         64.30  0.82 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     | 1000 |         tg128 |        105.04  0.55 |

Split out from #10206, but the code is pretty different.

Compute two result elements per workgroup (for Q{4,5}_{0,1}). This reuses
the B loads across the rows and also reuses some addressing calculations.
This required manually partially unrolling the loop, since the compiler
is less willing to unroll outer loops.

Add bounds-checking on the last iteration of the loop. I think this was at
least partly broken before.

Optimize the Q4_K shader to vectorize most loads and reduce the number of
bit twiddling instructions.
@0cc4m
Copy link
Collaborator

0cc4m commented Nov 14, 2024

Thank you, this is quite impressive!

I tested these models:

  • Llama 3 8B Q4_K_S
  • Llama 3 8B q4_0
  • Llama 2 13B q4_0
  • Mistral Nemo q5_0

Nvidia RTX 3090:

Before:

model size params backend ngl threads test t/s
llama 8B Q4_K - Small 4.36 GiB 8.03 B Vulkan 99 tg128 66.12 ± 0.99
llama 8B Q4_0 5.61 GiB 8.03 B Vulkan 99 tg128 57.64 ± 0.04
llama 13B Q4_0 6.86 GiB 13.02 B Vulkan 99 tg128 38.73 ± 0.12
llama 13B Q5_0 7.93 GiB 12.25 B Vulkan 99 tg128 20.79 ± 0.04

After:

model size params backend ngl threads test t/s
llama 8B Q4_K - Small 4.36 GiB 8.03 B Vulkan 99 tg128 77.54 ± 0.13
llama 8B Q4_0 5.61 GiB 8.03 B Vulkan 99 tg128 73.46 ± 0.09
llama 13B Q4_0 6.86 GiB 13.02 B Vulkan 99 tg128 50.04 ± 0.05
llama 13B Q5_0 7.93 GiB 12.25 B Vulkan 99 tg128 38.89 ± 0.11

Nvidia Tesla P40:

Before:

model size params backend ngl threads test t/s
llama 8B Q4_K - Small 4.36 GiB 8.03 B Vulkan 99 tg128 21.52 ± 0.03
llama 8B Q4_0 5.61 GiB 8.03 B Vulkan 99 tg128 22.80 ± 0.02
llama 13B Q4_0 6.86 GiB 13.02 B Vulkan 99 tg128 13.85 ± 0.00
llama 13B Q5_0 7.93 GiB 12.25 B Vulkan 99 tg128 8.49 ± 0.01

After:

model size params backend ngl threads test t/s
llama 8B Q4_K - Small 4.36 GiB 8.03 B Vulkan 99 tg128 38.30 ± 0.12
llama 8B Q4_0 5.61 GiB 8.03 B Vulkan 99 tg128 22.19 ± 0.01
llama 13B Q4_0 6.86 GiB 13.02 B Vulkan 99 tg128 13.05 ± 0.00
llama 13B Q5_0 7.93 GiB 12.25 B Vulkan 99 tg128 8.92 ± 0.59

AMD Radeon Pro VII:

Before:

model size params backend ngl threads test t/s
llama 8B Q4_K - Small 4.36 GiB 8.03 B Vulkan 99 tg128 25.14 ± 0.52
llama 8B Q4_0 5.61 GiB 8.03 B Vulkan 99 tg128 28.82 ± 0.54
llama 13B Q4_0 6.86 GiB 13.02 B Vulkan 99 tg128 16.30 ± 1.93
llama 13B Q5_0 7.93 GiB 12.25 B Vulkan 99 tg128 14.74 ± 0.06

After:

model size params backend ngl threads test t/s
llama 8B Q4_K - Small 4.36 GiB 8.03 B Vulkan 99 tg128 29.90 ± 0.03
llama 8B Q4_0 5.61 GiB 8.03 B Vulkan 99 tg128 38.49 ± 3.24
llama 13B Q4_0 6.86 GiB 13.02 B Vulkan 99 tg128 26.55 ± 0.35
llama 13B Q5_0 7.93 GiB 12.25 B Vulkan 99 tg128 18.09 ± 0.21

AMD Radeon RX 6800 XT:

Before:

model size params backend ngl threads test t/s
llama 8B Q4_K - Small 4.36 GiB 8.03 B Vulkan 99 8 tg128 83.11 ± 0.12
llama 8B Q4_0 5.61 GiB 8.03 B Vulkan 99 8 tg128 59.96 ± 0.35
llama 13B Q4_0 6.86 GiB 13.02 B Vulkan 99 8 tg128 39.01 ± 0.06
llama 13B Q5_0 7.93 GiB 12.25 B Vulkan 99 8 tg128 30.71 ± 0.01

After:

model size params backend ngl threads test t/s
llama 8B Q4_K - Small 4.36 GiB 8.03 B Vulkan 99 8 tg128 84.17 ± 0.02
llama 8B Q4_0 5.61 GiB 8.03 B Vulkan 99 8 tg128 70.52 ± 0.38
llama 13B Q4_0 6.86 GiB 13.02 B Vulkan 99 8 tg128 45.76 ± 0.17
llama 13B Q5_0 7.93 GiB 12.25 B Vulkan 99 8 tg128 37.49 ± 0.08

@netrunnereve
Copy link
Collaborator

This is 50% faster on Q4_0 with a RX 570, very nice!

Master

model size params backend ngl threads test t/s
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 100 8 pp512 95.83 ± 0.19
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 100 8 tg128 7.74 ± 0.06
llama 8B Q4_K - Small 4.36 GiB 8.03 B Vulkan 100 8 pp512 76.86 ± 0.09
llama 8B Q4_K - Small 4.36 GiB 8.03 B Vulkan 100 8 tg128 8.09 ± 0.02

PR

model size params backend ngl threads test t/s
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 100 8 pp512 96.81 ± 0.09
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 100 8 tg128 11.61 ± 0.05
llama 8B Q4_K - Small 4.36 GiB 8.03 B Vulkan 100 8 pp512 76.83 ± 0.43
llama 8B Q4_K - Small 4.36 GiB 8.03 B Vulkan 100 8 tg128 9.18 ± 0.02

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants