vulkan: Optimize some mat-vec mul quant shaders #10296

jeffbolznv · 2024-11-14T18:17:48Z

Compute two result elements per workgroup (for Q{4,5}_{0,1}). This reuses the B loads across the rows and also reuses some addressing calculations. This required manually partially unrolling the loop, since the compiler is less willing to unroll outer loops.

Add bounds-checking on the last iteration of the loop. I think this was at least partly broken before. I'd also like to be able to disable robustness for some of these pipelines in the future, to get a bit more perf.

Optimize the Q4_K shader to vectorize most loads and reduce the number of bit twiddling instructions. It should be possible to do something similar to other Qi_K shaders. I can maybe do this, but happy for somebody else to do it.

Perf results below. Still slower than CUDA (which is using dp4a), but a nice boost. Definitely worth testing on some other hardware, too.

Before:
  MUL_MAT(type_a=f32,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   10224 runs -   492.38 us/run - 117.44 MFLOP/run - 238.52 GFLOPS
  MUL_MAT(type_a=f16,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   20448 runs -   248.94 us/run - 117.44 MFLOP/run - 471.76 GFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  54528 runs -    92.30 us/run - 117.44 MFLOP/run -   1.27 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  45156 runs -   110.73 us/run - 117.44 MFLOP/run -   1.06 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  21300 runs -   241.06 us/run - 117.44 MFLOP/run - 487.19 GFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  16188 runs -   309.08 us/run - 117.44 MFLOP/run - 379.97 GFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  34932 runs -   145.27 us/run - 117.44 MFLOP/run - 808.44 GFLOPS
  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  51120 runs -    97.85 us/run - 117.44 MFLOP/run -   1.20 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  39192 runs -   128.78 us/run - 117.44 MFLOP/run - 911.94 GFLOPS
  MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  50268 runs -    99.85 us/run - 117.44 MFLOP/run -   1.18 TFLOPS
  MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  40896 runs -   124.83 us/run - 117.44 MFLOP/run - 940.77 GFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  40896 runs -   123.86 us/run - 117.44 MFLOP/run - 948.14 GFLOPS
  MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                11928 runs -   440.35 us/run - 117.44 MFLOP/run - 266.70 GFLOPS
  
| baichuan 13B Q4_0              |   7.44 GiB |    13.90 B | Vulkan     | 1000 |         tg128 |         35.03  0.13 |
| starcoder2 7B Q4_0             |   3.88 GiB |     7.40 B | Vulkan     | 1000 |         tg128 |         55.92  0.37 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     | 1000 |         tg128 |         93.81  0.97 |

After:
  MUL_MAT(type_a=f32,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   10224 runs -   493.00 us/run - 117.44 MFLOP/run - 238.21 GFLOPS
  MUL_MAT(type_a=f16,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   20448 runs -   251.57 us/run - 117.44 MFLOP/run - 466.83 GFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  65604 runs -    77.18 us/run - 117.44 MFLOP/run -   1.52 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  55380 runs -    91.59 us/run - 117.44 MFLOP/run -   1.28 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  36636 runs -   139.45 us/run - 117.44 MFLOP/run - 842.17 GFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  33228 runs -   154.00 us/run - 117.44 MFLOP/run - 762.60 GFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  35784 runs -   141.88 us/run - 117.44 MFLOP/run - 827.74 GFLOPS
  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  53676 runs -    94.49 us/run - 117.44 MFLOP/run -   1.24 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  39192 runs -   129.18 us/run - 117.44 MFLOP/run - 909.13 GFLOPS
  MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  82644 runs -    60.54 us/run - 117.44 MFLOP/run -   1.94 TFLOPS
  MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  39192 runs -   130.01 us/run - 117.44 MFLOP/run - 903.33 GFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  40896 runs -   123.02 us/run - 117.44 MFLOP/run - 954.66 GFLOPS
  MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                11076 runs -   459.25 us/run - 117.44 MFLOP/run - 255.72 GFLOPS

| baichuan 13B Q4_0              |   7.44 GiB |    13.90 B | Vulkan     | 1000 |         tg128 |         40.93  0.30 |
| starcoder2 7B Q4_0             |   3.88 GiB |     7.40 B | Vulkan     | 1000 |         tg128 |         64.30  0.82 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     | 1000 |         tg128 |        105.04  0.55 |

Split out from #10206, but the code is pretty different.

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

Compute two result elements per workgroup (for Q{4,5}_{0,1}). This reuses the B loads across the rows and also reuses some addressing calculations. This required manually partially unrolling the loop, since the compiler is less willing to unroll outer loops. Add bounds-checking on the last iteration of the loop. I think this was at least partly broken before. Optimize the Q4_K shader to vectorize most loads and reduce the number of bit twiddling instructions.

0cc4m · 2024-11-14T19:43:46Z

Thank you, this is quite impressive!

I tested these models:

Llama 3 8B Q4_K_S
Llama 3 8B q4_0
Llama 2 13B q4_0
Mistral Nemo q5_0

Nvidia RTX 3090:

Before:

model	size	params	backend	ngl	threads	test
llama 8B Q4_K - Small	4.36 GiB	8.03 B	Vulkan	99	tg128	66.12 ± 0.99
llama 8B Q4_0	5.61 GiB	8.03 B	Vulkan	99	tg128	57.64 ± 0.04
llama 13B Q4_0	6.86 GiB	13.02 B	Vulkan	99	tg128	38.73 ± 0.12
llama 13B Q5_0	7.93 GiB	12.25 B	Vulkan	99	tg128	20.79 ± 0.04

After:

model	size	params	backend	ngl	threads	test
llama 8B Q4_K - Small	4.36 GiB	8.03 B	Vulkan	99	tg128	77.54 ± 0.13
llama 8B Q4_0	5.61 GiB	8.03 B	Vulkan	99	tg128	73.46 ± 0.09
llama 13B Q4_0	6.86 GiB	13.02 B	Vulkan	99	tg128	50.04 ± 0.05
llama 13B Q5_0	7.93 GiB	12.25 B	Vulkan	99	tg128	38.89 ± 0.11

Nvidia Tesla P40:

Before:

model	size	params	backend	ngl	threads	test
llama 8B Q4_K - Small	4.36 GiB	8.03 B	Vulkan	99	tg128	21.52 ± 0.03
llama 8B Q4_0	5.61 GiB	8.03 B	Vulkan	99	tg128	22.80 ± 0.02
llama 13B Q4_0	6.86 GiB	13.02 B	Vulkan	99	tg128	13.85 ± 0.00
llama 13B Q5_0	7.93 GiB	12.25 B	Vulkan	99	tg128	8.49 ± 0.01

After:

model	size	params	backend	ngl	threads	test
llama 8B Q4_K - Small	4.36 GiB	8.03 B	Vulkan	99	tg128	38.30 ± 0.12
llama 8B Q4_0	5.61 GiB	8.03 B	Vulkan	99	tg128	22.19 ± 0.01
llama 13B Q4_0	6.86 GiB	13.02 B	Vulkan	99	tg128	13.05 ± 0.00
llama 13B Q5_0	7.93 GiB	12.25 B	Vulkan	99	tg128	8.92 ± 0.59

AMD Radeon Pro VII:

Before:

model	size	params	backend	ngl	threads	test
llama 8B Q4_K - Small	4.36 GiB	8.03 B	Vulkan	99	tg128	25.14 ± 0.52
llama 8B Q4_0	5.61 GiB	8.03 B	Vulkan	99	tg128	28.82 ± 0.54
llama 13B Q4_0	6.86 GiB	13.02 B	Vulkan	99	tg128	16.30 ± 1.93
llama 13B Q5_0	7.93 GiB	12.25 B	Vulkan	99	tg128	14.74 ± 0.06

After:

model	size	params	backend	ngl	threads	test
llama 8B Q4_K - Small	4.36 GiB	8.03 B	Vulkan	99	tg128	29.90 ± 0.03
llama 8B Q4_0	5.61 GiB	8.03 B	Vulkan	99	tg128	38.49 ± 3.24
llama 13B Q4_0	6.86 GiB	13.02 B	Vulkan	99	tg128	26.55 ± 0.35
llama 13B Q5_0	7.93 GiB	12.25 B	Vulkan	99	tg128	18.09 ± 0.21

AMD Radeon RX 6800 XT:

Before:

model	size	params	backend	ngl	threads	test	t/s
llama 8B Q4_K - Small	4.36 GiB	8.03 B	Vulkan	99	8	tg128	83.11 ± 0.12
llama 8B Q4_0	5.61 GiB	8.03 B	Vulkan	99	8	tg128	59.96 ± 0.35
llama 13B Q4_0	6.86 GiB	13.02 B	Vulkan	99	8	tg128	39.01 ± 0.06
llama 13B Q5_0	7.93 GiB	12.25 B	Vulkan	99	8	tg128	30.71 ± 0.01

After:

model	size	params	backend	ngl	threads	test	t/s
llama 8B Q4_K - Small	4.36 GiB	8.03 B	Vulkan	99	8	tg128	84.17 ± 0.02
llama 8B Q4_0	5.61 GiB	8.03 B	Vulkan	99	8	tg128	70.52 ± 0.38
llama 13B Q4_0	6.86 GiB	13.02 B	Vulkan	99	8	tg128	45.76 ± 0.17
llama 13B Q5_0	7.93 GiB	12.25 B	Vulkan	99	8	tg128	37.49 ± 0.08

netrunnereve · 2024-11-15T02:30:56Z

This is 50% faster on Q4_0 with a RX 570, very nice!

Master

model	size	params	backend	ngl	threads	test	t/s
llama 8B Q4_0	4.33 GiB	8.03 B	Vulkan	100	8	pp512	95.83 ± 0.19
llama 8B Q4_0	4.33 GiB	8.03 B	Vulkan	100	8	tg128	7.74 ± 0.06
llama 8B Q4_K - Small	4.36 GiB	8.03 B	Vulkan	100	8	pp512	76.86 ± 0.09
llama 8B Q4_K - Small	4.36 GiB	8.03 B	Vulkan	100	8	tg128	8.09 ± 0.02

PR

model	size	params	backend	ngl	threads	test	t/s
llama 8B Q4_0	4.33 GiB	8.03 B	Vulkan	100	8	pp512	96.81 ± 0.09
llama 8B Q4_0	4.33 GiB	8.03 B	Vulkan	100	8	tg128	11.61 ± 0.05
llama 8B Q4_K - Small	4.36 GiB	8.03 B	Vulkan	100	8	pp512	76.83 ± 0.43
llama 8B Q4_K - Small	4.36 GiB	8.03 B	Vulkan	100	8	tg128	9.18 ± 0.02

jeffbolznv requested a review from 0cc4m November 14, 2024 18:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vulkan: Optimize some mat-vec mul quant shaders #10296

vulkan: Optimize some mat-vec mul quant shaders #10296

jeffbolznv commented Nov 14, 2024

0cc4m commented Nov 14, 2024 •

edited

Loading

netrunnereve commented Nov 15, 2024

vulkan: Optimize some mat-vec mul quant shaders #10296

Are you sure you want to change the base?

vulkan: Optimize some mat-vec mul quant shaders #10296

Conversation

jeffbolznv commented Nov 14, 2024

0cc4m commented Nov 14, 2024 • edited Loading

Nvidia RTX 3090:

Nvidia Tesla P40:

AMD Radeon Pro VII:

AMD Radeon RX 6800 XT:

netrunnereve commented Nov 15, 2024

0cc4m commented Nov 14, 2024 •

edited

Loading