What is the partial sum in `block_q8_1_mmq`, is it for reducing the quantization error during MMA? #13507

TheTinyTeddy · 2025-05-13T11:56:50Z

TheTinyTeddy
May 13, 2025

The struct of q8_q_mmq is:

struct block_q8_1_mmq {
// The y float data is converted to a data layout that can simply be copied to shared memory as a contiguous block.
// The y float data is first grouped as blocks of 128 values.
// These blocks are then treated as individual data values and transposed.
//
// To avoid shared memory bank conflicts each block is padded with 16 bytes.
// This padding is also used to store block scales/partial sums.
// The scales multiplied with the quantized data are equal to the unquantized values.
// The partial sums are obtained by summing up a subgroup of the contained values (prior to quantization)
// and are only needed for performance reasons.
//
// The exact data stored depends on the x data type.
union {
float d4[4]; // 1 32 bit scale per 32 values, stored as d0,d1,d2,d3
half2 ds4[4]; // 1 16 bit scale + 1 16 bit partial sum per 32 values, stored as d0,s0,d1,s1,d2,s2,d3,s3
half d2s6[8]; // 1 16 bit scale per 64 values + 1 16 bit partial sum per 16 values for the first 96 values,
// stored as d0,d1,s1,s2,s3,s4,s5
};
int8_t qs[4*QK8_1]; // 128 values quantized to 8 bit each
};

I was wondering why do we need this partial sum, what is the meaning of "and are only needed for performance reasons."? Is it a bias term to reduce the quantization error during MMA?

Answered by jeffbolznv

May 13, 2025

The quantization used for A is decoded as Ad*a - Am where Ad and Am are the scale/bias for the block, and a is the element. The q8_1 quantization is decoded as Bd*b. So the matrix multiply dots a row of A and column of B, computing sum{(Ad*a-Am)*b*Bd}. If you expand this out, you can rewrite it as Ad*Bd*sum{a*b} - Am*Bd*sum{b}. The partial sum is this sum{b} term, precomputed to make the matrix multiply faster.

View full answer

jeffbolznv · 2025-05-13T13:34:59Z

jeffbolznv
May 13, 2025
Collaborator

The quantization used for A is decoded as Ad*a - Am where Ad and Am are the scale/bias for the block, and a is the element. The q8_1 quantization is decoded as Bd*b. So the matrix multiply dots a row of A and column of B, computing sum{(Ad*a-Am)*b*Bd}. If you expand this out, you can rewrite it as Ad*Bd*sum{a*b} - Am*Bd*sum{b}. The partial sum is this sum{b} term, precomputed to make the matrix multiply faster.

8 replies

TheTinyTeddy May 14, 2025
Author

I use CUDA for inference, and I can only see the dot product and quantization that are performed using GPU from inspecting the nsys, I use printf to track the dequatization is performed using CPU.

I'm not very familiar with the CPU backend. Maybe it dequantizes the whole block and just does a simple dot product. The cuda (and some vulkan) shaders use this alternate formulation I described because the sum{a*b} can use int8 dot product or tensor core instructions and run at a faster rate.

Which step do you think the dequantization is performed on, is it sum{a*b} performed on INT8 matmul, then dequantization is the same as Ad*Bd*sum{a*b} - Am*Bd*sum{b}?

JohannesGaessler May 14, 2025
Collaborator

Forgot to set --n-gpu-layers? In any case, there is definitely CUDA code for dequantization to FP16/FP32 in convert.cu. But the matrix multiplication kernel in mul_mat_q never explicitly writes dequantized data to VRAM, it directly loads the quantized data, maybe converts it to 8 bit integers, and then uses integer arithmetic to calculate the result.

TheTinyTeddy May 14, 2025
Author

I have set --n-gpu-layers=81 (greater than the layers of the model which has 32 layers). I have also seen it uses GPU (such as quantization and MMA from nsys profile and printf). However, during inference it did not use dequantize_block_q4_K from convert.cu, but rather dequantize_row_q4_K from ggml-quants.c. I have used printf to test it. That's why I'm a bit puzzled by this.

JohannesGaessler May 14, 2025
Collaborator

Unless you are using V100s or more recent AMD GPUs, the default compilation options will never call an explicit dequantization CUDA kernel. The dequantization -> FP16 GEMM path is only used with GGML_CUDA_FORCE_CUBLASE.

TheTinyTeddy May 14, 2025
Author

So I can understand the dequantization is performed implicitly after each INT8 GEMM before accumulation within the MMA kernel. Maybe some layer (embedding for example) uses weight dequantization -> FP16 GEMM path.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What is the partial sum in `block_q8_1_mmq`, is it for reducing the quantization error during MMA? #13507

{{title}}

Replies: 1 comment 8 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

What is the partial sum in block_q8_1_mmq, is it for reducing the quantization error during MMA? #13507

TheTinyTeddy May 13, 2025

Replies: 1 comment · 8 replies

jeffbolznv May 13, 2025 Collaborator

TheTinyTeddy May 14, 2025 Author

JohannesGaessler May 14, 2025 Collaborator

TheTinyTeddy May 14, 2025 Author

JohannesGaessler May 14, 2025 Collaborator

TheTinyTeddy May 14, 2025 Author

What is the partial sum in `block_q8_1_mmq`, is it for reducing the quantization error during MMA? #13507

TheTinyTeddy
May 13, 2025

Replies: 1 comment 8 replies

jeffbolznv
May 13, 2025
Collaborator

TheTinyTeddy May 14, 2025
Author

JohannesGaessler May 14, 2025
Collaborator

TheTinyTeddy May 14, 2025
Author

JohannesGaessler May 14, 2025
Collaborator

TheTinyTeddy May 14, 2025
Author