Skip to content

What is the partial sum in block_q8_1_mmq, is it for reducing the quantization error during MMA? #13507

Answered by jeffbolznv
TheTinyTeddy asked this question in Q&A
Discussion options

You must be logged in to vote

The quantization used for A is decoded as Ad*a - Am where Ad and Am are the scale/bias for the block, and a is the element. The q8_1 quantization is decoded as Bd*b. So the matrix multiply dots a row of A and column of B, computing sum{(Ad*a-Am)*b*Bd}. If you expand this out, you can rewrite it as Ad*Bd*sum{a*b} - Am*Bd*sum{b}. The partial sum is this sum{b} term, precomputed to make the matrix multiply faster.

Replies: 1 comment 8 replies

Comment options

You must be logged in to vote
8 replies
@TheTinyTeddy
Comment options

@JohannesGaessler
Comment options

@TheTinyTeddy
Comment options

@JohannesGaessler
Comment options

@TheTinyTeddy
Comment options

Answer selected by TheTinyTeddy
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
3 participants
Converted from issue

This discussion was converted from issue #13504 on May 13, 2025 13:37.