What is the partial sum in block_q8_1_mmq
, is it for reducing the quantization error during MMA?
#13507
-
The struct of q8_q_mmq is: struct block_q8_1_mmq { I was wondering why do we need this |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 8 replies
-
The quantization used for A is decoded as |
Beta Was this translation helpful? Give feedback.
The quantization used for A is decoded as
Ad*a - Am
where Ad and Am are the scale/bias for the block, and a is the element. The q8_1 quantization is decoded asBd*b
. So the matrix multiply dots a row of A and column of B, computingsum{(Ad*a-Am)*b*Bd}
. If you expand this out, you can rewrite it asAd*Bd*sum{a*b} - Am*Bd*sum{b}
. The partial sum is thissum{b}
term, precomputed to make the matrix multiply faster.