Is decoding one token or two token at a time in llama.cpp? #13198

afsara-ben · 2025-04-29T19:37:56Z

afsara-ben
Apr 29, 2025

As in decode stage per token output is calculated, shouldn't the attn output layer compute be of shapes (1, 1, 4096) @ (1, 4096, 4096)? But from observing the layer wise shapes, why is src1 obtained from last layer has shape [ 4096, 2, 1, 1] instead of [ 4096, 1, 1, 1] - what is the other token? why does kqv_out have 2 tokens instead of 1 in the decoding stage? is it a batched inference?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is decoding one token or two token at a time in llama.cpp? #13198

{{title}}

Replies: 0 comments

Select a reply

Is decoding one token or two token at a time in llama.cpp? #13198

afsara-ben Apr 29, 2025

Replies: 0 comments

afsara-ben
Apr 29, 2025