Is decoding one token or two token at a time in llama.cpp? #13198
Unanswered
afsara-ben
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
As in decode stage per token output is calculated, shouldn't the attn output layer compute be of shapes (1, 1, 4096) @ (1, 4096, 4096)? But from observing the layer wise shapes, why is

src1
obtained from last layer has shape[ 4096, 2, 1, 1]
instead of[ 4096, 1, 1, 1]
- what is the other token? why does kqv_out have 2 tokens instead of 1 in the decoding stage? is it a batched inference?Beta Was this translation helpful? Give feedback.
All reactions