Why the PP performance is better than TG of llama-bench tool? #13499
-
Hi all, I'm using the llama-bench tool to evaluate performance. However, I noticed that PP performance is consistently better than TG, as shown in the example data in the below table from the README of llama-bench. $ ./llama-bench -n 0 -n 16 -p 64 -t 1,2,4,8,16,32
|
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
I am not an expert, but in simple words: |
Beta Was this translation helpful? Give feedback.
In broad terms, prompt processing is faster than text generation, because the limitation is usually memory bandwidth, and for prompt processing you can batch a lot more work into a single read operation of the entire model. For text generation you only do a single token per read of the model.
So more tokens get handled at once -> tokens per second is higher. If you batch tg (generate multiple requests at once) that will also happen there, to some extent.