Skip to content

Why the PP performance is better than TG of llama-bench tool? #13499

Answered by 0cc4m
LifengWang asked this question in Q&A
Discussion options

You must be logged in to vote

In broad terms, prompt processing is faster than text generation, because the limitation is usually memory bandwidth, and for prompt processing you can batch a lot more work into a single read operation of the entire model. For text generation you only do a single token per read of the model.

So more tokens get handled at once -> tokens per second is higher. If you batch tg (generate multiple requests at once) that will also happen there, to some extent.

Replies: 1 comment 2 replies

Comment options

You must be logged in to vote
2 replies
@LifengWang
Comment options

@0cc4m
Comment options

0cc4m May 14, 2025
Collaborator

Answer selected by LifengWang
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
3 participants