Why the PP performance is better than TG of llama-bench tool? #13499

LifengWang · 2025-05-13T08:23:20Z

LifengWang
May 13, 2025

Hi all, I'm using the llama-bench tool to evaluate performance. However, I noticed that PP performance is consistently better than TG, as shown in the example data in the below table from the README of llama-bench.
From my understanding, PP corresponds to the performance of the first token latency, while TG corresponds to the next token latency. Based on this, I would expect TG to perform better than PP. But according to the llama-bench results, PP appears to outperform TG. Is this expected?

https://github.com/ggml-org/llama.cpp/edit/master/tools/llama-bench/README.md#different-numbers-of-threads

$ ./llama-bench -n 0 -n 16 -p 64 -t 1,2,4,8,16,32

model	size	params	backend	threads	test	t/s
llama 7B mostly Q4_0	3.56 GiB	6.74 B	CPU	1	pp 64	6.17 ± 0.07
llama 7B mostly Q4_0	3.56 GiB	6.74 B	CPU	1	tg 16	4.05 ± 0.02
llama 7B mostly Q4_0	3.56 GiB	6.74 B	CPU	2	pp 64	12.31 ± 0.13
llama 7B mostly Q4_0	3.56 GiB	6.74 B	CPU	2	tg 16	7.80 ± 0.07
llama 7B mostly Q4_0	3.56 GiB	6.74 B	CPU	4	pp 64	23.18 ± 0.06
llama 7B mostly Q4_0	3.56 GiB	6.74 B	CPU	4	tg 16	12.22 ± 0.07
llama 7B mostly Q4_0	3.56 GiB	6.74 B	CPU	8	pp 64	32.29 ± 1.21
llama 7B mostly Q4_0	3.56 GiB	6.74 B	CPU	8	tg 16	16.71 ± 0.66
llama 7B mostly Q4_0	3.56 GiB	6.74 B	CPU	16	pp 64	33.52 ± 0.03
llama 7B mostly Q4_0	3.56 GiB	6.74 B	CPU	16	tg 16	15.32 ± 0.05
llama 7B mostly Q4_0	3.56 GiB	6.74 B	CPU	32	pp 64	59.00 ± 1.11
llama 7B mostly Q4_0	3.56 GiB	6.74 B	CPU	32	tg 16	16.41 ± 0.79

Answered by 0cc4m

May 14, 2025

In broad terms, prompt processing is faster than text generation, because the limitation is usually memory bandwidth, and for prompt processing you can batch a lot more work into a single read operation of the entire model. For text generation you only do a single token per read of the model.

So more tokens get handled at once -> tokens per second is higher. If you batch tg (generate multiple requests at once) that will also happen there, to some extent.

View full answer

John-CPP · 2025-05-13T09:06:58Z

John-CPP
May 13, 2025

I am not an expert, but in simple words:
pp - frequency of digesting of context.
tg - frequency of response generation.

2 replies

LifengWang May 14, 2025
Author

I agree—that's exactly why I'm confused. In my understanding, PP typically requires more compute resources, so its performance should usually be slower than TG.

0cc4m May 14, 2025
Collaborator

In broad terms, prompt processing is faster than text generation, because the limitation is usually memory bandwidth, and for prompt processing you can batch a lot more work into a single read operation of the entire model. For text generation you only do a single token per read of the model.

So more tokens get handled at once -> tokens per second is higher. If you batch tg (generate multiple requests at once) that will also happen there, to some extent.

Answer selected by LifengWang

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why the PP performance is better than TG of llama-bench tool? #13499

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Why the PP performance is better than TG of llama-bench tool? #13499

LifengWang May 13, 2025

Replies: 1 comment · 2 replies

John-CPP May 13, 2025

LifengWang May 14, 2025 Author

0cc4m May 14, 2025 Collaborator

LifengWang
May 13, 2025

Replies: 1 comment 2 replies

John-CPP
May 13, 2025

LifengWang May 14, 2025
Author

0cc4m May 14, 2025
Collaborator