Llama 3 achieves pretty good recall to 65k context w/ rope_theta set to 16M #6890

dranger003 · 2024-04-25T01:41:21Z

dranger003
Apr 25, 2024

Source: https://twitter.com/winglian/status/1783122644579090600

I'm not sure how this can be applied using llama.cpp but when I try with -c 32768 --rope-scaling linear --rope-freq-base 8000000 I get coherent and high quality results from the model.

Am I using the right parameters? I also noticed the VRAM usage doesn't go up all that much and I can easily run the Q8_0 of the 8B version on 24GB (it uses only 16.5GB fully offloaded). Actually, I can even run the FP16 fully offloaded using 23GB and 32K context.

The performance is also quite acceptable (this is on 4090):

llama_print_timings:        load time =    3333.42 ms
llama_print_timings:      sample time =      36.12 ms /   487 runs   (    0.07 ms per token, 13483.95 tokens per second)
llama_print_timings: prompt eval time =   12897.15 ms / 24642 tokens (    0.52 ms per token,  1910.65 tokens per second)
llama_print_timings:        eval time =   10108.76 ms /   486 runs   (   20.80 ms per token,    48.08 tokens per second)
llama_print_timings:       total time =   23296.00 ms / 25128 tokens

disperaller · 2024-06-01T00:41:00Z

disperaller
Jun 1, 2024

does anyone know why setting the rope theta to 16m in order to achieve 65k context length according to the twitter source? in my understanding, if we want to extend the context length from 8k (llama3 8b supports naturally) to 65k, we need a multiply factor of 8. Based on the original setting of llama3 8b, the rope theta was set to 0.5m, and we multiply that by 8, we get 4m instead of 16m. Can't figure out where the difference comes from. Thanks in advance.

0 replies

dranger003 · 2024-06-01T01:38:06Z

dranger003
Jun 1, 2024
Author

The answer may lie these posts:

https://blog.eleuther.ai/rotary-embeddings/
https://blog.eleuther.ai/yarn/

0 replies

disperaller · 2024-06-03T08:20:25Z

disperaller
Jun 3, 2024

Thank you for the links. From what i read from the blog, the parameter rope-freq-base should be referred to as the new base value, correct? It is set to 8m here, however from the blog, shouldn't it become 0.5m (llama3 original base) * 4 (factor s, 32k [target ctx length] / 8k [llama3 ctx length]) ** (128 / 126) = 2m, instead of 8m? Correct me if i'm wrong, thanks.
In addition, i can't find anywhere in the blog explaining why chose 16m to be the rope_theta in order to expand to 65k? rope_theta and rope_freq_base are NOT the same thing, correct?

0 replies

dranger003 · 2024-06-03T11:09:58Z

dranger003
Jun 3, 2024
Author

Your understanding is the same as mine, and so are your questions. I initially posted here to solicitate some reactions to try to understand how RoPE works, but so far it has been a rather quiet echo from the experts.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Llama 3 achieves pretty good recall to 65k context w/ rope_theta set to 16M #6890

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Llama 3 achieves pretty good recall to 65k context w/ rope_theta set to 16M #6890

dranger003 Apr 25, 2024

Replies: 4 comments

disperaller Jun 1, 2024

dranger003 Jun 1, 2024 Author

disperaller Jun 3, 2024

dranger003 Jun 3, 2024 Author

dranger003
Apr 25, 2024

disperaller
Jun 1, 2024

dranger003
Jun 1, 2024
Author

disperaller
Jun 3, 2024

dranger003
Jun 3, 2024
Author