Skip to content

We study scaling law of large languge models observing there is an optimal model size given a computational budget

Notifications You must be signed in to change notification settings

I-Halder/scaling-law-of-large-language-models

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

63 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Scaling law of large language models

We train a transformer with single heads of head_dim 64, max_seq_len 256, vocab_dim 128256, feed_forward ratio 4 from scratch for various values of hidden_dim. We use a simple addition dataset with 1000 points as the training set. In the plot below we compare the validation loss againt the computational cost in FLOP. The FLOP is calculated after excluding the embedding and logit layers.

fig1

From the plot we see that given a computational budget, there is an optimal model size.

Description of the code

This repo contains significantly modified version of the code in https://github.com/lucidrains/self-rewarding-lm-pytorch.

About

We study scaling law of large languge models observing there is an optimal model size given a computational budget

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published