|
1 |
| -Repo where I'll be creating my own LLM from scratch. The goal is to deepen my understanding of model-architecture design, training methodologies, and core deep-learning principles. |
| 1 | +<div align="center"> |
2 | 2 |
|
3 |
| -Run first |
4 |
| -`export PYTHONPATH="${PYTHONPATH}:$(pwd)/src"` |
| 3 | +# LF_LLM-269M |
5 | 4 |
|
6 |
| -- Make sure to use PyTorch 2.0 or later! |
| 5 | +</div> |
| 6 | +<div align="center"><img src="./assets/puzzle_fox.jpg" width="300"/></div> |
7 | 7 |
|
8 |
| -To debug on a Mac, use: `./run_pre_training_e2e_on_mac.sh` |
9 |
| -To run on a machine with an NVIDIA GPU(s), use: `./run_pre_training_e2e.sh` |
| 8 | +MyLLM is a deep-learning personal project where I built a modern LLM from the ground up. I focused on developing the core components required for pre-training an LLM, including writing the model-architecture code, handling large datasets, training the model efficiently, and evaluating its performance. |
10 | 9 |
|
11 |
| -Push to server with NVIDIA GPUs (ignoring contents from `temp_data/` dir): |
12 |
| -``` |
13 |
| -rsync -avz --delete --progress --exclude 'temp_data/*' $PWD username@server_ip_address:/home/ubuntu/ |
14 |
| -``` |
| 10 | +# How To Reproduce |
| 11 | +You can debug on a Mac (or most unix/linux-machine) by using `./run_pre_training_e2e_debug.sh`. |
15 | 12 |
|
16 |
| -# Choosing Model Architecture and Training Parameters |
| 13 | +To actually train the model I used NVIDIA GPUs (went with 8xA100s because of cost). To run training end-to-end (downloading all datasets needed, training, running evals, etc) you can simply run `./run_pre_training_e2e.sh`. I used [VESSL AI's](https://vessl.ai) Workspaces to setup my training infra, using their `PyTorch 2.3.1 (CUDA 12.1)` image. |
| 14 | + |
| 15 | +Note that this project uses the ./temp_data/ dir as a quick access way to store temporary data, such as logs, datasets, and checkpoints. To avoid syncing this between a development machine and your accelerated machine you can use e.g. `rsync -avz --delete --progress --exclude 'temp_data/*' $PWD username@server_ip_address:/home/ubuntu/` |
| 16 | + |
| 17 | +# Building LF_LLM-269M |
| 18 | + |
| 19 | +### Choosing Model Architecture and Training Parameters |
17 | 20 | Due to my limited GPU resources (I don't want to spend resources searching for the best parameters), and because this is a learning project, I'll base my parameters around the parameters used by open source LLMs. It's not a perfect approach by any means, and choosing parameters can be an entire project of its own, but for now this is fine.
|
18 | 21 | Below are is a summary table I created to help me tune my parameters (more info in [parameters_tuning.ipynb](./notebooks/parameters_tuning.ipynb)).
|
19 | 22 |
|
20 | 23 | 
|
21 | 24 |
|
| 25 | +### Pre-Training Data |
| 26 | +For pre-training data I looked at [Dolma](https://allenai.org/dolma) and [RedPajama-v2](https://www.together.ai/blog/redpajama-data-v2), but [build-nanogpt](https://github.com/karpathy/build-nanogpt) showed me that a smaller, more refine dataset is enough for a small project like this. |
| 27 | + |
| 28 | +# Results |
| 29 | + |
| 30 | +[metric_graphs.ipynb](./notebooks/metric_graphs.ipynb) |
| 31 | + |
| 32 | + |
| 33 | + |
| 34 | + |
| 35 | + |
| 36 | +**At step 0 (no training):** |
| 37 | +(Green is the prompt, blue is LLM generated text.) |
| 38 | +- Sample 1: <span style="color:green;">If animals could talk, my pet would probably say </span><span style="color:blue;">undertake undertake distortion intest Gylassotide acids Yankee neoconcept Coming Coming launcherimpl Sussex Sussexinea minim Ding</span> |
| 39 | +- Sample 2: <span style="color:green;">HTML stands for </span><span style="color:blue;">campaigns desserts sawradio AUTH sort Pythononto unforeseen rainfall rainfall Host awaits solubleheaded Fever estimate genders proponentMAR</span> |
| 40 | +- Sample 3: <span style="color:green;">The clever fox built the strange machine with just a feather, a pebble, and a tiny twig </span><span style="color:blue;">intrusion complying Resist master Yad induced derogatory Magic damageced amusing 290Sn},{" muddy universal prospect prospect prospect Rey</span> |
| 41 | + |
| 42 | +**After last training step:** |
| 43 | +(Green is the prompt, blue is LLM generated text.) |
| 44 | +- Sample 1: <span style="color:green;">If animals could talk, my pet would probably say </span><span style="color:blue;">hello or I would say Hi. |
| 45 | +I am excited to have my pet respond to the sound I</span> |
| 46 | +- Sample 2: <span style="color:green;">HTML stands for </span><span style="color:blue;">HyperText Markup Language. For more information about the browser, see:<|endoftext|>A few months ago</span> |
| 47 | +- Sample 3: <span style="color:green;">The clever fox built the strange machine with just a feather, a pebble, and a tiny twig </span><span style="color:blue;">; by the time it was ready, it was a great working machine. After watching him carefully,</span> |
| 48 | + |
| 49 | + |
| 50 | +<details> |
| 51 | +<summary><strong>Resources/References</strong></summary> |
22 | 52 |
|
23 |
| -More to come... |
| 53 | +- [Molomo (MolmoE)](https://huggingface.co/allenai/MolmoE-1B-0924) |
| 54 | +- [apple/corenet](https://github.com/apple/corenet/tree/main) |
| 55 | +- [allenai/OLMo](https://github.com/allenai/OLMo) |
| 56 | +- [mosaicml/llm-foundry](https://github.com/mosaicml/llm-foundry) |
| 57 | +- [google-research/tuning_playbook](https://github.com/google-research/tuning_playbook) |
| 58 | +- [karpathy/build-nanogpt](https://github.com/karpathy/build-nanogpt/tree/master) |
24 | 59 |
|
| 60 | +</details> |
25 | 61 |
|
26 |
| -11/21/24 2:14PM |
27 |
| -Time (per step) = 767.36 ms. |
28 |
| -Throughput: 341,620.04 tokens/sec |
| 62 | +## License |
| 63 | +GNU GPLv3 ([LICENSE.txt](./LICENSE.txt)) |
0 commit comments