Skip to content

Commit 21182c2

Browse files
bigximikToolkit Userjlamypoirier
authored
Multi-Dataset Validation (LM-Loss/Perplexity) (#178)
Co-authored-by: Toolkit User <[email protected]> Co-authored-by: Joel Lamy-Poirier <[email protected]>
1 parent ab17636 commit 21182c2

25 files changed

+371
-213
lines changed

docs/quick-start.md

Lines changed: 12 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -492,9 +492,10 @@ Save the following as `fast-llm-tutorial/train-config.yaml`:
492492
train_iters: 100 # (1)!
493493
logs:
494494
interval: 10
495-
validation:
496-
iterations: 25
497-
interval: 100
495+
evaluations:
496+
validation:
497+
iterations: 25
498+
interval: 100
498499
export: # (2)!
499500
format: llama
500501
interval: 100
@@ -508,10 +509,10 @@ Save the following as `fast-llm-tutorial/train-config.yaml`:
508509
batch_size: 480 # (5)!
509510
data:
510511
datasets:
511-
Training:
512+
training:
512513
type: file
513514
path: fast-llm-tutorial/dataset/fast_llm_config_training.yaml # (6)!
514-
Validation:
515+
validation:
515516
type: file
516517
path: fast-llm-tutorial/dataset/fast_llm_config_validation.yaml # (6)!
517518
optimizer:
@@ -549,9 +550,10 @@ Save the following as `fast-llm-tutorial/train-config.yaml`:
549550
train_iters: 100_000 # (1)!
550551
logs:
551552
interval: 10
552-
validation:
553-
iterations: 25
554-
interval: 1000
553+
evaluations:
554+
validation:
555+
iterations: 25
556+
interval: 1000
555557
checkpoint:
556558
interval: 1000
557559
keep: 5
@@ -569,10 +571,10 @@ Save the following as `fast-llm-tutorial/train-config.yaml`:
569571
batch_size: 512 # (5)!
570572
data:
571573
datasets:
572-
Training:
574+
training:
573575
type: file
574576
path: fast-llm-tutorial/dataset/fast_llm_config_training.yaml # (6)!
575-
Validation:
577+
validation:
576578
type: file
577579
path: fast-llm-tutorial/dataset/fast_llm_config_validation.yaml # (6)!
578580
optimizer: # (7)!

docs/recipes/continue-training.md

Lines changed: 22 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -33,9 +33,10 @@ This is not much different from a pretraining config. We will:
3333
train_iters: 100_000
3434
logs:
3535
interval: 10
36-
validation:
37-
iterations: 25
38-
interval: 1000
36+
evaluations:
37+
validation:
38+
iterations: 25
39+
interval: 1000
3940
checkpoint:
4041
interval: 1000
4142
keep: 5
@@ -48,9 +49,13 @@ This is not much different from a pretraining config. We will:
4849
sequence_length: 4096
4950
batch_size: 256
5051
data:
51-
format: file
52-
path: fast-llm-tutorial/dataset.json # (2)!
53-
split: [99, 1, 0]
52+
datasets:
53+
training:
54+
type: file
55+
path: fast-llm-tutorial/dataset/fast_llm_config_training.yaml # (2)!
56+
validation:
57+
type: file
58+
path: fast-llm-tutorial/dataset/fast_llm_config_validation.yaml # (2)!
5459
optimizer:
5560
weight_decay: 0.1
5661
beta_1: 0.9
@@ -84,8 +89,9 @@ This is not much different from a pretraining config. We will:
8489
logs:
8590
interval: 10
8691
validation:
87-
iterations: 25
88-
interval: 1000
92+
Validation:
93+
iterations: 25
94+
interval: 1000
8995
checkpoint:
9096
interval: 1000
9197
keep: 5
@@ -98,9 +104,13 @@ This is not much different from a pretraining config. We will:
98104
sequence_length: 8192
99105
batch_size: 256
100106
data:
101-
format: file
102-
path: fast-llm-tutorial/dataset.json # (2)!
103-
split: [99, 1, 0]
107+
datasets:
108+
training:
109+
type: file
110+
path: fast-llm-tutorial/dataset/fast_llm_config_training.yaml # (6)!
111+
validation:
112+
type: file
113+
path: fast-llm-tutorial/dataset/fast_llm_config_validation.yaml # (6)!
104114
optimizer:
105115
weight_decay: 0.1
106116
beta_1: 0.9
@@ -129,7 +139,7 @@ This is not much different from a pretraining config. We will:
129139
```
130140

131141
1. A the model will be saved in Hugging Face format to `~/results` directory every 20,000 iterations.
132-
2. Location of the dataset metadata file generated in Step 4.
142+
2. Location of the dataset metadata file generated in Step 4 of quick start guide.
133143
3. The learning-rate can be used to trade-off between learning and forgetting. A higher learning-rate will learn quickly on our new dataset but will cause forgetting. A lower learning-rate will instead retain more of the pretrained model's knowledge, but will slow down adapting to the new domain.
134144
4. Config of the pretrained model. We load the model downloaded from the repository earlier.
135145
5. This tells Fast-LLM to load the weights of the pretrained model. If we wanted to use the model's configuration, but train from scratch, we could use the same config but set this to `no`.

docs/recipes/data-configuration.md

Lines changed: 49 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -13,10 +13,10 @@ We already saw an example dataset configuration in the [quick-start guide](../qu
1313
```yaml
1414
data:
1515
datasets:
16-
Training:
16+
training:
1717
type: file
1818
path: fast-llm-tutorial/dataset/fast_llm_config_training.yaml
19-
Validation:
19+
validation:
2020
type: file
2121
path: fast-llm-tutorial/dataset/fast_llm_config_validation.yaml
2222
```
@@ -25,14 +25,24 @@ We already saw an example dataset configuration in the [quick-start guide](../qu
2525

2626
In this section we are interested in generalizing step 3. For more details on steps 1 and 2, please refer to the quick-start guide or [this example](data-configuration.md).
2727

28+
The section `data.datasets` holds descriptions of datasets used in training, validation, and testing.
29+
30+
The Training and Testing phases must have predetermined dataset names: `training` and `testing`, respectively. Each of these phases can have only one dataset.
31+
32+
For validation datasets, the rules are different. There can be as many validation datasets as needed, and their names are arbitrary. In the example above, the dataset name `validation` is chosen for simplicity. The datasets names used for validation and their application details are specified in the training config `evaluations` sections.
33+
34+
Adding multiple validation datasets increases flexibility in tracking the accuracy of your trained model. One possible scenario is using a separate validation dataset for each blended training dataset, allowing you to track training progress on each subset separately and observe how the model performs in real time on different subsets of your training data.
35+
36+
Below are examples of how to configure various aspects of training and validation datasets.
37+
2838
## Example 1: Blending multiple datasets
2939

3040
In this example, we have three datasets and want to sample from each of them during training with probabilities 0.70, 0.25 and 0.05. For this, we use the `blended` type which takes other datasets as arguments:
3141

3242
```yaml
3343
data:
3444
datasets:
35-
Training:
45+
training:
3646
type: blended
3747
datasets:
3848
- type: file
@@ -54,7 +64,7 @@ In this example, we have a large dataset that comes pre-shuffled, so shuffling i
5464
```yaml
5565
data:
5666
datasets:
57-
Training:
67+
training:
5868
type: file
5969
path: path/to/dataset.yaml
6070
sampling:
@@ -68,10 +78,10 @@ In this example, we want to disable shuffling entirely, but only for the validat
6878
```yaml
6979
data:
7080
datasets:
71-
Training:
81+
training:
7282
type: file
7383
path: path/to/training_dataset.yaml
74-
Validation:
84+
validation:
7585
type: sampled
7686
dataset:
7787
type: file
@@ -91,7 +101,7 @@ In this example, we have a blend of datasets as in example 1, but we wish to set
91101
```yaml
92102
data:
93103
datasets:
94-
Training:
104+
training:
95105
type: blended
96106
datasets:
97107
- type: sampled
@@ -118,7 +128,34 @@ data:
118128
!!! note "Default seed"
119129
In the absence of explicit seed, Fast-LLM uses a default seed (`data.sampling`'s default) instead, and uses seed shifts to ensure different seeds for each phase and for the various blended datasets.
120130

121-
## Example 5: Advanced scenario
131+
132+
## Example 5: Specifying Multiple Validation Datasets
133+
134+
In this example, we show how to specify multiple validation datasets and configure how often they are applied, along with their usage attributes in the `training.evaluations` section.
135+
136+
Please note that the same dataset names must be used in the `training.evaluations` section. If a validation dataset is specified in the `datasets` section but not in `training.evaluations`, it will not be used for validation.
137+
138+
```yaml
139+
training:
140+
evaluations:
141+
the_stack:
142+
iterations: 25
143+
interval: 50
144+
fineweb:
145+
iterations: 25
146+
interval: 100
147+
data:
148+
datasets:
149+
the_stack:
150+
type: file
151+
path: path/to/validation_the_stack_dataset.yaml
152+
fineweb:
153+
type: file
154+
path: path/to/validation_fineweb_dataset.yaml
155+
156+
```
157+
158+
## Example 6: Advanced scenario
122159

123160
In this example, we combine everything we learned so far to create a complex scenario, where:
124161

@@ -129,7 +166,7 @@ In this example, we combine everything we learned so far to create a complex sce
129166
```yaml
130167
data:
131168
datasets:
132-
Training:
169+
training:
133170
type: blended
134171
datasets:
135172
- type: sampled
@@ -156,7 +193,7 @@ data:
156193
# Seed = default + train_shift + 2 * blend_shift, shuffle = skip_first_epoch
157194
path: path/to/dataset_3.yaml
158195
weights: [0.70, 0.25, 0.05]
159-
Validation:
196+
validation:
160197
type: sampled
161198
dataset:
162199
type: file
@@ -174,10 +211,10 @@ data:
174211
```yaml
175212
data:
176213
datasets:
177-
Training:
214+
training:
178215
type: file
179216
path: path/to/training_dataset_config.yaml
180-
Validation:
217+
validation:
181218
type: file
182219
path: path/to/validation_dataset_config.yaml
183220
sampling:

docs/recipes/instruction-finetuning.md

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -114,9 +114,10 @@ training:
114114
train_iters: 5_000
115115
logs:
116116
interval: 1
117-
validation:
118-
iterations: 25
119-
interval: 1000
117+
evaluations:
118+
validation:
119+
iterations: 25
120+
interval: 1000
120121
checkpoint:
121122
interval: 1000
122123
keep: 5
@@ -131,10 +132,10 @@ batch:
131132
cross_document_attention: no # (1)!
132133
data:
133134
datasets:
134-
Training:
135+
training:
135136
type: file
136137
path: ./sft-tutorial/tokenized/Llama-3.1-8B/fast_llm_config_training.yaml
137-
Validation:
138+
validation:
138139
type: file
139140
path: ./sft-tutorial/tokenized/Llama-3.1-8B/fast_llm_config_validation.yaml
140141
truncate_documents: no # (2)!

docs/recipes/train.md

Lines changed: 22 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -19,9 +19,10 @@ Let's start from the following training configuration:
1919
train_iters: 100_000
2020
logs:
2121
interval: 10
22-
validation:
23-
iterations: 25
24-
interval: 1000
22+
evaluations:
23+
validation:
24+
iterations: 25
25+
interval: 1000
2526
checkpoint:
2627
interval: 1000
2728
keep: 5
@@ -34,9 +35,13 @@ Let's start from the following training configuration:
3435
sequence_length: 4096
3536
batch_size: 256
3637
data:
37-
format: file
38-
path: fast-llm-tutorial/dataset/fast_llm_dataset.json
39-
split: [99, 1, 0]
38+
datasets:
39+
training:
40+
type: file
41+
path: path/to/training_dataset_config.yaml
42+
validation:
43+
type: file
44+
path: path/to/validation_dataset_config.yaml
4045
optimizer:
4146
weight_decay: 0.1
4247
beta_1: 0.9
@@ -63,9 +68,10 @@ Let's start from the following training configuration:
6368
train_iters: 100_000
6469
logs:
6570
interval: 10
66-
validation:
67-
iterations: 25
68-
interval: 1000
71+
evaluations:
72+
validation:
73+
iterations: 25
74+
interval: 1000
6975
checkpoint:
7076
interval: 1000
7177
keep: 5
@@ -78,9 +84,13 @@ Let's start from the following training configuration:
7884
sequence_length: 8192
7985
batch_size: 256
8086
data:
81-
format: file
82-
path: fast-llm-tutorial/dataset/fast_llm_dataset.json
83-
split: [99, 1, 0]
87+
datasets:
88+
training:
89+
type: file
90+
path: path/to/training_dataset_config.yaml
91+
validation:
92+
type: file
93+
path: path/to/validation_dataset_config.yaml
8494
optimizer:
8595
weight_decay: 0.1
8696
beta_1: 0.9

examples/mistral.yaml

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -3,16 +3,17 @@ training:
33
num_workers: 8
44
logs:
55
interval: 10
6-
validation:
7-
iterations: null
6+
evaluations:
7+
validation:
8+
iterations: null
89
test_iters: 0
910
batch:
1011
sequence_length: 4096
1112
micro_batch_size: 2
1213
batch_size: 64
1314
data:
1415
datasets:
15-
Training:
16+
training:
1617
type: random
1718
optimizer:
1819
learning_rate:

0 commit comments

Comments
 (0)