You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/datasets.md
+25Lines changed: 25 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -42,6 +42,31 @@ To load audio files and related metadata from .tar files in the WebDataset forma
42
42
}
43
43
```
44
44
45
+
## Pre Encoded Datasets
46
+
To use pre encoded latents created with the [pre encoding script](pre_encoding.md), set the `dataset_type` property to `"pre_encoded"`, and provide the path to the directory containing the pre encoded `.npy` latent files and corresponding `.json` metadata files.
47
+
48
+
You can optionally specify a `latent_crop_length` in latent units (latent length = `audio_samples // 2048`) to crop the pre encoded latents to a smaller length than you encoded to. If not specified, uses the full pre encoded length. When `random_crop` is set to true, it will randomly crop from the sequence at your desired `latent_crop_length` while taking padding into account.
49
+
50
+
**Note**: `random_crop` does not currently update `seconds_start`, so it will be inaccurate when used to train or fine-tune models with that condition (e.g. `stable-audio-open-1.0`), but can be used with models that do not use `seconds_start` (e.g. `stable-audio-open-small`).
For information on creating pre encoded datasets, see [Pre Encoding](pre_encoding.md).
69
+
45
70
# Custom metadata
46
71
To customize the metadata provided to the conditioners during model training, you can provide a separate custom metadata module to the dataset config. This metadata module should be a Python file that must contain a function called `get_custom_metadata` that takes in two parameters, `info`, and `audio`, and returns a dictionary.
Copy file name to clipboardExpand all lines: docs/pre_encoding.md
+16-4Lines changed: 16 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,6 +6,8 @@ When training models on encoded latents from a frozen pre-trained autoencoder, t
6
6
7
7
To pre-encode audio to latents, you'll need a dataset config file, an autoencoder model config file, and an **unwrapped** autoencoder checkpoint file.
8
8
9
+
**Note:** You can find a copy of the unwrapped VAE checkpoint (`vae_model.ckpt`) and config (`vae_config.json`) in the `stabilityai/stable-audio-open-1.0` Hugging Face [repo](https://huggingface.co/stabilityai/stable-audio-open-1.0). This is the same VAE used in `stable-audio-open-small`.
10
+
9
11
## Run the Pre Encoding Script
10
12
11
13
To pre-encode latents from an autoencoder model, you can use `pre_encode.py`. This script will load a pre-trained autoencoder, encode the latents/tokens, and save them to disk in a format that can be easily loaded during training.
@@ -50,6 +52,8 @@ The `pre_encode.py` script accepts the following command line arguments:
50
52
- If true, shuffles the dataset
51
53
- Optional
52
54
55
+
**Note:** When pre encoding, it's recommended to set `"drop_last": false` in your dataset config to ensure the last batch is processed even if it's not full.
56
+
53
57
For example, if you wanted to encode latents with padding up to 30 seconds long in half precision, you could run the following:
54
58
55
59
```bash
@@ -81,7 +85,7 @@ Inside the numbered subdirectories, you will find the encoded latents as `.npy`
81
85
82
86
## Training on Pre Encoded Latents
83
87
84
-
Once you have saved your latents to disk, you can use them to train a model by providing a dataset config file to `train.py` that points to the pre-encoded latents, specifying `"dataset_type"` is `"pre_encoded"`. Under the hood, this will configure a `stable_audio_tools.data.dataset.PreEncodedDataset`.
88
+
Once you have saved your latents to disk, you can use them to train a model by providing a dataset config file to `train.py` that points to the pre-encoded latents, specifying `"dataset_type"` is `"pre_encoded"`. Under the hood, this will configure a `stable_audio_tools.data.dataset.PreEncodedDataset`. For more information on configuring pre encoded datasets, see the [Pre Encoded Datasets](datasets.md#pre-encoded-datasets) section of the datasets docs.
85
89
86
90
The dataset config file should look something like this:
87
91
@@ -91,10 +95,18 @@ The dataset config file should look something like this:
91
95
"datasets": [
92
96
{
93
97
"id": "my_audio",
94
-
"path": "/path/to/output/dir",
95
-
"latent_crop_length": 645
98
+
"path": "/path/to/output/dir"
96
99
}
97
100
],
98
101
"random_crop": false
99
102
}
100
-
```
103
+
```
104
+
105
+
In your diffusion model config, you'll also need to specify `pre_encoded: true` in the [`training` section](diffusion.md#training-configs) to tell the training wrapper to operate on pre encoded latents instead of audio.
0 commit comments