Skip to content

Commit c1e9c46

Browse files
authored
Merge pull request #216 from nateraw/pre-encoded-docs
Update docs on working with pre encoded latents
2 parents 9e5954d + 2e65a9a commit c1e9c46

File tree

3 files changed

+45
-4
lines changed

3 files changed

+45
-4
lines changed

docs/datasets.md

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,31 @@ To load audio files and related metadata from .tar files in the WebDataset forma
4242
}
4343
```
4444

45+
## Pre Encoded Datasets
46+
To use pre encoded latents created with the [pre encoding script](pre_encoding.md), set the `dataset_type` property to `"pre_encoded"`, and provide the path to the directory containing the pre encoded `.npy` latent files and corresponding `.json` metadata files.
47+
48+
You can optionally specify a `latent_crop_length` in latent units (latent length = `audio_samples // 2048`) to crop the pre encoded latents to a smaller length than you encoded to. If not specified, uses the full pre encoded length. When `random_crop` is set to true, it will randomly crop from the sequence at your desired `latent_crop_length` while taking padding into account.
49+
50+
**Note**: `random_crop` does not currently update `seconds_start`, so it will be inaccurate when used to train or fine-tune models with that condition (e.g. `stable-audio-open-1.0`), but can be used with models that do not use `seconds_start` (e.g. `stable-audio-open-small`).
51+
52+
### Example config
53+
```json
54+
{
55+
"dataset_type": "pre_encoded",
56+
"datasets": [
57+
{
58+
"id": "my_pre_encoded_audio",
59+
"path": "/path/to/pre_encoded/output/",
60+
"latent_crop_length": 512,
61+
"custom_metadata_module": "/path/to/custom_metadata.py"
62+
}
63+
],
64+
"random_crop": true
65+
}
66+
```
67+
68+
For information on creating pre encoded datasets, see [Pre Encoding](pre_encoding.md).
69+
4570
# Custom metadata
4671
To customize the metadata provided to the conditioners during model training, you can provide a separate custom metadata module to the dataset config. This metadata module should be a Python file that must contain a function called `get_custom_metadata` that takes in two parameters, `info`, and `audio`, and returns a dictionary.
4772

docs/diffusion.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -61,6 +61,10 @@ The `training` config in the diffusion model config file should have the followi
6161
- Optional, overrides `learning_rate`
6262
- `demo`
6363
- Configuration for the demos during training, including conditioning information
64+
- `pre_encoded`
65+
- If true, indicates that the model should operate on [pre encoded latents](pre_encoding.md) instead of raw audio
66+
- Required when training with [pre encoded datasets](datasets.md#pre-encoded-datasets)
67+
- Optional. Default: `false`
6468

6569
## Example config
6670
```json

docs/pre_encoding.md

Lines changed: 16 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,8 @@ When training models on encoded latents from a frozen pre-trained autoencoder, t
66

77
To pre-encode audio to latents, you'll need a dataset config file, an autoencoder model config file, and an **unwrapped** autoencoder checkpoint file.
88

9+
**Note:** You can find a copy of the unwrapped VAE checkpoint (`vae_model.ckpt`) and config (`vae_config.json`) in the `stabilityai/stable-audio-open-1.0` Hugging Face [repo](https://huggingface.co/stabilityai/stable-audio-open-1.0). This is the same VAE used in `stable-audio-open-small`.
10+
911
## Run the Pre Encoding Script
1012

1113
To pre-encode latents from an autoencoder model, you can use `pre_encode.py`. This script will load a pre-trained autoencoder, encode the latents/tokens, and save them to disk in a format that can be easily loaded during training.
@@ -50,6 +52,8 @@ The `pre_encode.py` script accepts the following command line arguments:
5052
- If true, shuffles the dataset
5153
- Optional
5254

55+
**Note:** When pre encoding, it's recommended to set `"drop_last": false` in your dataset config to ensure the last batch is processed even if it's not full.
56+
5357
For example, if you wanted to encode latents with padding up to 30 seconds long in half precision, you could run the following:
5458

5559
```bash
@@ -81,7 +85,7 @@ Inside the numbered subdirectories, you will find the encoded latents as `.npy`
8185

8286
## Training on Pre Encoded Latents
8387

84-
Once you have saved your latents to disk, you can use them to train a model by providing a dataset config file to `train.py` that points to the pre-encoded latents, specifying `"dataset_type"` is `"pre_encoded"`. Under the hood, this will configure a `stable_audio_tools.data.dataset.PreEncodedDataset`.
88+
Once you have saved your latents to disk, you can use them to train a model by providing a dataset config file to `train.py` that points to the pre-encoded latents, specifying `"dataset_type"` is `"pre_encoded"`. Under the hood, this will configure a `stable_audio_tools.data.dataset.PreEncodedDataset`. For more information on configuring pre encoded datasets, see the [Pre Encoded Datasets](datasets.md#pre-encoded-datasets) section of the datasets docs.
8589

8690
The dataset config file should look something like this:
8791

@@ -91,10 +95,18 @@ The dataset config file should look something like this:
9195
"datasets": [
9296
{
9397
"id": "my_audio",
94-
"path": "/path/to/output/dir",
95-
"latent_crop_length": 645
98+
"path": "/path/to/output/dir"
9699
}
97100
],
98101
"random_crop": false
99102
}
100-
```
103+
```
104+
105+
In your diffusion model config, you'll also need to specify `pre_encoded: true` in the [`training` section](diffusion.md#training-configs) to tell the training wrapper to operate on pre encoded latents instead of audio.
106+
107+
```json
108+
"training": {
109+
"pre_encoded": true,
110+
...
111+
}
112+
```

0 commit comments

Comments
 (0)