Merge pull request #216 from nateraw/pre-encoded-docs

zqevans · web-flow · commit c1e9c46183be · 2025-07-14T13:18:27.000-07:00
Update docs on working with pre encoded latents
diff --git a/docs/datasets.md b/docs/datasets.md
@@ -42,6 +42,31 @@ To load audio files and related metadata from .tar files in the WebDataset forma
 }
 ```
 
+## Pre Encoded Datasets
+To use pre encoded latents created with the [pre encoding script](pre_encoding.md), set the `dataset_type` property to `"pre_encoded"`, and provide the path to the directory containing the pre encoded `.npy` latent files and corresponding `.json` metadata files.
+
+You can optionally specify a `latent_crop_length` in latent units (latent length = `audio_samples // 2048`) to crop the pre encoded latents to a smaller length than you encoded to. If not specified, uses the full pre encoded length. When `random_crop` is set to true, it will randomly crop from the sequence at your desired `latent_crop_length` while taking padding into account.
+
+**Note**: `random_crop` does not currently update `seconds_start`, so it will be inaccurate when used to train or fine-tune models with that condition (e.g. `stable-audio-open-1.0`), but can be used with models that do not use `seconds_start` (e.g. `stable-audio-open-small`).
+
+### Example config
+```json
+{
+    "dataset_type": "pre_encoded",
+    "datasets": [
+        {
+            "id": "my_pre_encoded_audio",
+            "path": "/path/to/pre_encoded/output/",
+            "latent_crop_length": 512,
+            "custom_metadata_module": "/path/to/custom_metadata.py"
+        }
+    ],
+    "random_crop": true
+}
+```
+
+For information on creating pre encoded datasets, see [Pre Encoding](pre_encoding.md).
+
 # Custom metadata
 To customize the metadata provided to the conditioners during model training, you can provide a separate custom metadata module to the dataset config. This metadata module should be a Python file that must contain a function called `get_custom_metadata` that takes in two parameters, `info`, and `audio`, and returns a dictionary. 
 
diff --git a/docs/diffusion.md b/docs/diffusion.md
@@ -61,6 +61,10 @@ The `training` config in the diffusion model config file should have the followi
     - Optional, overrides `learning_rate`
 - `demo`
     - Configuration for the demos during training, including conditioning information
+- `pre_encoded`
+    - If true, indicates that the model should operate on [pre encoded latents](pre_encoding.md) instead of raw audio
+    - Required when training with [pre encoded datasets](datasets.md#pre-encoded-datasets)
+    - Optional. Default: `false`
 
 ## Example config
 ```json
diff --git a/docs/pre_encoding.md b/docs/pre_encoding.md
@@ -6,6 +6,8 @@ When training models on encoded latents from a frozen pre-trained autoencoder, t
 
 To pre-encode audio to latents, you'll need a dataset config file, an autoencoder model config file, and an **unwrapped** autoencoder checkpoint file.
 
+**Note:** You can find a copy of the unwrapped VAE checkpoint (`vae_model.ckpt`) and config (`vae_config.json`) in the `stabilityai/stable-audio-open-1.0` Hugging Face [repo](https://huggingface.co/stabilityai/stable-audio-open-1.0). This is the same VAE used in  `stable-audio-open-small`.
+
 ## Run the Pre Encoding Script
 
 To pre-encode latents from an autoencoder model, you can use `pre_encode.py`. This script will load a pre-trained autoencoder, encode the latents/tokens, and save them to disk in a format that can be easily loaded during training.
@@ -50,6 +52,8 @@ The `pre_encode.py` script accepts the following command line arguments:
   - If true, shuffles the dataset
   - Optional
 
+**Note:** When pre encoding, it's recommended to set `"drop_last": false` in your dataset config to ensure the last batch is processed even if it's not full.
+
 For example, if you wanted to encode latents with padding up to 30 seconds long in half precision, you could run the following:
 
 ```bash
@@ -81,7 +85,7 @@ Inside the numbered subdirectories, you will find the encoded latents as `.npy`
 
 ## Training on Pre Encoded Latents
 
-Once you have saved your latents to disk, you can use them to train a model by providing a dataset config file to `train.py` that points to the pre-encoded latents, specifying `"dataset_type"` is `"pre_encoded"`. Under the hood, this will configure a `stable_audio_tools.data.dataset.PreEncodedDataset`.
+Once you have saved your latents to disk, you can use them to train a model by providing a dataset config file to `train.py` that points to the pre-encoded latents, specifying `"dataset_type"` is `"pre_encoded"`. Under the hood, this will configure a `stable_audio_tools.data.dataset.PreEncodedDataset`. For more information on configuring pre encoded datasets, see the [Pre Encoded Datasets](datasets.md#pre-encoded-datasets) section of the datasets docs.
 
 The dataset config file should look something like this:
 
@@ -91,10 +95,18 @@ The dataset config file should look something like this:
     "datasets": [
         {
             "id": "my_audio",
-            "path": "/path/to/output/dir",
-            "latent_crop_length": 645
+            "path": "/path/to/output/dir"
         }
     ],
     "random_crop": false
 }
-```
+```
+
+In your diffusion model config, you'll also need to specify `pre_encoded: true` in the [`training` section](diffusion.md#training-configs) to tell the training wrapper to operate on pre encoded latents instead of audio.
+
+```json
+"training": {
+    "pre_encoded": true,
+    ...
+}
+```