You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/datasets.md
+25Lines changed: 25 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -42,6 +42,31 @@ To load audio files and related metadata from .tar files in the WebDataset forma
42
42
}
43
43
```
44
44
45
+
## Pre Encoded Datasets
46
+
To use pre encoded latents created with the [pre encoding script](pre_encoding.md), set the `dataset_type` property to `"pre_encoded"`, and provide the path to the directory containing the pre encoded `.npy` latent files and corresponding `.json` metadata files.
47
+
48
+
You can optionally specify a `latent_crop_length` in latent units (latent length = `audio_samples // 2048`) to crop the pre encoded latents to a smaller length than you encoded to. If not specified, uses the full pre encoded length. When `random_crop` is set to true, it will randomly crop from the sequence at your desired `latent_crop_length` while taking padding into account.
49
+
50
+
**Note**: `random_crop` does not currently update `seconds_start`, so it will be inaccurate when used to train or fine-tune models with that condition (e.g. `stable-audio-open-1.0`), but can be used with models that do not use `seconds_start` (e.g. `stable-audio-open-small`).
For information on creating pre encoded datasets, see [Pre Encoding](pre_encoding.md).
69
+
45
70
# Custom metadata
46
71
To customize the metadata provided to the conditioners during model training, you can provide a separate custom metadata module to the dataset config. This metadata module should be a Python file that must contain a function called `get_custom_metadata` that takes in two parameters, `info`, and `audio`, and returns a dictionary.
When training models on encoded latents from a frozen pre-trained autoencoder, the encoder is typically frozen. Because of that, it is common to pre-encode audio to latents and store them on disk instead of computing them on-the-fly during training. This can improve training throughput as well as free up GPU memory that would otherwise be used for encoding.
4
+
5
+
## Prerequisites
6
+
7
+
To pre-encode audio to latents, you'll need a dataset config file, an autoencoder model config file, and an **unwrapped** autoencoder checkpoint file.
8
+
9
+
**Note:** You can find a copy of the unwrapped VAE checkpoint (`vae_model.ckpt`) and config (`vae_config.json`) in the `stabilityai/stable-audio-open-1.0` Hugging Face [repo](https://huggingface.co/stabilityai/stable-audio-open-1.0). This is the same VAE used in `stable-audio-open-small`.
10
+
11
+
## Run the Pre Encoding Script
12
+
13
+
To pre-encode latents from an autoencoder model, you can use `pre_encode.py`. This script will load a pre-trained autoencoder, encode the latents/tokens, and save them to disk in a format that can be easily loaded during training.
14
+
15
+
The `pre_encode.py` script accepts the following command line arguments:
16
+
17
+
-`--model-config`
18
+
- Path to model config
19
+
-`--ckpt-path`
20
+
- Path to **unwrapped** autoencoder model checkpoint
21
+
-`--model-half`
22
+
- If true, uses half precision for model weights
23
+
- Optional
24
+
-`--dataset-config`
25
+
- Path to dataset config file
26
+
- Required
27
+
-`--output-path`
28
+
- Path to output folder
29
+
- Required
30
+
-`--batch-size`
31
+
- Batch size for processing
32
+
- Optional, defaults to 1
33
+
-`--sample-size`
34
+
- Number of audio samples to pad/crop to for pre-encoding
35
+
- Optional, defaults to 1320960 (~30 seconds)
36
+
-`--is-discrete`
37
+
- If true, treats the model as discrete, saving discrete tokens instead of continuous latents
38
+
- Optional
39
+
-`--num-nodes`
40
+
- Number of nodes to use for distributed processing, if available.
41
+
- Optional, defaults to 1
42
+
-`--num-workers`
43
+
- Number of dataloader workers
44
+
- Optional, defaults to 4
45
+
-`--strategy`
46
+
- PyTorch Lightning strategy
47
+
- Optional, defaults to 'auto'
48
+
-`--limit-batches`
49
+
- Limits the number of batches processed
50
+
- Optional
51
+
-`--shuffle`
52
+
- If true, shuffles the dataset
53
+
- Optional
54
+
55
+
**Note:** When pre encoding, it's recommended to set `"drop_last": false` in your dataset config to ensure the last batch is processed even if it's not full.
56
+
57
+
For example, if you wanted to encode latents with padding up to 30 seconds long in half precision, you could run the following:
58
+
59
+
```bash
60
+
$ python3 ./pre_encode.py \
61
+
--model-config /path/to/model/config.json \
62
+
--ckpt-path /path/to/autoencoder/model.ckpt \
63
+
--model-half \
64
+
--dataset-config /path/to/dataset/config.json \
65
+
--output-path /path/to/output/dir \
66
+
--sample-size 1320960 \
67
+
```
68
+
69
+
When you run the above, the `--output-path` directory will contain numbered subdirectories for each GPU process used to encode the latents, and a `details.json` file that keeps track of settings used when the script was run.
70
+
71
+
Inside the numbered subdirectories, you will find the encoded latents as `.npy` files, along with associated `.json` metadata files.
72
+
73
+
```bash
74
+
/path/to/output/dir/
75
+
├── 0
76
+
│ ├── 0000000000000.json
77
+
│ ├── 0000000000000.npy
78
+
│ ├── 0000000000001.json
79
+
│ ├── 0000000000001.npy
80
+
│ ├── 0000000000002.json
81
+
│ ├── 0000000000002.npy
82
+
...
83
+
└── details.json
84
+
```
85
+
86
+
## Training on Pre Encoded Latents
87
+
88
+
Once you have saved your latents to disk, you can use them to train a model by providing a dataset config file to `train.py` that points to the pre-encoded latents, specifying `"dataset_type"` is `"pre_encoded"`. Under the hood, this will configure a `stable_audio_tools.data.dataset.PreEncodedDataset`. For more information on configuring pre encoded datasets, see the [Pre Encoded Datasets](datasets.md#pre-encoded-datasets) section of the datasets docs.
89
+
90
+
The dataset config file should look something like this:
91
+
92
+
```json
93
+
{
94
+
"dataset_type": "pre_encoded",
95
+
"datasets": [
96
+
{
97
+
"id": "my_audio",
98
+
"path": "/path/to/output/dir"
99
+
}
100
+
],
101
+
"random_crop": false
102
+
}
103
+
```
104
+
105
+
In your diffusion model config, you'll also need to specify `pre_encoded: true` in the [`training` section](diffusion.md#training-configs) to tell the training wrapper to operate on pre encoded latents instead of audio.
0 commit comments