Stability-AI
diff --git a/‎README.md‎
Lines changed: 2 additions & 2 deletions b/‎README.md‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎docs/datasets.md‎
Lines changed: 25 additions & 0 deletions b/‎docs/datasets.md‎
Lines changed: 25 additions & 0 deletions
diff --git a/‎docs/diffusion.md‎
Lines changed: 4 additions & 0 deletions b/‎docs/diffusion.md‎
Lines changed: 4 additions & 0 deletions
diff --git a/‎docs/pre_encoding.md‎
Lines changed: 112 additions & 0 deletions b/‎docs/pre_encoding.md‎
Lines changed: 112 additions & 0 deletions
diff --git a/‎pre_encode.py‎
Lines changed: 189 additions & 0 deletions b/‎pre_encode.py‎
Lines changed: 189 additions & 0 deletions
diff --git a/‎run_gradio.py‎
Lines changed: 3 additions & 1 deletion b/‎run_gradio.py‎
Lines changed: 3 additions & 1 deletion
@@ -14,9 +14,9 @@ $ pip install .
 ```
 
 # Requirements
-Requires PyTorch 2.0 or later for Flash Attention support
+Requires PyTorch 2.5 or later for Flash Attention and Flex Attention support
 
-Development for the repo is done in Python 3.8.10
+Development for the repo is done in Python 3.10
 
 # Interface
 
 
@@ -42,6 +42,31 @@ To load audio files and related metadata from .tar files in the WebDataset forma
 }
 ```
 
+## Pre Encoded Datasets
+To use pre encoded latents created with the [pre encoding script](pre_encoding.md), set the `dataset_type` property to `"pre_encoded"`, and provide the path to the directory containing the pre encoded `.npy` latent files and corresponding `.json` metadata files.
+
+You can optionally specify a `latent_crop_length` in latent units (latent length = `audio_samples // 2048`) to crop the pre encoded latents to a smaller length than you encoded to. If not specified, uses the full pre encoded length. When `random_crop` is set to true, it will randomly crop from the sequence at your desired `latent_crop_length` while taking padding into account.
+
+**Note**: `random_crop` does not currently update `seconds_start`, so it will be inaccurate when used to train or fine-tune models with that condition (e.g. `stable-audio-open-1.0`), but can be used with models that do not use `seconds_start` (e.g. `stable-audio-open-small`).
+
+### Example config
+```json
+{
+    "dataset_type": "pre_encoded",
+    "datasets": [
+        {
+            "id": "my_pre_encoded_audio",
+            "path": "/path/to/pre_encoded/output/",
+            "latent_crop_length": 512,
+            "custom_metadata_module": "/path/to/custom_metadata.py"
+        }
+    ],
+    "random_crop": true
+}
+```
+
+For information on creating pre encoded datasets, see [Pre Encoding](pre_encoding.md).
+
 # Custom metadata
 To customize the metadata provided to the conditioners during model training, you can provide a separate custom metadata module to the dataset config. This metadata module should be a Python file that must contain a function called `get_custom_metadata` that takes in two parameters, `info`, and `audio`, and returns a dictionary. 
 
 
@@ -61,6 +61,10 @@ The `training` config in the diffusion model config file should have the followi
     - Optional, overrides `learning_rate`
 - `demo`
     - Configuration for the demos during training, including conditioning information
+- `pre_encoded`
+    - If true, indicates that the model should operate on [pre encoded latents](pre_encoding.md) instead of raw audio
+    - Required when training with [pre encoded datasets](datasets.md#pre-encoded-datasets)
+    - Optional. Default: `false`
 
 ## Example config
 ```json
 
@@ -0,0 +1,112 @@
+# Pre Encoding
+
+When training models on encoded latents from a frozen pre-trained autoencoder, the encoder is typically frozen. Because of that, it is common to pre-encode audio to latents and store them on disk instead of computing them on-the-fly during training. This can improve training throughput as well as free up GPU memory that would otherwise be used for encoding.
+
+## Prerequisites
+
+To pre-encode audio to latents, you'll need a dataset config file, an autoencoder model config file, and an **unwrapped** autoencoder checkpoint file.
+
+**Note:** You can find a copy of the unwrapped VAE checkpoint (`vae_model.ckpt`) and config (`vae_config.json`) in the `stabilityai/stable-audio-open-1.0` Hugging Face [repo](https://huggingface.co/stabilityai/stable-audio-open-1.0). This is the same VAE used in  `stable-audio-open-small`.
+
+## Run the Pre Encoding Script
+
+To pre-encode latents from an autoencoder model, you can use `pre_encode.py`. This script will load a pre-trained autoencoder, encode the latents/tokens, and save them to disk in a format that can be easily loaded during training.
+
+The `pre_encode.py` script accepts the following command line arguments:
+
+- `--model-config`
+  - Path to model config
+- `--ckpt-path`
+  - Path to **unwrapped** autoencoder model checkpoint
+- `--model-half`
+  - If true, uses half precision for model weights
+  - Optional
+- `--dataset-config`
+  - Path to dataset config file
+  - Required
+- `--output-path`
+  - Path to output folder
+  - Required
+- `--batch-size`
+  - Batch size for processing
+  - Optional, defaults to 1
+- `--sample-size`
+  - Number of audio samples to pad/crop to for pre-encoding
+  - Optional, defaults to 1320960 (~30 seconds)
+- `--is-discrete`
+  - If true, treats the model as discrete, saving discrete tokens instead of continuous latents
+  - Optional
+- `--num-nodes`
+  - Number of nodes to use for distributed processing, if available.
+  - Optional, defaults to 1
+- `--num-workers`
+  - Number of dataloader workers
+  - Optional, defaults to 4
+- `--strategy`
+  - PyTorch Lightning strategy
+  - Optional, defaults to 'auto'
+- `--limit-batches`
+  - Limits the number of batches processed
+  - Optional
+- `--shuffle`
+  - If true, shuffles the dataset
+  - Optional
+
+**Note:** When pre encoding, it's recommended to set `"drop_last": false` in your dataset config to ensure the last batch is processed even if it's not full.
+
+For example, if you wanted to encode latents with padding up to 30 seconds long in half precision, you could run the following:
+
+```bash
+$ python3 ./pre_encode.py \
+--model-config /path/to/model/config.json \
+--ckpt-path /path/to/autoencoder/model.ckpt \
+--model-half \
+--dataset-config /path/to/dataset/config.json \
+--output-path /path/to/output/dir \
+--sample-size 1320960 \
+```
+
+When you run the above, the `--output-path` directory will contain numbered subdirectories for each GPU process used to encode the latents, and a `details.json` file that keeps track of settings used when the script was run.
+
+Inside the numbered subdirectories, you will find the encoded latents as `.npy` files, along with associated `.json` metadata files.
+
+```bash
+/path/to/output/dir/
+├── 0
+│   ├── 0000000000000.json
+│   ├── 0000000000000.npy
+│   ├── 0000000000001.json
+│   ├── 0000000000001.npy
+│   ├── 0000000000002.json
+│   ├── 0000000000002.npy
+...
+└── details.json
+```
+
+## Training on Pre Encoded Latents
+
+Once you have saved your latents to disk, you can use them to train a model by providing a dataset config file to `train.py` that points to the pre-encoded latents, specifying `"dataset_type"` is `"pre_encoded"`. Under the hood, this will configure a `stable_audio_tools.data.dataset.PreEncodedDataset`. For more information on configuring pre encoded datasets, see the [Pre Encoded Datasets](datasets.md#pre-encoded-datasets) section of the datasets docs.
+
+The dataset config file should look something like this:
+
+```json
+{
+    "dataset_type": "pre_encoded",
+    "datasets": [
+        {
+            "id": "my_audio",
+            "path": "/path/to/output/dir"
+        }
+    ],
+    "random_crop": false
+}
+```
+
+In your diffusion model config, you'll also need to specify `pre_encoded: true` in the [`training` section](diffusion.md#training-configs) to tell the training wrapper to operate on pre encoded latents instead of audio.
+
+```json
+"training": {
+    "pre_encoded": true,
+    ...
+}
+```
@@ -0,0 +1,189 @@
+import argparse
+import gc
+import json
+from pathlib import Path
+
+import numpy as np
+import pytorch_lightning as pl
+import torch
+from torch.nn import functional as F
+
+from stable_audio_tools.data.dataset import create_dataloader_from_config
+from stable_audio_tools.models.factory import create_model_from_config
+from stable_audio_tools.models.pretrained import get_pretrained_model
+from stable_audio_tools.models.utils import load_ckpt_state_dict, copy_state_dict
+
+
+def load_model(model_config=None, model_ckpt_path=None, pretrained_name=None, model_half=False):
+    if pretrained_name is not None:
+        print(f"Loading pretrained model {pretrained_name}")
+        model, model_config = get_pretrained_model(pretrained_name)
+
+    elif model_config is not None and model_ckpt_path is not None:
+        print(f"Creating model from config")
+        model = create_model_from_config(model_config)
+
+        print(f"Loading model checkpoint from {model_ckpt_path}")
+        copy_state_dict(model, load_ckpt_state_dict(model_ckpt_path))
+
+    model.eval().requires_grad_(False)
+
+    if model_half:
+        model.to(torch.float16)
+
+    print("Done loading model")
+
+    return model, model_config
+
+
+class PreEncodedLatentsInferenceWrapper(pl.LightningModule):
+    def __init__(
+        self, 
+        model,
+        output_path,
+        is_discrete=False,
+        model_half=False,
+        model_config=None,
+        dataset_config=None,
+        sample_size=1920000,
+        args_dict=None
+    ):
+        super().__init__()
+        self.save_hyperparameters(ignore=['model'])
+        self.model = model
+        self.output_path = Path(output_path)
+
+    def prepare_data(self):
+        # runs on rank 0
+        self.output_path.mkdir(parents=True, exist_ok=True)
+        details_path = self.output_path / "details.json"
+        if not details_path.exists():  # Only save if it doesn't exist
+            details = {
+                "model_config": self.hparams.model_config,
+                "dataset_config": self.hparams.dataset_config,
+                "sample_size": self.hparams.sample_size,
+                "args": self.hparams.args_dict
+            }
+            details_path.write_text(json.dumps(details))
+
+    def setup(self, stage=None):
+        # runs on each device
+        process_dir = self.output_path / str(self.global_rank)
+        process_dir.mkdir(parents=True, exist_ok=True)
+
+    def validation_step(self, batch, batch_idx):
+        audio, metadata = batch
+
+        if audio.ndim == 4 and audio.shape[0] == 1:
+            audio = audio[0]
+
+        if torch.cuda.is_available():
+            torch.cuda.empty_cache()
+        gc.collect()
+
+        if self.hparams.model_half:
+            audio = audio.to(torch.float16)
+
+        with torch.no_grad():
+            if not self.hparams.is_discrete:
+                latents = self.model.encode(audio)
+            else:
+                _, info = self.model.encode(audio, return_info=True)
+                latents = info[self.model.bottleneck.tokens_id]
+
+        latents = latents.cpu().numpy()
+
+        # Save each sample in the batch
+        for i, latent in enumerate(latents):
+            latent_id = f"{self.global_rank:03d}{batch_idx:06d}{i:04d}"
+
+            # Save latent as numpy file
+            latent_path = self.output_path / str(self.global_rank) / f"{latent_id}.npy"
+            with open(latent_path, "wb") as f:
+                np.save(f, latent)
+
+            md = metadata[i]
+            padding_mask = F.interpolate(
+                md["padding_mask"].unsqueeze(0).unsqueeze(1).float(),
+                size=latent.shape[1],
+                mode="nearest"
+            ).squeeze().int()
+            md["padding_mask"] = padding_mask.cpu().numpy().tolist()
+
+            # Convert tensors in md to serializable types
+            for k, v in md.items():
+                if isinstance(v, torch.Tensor):
+                    md[k] = v.cpu().numpy().tolist()
+
+            # Save metadata to json file
+            metadata_path = self.output_path / str(self.global_rank) / f"{latent_id}.json"
+            with open(metadata_path, "w") as f:
+                json.dump(md, f)
+
+    def configure_optimizers(self):
+        return None
+
+
+def main(args):
+    with open(args.model_config) as f:
+        model_config = json.load(f)
+
+    with open(args.dataset_config) as f:
+        dataset_config = json.load(f)
+
+    model, model_config = load_model(
+        model_config=model_config,
+        model_ckpt_path=args.ckpt_path,
+        model_half=args.model_half
+    )
+
+    data_loader = create_dataloader_from_config(
+        dataset_config,
+        batch_size=args.batch_size,
+        num_workers=args.num_workers,
+        sample_rate=model_config["sample_rate"],
+        sample_size=args.sample_size,
+        audio_channels=model_config.get("audio_channels", 2),
+        shuffle=args.shuffle
+    )
+
+    pl_module = PreEncodedLatentsInferenceWrapper(
+        model=model,
+        output_path=args.output_path,
+        is_discrete=args.is_discrete,
+        model_half=args.model_half,
+        model_config=args.model_config,
+        dataset_config=args.dataset_config,
+        sample_size=args.sample_size,
+        args_dict=vars(args)
+    )
+
+    trainer = pl.Trainer(
+        accelerator="gpu",
+        devices="auto",
+        num_nodes = args.num_nodes,
+        strategy=args.strategy,
+        precision="16-true" if args.model_half else "32",
+        max_steps=args.limit_batches if args.limit_batches else -1,
+        logger=False,  # Disable logging since we're just doing inference
+        enable_checkpointing=False,
+    )
+    trainer.validate(pl_module, data_loader)
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description='Encode audio dataset to VAE latents using PyTorch Lightning')
+    parser.add_argument('--model-config', type=str, help='Path to model config', required=False)
+    parser.add_argument('--ckpt-path', type=str, help='Path to unwrapped autoencoder model checkpoint', required=False)
+    parser.add_argument('--model-half', action='store_true', help='Whether to use half precision')
+    parser.add_argument('--dataset-config', type=str, help='Path to dataset config file', required=True)
+    parser.add_argument('--output-path', type=str, help='Path to output folder', required=True)
+    parser.add_argument('--batch-size', type=int, help='Batch size', default=1)
+    parser.add_argument('--sample-size', type=int, help='Number of audio samples to pad/crop to', default=1320960)
+    parser.add_argument('--is-discrete', action='store_true', help='Whether the model is discrete')
+    parser.add_argument('--num-nodes', type=int, help='Number of GPU nodes', default=1)
+    parser.add_argument('--num-workers', type=int, help='Number of dataloader workers', default=4)
+    parser.add_argument('--strategy', type=str, help='PyTorch Lightning strategy', default='auto')
+    parser.add_argument('--limit-batches', type=int, help='Limit number of batches (optional)', default=None)
+    parser.add_argument('--shuffle', action='store_true', help='Shuffle dataset')
+    args = parser.parse_args()
+    main(args)
@@ -12,7 +12,8 @@ def main(args):
         ckpt_path=args.ckpt_path, 
         pretrained_name=args.pretrained_name, 
         pretransform_ckpt_path=args.pretransform_ckpt_path,
-        model_half=args.model_half
+        model_half=args.model_half,
+        gradio_title=args.title
     )
     interface.queue()
     interface.launch(share=args.share, auth=(args.username, args.password) if args.username is not None else None)
@@ -28,5 +29,6 @@ def main(args):
     parser.add_argument('--username', type=str, help='Gradio username', required=False)
     parser.add_argument('--password', type=str, help='Gradio password', required=False)
     parser.add_argument('--model-half', action='store_true', help='Whether to use half precision', required=False, default=True)
+    parser.add_argument('--title', type=str, help='Display Title top of Gradio', required=False)
     args = parser.parse_args()
     main(args)