|
| 1 | +# Stable Audio |
| 2 | + |
| 3 | +FastVideo supports text-to-audio (T2A) generation via the **Stable Audio Open** model from Stability AI. This page describes supported models, installation, usage, and known limitations. |
| 4 | + |
| 5 | +## Supported Models and Weights |
| 6 | + |
| 7 | +| Model | HuggingFace ID | Local Path | |
| 8 | +|-------|----------------|------------| |
| 9 | +| Stable Audio Open 1.0 | `stabilityai/stable-audio-open-1.0` | `official_weights/stable-audio-open-1.0` | |
| 10 | + |
| 11 | +**Weight format**: |
| 12 | + |
| 13 | +- **HuggingFace**: Pass the model ID (e.g. `stabilityai/stable-audio-open-1.0`) to `VideoGenerator.from_pretrained()`. FastVideo will download and cache the model on first use. |
| 14 | +- **Local**: Place a unified checkpoint (`model.safetensors` or `model.ckpt`) and `model_config.json` at the model root. Use the directory path as `model_path`. |
| 15 | + |
| 16 | +## Installation and Dependencies |
| 17 | + |
| 18 | +### Conflict with `stable-audio-tools` (Python 3.12) |
| 19 | + |
| 20 | +The `stable-audio-tools` PyPI package has dependencies that **fail to build on Python 3.12** (e.g. PyWavelets). Do **not** run `pip install stable-audio-tools` directly. |
| 21 | + |
| 22 | +Use the following two-step install: |
| 23 | + |
| 24 | +```bash |
| 25 | +# 1. Install stable-audio-tools without its dependencies |
| 26 | +pip install stable-audio-tools --no-deps |
| 27 | + |
| 28 | +# 2. Install compatible inference dependencies |
| 29 | +pip install .[stable-audio] |
| 30 | +# or: pip install k-diffusion v-diffusion-pytorch prefigure ema-pytorch local-attention alias-free-torch |
| 31 | +``` |
| 32 | + |
| 33 | +If FastVideo is already installed: |
| 34 | + |
| 35 | +```bash |
| 36 | +pip install stable-audio-tools --no-deps |
| 37 | +pip install fastvideo[stable-audio] |
| 38 | +``` |
| 39 | + |
| 40 | +### Dependencies Installed by `[stable-audio]` |
| 41 | + |
| 42 | +- `k-diffusion>=0.1.1` |
| 43 | +- `v-diffusion-pytorch>=0.0.2` |
| 44 | +- `prefigure>=0.0.9` |
| 45 | +- `ema-pytorch>=0.2.3` |
| 46 | +- `local-attention>=1.8.6` |
| 47 | +- `alias-free-torch>=0.0.6` |
| 48 | + |
| 49 | +These versions are compatible with FastVideo. `stable-audio-tools` declares stricter pins; the `--no-deps` install avoids pulling in conflicting packages (PyWavelets, encodec, etc.) that are not required for inference. |
| 50 | + |
| 51 | +## Running the Example |
| 52 | + |
| 53 | +### Basic usage |
| 54 | + |
| 55 | +```bash |
| 56 | +python examples/inference/basic/stable_audio_basic.py |
| 57 | +``` |
| 58 | + |
| 59 | +### With custom parameters |
| 60 | + |
| 61 | +```bash |
| 62 | +python examples/inference/basic/stable_audio_basic.py \ |
| 63 | + --prompt "A gentle rain on a wooden roof" \ |
| 64 | + --duration 10 \ |
| 65 | + --steps 250 \ |
| 66 | + --output my_audio.wav |
| 67 | +``` |
| 68 | + |
| 69 | +### Main parameters |
| 70 | + |
| 71 | +| Argument | Default | Description | |
| 72 | +|----------|---------|-------------| |
| 73 | +| `--model-path` | `stabilityai/stable-audio-open-1.0` | Model path or HuggingFace model ID | |
| 74 | +| `--prompt` | `A beautiful piano arpeggio` | Text description of the audio to generate | |
| 75 | +| `--duration` | `10.0` | Output duration in seconds | |
| 76 | +| `--output` | `outputs_audio/stable_audio_output.wav` | Output WAV file path | |
| 77 | +| `--steps` | `250` | Number of denoising steps (`num_inference_steps`) | |
| 78 | +| `--guidance-scale` | `6.0` | Classifier-free guidance scale | |
| 79 | +| `--seed` | `42` | Random seed | |
| 80 | +| `--no-cpu-offload` | (flag) | Disable CPU offload for higher GPU utilization (requires more VRAM) | |
| 81 | + |
| 82 | +### Programmatic usage |
| 83 | + |
| 84 | +```python |
| 85 | +from fastvideo import VideoGenerator |
| 86 | + |
| 87 | +generator = VideoGenerator.from_pretrained( |
| 88 | + "stabilityai/stable-audio-open-1.0", |
| 89 | + num_gpus=1, |
| 90 | +) |
| 91 | + |
| 92 | +result = generator.generate_audio( |
| 93 | + prompt="A beautiful piano arpeggio", |
| 94 | + duration_seconds=10.0, |
| 95 | + num_inference_steps=250, |
| 96 | + guidance_scale=6.0, |
| 97 | + seed=42, |
| 98 | +) |
| 99 | + |
| 100 | +# result["audio"]: torch.Tensor (B, C, T) |
| 101 | +# result["sample_rate"]: 44100 |
| 102 | +generator.shutdown() |
| 103 | +``` |
| 104 | + |
| 105 | +### Sampling parameters (`generate_audio` kwargs) |
| 106 | + |
| 107 | +| Parameter | Default | Description | |
| 108 | +|-----------|---------|-------------| |
| 109 | +| `duration_seconds` | `10.0` | Output duration (seconds) | |
| 110 | +| `num_inference_steps` | `250` | Denoising steps | |
| 111 | +| `guidance_scale` | `6.0` | CFG scale | |
| 112 | +| `seed` | `42` | Random seed | |
| 113 | +| `seconds_start` | `0.0` | Conditioning start offset | |
| 114 | +| `seconds_total` | Same as `duration_seconds` | Conditioning total duration | |
| 115 | + |
| 116 | +`sample_rate` is fixed at **44.1 kHz** and comes from the pipeline config. |
| 117 | + |
| 118 | +## Known Limitations |
| 119 | + |
| 120 | +| Item | Description | |
| 121 | +|------|-------------| |
| 122 | +| **T2A only** | Only text-to-audio is supported. Audio-to-audio, stem separation, and other stable-audio-tools features are not implemented. | |
| 123 | +| **Single model** | Only Stable Audio Open 1.0 is supported. | |
| 124 | +| **VRAM** | ~6–8 GB for typical generation (10 s, 250 steps). Use `--no-cpu-offload` for higher GPU utilization; this increases VRAM use. | |
| 125 | +| **Max duration** | ~47.5 s at 44.1 kHz (model `sample_size` limit). | |
| 126 | +| **Differences from official** | Uses FastVideo’s pipeline layout and executor; sampling logic matches stable-audio-tools (k-diffusion v-prediction, DPM++ 2M SDE). Minor numerical differences may occur due to implementation details. | |
0 commit comments