hao-ai-lab
diff --git a/‎docs/inference/inference_quick_start.md‎
Lines changed: 2 additions & 0 deletions b/‎docs/inference/inference_quick_start.md‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎docs/inference/stable_audio.md‎
Lines changed: 126 additions & 0 deletions b/‎docs/inference/stable_audio.md‎
Lines changed: 126 additions & 0 deletions
diff --git a/‎docs/inference/support_matrix.md‎
Lines changed: 4 additions & 0 deletions b/‎docs/inference/support_matrix.md‎
Lines changed: 4 additions & 0 deletions
diff --git a/‎examples/inference/basic/stable_audio_basic.py‎
Lines changed: 128 additions & 0 deletions b/‎examples/inference/basic/stable_audio_basic.py‎
Lines changed: 128 additions & 0 deletions
diff --git a/‎fastvideo/configs/models/dits/__init__.py‎
Lines changed: 12 additions & 1 deletion b/‎fastvideo/configs/models/dits/__init__.py‎
Lines changed: 12 additions & 1 deletion
diff --git a/‎fastvideo/configs/models/dits/stable_audio.py‎
Lines changed: 39 additions & 0 deletions b/‎fastvideo/configs/models/dits/stable_audio.py‎
Lines changed: 39 additions & 0 deletions
diff --git a/‎fastvideo/configs/pipelines/stable_audio.py‎
Lines changed: 22 additions & 0 deletions b/‎fastvideo/configs/pipelines/stable_audio.py‎
Lines changed: 22 additions & 0 deletions
@@ -67,6 +67,8 @@ More inference example scripts can be found in `scripts/inference/`
 
 Please see the [support matrix](support_matrix.md) for the list of supported models and their available optimizations.
 
+For **text-to-audio** generation (Stable Audio), see [Stable Audio](stable_audio.md).
+
 ## Image-to-Video Generation
 
 You can generate a video starting from an initial image:
 
@@ -0,0 +1,126 @@
+# Stable Audio
+
+FastVideo supports text-to-audio (T2A) generation via the **Stable Audio Open** model from Stability AI. This page describes supported models, installation, usage, and known limitations.
+
+## Supported Models and Weights
+
+| Model | HuggingFace ID | Local Path |
+|-------|----------------|------------|
+| Stable Audio Open 1.0 | `stabilityai/stable-audio-open-1.0` | `official_weights/stable-audio-open-1.0` |
+
+**Weight format**:
+
+- **HuggingFace**: Pass the model ID (e.g. `stabilityai/stable-audio-open-1.0`) to `VideoGenerator.from_pretrained()`. FastVideo will download and cache the model on first use.
+- **Local**: Place a unified checkpoint (`model.safetensors` or `model.ckpt`) and `model_config.json` at the model root. Use the directory path as `model_path`.
+
+## Installation and Dependencies
+
+### Conflict with `stable-audio-tools` (Python 3.12)
+
+The `stable-audio-tools` PyPI package has dependencies that **fail to build on Python 3.12** (e.g. PyWavelets). Do **not** run `pip install stable-audio-tools` directly.
+
+Use the following two-step install:
+
+```bash
+# 1. Install stable-audio-tools without its dependencies
+pip install stable-audio-tools --no-deps
+
+# 2. Install compatible inference dependencies
+pip install .[stable-audio]
+# or: pip install k-diffusion v-diffusion-pytorch prefigure ema-pytorch local-attention alias-free-torch
+```
+
+If FastVideo is already installed:
+
+```bash
+pip install stable-audio-tools --no-deps
+pip install fastvideo[stable-audio]
+```
+
+### Dependencies Installed by `[stable-audio]`
+
+- `k-diffusion>=0.1.1`
+- `v-diffusion-pytorch>=0.0.2`
+- `prefigure>=0.0.9`
+- `ema-pytorch>=0.2.3`
+- `local-attention>=1.8.6`
+- `alias-free-torch>=0.0.6`
+
+These versions are compatible with FastVideo. `stable-audio-tools` declares stricter pins; the `--no-deps` install avoids pulling in conflicting packages (PyWavelets, encodec, etc.) that are not required for inference.
+
+## Running the Example
+
+### Basic usage
+
+```bash
+python examples/inference/basic/stable_audio_basic.py
+```
+
+### With custom parameters
+
+```bash
+python examples/inference/basic/stable_audio_basic.py \
+  --prompt "A gentle rain on a wooden roof" \
+  --duration 10 \
+  --steps 250 \
+  --output my_audio.wav
+```
+
+### Main parameters
+
+| Argument | Default | Description |
+|----------|---------|-------------|
+| `--model-path` | `stabilityai/stable-audio-open-1.0` | Model path or HuggingFace model ID |
+| `--prompt` | `A beautiful piano arpeggio` | Text description of the audio to generate |
+| `--duration` | `10.0` | Output duration in seconds |
+| `--output` | `outputs_audio/stable_audio_output.wav` | Output WAV file path |
+| `--steps` | `250` | Number of denoising steps (`num_inference_steps`) |
+| `--guidance-scale` | `6.0` | Classifier-free guidance scale |
+| `--seed` | `42` | Random seed |
+| `--no-cpu-offload` | (flag) | Disable CPU offload for higher GPU utilization (requires more VRAM) |
+
+### Programmatic usage
+
+```python
+from fastvideo import VideoGenerator
+
+generator = VideoGenerator.from_pretrained(
+    "stabilityai/stable-audio-open-1.0",
+    num_gpus=1,
+)
+
+result = generator.generate_audio(
+    prompt="A beautiful piano arpeggio",
+    duration_seconds=10.0,
+    num_inference_steps=250,
+    guidance_scale=6.0,
+    seed=42,
+)
+
+# result["audio"]: torch.Tensor (B, C, T)
+# result["sample_rate"]: 44100
+generator.shutdown()
+```
+
+### Sampling parameters (`generate_audio` kwargs)
+
+| Parameter | Default | Description |
+|-----------|---------|-------------|
+| `duration_seconds` | `10.0` | Output duration (seconds) |
+| `num_inference_steps` | `250` | Denoising steps |
+| `guidance_scale` | `6.0` | CFG scale |
+| `seed` | `42` | Random seed |
+| `seconds_start` | `0.0` | Conditioning start offset |
+| `seconds_total` | Same as `duration_seconds` | Conditioning total duration |
+
+`sample_rate` is fixed at **44.1 kHz** and comes from the pipeline config.
+
+## Known Limitations
+
+| Item | Description |
+|------|-------------|
+| **T2A only** | Only text-to-audio is supported. Audio-to-audio, stem separation, and other stable-audio-tools features are not implemented. |
+| **Single model** | Only Stable Audio Open 1.0 is supported. |
+| **VRAM** | ~6–8 GB for typical generation (10 s, 250 steps). Use `--no-cpu-offload` for higher GPU utilization; this increases VRAM use. |
+| **Max duration** | ~47.5 s at 44.1 kHz (model `sample_size` limit). |
+| **Differences from official** | Uses FastVideo’s pipeline layout and executor; sampling logic matches stable-audio-tools (k-diffusion v-prediction, DPM++ 2M SDE). Minor numerical differences may occur due to implementation details. |
@@ -60,6 +60,7 @@ The `HuggingFace Model ID` can be directly pass to `from_pretrained()` methods a
 | Matrix Game 2.0 Base | `FastVideo/Matrix-Game-2.0-Base-Diffusers` | 352x640 | ⭕ | ⭕ | ⭕ | ⭕ | ⭕ |
 | Matrix Game 2.0 GTA | `FastVideo/Matrix-Game-2.0-GTA-Diffusers` | 352x640 | ⭕ | ⭕ | ⭕ | ⭕ | ⭕ |
 | Matrix Game 2.0 TempleRun | `FastVideo/Matrix-Game-2.0-TempleRun-Diffusers` | 352x640 | ⭕ | ⭕ | ⭕ | ⭕ | ⭕ |
+| Stable Audio Open 1.0 (T2A) | `stabilityai/stable-audio-open-1.0` | 44.1 kHz stereo | ⭕ | ⭕ | ⭕ | ⭕ | ⭕ |
 
 **Note**: Wan2.2 TI2V 5B has some quality issues when performing I2V generation. We are working on fixing this issue.
 
@@ -80,3 +81,6 @@ The `HuggingFace Model ID` can be directly pass to `from_pretrained()` methods a
 - Image-to-video game world models with keyboard/mouse control input
 - Three variants available: Base (universal), GTA, and TempleRun
 - Each variant has different keyboard dimensions for control inputs
+
+### Stable Audio Open 1.0
+- Text-to-audio (T2A) only. See [Stable Audio](stable_audio.md) for installation and usage.
@@ -0,0 +1,128 @@
+#!/usr/bin/env python3
+# SPDX-License-Identifier: Apache-2.0
+"""
+Minimal example: generate audio from a text prompt using Stable Audio Open.
+
+Install (once): pip install stable-audio-tools --no-deps && pip install .[stable-audio]
+
+Usage:
+  python examples/inference/basic/stable_audio_basic.py
+  python examples/inference/basic/stable_audio_basic.py --prompt "A gentle rain" --duration 8
+  python examples/inference/basic/stable_audio_basic.py --no-cpu-offload  # higher GPU utilization
+"""
+import argparse
+import os
+
+import numpy as np
+import torch
+
+from fastvideo import VideoGenerator
+
+
+def save_audio_wav(audio: torch.Tensor, sample_rate: int, path: str) -> None:
+    """Save audio tensor (B, C, T) to WAV file. Output is stereo interleaved."""
+    import wave
+
+    if audio.ndim == 3:
+        audio = audio[0]
+    audio_np = audio.detach().cpu().float().numpy()
+    audio_np = np.clip(audio_np, -1.0, 1.0)
+    audio_int16 = (audio_np * 32767.0).astype(np.int16)
+    if audio_int16.ndim == 1:
+        audio_int16 = audio_int16[:, None]
+    num_channels = audio_int16.shape[0]
+    num_frames = audio_int16.shape[1]
+    frames_bytes = audio_int16.T.tobytes()
+
+    os.makedirs(os.path.dirname(path) or ".", exist_ok=True)
+    with wave.open(path, "wb") as wav_file:
+        wav_file.setnchannels(num_channels)
+        wav_file.setsampwidth(2)
+        wav_file.setframerate(sample_rate)
+        wav_file.writeframes(frames_bytes)
+
+
+def main() -> None:
+    parser = argparse.ArgumentParser(description="Stable Audio text-to-audio generation")
+    parser.add_argument(
+        "--model-path",
+        type=str,
+        default="stabilityai/stable-audio-open-1.0",
+        help="Path to model or HuggingFace model ID (e.g. stabilityai/stable-audio-open-1.0)",
+    )
+    parser.add_argument(
+        "--prompt",
+        type=str,
+        default="A beautiful piano arpeggio",
+        help="Text description of the audio to generate",
+    )
+    parser.add_argument(
+        "--duration",
+        type=float,
+        default=10.0,
+        help="Duration in seconds (default: 10)",
+    )
+    parser.add_argument(
+        "--output",
+        type=str,
+        default="outputs_audio/stable_audio_output.wav",
+        help="Output WAV file path",
+    )
+    parser.add_argument(
+        "--steps",
+        type=int,
+        default=250,
+        help="Number of denoising steps (default: 250)",
+    )
+    parser.add_argument(
+        "--guidance-scale",
+        type=float,
+        default=6.0,
+        help="Classifier-free guidance scale (default: 6.0)",
+    )
+    parser.add_argument(
+        "--seed",
+        type=int,
+        default=42,
+        help="Random seed",
+    )
+    parser.add_argument(
+        "--no-cpu-offload",
+        action="store_true",
+        help="Disable CPU offload for higher GPU utilization (requires more VRAM)",
+    )
+    args = parser.parse_args()
+
+    offload_kwargs = {}
+    if args.no_cpu_offload:
+        offload_kwargs = dict(
+            dit_cpu_offload=False,
+            text_encoder_cpu_offload=False,
+            vae_cpu_offload=False,
+        )
+
+    generator = VideoGenerator.from_pretrained(
+        args.model_path,
+        num_gpus=1,
+        **offload_kwargs,
+    )
+
+    result = generator.generate_audio(
+        prompt=args.prompt,
+        duration_seconds=args.duration,
+        num_inference_steps=args.steps,
+        guidance_scale=args.guidance_scale,
+        seed=args.seed,
+    )
+
+    generator.shutdown()
+
+    save_audio_wav(result["audio"], result["sample_rate"], args.output)
+    print(f"Saved audio to {args.output}")
+    print(f"  Shape: {result['audio'].shape}, sample_rate: {result['sample_rate']} Hz")
+    if result.get("generation_time"):
+        print(f"  Generation time: {result['generation_time']:.1f}s")
+
+
+if __name__ == "__main__":
+    main()
@@ -3,15 +3,26 @@
 from fastvideo.configs.models.dits.hunyuanvideo import HunyuanVideoConfig
 from fastvideo.configs.models.dits.hunyuanvideo15 import HunyuanVideo15Config
 from fastvideo.configs.models.dits.lingbotworld import LingBotWorldVideoConfig
+from fastvideo.configs.models.dits.hyworld import HYWorldConfig
 from fastvideo.configs.models.dits.longcat import LongCatVideoConfig
 from fastvideo.configs.models.dits.ltx2 import LTX2VideoConfig
+from fastvideo.configs.models.dits.stable_audio import StableAudioDiTConfig
 from fastvideo.configs.models.dits.stepvideo import StepVideoConfig
 from fastvideo.configs.models.dits.wanvideo import WanVideoConfig
-from fastvideo.configs.models.dits.hyworld import HYWorldConfig
 
 __all__ = [
     "HunyuanVideoConfig", "HunyuanVideo15Config", "WanVideoConfig",
     "StepVideoConfig", "CosmosVideoConfig", "Cosmos25VideoConfig",
     "LongCatVideoConfig", "LTX2VideoConfig", "HYWorldConfig",
     "LingBotWorldVideoConfig"
+    "HunyuanVideoConfig",
+    "HunyuanVideo15Config",
+    "WanVideoConfig",
+    "StepVideoConfig",
+    "CosmosVideoConfig",
+    "Cosmos25VideoConfig",
+    "LongCatVideoConfig",
+    "LTX2VideoConfig",
+    "HYWorldConfig",
+    "StableAudioDiTConfig",
 ]
@@ -0,0 +1,39 @@
+# SPDX-License-Identifier: Apache-2.0
+"""
+Stable Audio DiT config for FastVideo.
+"""
+from dataclasses import dataclass, field
+
+from fastvideo.configs.models.dits.base import DiTArchConfig, DiTConfig
+
+
+@dataclass
+class StableAudioDiTArchConfig(DiTArchConfig):
+    """Arch config for Stable Audio DiT."""
+
+    # Iterator strips model.model. prefix; map inner keys to wrapper's model.*
+    param_names_mapping: dict = field(default_factory=lambda: {
+        r"^(.*)$": r"model.\1",
+    })
+    reverse_param_names_mapping: dict = field(default_factory=dict)
+    lora_param_names_mapping: dict = field(default_factory=dict)
+    _fsdp_shard_conditions: list = field(default_factory=list)
+
+    # HF config fields (from transformer/config.json)
+    attention_head_dim: int = 64
+    cross_attention_dim: int = 768
+    cross_attention_input_dim: int = 768
+    global_states_input_dim: int = 1536
+    num_key_value_attention_heads: int = 12
+    num_layers: int = 24
+    sample_size: int = 1024
+    time_proj_dim: int = 256
+
+
+@dataclass
+class StableAudioDiTConfig(DiTConfig):
+    """Config for Stable Audio DiffusionTransformer."""
+
+    arch_config: DiTArchConfig = field(default_factory=StableAudioDiTArchConfig)
+    unified_checkpoint_path: str | None = None
+    transformer_key_prefix: str = "model.model."
@@ -0,0 +1,22 @@
+# SPDX-License-Identifier: Apache-2.0
+"""Stable Audio pipeline config."""
+from dataclasses import dataclass, field
+
+from fastvideo.configs.models import DiTConfig
+from fastvideo.configs.models.dits.stable_audio import StableAudioDiTConfig
+from fastvideo.configs.pipelines.base import PipelineConfig
+
+
+@dataclass
+class StableAudioPipelineConfig(PipelineConfig):
+    """Config for Stable Audio text-to-audio pipeline.
+
+    Matches stable-audio-open-1.0: 44.1kHz, Oobleck VAE, T5+seconds conditioning.
+    """
+
+    dit_config: DiTConfig = field(default_factory=StableAudioDiTConfig)
+
+    # Audio-specific
+    sample_rate: int = 44100
+    sample_size: int = 2097152  # Max ~47.5s at 44.1kHz
+    embedded_cfg_scale: float = 6.0