Blaizzy
diff --git a/‎README.md‎
Lines changed: 27 additions & 0 deletions b/‎README.md‎
Lines changed: 27 additions & 0 deletions
diff --git a/‎mlx_audio/tts/models/longcat_audiodit/README.md‎
Lines changed: 116 additions & 0 deletions b/‎mlx_audio/tts/models/longcat_audiodit/README.md‎
Lines changed: 116 additions & 0 deletions
diff --git a/‎mlx_audio/tts/models/longcat_audiodit/__init__.py‎
Lines changed: 2 additions & 0 deletions b/‎mlx_audio/tts/models/longcat_audiodit/__init__.py‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎mlx_audio/tts/models/longcat_audiodit/config.py‎
Lines changed: 94 additions & 0 deletions b/‎mlx_audio/tts/models/longcat_audiodit/config.py‎
Lines changed: 94 additions & 0 deletions
@@ -103,6 +103,7 @@ for result in model.generate("Hello from MLX-Audio!", voice="af_heart"):
 | **Ming Omni TTS (Dense)** | Lightweight dense Ming Omni variant for voice cloning and style control | EN, ZH | [mlx-community/Ming-omni-tts-0.5B-bf16](https://huggingface.co/mlx-community/Ming-omni-tts-0.5B-bf16) |
 | **KugelAudio** | SOTA 7B AR+Diffusion TTS for European languages | EN, DE, FR, ES, IT, PT, NL, PL, RU, UK, + 14 more | [kugelaudio/kugelaudio-0-open](https://huggingface.co/kugelaudio/kugelaudio-0-open) |
 | **Voxtral TTS** | Mistral's 4B multilingual TTS (20 voices, 9 languages) | EN, FR, ES, DE, IT, PT, NL, AR, HI | [mlx-community/Voxtral-4B-TTS-2603-mlx-bf16](https://huggingface.co/mlx-community/Voxtral-4B-TTS-2603-mlx-bf16) |
+| **LongCat-AudioDiT** | SOTA diffusion TTS in waveform latent space with voice cloning | ZH, EN | [mlx-community/LongCat-AudioDiT-1B-bf16](https://huggingface.co/mlx-community/LongCat-AudioDiT-1B-bf16) |
 
 ### Speech-to-Text (STT)
 
@@ -392,6 +393,32 @@ python -m mlx_audio.convert \
 > **Note:** Requires ~17GB memory (7B params in bfloat16).
 > Pre-encoded voice presets (voice cloning) are not yet available in the upstream model — the model generates speech with a default voice.
 
+### LongCat-AudioDiT
+
+SOTA diffusion-based TTS operating in the waveform latent space. Uses Conditional Flow Matching with a DiT backbone and WAV-VAE codec at 24kHz. Supports zero-shot voice cloning.
+
+```python
+from mlx_audio.tts.utils import load
+
+model = load("mlx-community/LongCat-AudioDiT-1B-bf16")
+
+# Zero-shot TTS
+result = next(model.generate("Hello, this is a test of AudioDiT."))
+audio = result.audio  # mx.array, 24kHz
+
+# Voice cloning (use "apg" guidance for best similarity)
+result = next(model.generate(
+    text="Today is warm turning to rain.",
+    ref_audio="reference.wav",
+    ref_text="Transcript of the reference audio.",
+    guidance_method="apg",
+    cfg_strength=4.0,
+    steps=16,
+))
+```
+
+See the [LongCat-AudioDiT README](mlx_audio/tts/models/longcat_audiodit/README.md) for all parameters and CLI usage.
+
 ### Voxtral TTS
 
 Mistral's 4B multilingual text-to-speech with 20 voice presets across 9 languages.
 
@@ -0,0 +1,116 @@
+# LongCat-AudioDiT
+
+State-of-the-art diffusion-based text-to-speech that operates directly in the waveform latent space. Uses Conditional Flow Matching with a DiT (Diffusion Transformer) backbone and a WAV-VAE audio codec at 24kHz. Supports zero-shot voice cloning with SOTA speaker similarity on the Seed benchmark.
+
+**Paper:** [LongCat-AudioDiT](https://github.com/meituan-longcat/LongCat-AudioDiT/blob/main/LongCat-AudioDiT.pdf)
+
+## Usage
+
+Python API:
+
+```python
+from mlx_audio.tts.utils import load
+
+model = load("mlx-community/LongCat-AudioDiT-1B-bf16")
+
+result = next(model.generate("Hello, this is a test of AudioDiT."))
+audio = result.audio  # mlx array, 24kHz
+```
+
+Play audio directly:
+
+```python
+from mlx_audio.tts.audio_player import AudioPlayer
+
+player = AudioPlayer(sample_rate=24000)
+result = next(model.generate("The quick brown fox jumps over the lazy dog."))
+player.queue_audio(result.audio)
+player.wait_for_drain()
+player.stop()
+```
+
+## Voice Cloning
+
+Clone any voice using a reference audio sample and its transcript. Use `guidance_method="apg"` for best voice cloning quality:
+
+```python
+result = next(model.generate(
+    text="Today is warm turning to rain, with good air quality.",
+    ref_audio="reference.wav",
+    ref_text="Transcript of the reference audio.",
+    guidance_method="apg",
+    cfg_strength=4.0,
+    steps=16,
+))
+```
+
+## Zero-Shot Generation (Chinese)
+
+```python
+result = next(model.generate(
+    text="今天晴暖转阴雨，空气质量优至良，空气相对湿度较低。",
+    steps=16,
+    cfg_strength=4.0,
+))
+```
+
+## Generation Parameters
+
+| Parameter | Default | Description |
+|-----------|---------|-------------|
+| `steps` | 16 | Euler ODE solver steps. Higher = better quality, slower |
+| `cfg_strength` | 4.0 | Classifier-free guidance strength |
+| `guidance_method` | `"cfg"` | `"cfg"` for TTS, `"apg"` for voice cloning |
+| `seed` | 1024 | Random seed for reproducibility |
+| `ref_audio` | `None` | Reference audio for voice cloning (24kHz) |
+| `ref_text` | `None` | Transcript of the reference audio |
+
+## CLI
+
+```bash
+# Zero-shot TTS
+python -m mlx_audio.tts.generate \
+  --model mlx-community/LongCat-AudioDiT-1B-bf16 \
+  --text "Hello, this is a test of AudioDiT." \
+  --play
+
+# Voice cloning
+python -m mlx_audio.tts.generate \
+  --model mlx-community/LongCat-AudioDiT-1B-bf16 \
+  --text "Today is warm turning to rain." \
+  --ref_audio reference.wav \
+  --ref_text "Transcript of the reference audio." \
+  --play
+```
+
+## Available Models
+
+| Model | Parameters | Format | Languages |
+|-------|-----------|--------|-----------|
+| `mlx-community/LongCat-AudioDiT-1B-bf16` | 1B | bf16 | Chinese, English |
+| `mlx-community/LongCat-AudioDiT-1B-8bit` | 1B | 8-bit | Chinese, English |
+| `mlx-community/LongCat-AudioDiT-1B-6bit` | 1B | 6-bit | Chinese, English |
+| `mlx-community/LongCat-AudioDiT-1B-5bit` | 1B | 5-bit | Chinese, English |
+| `mlx-community/LongCat-AudioDiT-1B-4bit` | 1B | 4-bit | Chinese, English |
+| `mlx-community/LongCat-AudioDiT-1B-mxfp8` | 1B | MXFP8 | Chinese, English |
+| `mlx-community/LongCat-AudioDiT-1B-mxfp4` | 1B | MXFP4 | Chinese, English |
+| `mlx-community/LongCat-AudioDiT-1B-nvfp4` | 1B | NVFP4 | Chinese, English |
+| `mlx-community/LongCat-AudioDiT-3.5B-bf16` | 3.5B | bf16 | Chinese, English |
+| `mlx-community/LongCat-AudioDiT-3.5B-8bit` | 3.5B | 8-bit | Chinese, English |
+| `mlx-community/LongCat-AudioDiT-3.5B-6bit` | 3.5B | 6-bit | Chinese, English |
+| `mlx-community/LongCat-AudioDiT-3.5B-5bit` | 3.5B | 5-bit | Chinese, English |
+| `mlx-community/LongCat-AudioDiT-3.5B-4bit` | 3.5B | 4-bit | Chinese, English |
+| `mlx-community/LongCat-AudioDiT-3.5B-mxfp8` | 3.5B | MXFP8 | Chinese, English |
+| `mlx-community/LongCat-AudioDiT-3.5B-mxfp4` | 3.5B | MXFP4 | Chinese, English |
+| `mlx-community/LongCat-AudioDiT-3.5B-nvfp4` | 3.5B | NVFP4 | Chinese, English |
+
+## Architecture
+
+- **DiT backbone:** dim=1536, depth=24, heads=24 with RoPE and AdaLN
+- **WAV-VAE codec:** latent_dim=64, 24kHz, runs in fp16
+- **UMT5 text encoder:** 768-dim, 12 layers with per-layer relative position bias
+- **Conditional Flow Matching** with Euler ODE solver
+
+## License
+
+LongCat-AudioDiT weights and code are released under the [MIT License](https://github.com/meituan-longcat/LongCat-AudioDiT/blob/main/LICENSE).
@@ -0,0 +1,2 @@
+from .config import ModelConfig
+from .longcat_audiodit import Model
@@ -0,0 +1,94 @@
+import math
+from dataclasses import dataclass, field
+from typing import List, Optional
+
+from mlx_audio.tts.models.base import BaseModelArgs
+
+
+@dataclass
+class VaeConfig:
+    in_channels: int = 1
+    channels: int = 128
+    c_mults: List[int] = field(default_factory=lambda: [1, 2, 4, 8, 16])
+    strides: List[int] = field(default_factory=lambda: [2, 4, 4, 8, 8])
+    latent_dim: int = 64
+    encoder_latent_dim: int = 128
+    use_snake: bool = True
+    downsample_shortcut: str = "averaging"
+    upsample_shortcut: str = "duplicating"
+    out_shortcut: str = "averaging"
+    in_shortcut: str = "duplicating"
+    final_tanh: bool = False
+    downsampling_ratio: int = 2048
+    sample_rate: int = 24000
+    scale: float = 0.71
+
+
+@dataclass
+class TextEncoderConfig:
+    vocab_size: int = 256384
+    d_model: int = 768
+    d_kv: int = 64
+    d_ff: int = 2048
+    num_layers: int = 12
+    num_heads: int = 12
+    relative_attention_num_buckets: int = 32
+    relative_attention_max_distance: int = 128
+    dropout_rate: float = 0.1
+    layer_norm_epsilon: float = 1e-6
+    is_gated_act: bool = True
+    dense_act_fn: str = "gelu_new"
+
+
+@dataclass
+class ModelConfig(BaseModelArgs):
+    model_type: str = "audiodit"
+    dit_dim: int = 1536
+    dit_depth: int = 24
+    dit_heads: int = 24
+    dit_ff_mult: float = 4.0
+    dit_text_dim: int = 768
+    dit_dropout: float = 0.0
+    dit_bias: bool = True
+    dit_cross_attn: bool = True
+    dit_adaln_type: str = "global"
+    dit_adaln_use_text_cond: bool = True
+    dit_long_skip: bool = True
+    dit_text_conv: bool = True
+    dit_qk_norm: bool = True
+    dit_cross_attn_norm: bool = False
+    dit_eps: float = 1e-6
+    dit_use_latent_condition: bool = True
+    repa_dit_layer: int = 8
+    latent_dim: int = 64
+    sigma: float = 0.0
+    sampling_rate: int = 24000
+    latent_hop: int = 2048
+    max_wav_duration: float = 30.0
+    text_encoder_model: str = "google/umt5-base"
+    text_add_embed: bool = True
+    text_norm_feat: bool = True
+    vae_config: Optional[VaeConfig] = None
+    text_encoder_config: Optional[TextEncoderConfig] = None
+
+    def __post_init__(self):
+        if isinstance(self.vae_config, dict):
+            self.vae_config = VaeConfig(
+                **{
+                    k: v
+                    for k, v in self.vae_config.items()
+                    if k in VaeConfig.__dataclass_fields__
+                }
+            )
+        if self.vae_config is None:
+            self.vae_config = VaeConfig()
+        if isinstance(self.text_encoder_config, dict):
+            self.text_encoder_config = TextEncoderConfig(
+                **{
+                    k: v
+                    for k, v in self.text_encoder_config.items()
+                    if k in TextEncoderConfig.__dataclass_fields__
+                }
+            )
+        if self.text_encoder_config is None:
+            self.text_encoder_config = TextEncoderConfig()
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,2 @@`
	`1`	`+from .config import ModelConfig`
	`2`	`+from .longcat_audiodit import Model`