Skip to content

Commit 63d874d

Browse files
committed
Add LongCat-AudioDiT pipeline
Signed-off-by: Lancer <maruixiang6688@gmail.com>
1 parent e365d74 commit 63d874d

File tree

14 files changed

+1731
-0
lines changed

14 files changed

+1731
-0
lines changed

docs/source/en/_toctree.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -488,6 +488,8 @@
488488
title: AudioLDM 2
489489
- local: api/pipelines/stable_audio
490490
title: Stable Audio
491+
- local: api/pipelines/longcat_audio_dit
492+
title: LongCat-AudioDiT
491493
title: Audio
492494
- sections:
493495
- local: api/pipelines/animatediff
Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
<!--Copyright 2026 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License.
11+
-->
12+
13+
# LongCat-AudioDiT
14+
15+
LongCat-AudioDiT is a text-to-audio diffusion model from Meituan LongCat. The diffusers integration exposes a standard [`DiffusionPipeline`] interface for text-conditioned audio generation.
16+
17+
This pipeline supports loading the original flat LongCat checkpoint layout from either a local directory or a Hugging Face Hub repository containing:
18+
19+
- `config.json`
20+
- `model.safetensors`
21+
22+
The loader builds the text encoder, transformer, and VAE from `config.json`, restores component weights from `model.safetensors`, and ties the shared UMT5 embedding when needed.
23+
24+
This pipeline was adapted from the LongCat-AudioDiT reference implementation: https://github.com/meituan-longcat/LongCat-AudioDiT
25+
26+
## Usage
27+
28+
```py
29+
import torch
30+
from diffusers import LongCatAudioDiTPipeline
31+
32+
repo_id = "<longcat-audio-dit-repo-id>"
33+
tokenizer_path = os.environ["LONGCAT_AUDIO_DIT_TOKENIZER_PATH"]
34+
35+
pipe = LongCatAudioDiTPipeline.from_pretrained(
36+
repo_id,
37+
tokenizer=tokenizer_path,
38+
torch_dtype=torch.float16,
39+
local_files_only=True,
40+
)
41+
pipe = pipe.to("cuda")
42+
43+
audio = pipe(
44+
prompt="A calm ocean wave ambience with soft wind in the background.",
45+
audio_end_in_s=2.0,
46+
num_inference_steps=16,
47+
guidance_scale=4.0,
48+
output_type="pt",
49+
).audios
50+
```
51+
52+
## Tips
53+
54+
- `audio_end_in_s` is the most direct way to control output duration.
55+
- `output_type="pt"` returns a PyTorch tensor shaped `(batch, channels, samples)`.
56+
- If your tokenizer path is local-only, pass it explicitly to `from_pretrained(...)`.
57+
58+
## LongCatAudioDiTPipeline
59+
60+
[[autodoc]] LongCatAudioDiTPipeline
61+
- all
62+
- __call__
63+
- from_pretrained

docs/source/en/api/pipelines/overview.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,7 @@ The table below lists all the pipelines currently available in 🤗 Diffusers an
2929
|---|---|
3030
| [AnimateDiff](animatediff) | text2video |
3131
| [AudioLDM2](audioldm2) | text2audio |
32+
| [LongCat-AudioDiT](longcat_audio_dit) | text2audio |
3233
| [AuraFlow](aura_flow) | text2image |
3334
| [Bria 3.2](bria_3_2) | text2image |
3435
| [CogVideoX](cogvideox) | text2video |

src/diffusers/__init__.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -212,6 +212,7 @@
212212
"AutoencoderKLTemporalDecoder",
213213
"AutoencoderKLWan",
214214
"AutoencoderOobleck",
215+
"LongCatAudioDiTVae",
215216
"AutoencoderRAE",
216217
"AutoencoderTiny",
217218
"AutoencoderVidTok",
@@ -253,6 +254,7 @@
253254
"Kandinsky5Transformer3DModel",
254255
"LatteTransformer3DModel",
255256
"LongCatImageTransformer2DModel",
257+
"LongCatAudioDiTTransformer",
256258
"LTX2VideoTransformer3DModel",
257259
"LTXVideoTransformer3DModel",
258260
"Lumina2Transformer2DModel",
@@ -594,6 +596,7 @@
594596
"LLaDA2PipelineOutput",
595597
"LongCatImageEditPipeline",
596598
"LongCatImagePipeline",
599+
"LongCatAudioDiTPipeline",
597600
"LTX2ConditionPipeline",
598601
"LTX2ImageToVideoPipeline",
599602
"LTX2LatentUpsamplePipeline",
@@ -1007,6 +1010,7 @@
10071010
AutoencoderKLTemporalDecoder,
10081011
AutoencoderKLWan,
10091012
AutoencoderOobleck,
1013+
LongCatAudioDiTVae,
10101014
AutoencoderRAE,
10111015
AutoencoderTiny,
10121016
AutoencoderVidTok,
@@ -1048,6 +1052,7 @@
10481052
Kandinsky5Transformer3DModel,
10491053
LatteTransformer3DModel,
10501054
LongCatImageTransformer2DModel,
1055+
LongCatAudioDiTTransformer,
10511056
LTX2VideoTransformer3DModel,
10521057
LTXVideoTransformer3DModel,
10531058
Lumina2Transformer2DModel,
@@ -1365,6 +1370,7 @@
13651370
LLaDA2PipelineOutput,
13661371
LongCatImageEditPipeline,
13671372
LongCatImagePipeline,
1373+
LongCatAudioDiTPipeline,
13681374
LTX2ConditionPipeline,
13691375
LTX2ImageToVideoPipeline,
13701376
LTX2LatentUpsamplePipeline,

src/diffusers/models/__init__.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -51,6 +51,7 @@
5151
_import_structure["autoencoders.autoencoder_kl_temporal_decoder"] = ["AutoencoderKLTemporalDecoder"]
5252
_import_structure["autoencoders.autoencoder_kl_wan"] = ["AutoencoderKLWan"]
5353
_import_structure["autoencoders.autoencoder_oobleck"] = ["AutoencoderOobleck"]
54+
_import_structure["autoencoders.autoencoder_longcat_audio_dit"] = ["LongCatAudioDiTVae"]
5455
_import_structure["autoencoders.autoencoder_rae"] = ["AutoencoderRAE"]
5556
_import_structure["autoencoders.autoencoder_tiny"] = ["AutoencoderTiny"]
5657
_import_structure["autoencoders.autoencoder_vidtok"] = ["AutoencoderVidTok"]
@@ -112,6 +113,7 @@
112113
_import_structure["transformers.transformer_hunyuanimage"] = ["HunyuanImageTransformer2DModel"]
113114
_import_structure["transformers.transformer_kandinsky"] = ["Kandinsky5Transformer3DModel"]
114115
_import_structure["transformers.transformer_longcat_image"] = ["LongCatImageTransformer2DModel"]
116+
_import_structure["transformers.transformer_longcat_audio_dit"] = ["LongCatAudioDiTTransformer"]
115117
_import_structure["transformers.transformer_ltx"] = ["LTXVideoTransformer3DModel"]
116118
_import_structure["transformers.transformer_ltx2"] = ["LTX2VideoTransformer3DModel"]
117119
_import_structure["transformers.transformer_lumina2"] = ["Lumina2Transformer2DModel"]

src/diffusers/models/autoencoders/__init__.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@
2020
from .autoencoder_kl_temporal_decoder import AutoencoderKLTemporalDecoder
2121
from .autoencoder_kl_wan import AutoencoderKLWan
2222
from .autoencoder_oobleck import AutoencoderOobleck
23+
from .autoencoder_longcat_audio_dit import LongCatAudioDiTVae
2324
from .autoencoder_rae import AutoencoderRAE
2425
from .autoencoder_tiny import AutoencoderTiny
2526
from .autoencoder_vidtok import AutoencoderVidTok

0 commit comments

Comments
 (0)