Add ACE-Step pipeline for text-to-music generation#13095
Add ACE-Step pipeline for text-to-music generation#13095ChuxiJ wants to merge 32 commits intohuggingface:mainfrom
Conversation
|
Hi @ChuxiJ, thanks for the PR! As a preliminary comment, I tried the test script given above but got an error, which I think is due to the fact that the If I convert the checkpoint locally from a local snapshot of python scripts/convert_ace_step_to_diffusers.py \
--checkpoint_dir /path/to/acestep-v15-repo \
--dit_config acestep-v15-turbo \
--output_dir /path/to/acestep-v15-diffusers \
--dtype bf16and then test it using the following script: import torch
import soundfile as sf
from diffusers import AceStepPipeline
OUTPUT_SAMPLE_RATE = 48000
model_id = "/path/to/acestep-v15-diffusers"
device = "cuda"
dtype = torch.bfloat16
seed = 42
pipe = AceStepPipeline.from_pretrained(model_id, torch_dtype=dtype)
pipe = pipe.to(device)
generator = torch.Generator(device=device).manual_seed(seed)
# Text-to-music generation
audio = pipe(
prompt="A beautiful piano piece with soft melodies",
lyrics="[verse]\nSoft notes in the morning light\n[chorus]\nMusic fills the air tonight",
audio_duration=30.0,
num_inference_steps=8,
bpm=120,
keyscale="C major",
generator=generator,
).audios
sf.write("acestep_t2m.wav", audio[0, 0].cpu().numpy(), OUTPUT_SAMPLE_RATE)I get the following sample: The sample quality is lower than expected, so there is probably a bug. Could you look into it? |
|
@ChuxiJ are you still working on it? or, should we do a final review now? |
* [agents docs] add pipelines.md and restructure review rules - Add .ai/pipelines.md: pipeline conventions and gotchas (config-derived values, no_grad discipline, reinventing scheduler logic, subclassing variants, # Copied from annotations). - models.md: add Attention masks subsection inside Attention pattern; fold reference-implementations skim into conventions; consolidate __init__.py / _import_structure gotchas; trim gotchas covered by AGENTS.md (silent fallbacks, config serialization gap) or pipelines.md (no_grad, guider/scheduler reuse). - review-rules.md: collapse to a short reviewer checklist that points into AGENTS / models / pipelines / modular gotchas; only LLM-specific pattern (ephemeral context) lives here directly. - AGENTS.md: collapse defensive-code / unused-params / backwards-compat / deprecation rules into one umbrella bullet; replace inline pipeline bullet list with a pointer to pipelines.md. - SKILL.md (model-integration): trim pre-PR self-review to a one-line pointer. Sourced from the ACE-Step PR (#13095) review. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Apply suggestions from code review Co-authored-by: YiYi Xu <yixu310@gmail.com> * Apply suggestion from @yiyixuxu * Apply suggestions from code review Co-authored-by: Sayak Paul <spsayakpaul@gmail.com> Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * fix capability-flags gotcha: drop fake attrs, tighten to real failure modes `_supports_quantization` and `_supports_cache_class` don't exist in diffusers (sayak flagged the first; the second was also fabricated). Replaced with the two flags where the "advertised but unbacked" pattern is a real mistake: `_supports_gradient_checkpointing` (needs `if self.gradient_checkpointing:` branches in forward) and `_no_split_modules` (needs correct block class names for `device_map`). Dropped `_supports_group_offloading` — its realistic failure mode is forgetting to opt out, not opt in. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: yiyi@huggingface.co <yiyi@ip-26-0-160-103.ec2.internal> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Sayak Paul <spsayakpaul@gmail.com> Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
|
@claude |
|
Claude finished @yiyixuxu's task in 2m 26s —— View job Review of ACE-Step Pipeline PR
Good progress since the last round. Many prior review comments have been addressed — the backward-compat aliases are gone, PR-process comments are removed, Blockers1. Missing
|
| Category | Count |
|---|---|
| Blockers | 4 |
| Non-blockers | 6 |
| Nits | 4 |
The main items to address before merge are: (1) _no_split_modules, (2) remove nested torch.no_grad(), (3) clean up _variant_defaults → Flux2 pattern, and (4) add audio-to-audio task tests.
yiyixuxu
left a comment
There was a problem hiding this comment.
thanks!
i left a few more comments, I think we don't have any merge blocker left
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
What does this PR do?
This PR adds the ACE-Step 1.5 pipeline to Diffusers — a text-to-music generation model that produces high-quality stereo music with lyrics at 48kHz from text prompts.
New Components
AceStepDiTModel(src/diffusers/models/transformers/ace_step_transformer.py): A Diffusion Transformer (DiT) model with RoPE, GQA, sliding window attention, and flow matching for denoising audio latents. Includes custom components:AceStepRMSNorm,AceStepRotaryEmbedding,AceStepMLP,AceStepTimestepEmbedding,AceStepAttention,AceStepEncoderLayer, andAceStepDiTLayer.AceStepConditionEncoder(src/diffusers/pipelines/ace_step/modeling_ace_step.py): Condition encoder that fuses text, lyric, and timbre embeddings into a unified cross-attention conditioning signal. IncludesAceStepLyricEncoderandAceStepTimbreEncodersub-modules.AceStepPipeline(src/diffusers/pipelines/ace_step/pipeline_ace_step.py): The main pipeline supporting 6 task types:text2music— generate music from text and lyricscover— generate from audio semantic codes or with timbre transfer via reference audiorepaint— regenerate a time region within existing audioextract— extract a specific track (vocals, drums, etc.) from audiolego— generate a specific track given audio contextcomplete— complete audio with additional tracksConversion script (
scripts/convert_ace_step_to_diffusers.py): Converts original ACE-Step 1.5 checkpoint weights to Diffusers format.Key Features
_get_task_instructionbpm,keyscale,timesignatureparameters formatted into the SFT prompt templatesrc_audio) and reference audio (reference_audio) inputs with VAE encoding_tiled_encode) and decoding (_tiled_decode) for long audioguidance_scale,cfg_interval_start, andcfg_interval_end(primarily for base/SFT models; turbo models have guidance distilled into weights)audio_cover_strength_parse_audio_code_stringextracts semantic codes from<|audio_code_N|>tokens for cover tasks_build_chunk_maskcreates time-region masks for repaint/lego taskstimestepsshift=3.0Architecture
ACE-Step 1.5 comprises three main components:
Tests
tests/pipelines/ace_step/test_ace_step.py):AceStepDiTModelTests— forward shape, return dict, gradient checkpointingAceStepConditionEncoderTests— forward shape, save/load configAceStepPipelineFastTests(extendsPipelineTesterMixin) — 39 tests covering basic generation, batch processing, latent output, save/load, float16 inference, CPU/model offloading, encode_prompt, prepare_latents, timestep_schedule, format_prompt, and moretests/models/transformers/test_models_transformer_ace_step.py):TestAceStepDiTModel(extendsModelTesterMixin) — forward pass, dtype inference, save/load, determinismTestAceStepDiTModelMemory(extendsMemoryTesterMixin) — layerwise casting, group offloadingTestAceStepDiTModelTraining(extendsTrainingTesterMixin) — training, EMA, gradient checkpointing, mixed precisionAll 70 tests pass (39 pipeline + 31 model).
Documentation
docs/source/en/api/pipelines/ace_step.md— Pipeline API documentation with usage examplesdocs/source/en/api/models/ace_step_transformer.md— Transformer model documentationUsage
Before submitting
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
References