ACE-Step 1.5

ACE-Step 1.5 was introduced in ACE-Step 1.5: Pushing the Boundaries of Open-Source Music Generation by the ACE-Step Team (ACE Studio and StepFun). It is an open-source music foundation model that generates commercial-grade stereo music with lyrics from text prompts.

ACE-Step 1.5 generates variable-length stereo audio at 48 kHz (10 seconds to 10 minutes) from text prompts and optional lyrics. The full system pairs a Language Model planner with a Diffusion Transformer (DiT) synthesizer; this pipeline wraps the DiT half of that stack, and consists of three components: an [AutoencoderOobleck] VAE that compresses waveforms into 25 Hz stereo latents, a Qwen3-based text encoder for prompt and lyric conditioning, and an [AceStepTransformer1DModel] DiT that operates in the VAE latent space using flow matching.

The model supports 50+ languages for lyrics — including English, Chinese, Japanese, Korean, French, German, Spanish, Italian, Portuguese, and Russian — and runs on consumer GPUs (under 4 GB of VRAM when offloaded).

This pipeline was contributed by the ACE-Step Team. The original codebase can be found at ace-step/ACE-Step-1.5.

Variants

ACE-Step 1.5 ships three DiT checkpoints that share the same transformer architecture but differ in sampling recipe; the pipeline auto-detects the variant from the loaded transformer config and applies the matching defaults.

Variant	CFG	Default steps	Default `guidance_scale`	HF repo
`turbo` (guidance-distilled)	off	8	1.0	`ACE-Step/Ace-Step1.5`
`base`	on	8	7.0	`ACE-Step/acestep-v15-base`
`sft`	on	8	7.0	`ACE-Step/acestep-v15-sft`

Base and SFT use the learned null_condition_emb for classifier-free guidance (APG, not vanilla CFG). Users commonly override num_inference_steps to 30–60 on base/sft for higher quality.

Tips

When constructing a prompt, keep in mind:

Descriptive prompt inputs work best; use adjectives to describe the music style, instruments, mood, and tempo.
The prompt should describe the overall musical characteristics (e.g., "upbeat pop song with electric guitar and drums").
Lyrics should be structured with tags like [verse], [chorus], [bridge], etc.

During inference:

num_inference_steps, guidance_scale, and shift fall back to the variant-specific defaults shown above when left as None.
The audio_duration parameter controls the length of the generated music in seconds.
The vocal_language parameter should match the language of the lyrics.
pipe.sample_rate and pipe.latents_per_second are sourced from the VAE config (48000 Hz and 25 fps for the released checkpoints).

import torch
import soundfile as sf
from diffusers import AceStepPipeline

pipe = AceStepPipeline.from_pretrained("ACE-Step/Ace-Step1.5", torch_dtype=torch.bfloat16)
pipe = pipe.to("cuda")

audio = pipe(
    prompt="A beautiful piano piece with soft melodies and gentle rhythm",
    lyrics="[verse]\nSoft notes in the morning light\nDancing through the air so bright\n[chorus]\nMusic fills the air tonight\nEvery note feels just right",
    audio_duration=30.0,
).audios

sf.write("output.wav", audio[0].T.cpu().float().numpy(), pipe.sample_rate)

AceStepPipeline

[[autodoc]] AceStepPipeline - all - call

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ACE-Step 1.5

Variants

Tips

AceStepPipeline

FilesExpand file tree

ace_step.md

Latest commit

History

ace_step.md

File metadata and controls

ACE-Step 1.5

Variants

Tips

AceStepPipeline