Skip to content

Latest commit

 

History

History
70 lines (48 loc) · 4.13 KB

File metadata and controls

70 lines (48 loc) · 4.13 KB

ACE-Step 1.5

ACE-Step 1.5 was introduced in ACE-Step 1.5: Pushing the Boundaries of Open-Source Music Generation by the ACE-Step Team (ACE Studio and StepFun). It is an open-source music foundation model that generates commercial-grade stereo music with lyrics from text prompts.

ACE-Step 1.5 generates variable-length stereo audio at 48 kHz (10 seconds to 10 minutes) from text prompts and optional lyrics. The full system pairs a Language Model planner with a Diffusion Transformer (DiT) synthesizer; this pipeline wraps the DiT half of that stack, and consists of three components: an [AutoencoderOobleck] VAE that compresses waveforms into 25 Hz stereo latents, a Qwen3-based text encoder for prompt and lyric conditioning, and an [AceStepTransformer1DModel] DiT that operates in the VAE latent space using flow matching.

The model supports 50+ languages for lyrics — including English, Chinese, Japanese, Korean, French, German, Spanish, Italian, Portuguese, and Russian — and runs on consumer GPUs (under 4 GB of VRAM when offloaded).

This pipeline was contributed by the ACE-Step Team. The original codebase can be found at ace-step/ACE-Step-1.5.

Variants

ACE-Step 1.5 ships three DiT checkpoints that share the same transformer architecture but differ in sampling recipe; the pipeline auto-detects the variant from the loaded transformer config and applies the matching defaults.

Variant CFG Default steps Default guidance_scale HF repo
turbo (guidance-distilled) off 8 1.0 ACE-Step/Ace-Step1.5
base on 8 7.0 ACE-Step/acestep-v15-base
sft on 8 7.0 ACE-Step/acestep-v15-sft

Base and SFT use the learned null_condition_emb for classifier-free guidance (APG, not vanilla CFG). Users commonly override num_inference_steps to 30–60 on base/sft for higher quality.

Tips

When constructing a prompt, keep in mind:

  • Descriptive prompt inputs work best; use adjectives to describe the music style, instruments, mood, and tempo.
  • The prompt should describe the overall musical characteristics (e.g., "upbeat pop song with electric guitar and drums").
  • Lyrics should be structured with tags like [verse], [chorus], [bridge], etc.

During inference:

  • num_inference_steps, guidance_scale, and shift fall back to the variant-specific defaults shown above when left as None.
  • The audio_duration parameter controls the length of the generated music in seconds.
  • The vocal_language parameter should match the language of the lyrics.
  • pipe.sample_rate and pipe.latents_per_second are sourced from the VAE config (48000 Hz and 25 fps for the released checkpoints).
import torch
import soundfile as sf
from diffusers import AceStepPipeline

pipe = AceStepPipeline.from_pretrained("ACE-Step/Ace-Step1.5", torch_dtype=torch.bfloat16)
pipe = pipe.to("cuda")

audio = pipe(
    prompt="A beautiful piano piece with soft melodies and gentle rhythm",
    lyrics="[verse]\nSoft notes in the morning light\nDancing through the air so bright\n[chorus]\nMusic fills the air tonight\nEvery note feels just right",
    audio_duration=30.0,
).audios

sf.write("output.wav", audio[0].T.cpu().float().numpy(), pipe.sample_rate)

AceStepPipeline

[[autodoc]] AceStepPipeline - all - call