Feature: selectable full-frame backgrounds for the compositor via a shared
BackgroundSource API (ensure() / background_frame(t) / close()).
See also docs/technical/background-stills.md for the SDXL keyframe cache and
interpolation details, and docs/technical/background-keyframes-editor.md for
the Gradio timeline (edit timing/prompts, regen, uploads).
Status — AnimateDiff black-frame regression fixed. The SDXL stock VAE silently NaNs in fp16 once the (16-frame × latent) tensor passes through decode, which surfaced as all-black PNGs. Fixed by loading
madebyollin/sdxl-vae-fp16-fix(Apache-2.0, ~330 MB, drop-in replacement) and assigning it topipe.vaeafter pipeline construction in_load_pipe. The default constructor usesDEFAULT_FP16_VAE_ID = "madebyollin/sdxl-vae-fp16-fix"; override via thevae_id=kwarg. Old (v2 / no-vae_id) caches automatically invalidate becausevae_idparticipates in_prompt_hash_segments.
| Canonical ID | Module / class | Notes |
|---|---|---|
sdxl-stills |
pipeline.background_stills.BackgroundStills |
Default. manifest.json + keyframe_*.png. Optional RIFE morph (sdxl_rife_morph, rife_exp); see rife-morph-background.md. |
static-kenburns |
pipeline.background_kenburns.StaticKenBurnsBackground |
manifest_static_kenburns.json, source.<ext>; RMS from analysis.json drives zoom/pan/tilt. |
animatediff |
pipeline.background_animatediff.AnimateDiffBackground |
AnimateDiff seeded by SDXL stills. Wraps a BackgroundStills instance as init_image_source and uses each closest SDXL keyframe as the init latent for the matching AnimateDiff segment — the output IS the SDXL still, animated. No sample-time blending / overlay. manifest_animatediff.json (schema v2), anim_{seg}_{f}.png; requires CUDA + diffusers with AnimateDiffSDXLPipeline. Loads madebyollin/sdxl-vae-fp16-fix for fp16-safe decode. |
- Source modules:
pipeline/background.py(protocol + factory),background_stills.py,background_kenburns.py,background_animatediff.py; UI/inputs:app.py,orchestrator.py(background_mode,static_background_image,sdxl_ken_burns,sdxl_rife_morph,rife_expfor stills). pipeline.background.create_background_source(...)returns a concrete source (sdxl_rife_morph/rife_expapply only tosdxl-stills).pipeline.background.normalize_background_modemaps UI labels to canonical IDs (raises on unknown).OrchestratorInputs.background_modeandOrchestratorInputs.static_background_imagecarry user choices into full render / preview (non-cache-key metadata).
Where it applies
static-kenburns: entire plate is RMS-driven Ken Burns (StaticKenBurnsBackground).sdxl-stills: optional — whensdxl_ken_burnsis on, each sampled still (interpolated and/or RIFE-morphed) passes throughapply_ken_burns_to_rgb_array(BackgroundStills).
Motion recipe
- Progress
ufrom smoothstep(t/duration), normalized RMS fromanalysis.json(rms.valuesat RMS fps inside_rms_envelope_stats/_interp_scalar_series). _ken_burns_transformpans/zooms/tilts a crop from_cover_canvasscaling (DEFAULT_MARGIN1.38 by default → headroom — override viaken_burns_marginin the factory / orchestrator wiring).- Effects timeline optional
ken_burns_rms_automation: RMS drive issdxl_ken_burns_rms_reactivity× interpolated envelope (seedocs/technical/effects-timeline.md,docs/technical/orchestrator-effects-timeline-wiring.md).
Post-rotate edge trim (zoom-in)
Tilt calls Pillow rotate(..., expand=True, fillcolor=black). Resizing the full expanded bounding box onto the compositor rectangle used to shear black padding into thin edge slivers. The pipeline now _crop_center_cover_resize after rotate: centered crop matching output aspect ratio with ROT_TRIM_OVERSCAN (default ~1.032 × zoom-in versus the maximal inscribed crop), then resize to final width × height. Increase ROT_TRIM_OVERSCAN slightly in code if traces remain under extreme RMS.
AnimateDiff uses a dedicated motion-prompt builder (_build_motion_prompt in
pipeline/background_animatediff.py), separate from the SDXL stills keyframe
builder (_build_keyframe_prompt in background_stills.py):
- Stills append structural hints (
scene N of M, song section K of C, t=X.Xs) to diversify still keyframes; harmless for image diffusion. - AnimateDiff skips structural hints — the motion adapter treats them as
content and drifts off-topic. Instead every loop gets:
- The scene prompt from the orchestrator (Gradio Visual style textbox, or the shader’s built-in example when empty).
- A motion flavor: for cache ids
style-<shader_stem>, frompipeline.visual_style.motion_flavor_for_style_preset; otherwise from legacyMOTION_FLAVORSby preset id, orDEFAULT_MOTION_FLAVORwhen unknown. - A pacing cue that varies by song position:
establishing shot, slow motionin the first quartile,steady motionthrough the middle,slower fade-out motionin the last quartile (_pacing_cue). - A short quality tail (
cinematic, high detail, coherent frames).
DEFAULT_NUM_INFERENCE_STEPS = 35(bumped from the stills default of 28). The SDXL-beta motion adapter needs more denoising steps before temporal attention settles; below ~32 frames often look soft or ghosted.ANIMATEDIFF_NEGATIVE_PROMPTextends the stillsDEFAULT_NEGATIVE_PROMPTwith motion-specific failure terms:static frame, frozen motion, stutter, duplicate frames, jerky camera, hard cut, scene cut, flickering, morphing shapes, distorted proportions, rolling shutter.
Any change to prompts, motion flavors, inference steps, negative prompt,
model ids, VAE id, init-image keyframe times, or resolution bumps the
hash or schema and invalidates existing manifest_animatediff.json files.
Phase 4 ships schema v2; v1 caches are ignored by matches_key and
regenerated on first run. The VAE-fix patch adds vae_id to
_prompt_hash_segments, so any cache produced before the fp16-safe VAE
swap (which would have contained black frames) automatically regenerates.
The cross-segment morph patch hashes each segment as the pair
"<prompt>|||<prompt_2>" so toggling prompt travel also invalidates the
cache. The init-image path hashes init_key as
img2img-v1|s=<strength>|t=<comma-separated keyframe times> (or
init=none when no stills cache), so changing img2img strength, keyframe
timing, or disabling stills invalidates cleanly.
AnimateDiffBackground accepts an optional
init_image_source: BackgroundSource. The factory wires a BackgroundStills
instance as the source whenever mode == "animatediff", so the user-facing
behaviour matches the spec:
load SDXL → make stills → dump SDXL → load AnimateDiff → animate the stills.
There is no sample-time blending or overlay. The AnimateDiff frames that go
to disk are the output; the SDXL still seeds img2img-style temporal
diffusion (see render step 5 below). Segment count follows analysis.json
musical sections (_segments_from_analysis), not SDXL keyframe count — you may
see ~8 AnimateDiff clips on a long song because the beat/spectrogram segmenter
produced eight sections; each gets one 16-frame loop seeded from the nearest
SDXL keyframe in time.
ensure(progress)runs the init-image source first with progress mapped to[0.0, 0.4]. This generateskeyframe_*.png+manifest.jsonundercache/<song_hash>/background/.- The init source is closed inline (so its SDXL pipeline releases its
~5 GB of VRAM) and its
keyframe_timesare snapshotted to disk lookup. - AnimateDiff loads its pipeline (also fp16, also gets the fp16-safe VAE
swap) with progress mapped to
[0.4, 1.0]. - For each musical section,
_init_image_for_segmentpicks the SDXL keyframe PNG whose time is closest to the segment start. - Img2img init (critical): Passing raw VAE latents as
pipe(latents=…)does not preserve the still.AnimateDiffSDXLPipeline.prepare_latentsassumes Gaussian noise scaled byinit_noise_sigma, then runs the full timestep schedule — clean latents land in the wrong diffusion state and the UNet hallucinates unrelated shapes (random smears / colors unrelated to the SDXL PNG). We instead mirrorStableDiffusionXLImg2ImgPipelineinside_animatediff_img2img_generate: slicescheduler.timestepsbyinit_image_strength, runscheduler.add_noiseon the VAE-encoded still at the first kept timestep (latent tiled acrossnum_frames), align latent scale withinit_noise_sigma(1.0on DDIM), then run only the suffix of the schedule in the same UNet loop as upstream AnimateDiff SDXL. init_image_strengthdefaults toDEFAULT_INIT_IMAGE_STRENGTH(0.38). Higher ⇒ more denoising steps / more motion / freer deviation from the still; lower ⇒ sticks closer to the keyframe. Allowed range on the class:[0.05, 1.0].- If the SDXL cache is missing,
_init_image_for_segmentreturnsNoneand the segment falls back to plain text-to-video (logged).
Each AnimateDiff segment also passes the next segment's motion prompt
to SDXL's second text encoder via pipe(prompt=..., prompt_2=...). The
dual-encoder design conditions the loop on a blend of the two, softening
section boundaries on top of the init-image seed. The mapping lives in
_prompt_2_for_index and participates in the manifest hash. The last
segment uses its own prompt as prompt_2 (no morph past the song end).
- Drop
init_image_sourcefrom the factory for classic text-to-video AnimateDiff (no SDXL pre-pass). - Adjust
init_image_strengthonAnimateDiffBackgroundwhen constructing the source (not exposed in Gradio yet); it participates ininit_keyso caches regenerate when it changes. init_image_sourceis not held afterensure()returns; keyframes are read from disk per segment.
No silent fallback between modes: missing uploads, missing analysis.json, missing
CUDA (AnimateDiff / SDXL), missing diffusers motion stack, or OOM all surface as
explicit exceptions per project policy.