Skip to content

Harden shared cache lifecycle and add optional DiT tiling#551

Open
xmarre wants to merge 37 commits intonumz:mainfrom
xmarre:main
Open

Harden shared cache lifecycle and add optional DiT tiling#551
xmarre wants to merge 37 commits intonumz:mainfrom
xmarre:main

Conversation

@xmarre
Copy link
Copy Markdown

@xmarre xmarre commented Mar 15, 2026

Summary

This PR now has two parts:

  1. harden SeedVR2’s shared runner / DiT / VAE cache reuse path around claim-release, teardown ordering, and failure handling
  2. add optional spatial DiT tiling controls to reduce peak VRAM during the final SeedVR2 diffusion upscaling phase

The original freeze investigation ended up being broader than a single early-release bug in isolation. The underlying problem was cache ownership: claimed cached runners/models could become reusable too early, get rewritten under stale references, or get evicted without proving that the current execution still owned that exact cached object.

This branch fixes that by making cache reuse identity-safe end to end.

In addition, this branch now exposes optional DiT tiling:

  • dit_tiled
  • dit_tile_size (latent-space pixels, default 128)
  • dit_tile_overlap (latent-space pixels, default 16)

When enabled, the final DiT inference phase runs in overlapping latent-space tiles instead of one full-frame pass. That is slower than full-frame DiT inference, but it provides a separate VRAM relief valve for large outputs / large crops where VAE tiling alone is not enough.

Root cause

The violated invariant was:

a claimed cached runner/template or cached model must remain exclusively owned until full outer cleanup has finished, the final reusable object identity has been safely written back into cache, and only that final published object is marked reusable

Earlier revisions still violated that invariant in several ways:

  • cached DiT/VAE claims could become reusable before outer cleanup had fully finished
  • finalization could rewrite the cache slot to a released post-cleanup model object, while later claim-release logic still referenced only the originally claimed object
  • newly cached models were not always fed back into the same claim/finalize/release bookkeeping
  • failure paths could remove cached runner templates without proving the cache still contained the same claimed runner instance

That created several bad windows:

  • another execution could begin reusing a model while teardown was still in progress
  • a refreshed cached model could remain stuck claimed forever because only the old pre-refresh object got unclaimed
  • one execution could remove a newer healthy runner/template published by another execution
  • newly inserted cached objects could miss the intended ownership-release path entirely

What changed

1. Cache ownership / teardown hardening

This branch now keeps the ownership rule consistent end to end:

  • claim the exact cached runner/model object for exclusive use
  • keep that claim through outer cleanup
  • refresh the cache with the final released post-cleanup object identity
  • only then mark that final published object reusable again
  • on failure, evict only if the cache still contains the same claimed object instance

Concretely, this includes:

  • explicit synchronized runner claiming
  • treating active cached runners as busy instead of invalid
  • identity-safe final publication of cached DiT/VAE objects after teardown
  • releasing the claim on the final refreshed object, not only the original claimed reference
  • threading newly cached models back into the same cache bookkeeping
  • atomic identity-safe runner taint / eviction on failure
  • cleanup coverage for claimed cached runners when setup aborts after claim acquisition

2. Cold-cache reactivation / reuse correctness

Cached models that stay in cold cached form now rebuild execution state correctly when reused, instead of assuming they are already fully live and configured for the current run.

This helps the cache path behave consistently across hot-cache reuse, cold-cache reactivation, and fresh insertion into cache.

3. Optional DiT tiling support

The DiT loader node now exposes optional tiling controls and passes them through the runner/inference path:

  • SeedVR2DiTModelLoader stores dit_tiled, dit_tile_size, and dit_tile_overlap
  • the main node forwards those settings into runner preparation
  • InferPipeline.inference() now supports tiled single-sample DiT inference with overlap blending

The implementation splits the latent spatial plane into overlapping tiles, runs the existing DiT inference path per tile, and blends tiles back together with edge-aware weights.

This is intended as a practical VRAM reduction tool for the final diffusion phase. It is not meant as a claim that tiled DiT is always quality-equivalent to full-frame DiT. Smaller tiles can reduce global consistency.

4. Supporting hardening around the affected path

This branch also includes smaller supporting corrections around the same execution path, including:

  • safer synchronization before critical teardown / device-move operations
  • removal of unsafe in-place storage invalidation during cleanup
  • cleanup/sync coverage for runtime tensors stored in module.memory
  • keeping reused modules attached to the current run’s debug object
  • target-dimension validation and Phase 4 reconstruction-path hardening in the affected workflow path

Scope / non-goals

This PR is about correctness of shared cache reuse plus optional DiT tiling support.

It does not change the intended residency policy of healthy cached models on the hot-cache path. In particular, if cache_model=True and offload_device=cuda, healthy cached SeedVR2 models may still remain GPU-resident by design.

Likewise, the new DiT tiling controls are optional. Full-frame DiT remains the default unless dit_tiled is enabled.

Expected impact

This should improve reliability in the paths that were previously most fragile:

  • repeated queued generations
  • cancel / requeue after prior work
  • overlapping executions sharing cache slots
  • setup-abort paths after a cached runner has already been claimed
  • high-VRAM offload_device=cuda cache-reuse sessions
  • WSL/CUDA sessions where teardown mistakes could escalate into a hard wedge

And it adds an additional user-facing VRAM control for the final upscaling phase:

  • optional DiT tiling with configurable tile size / overlap
  • overlap blending to reduce visible tile boundaries
  • a slower but lower-peak-memory alternative to full-frame DiT inference

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1d02d47f43

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/core/model_configuration.py
@xmarre
Copy link
Copy Markdown
Author

xmarre commented Mar 15, 2026

I found one more ownership bug in the earlier revision and addressed it here.

The issue was that the model-cache claim could still be released too early from inside cleanup_dit() / cleanup_vae(). That left a window where another execution could begin reusing cached models before the owning execution had fully finished outer teardown.

This update moves claim release to the outer cleanup boundary, after complete_cleanup(...) has returned, in both the node execution path and the CLI path.

So the intended lifecycle is now:

  • cached runner/model is claimed for exclusive use
  • teardown runs while that claim is still held
  • healthy cached models are preserved in reusable cached form
  • only after full outer cleanup returns are cached models marked reusable again
  • interrupted setup/runtime paths evict or taint claimed cached entries instead of preserving potentially dirty state

So the important fix here is that claim/release boundaries now match the actual teardown lifetime, instead of exposing cached entries as reusable too early.

I’m still stress-testing queued reuse / cancel / requeue paths, but this is the first revision where the ownership boundary itself matches the real lifetime of teardown.

@xmarre xmarre changed the title Harden runner reuse and accelerator cleanup against async teardown races Fix early cached-model claim release during teardown Mar 16, 2026
@xmarre
Copy link
Copy Markdown
Author

xmarre commented Mar 20, 2026

Since my last comment, I found a couple more cache-lifecycle issues in the same area and updated the PR accordingly.

The earlier fix in f0376ea was still directionally right — model claims must not be released before outer teardown finishes — but the implementation needed a few more ownership fixes around what exact object gets published back into cache, what gets unclaimed afterward, and how reused runner templates are invalidated on failure.

What changed since then:

  • Healthy hot-cache reuse is preserved again.
    The model cache no longer rejects a reusable DiT/VAE just because it is not back in a fully cold CPU form yet. Reused runners also now reset their per-run cleanup phase flags correctly, so a valid hot-cache hit does not get treated like already-cleaned state.

  • Claimed model finalization is now identity-safe.
    After teardown, the cache is refreshed with the runner-held released model object using replace_dit(...) / replace_vae(...) guarded by expected_model=claimed_*. If the claimed entry no longer matches, that cache slot is removed instead of rewriting whatever happens to be there.

  • Runner-template failure eviction is now atomic.
    On exception paths, reused cached runners are no longer removed through a separate non-atomic sequence. The cache now does a single taint+remove operation on the exact expected runner template, which avoids one execution accidentally invalidating a newer template published by another execution.

  • Claim release now follows the final refreshed object, not just the originally claimed one.
    One remaining bug was that finalize could replace the cached entry with a different post-cleanup model object, while the finally block only unclaimed the original claimed reference. That could leave the refreshed cached object stuck in claimed state. The cleanup path now tracks the refreshed DiT/VAE and clears the claimed flag on those objects as well when they differ.

  • Newly cached models are now tracked too.
    If the current run is the one that first inserts the DiT/VAE into cache, those newly cached objects are now recorded back into cache_context so the same claim-release logic also applies to them.

So the ownership rule this PR now enforces is:

the cached runner/template and cached DiT/VAE stay exclusively owned by the current execution until full cleanup finishes, the final reusable object identities are written back into cache, and only those final published objects are marked reusable.

That is the behavior I was actually aiming for from the start. The freeze/hang risk still looks most plausible on the GPU-resident hot-cache path (offload_device=cuda), but the underlying fixes here are broader than just that one manifestation.

@xmarre
Copy link
Copy Markdown
Author

xmarre commented Mar 21, 2026

Quick update since my last PR comment.

The cache-lifecycle / ownership fixes added after the earlier teardown change are holding up much better now, including the cases where later SeedVR2 executions in the same workflow were previously tripping over claimed cache entries.

While stress-testing that, I found that the remaining freeze no longer lined up with the cache claim/reuse path itself. To narrow that down, I added the two breadcrumb commits now in this PR around:

  • SeedVR2 model preparation
  • video transform setup / pre-generation dimension planning

That let me reduce the remaining freeze window from “somewhere after cached reuse” down to a very small pre-generation path.

The latest repro made it all the way through:

  • cached runner reuse
  • cached DiT/VAE reuse
  • model preparation
  • text embedding load
  • memory logging
  • entry into compute_generation_info(...)
  • the resize-only probe inside setup_video_transform(...)

and then stopped specifically on the full cached ctx['video_transform'](sample_frame) call that was being used only to derive padded target dimensions before generation started.

So at this point, the remaining hang looks like a separate issue from the earlier cache ownership bugs. The current evidence points at the pre-generation transform/dimension-planning path, not at the claimed cache lifecycle itself.

Based on that, I changed the dimension-planning path so it no longer runs the full live transform pipeline on a real sample tensor just to compute padded dimensions. It now keeps the resize-only probe needed to determine the resized target size, computes the padded dimensions directly from that resized shape, caches those dims in context, and clears them again during normal cleanup.

I’m keeping the breadcrumbs in for now.

The latest patch is behaving well so far, but this freeze can take a while to reproduce, so I want a longer stress-test / soak period before removing the breadcrumbs and before claiming that the remaining bug is fully squashed.

If the current testing stays clean, I’ll do one final cleanup pass afterward to remove the temporary breadcrumbs and keep only the actual correctness fixes.

@xmarre xmarre changed the title Fix early cached-model claim release during teardown Harden shared cache lifecycle and add optional DiT tiling Apr 17, 2026
xmarre and others added 2 commits April 17, 2026 09:38
…s-handoff

Remove temporary SeedVR2 breadcrumb tracing from generation paths
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant