Harden shared cache lifecycle and add optional DiT tiling by xmarre · Pull Request #551 · numz/ComfyUI-SeedVR2_VideoUpscaler

xmarre · 2026-03-15T12:00:02Z

Summary

This PR now has two parts:

harden SeedVR2’s shared runner / DiT / VAE cache reuse path around claim-release, teardown ordering, and failure handling
add optional spatial DiT tiling controls to reduce peak VRAM during the final SeedVR2 diffusion upscaling phase

The original freeze investigation ended up being broader than a single early-release bug in isolation. The underlying problem was cache ownership: claimed cached runners/models could become reusable too early, get rewritten under stale references, or get evicted without proving that the current execution still owned that exact cached object.

This branch fixes that by making cache reuse identity-safe end to end.

In addition, this branch now exposes optional DiT tiling:

dit_tiled
dit_tile_size (latent-space pixels, default 128)
dit_tile_overlap (latent-space pixels, default 16)

When enabled, the final DiT inference phase runs in overlapping latent-space tiles instead of one full-frame pass. That is slower than full-frame DiT inference, but it provides a separate VRAM relief valve for large outputs / large crops where VAE tiling alone is not enough.

Root cause

The violated invariant was:

a claimed cached runner/template or cached model must remain exclusively owned until full outer cleanup has finished, the final reusable object identity has been safely written back into cache, and only that final published object is marked reusable

Earlier revisions still violated that invariant in several ways:

cached DiT/VAE claims could become reusable before outer cleanup had fully finished
finalization could rewrite the cache slot to a released post-cleanup model object, while later claim-release logic still referenced only the originally claimed object
newly cached models were not always fed back into the same claim/finalize/release bookkeeping
failure paths could remove cached runner templates without proving the cache still contained the same claimed runner instance

That created several bad windows:

another execution could begin reusing a model while teardown was still in progress
a refreshed cached model could remain stuck claimed forever because only the old pre-refresh object got unclaimed
one execution could remove a newer healthy runner/template published by another execution
newly inserted cached objects could miss the intended ownership-release path entirely

What changed

1. Cache ownership / teardown hardening

This branch now keeps the ownership rule consistent end to end:

claim the exact cached runner/model object for exclusive use
keep that claim through outer cleanup
refresh the cache with the final released post-cleanup object identity
only then mark that final published object reusable again
on failure, evict only if the cache still contains the same claimed object instance

Concretely, this includes:

explicit synchronized runner claiming
treating active cached runners as busy instead of invalid
identity-safe final publication of cached DiT/VAE objects after teardown
releasing the claim on the final refreshed object, not only the original claimed reference
threading newly cached models back into the same cache bookkeeping
atomic identity-safe runner taint / eviction on failure
cleanup coverage for claimed cached runners when setup aborts after claim acquisition

2. Cold-cache reactivation / reuse correctness

Cached models that stay in cold cached form now rebuild execution state correctly when reused, instead of assuming they are already fully live and configured for the current run.

This helps the cache path behave consistently across hot-cache reuse, cold-cache reactivation, and fresh insertion into cache.

3. Optional DiT tiling support

The DiT loader node now exposes optional tiling controls and passes them through the runner/inference path:

SeedVR2DiTModelLoader stores dit_tiled, dit_tile_size, and dit_tile_overlap
the main node forwards those settings into runner preparation
InferPipeline.inference() now supports tiled single-sample DiT inference with overlap blending

The implementation splits the latent spatial plane into overlapping tiles, runs the existing DiT inference path per tile, and blends tiles back together with edge-aware weights.

This is intended as a practical VRAM reduction tool for the final diffusion phase. It is not meant as a claim that tiled DiT is always quality-equivalent to full-frame DiT. Smaller tiles can reduce global consistency.

4. Supporting hardening around the affected path

This branch also includes smaller supporting corrections around the same execution path, including:

safer synchronization before critical teardown / device-move operations
removal of unsafe in-place storage invalidation during cleanup
cleanup/sync coverage for runtime tensors stored in module.memory
keeping reused modules attached to the current run’s debug object
target-dimension validation and Phase 4 reconstruction-path hardening in the affected workflow path

Scope / non-goals

This PR is about correctness of shared cache reuse plus optional DiT tiling support.

It does not change the intended residency policy of healthy cached models on the hot-cache path. In particular, if cache_model=True and offload_device=cuda, healthy cached SeedVR2 models may still remain GPU-resident by design.

Likewise, the new DiT tiling controls are optional. Full-frame DiT remains the default unless dit_tiled is enabled.

Expected impact

This should improve reliability in the paths that were previously most fragile:

repeated queued generations
cancel / requeue after prior work
overlapping executions sharing cache slots
setup-abort paths after a cached runner has already been claimed
high-VRAM offload_device=cuda cache-reuse sessions
WSL/CUDA sessions where teardown mistakes could escalate into a hard wedge

And it adds an additional user-facing VRAM control for the final upscaling phase:

optional DiT tiling with configurable tile size / overlap
overlap blending to reduce visible tile boundaries
a slower but lower-peak-memory alternative to full-frame DiT inference

Fix CUDA cleanup state handling to prevent runner hangs

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1d02d47f43

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

xmarre · 2026-03-15T17:00:56Z

I found one more ownership bug in the earlier revision and addressed it here.

The issue was that the model-cache claim could still be released too early from inside cleanup_dit() / cleanup_vae(). That left a window where another execution could begin reusing cached models before the owning execution had fully finished outer teardown.

This update moves claim release to the outer cleanup boundary, after complete_cleanup(...) has returned, in both the node execution path and the CLI path.

So the intended lifecycle is now:

cached runner/model is claimed for exclusive use
teardown runs while that claim is still held
healthy cached models are preserved in reusable cached form
only after full outer cleanup returns are cached models marked reusable again
interrupted setup/runtime paths evict or taint claimed cached entries instead of preserving potentially dirty state

So the important fix here is that claim/release boundaries now match the actual teardown lifetime, instead of exposing cached entries as reusable too early.

I’m still stress-testing queued reuse / cancel / requeue paths, but this is the first revision where the ownership boundary itself matches the real lifetime of teardown.

Fix cached model claim ownership during runner cleanup

xmarre · 2026-03-20T13:17:18Z

Since my last comment, I found a couple more cache-lifecycle issues in the same area and updated the PR accordingly.

The earlier fix in f0376ea was still directionally right — model claims must not be released before outer teardown finishes — but the implementation needed a few more ownership fixes around what exact object gets published back into cache, what gets unclaimed afterward, and how reused runner templates are invalidated on failure.

What changed since then:

Healthy hot-cache reuse is preserved again.
The model cache no longer rejects a reusable DiT/VAE just because it is not back in a fully cold CPU form yet. Reused runners also now reset their per-run cleanup phase flags correctly, so a valid hot-cache hit does not get treated like already-cleaned state.
Claimed model finalization is now identity-safe.
After teardown, the cache is refreshed with the runner-held released model object using replace_dit(...) / replace_vae(...) guarded by expected_model=claimed_*. If the claimed entry no longer matches, that cache slot is removed instead of rewriting whatever happens to be there.
Runner-template failure eviction is now atomic.
On exception paths, reused cached runners are no longer removed through a separate non-atomic sequence. The cache now does a single taint+remove operation on the exact expected runner template, which avoids one execution accidentally invalidating a newer template published by another execution.
Claim release now follows the final refreshed object, not just the originally claimed one.
One remaining bug was that finalize could replace the cached entry with a different post-cleanup model object, while the finally block only unclaimed the original claimed reference. That could leave the refreshed cached object stuck in claimed state. The cleanup path now tracks the refreshed DiT/VAE and clears the claimed flag on those objects as well when they differ.
Newly cached models are now tracked too.
If the current run is the one that first inserts the DiT/VAE into cache, those newly cached objects are now recorded back into cache_context so the same claim-release logic also applies to them.

So the ownership rule this PR now enforces is:

the cached runner/template and cached DiT/VAE stay exclusively owned by the current execution until full cleanup finishes, the final reusable object identities are written back into cache, and only those final published objects are marked reusable.

That is the behavior I was actually aiming for from the start. The freeze/hang risk still looks most plausible on the GPU-resident hot-cache path (offload_device=cuda), but the underlying fixes here are broader than just that one manifestation.

Fix video transform padding math in setup path

xmarre · 2026-03-21T18:01:56Z

Quick update since my last PR comment.

The cache-lifecycle / ownership fixes added after the earlier teardown change are holding up much better now, including the cases where later SeedVR2 executions in the same workflow were previously tripping over claimed cache entries.

While stress-testing that, I found that the remaining freeze no longer lined up with the cache claim/reuse path itself. To narrow that down, I added the two breadcrumb commits now in this PR around:

SeedVR2 model preparation
video transform setup / pre-generation dimension planning

That let me reduce the remaining freeze window from “somewhere after cached reuse” down to a very small pre-generation path.

The latest repro made it all the way through:

cached runner reuse
cached DiT/VAE reuse
model preparation
text embedding load
memory logging
entry into compute_generation_info(...)
the resize-only probe inside setup_video_transform(...)

and then stopped specifically on the full cached ctx['video_transform'](sample_frame) call that was being used only to derive padded target dimensions before generation started.

So at this point, the remaining hang looks like a separate issue from the earlier cache ownership bugs. The current evidence points at the pre-generation transform/dimension-planning path, not at the claimed cache lifecycle itself.

Based on that, I changed the dimension-planning path so it no longer runs the full live transform pipeline on a real sample tensor just to compute padded dimensions. It now keeps the resize-only probe needed to determine the resized target size, computes the padded dimensions directly from that resized shape, caches those dims in context, and clears them again during normal cleanup.

I’m keeping the breadcrumbs in for now.

The latest patch is behaving well so far, but this freeze can take a while to reproduce, so I want a longer stress-test / soak period before removing the breadcrumbs and before claiming that the remaining bug is fully squashed.

If the current testing stays clean, I’ll do one final cleanup pass afterward to remove the temporary breadcrumbs and keep only the actual correctness fixes.

Refactor target-dimension probe logging in generation utils

Instrument Phase 4 pre-LAB post-processing freeze with targeted breadcrumbs

Avoid executing NaResize during target-dimension probing in setup_video_transform

…ce path" This reverts commit f5c46f4.

…e 1 device path"" This reverts commit 29aae4a.

Revert "Revert "Align Phase 4 reconstructed input transform with Phas…

Apply SeedVR2 DiT tiling GUI patch

…s-handoff Remove temporary SeedVR2 breadcrumb tracing from generation paths

…-improvements Make BF16 import probe safe

xmarre and others added 7 commits March 15, 2026 09:54

Investigate CUDA cleanup hang

b9a3816

Fix CUDA cleanup reuse hangs

3a5411d

Fix runner template cache handling

516da73

Fix release_model_memory containers

b9f6033

Synchronize runner template access

8e4c89a

Synchronize runner template access

e16c827

Merge pull request #1 from xmarre/codex/fix-cuda-state-cleanup-hang

1d02d47

Fix CUDA cleanup state handling to prevent runner hangs

chatgpt-codex-connector Bot reviewed Mar 15, 2026

View reviewed changes

Comment thread src/core/model_configuration.py

xmarre added 2 commits March 15, 2026 14:03

Address runner cache eviction bug

4760017

Keep model claim until cache rewrite

f0376ea

Allow reuse of hot model cache

7d28241

xmarre changed the title ~~Harden runner reuse and accelerator cleanup against async teardown races~~ Fix early cached-model claim release during teardown Mar 16, 2026

xmarre and others added 5 commits March 19, 2026 16:38

Fix cached model claim ownership during teardown

549c065

Fix atomic runner eviction and claim cleanup

92d7a6e

Fix cached claim release after finalize refresh

8a07d15

Merge pull request #2 from xmarre/codex/apply-patch-instructions

4b2c202

Fix cached model claim ownership during runner cleanup

Track newly cached models for claim release

08fad5b

xmarre and others added 4 commits March 20, 2026 19:56

Add breadcrumbs around SeedVR2 model prep

c6466fa

Add breadcrumbs around video transform setup

9b572ed

Fix video transform dim planning without live tensor pass

5cd8f96

Merge pull request #3 from xmarre/codex/stop-full-transform-for-padding

04e00da

Fix video transform padding math in setup path

xmarre and others added 6 commits April 3, 2026 04:39

Refactor target-dimension probe and add debug breadcrumbs

a915809

Merge pull request #4 from xmarre/codex/refactor-targetdim-probe

20627f5

Refactor target-dimension probe logging in generation utils

Remove repository metadata and documentation

6dcbf02

Merge pull request #5 from xmarre/codex/investigate-batch-2-stall

40c491b

Instrument Phase 4 pre-LAB post-processing freeze with targeted breadcrumbs

Remove repository metadata and documentation files

a292b2b

Remove repository metadata and documentation files

0b0dbbd

xmarre and others added 10 commits April 6, 2026 21:17

Remove repository metadata and documentation files

ae69005

Remove repository metadata and documentation files

ca7f1a2

Remove repository metadata and documentation files

ee506ad

Merge pull request #6 from xmarre/codex/fix-probe-resize-hang

828f5fd

Avoid executing NaResize during target-dimension probing in setup_video_transform

Align Phase 4 reconstructed input transform with Phase 1 device path

f5c46f4

Revert "Align Phase 4 reconstructed input transform with Phase 1 devi…

29aae4a

…ce path" This reverts commit f5c46f4.

Revert "Revert "Align Phase 4 reconstructed input transform with Phas…

5a043be

…e 1 device path"" This reverts commit 29aae4a.

Merge pull request #7 from xmarre/codex/reapply-phase4-device-path

aaa98e9

Revert "Revert "Align Phase 4 reconstructed input transform with Phas…

Add DiT tiling GUI patch wiring

431ca7b

Merge pull request #8 from xmarre/codex/seedvr2-dit-tiling-pr

d14078b

Apply SeedVR2 DiT tiling GUI patch

xmarre changed the title ~~Fix early cached-model claim release during teardown~~ Harden shared cache lifecycle and add optional DiT tiling Apr 17, 2026

xmarre and others added 4 commits April 17, 2026 09:38

Remove temporary SeedVR2 breadcrumb tracing

474688a

Merge pull request #9 from xmarre/codex/remove-seedvr2-breadcrumb-log…

b352c80

…s-handoff Remove temporary SeedVR2 breadcrumb tracing from generation paths

Make BF16 import probe safe

1673ddc

Merge pull request #10 from xmarre/codex/apply-patch-for-bf16-support…

b62b5b6

…-improvements Make BF16 import probe safe

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Harden shared cache lifecycle and add optional DiT tiling#551

Harden shared cache lifecycle and add optional DiT tiling#551
xmarre wants to merge 39 commits into
numz:mainfrom
xmarre:main

xmarre commented Mar 15, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

xmarre commented Mar 15, 2026 •

edited

Loading

Uh oh!

xmarre commented Mar 20, 2026

Uh oh!

xmarre commented Mar 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

xmarre commented Mar 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root cause

What changed

1. Cache ownership / teardown hardening

2. Cold-cache reactivation / reuse correctness

3. Optional DiT tiling support

4. Supporting hardening around the affected path

Scope / non-goals

Expected impact

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

xmarre commented Mar 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xmarre commented Mar 20, 2026

Uh oh!

xmarre commented Mar 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

xmarre commented Mar 15, 2026 •

edited

Loading

xmarre commented Mar 15, 2026 •

edited

Loading