Harden shared cache lifecycle and add optional DiT tiling#551
Harden shared cache lifecycle and add optional DiT tiling#551
Conversation
Fix CUDA cleanup state handling to prevent runner hangs
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1d02d47f43
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
I found one more ownership bug in the earlier revision and addressed it here. The issue was that the model-cache claim could still be released too early from inside This update moves claim release to the outer cleanup boundary, after So the intended lifecycle is now:
So the important fix here is that claim/release boundaries now match the actual teardown lifetime, instead of exposing cached entries as reusable too early. I’m still stress-testing queued reuse / cancel / requeue paths, but this is the first revision where the ownership boundary itself matches the real lifetime of teardown. |
Fix cached model claim ownership during runner cleanup
|
Since my last comment, I found a couple more cache-lifecycle issues in the same area and updated the PR accordingly. The earlier fix in What changed since then:
So the ownership rule this PR now enforces is:
That is the behavior I was actually aiming for from the start. The freeze/hang risk still looks most plausible on the GPU-resident hot-cache path ( |
|
Quick update since my last PR comment. The cache-lifecycle / ownership fixes added after the earlier teardown change are holding up much better now, including the cases where later SeedVR2 executions in the same workflow were previously tripping over claimed cache entries. While stress-testing that, I found that the remaining freeze no longer lined up with the cache claim/reuse path itself. To narrow that down, I added the two breadcrumb commits now in this PR around:
That let me reduce the remaining freeze window from “somewhere after cached reuse” down to a very small pre-generation path. The latest repro made it all the way through:
and then stopped specifically on the full cached So at this point, the remaining hang looks like a separate issue from the earlier cache ownership bugs. The current evidence points at the pre-generation transform/dimension-planning path, not at the claimed cache lifecycle itself. Based on that, I changed the dimension-planning path so it no longer runs the full live transform pipeline on a real sample tensor just to compute padded dimensions. It now keeps the resize-only probe needed to determine the resized target size, computes the padded dimensions directly from that resized shape, caches those dims in context, and clears them again during normal cleanup. I’m keeping the breadcrumbs in for now. The latest patch is behaving well so far, but this freeze can take a while to reproduce, so I want a longer stress-test / soak period before removing the breadcrumbs and before claiming that the remaining bug is fully squashed. If the current testing stays clean, I’ll do one final cleanup pass afterward to remove the temporary breadcrumbs and keep only the actual correctness fixes. |
Refactor target-dimension probe logging in generation utils
Instrument Phase 4 pre-LAB post-processing freeze with targeted breadcrumbs
…s-handoff Remove temporary SeedVR2 breadcrumb tracing from generation paths
Summary
This PR now has two parts:
The original freeze investigation ended up being broader than a single early-release bug in isolation. The underlying problem was cache ownership: claimed cached runners/models could become reusable too early, get rewritten under stale references, or get evicted without proving that the current execution still owned that exact cached object.
This branch fixes that by making cache reuse identity-safe end to end.
In addition, this branch now exposes optional DiT tiling:
dit_tileddit_tile_size(latent-space pixels, default128)dit_tile_overlap(latent-space pixels, default16)When enabled, the final DiT inference phase runs in overlapping latent-space tiles instead of one full-frame pass. That is slower than full-frame DiT inference, but it provides a separate VRAM relief valve for large outputs / large crops where VAE tiling alone is not enough.
Root cause
The violated invariant was:
Earlier revisions still violated that invariant in several ways:
That created several bad windows:
What changed
1. Cache ownership / teardown hardening
This branch now keeps the ownership rule consistent end to end:
Concretely, this includes:
2. Cold-cache reactivation / reuse correctness
Cached models that stay in cold cached form now rebuild execution state correctly when reused, instead of assuming they are already fully live and configured for the current run.
This helps the cache path behave consistently across hot-cache reuse, cold-cache reactivation, and fresh insertion into cache.
3. Optional DiT tiling support
The DiT loader node now exposes optional tiling controls and passes them through the runner/inference path:
SeedVR2DiTModelLoaderstoresdit_tiled,dit_tile_size, anddit_tile_overlapInferPipeline.inference()now supports tiled single-sample DiT inference with overlap blendingThe implementation splits the latent spatial plane into overlapping tiles, runs the existing DiT inference path per tile, and blends tiles back together with edge-aware weights.
This is intended as a practical VRAM reduction tool for the final diffusion phase. It is not meant as a claim that tiled DiT is always quality-equivalent to full-frame DiT. Smaller tiles can reduce global consistency.
4. Supporting hardening around the affected path
This branch also includes smaller supporting corrections around the same execution path, including:
module.memoryScope / non-goals
This PR is about correctness of shared cache reuse plus optional DiT tiling support.
It does not change the intended residency policy of healthy cached models on the hot-cache path. In particular, if
cache_model=Trueandoffload_device=cuda, healthy cached SeedVR2 models may still remain GPU-resident by design.Likewise, the new DiT tiling controls are optional. Full-frame DiT remains the default unless
dit_tiledis enabled.Expected impact
This should improve reliability in the paths that were previously most fragile:
offload_device=cudacache-reuse sessionsAnd it adds an additional user-facing VRAM control for the final upscaling phase: