Living document. Tracks shipped features, near-term work, and longer-term research directions. Items are ordered by value, not calendar date.
- AUTO inpaint mode / per-frame quality tier (roadmap #5 Inpainting)
-- new
InpaintMode.AUTOroutes each TBE batch between temporal (STTN) and spatial (LaMa) inpainting based on a per-batch exposure score vs configurable threshold. - Keyframe-driven detection (roadmap #1 Detection) -- ffprobe enumerates I-frames; OCR runs only at keyframes, Kalman propagates masks between. Falls back to pHash skip when ffprobe is missing.
- Deinterlace on ingest (roadmap #9 Preprocessing) -- ffprobe idet detects combing; ffmpeg yadif writes a temp progressive source. Auto + manual toggles.
- PSNR / SSIM quality report (roadmap #E Testing) -- 10-frame deterministic sample after each run. Pure-numpy SSIM, cv2 PSNR, no new deps. CLI + GUI toggle.
- Glob + config-file batch CLI (roadmap #8) --
--pattern/--out-dir/--config; JSON overlay supports any ProcessingConfig field; mutual validation with--input. - Crash-resume checkpointing (roadmap #9) -- SHA-256-fingerprinted
.donemarkers under%APPDATA%\VideoSubtitleRemoverPro\checkpoints\. Input size/mtime is part of the key so re-encoded sources are correctly re-processed.--no-resumebypasses. - Preset JSON export / import (roadmap #15 workflow) -- share
tuned recipes as single JSON files with a
vsr_preset_formatversion tag. Collisions with built-ins auto-rename. - Live processing preview (roadmap #7) -- backend
on_preview_framecallback pipes the latest inpainted frame into the preview pane at 15 FPS while a batch is running.
- Kalman box tracking (roadmap #7) -- constant-velocity filter per subtitle line smooths OCR jitter, fills single-frame misses, stabilises TBE coverage.
- Perceptual-hash adaptive mask reuse (roadmap #8) -- skip OCR on frames pHash-close to the last detected one; adapts to scene content rather than a fixed interval.
- Colour-tuned mask expansion (roadmap #2) -- Lab-space cluster split inside each detected box; extends the mask to cover serifs, drop shadows, decorative strokes.
- Preset library (roadmap #26) -- six built-ins (YouTube,
Anime, Motion-heavy, TikTok, VHS, News chyron) plus user-saved
presets persisted to
%APPDATA%\VideoSubtitleRemoverPro\presets.json.
- Scene-cut-aware TBE (roadmap #11) -- histogram-based scene-cut detection inside the TBE batch; each segment aggregates independently so background is never mixed across a cut.
- Flow-warped TBE (roadmap #6) -- optional Farneback dense optical flow aligns every frame in a segment to a reference before aggregation. Dramatically cleaner on pans and zooms.
- Edge-ring colour match (roadmap #17) -- post-inpaint colour correction using a 1-pixel ring just outside the mask. Kills the faint colour seam on gradient backgrounds.
- Auto subtitle-band detection (roadmap #1) -- 30-frame probe
- vertical clustering pins the dominant band as
subtitle_area.
- vertical clustering pins the dominant band as
- Multi-region masks (roadmap #15) --
subtitle_areasfield accepts a list of rects unioned with per-frame detections. - SRT sidecar export (roadmap #12) -- OCR text collected per frame, collapsed into SRT cues with ~0.5s gap tolerance.
- Debug mask video export (roadmap #16) -- optional
.mask.mp4of the per-frame binary mask. - Adaptive batch sizing (roadmap #20) -- NVML free-VRAM probe
scales
sttn_max_load_numto [8, 512] based on available memory.
- RapidOCR default detector -- PP-OCR via ONNX Runtime. ~4-5x faster than paddlepaddle, no memory leaks, smaller install. Falls through to PaddleOCR / Surya / EasyOCR / OpenCV.
- Temporal Background Exposure (TBE) -- STTN mode now real video inpainting. For each masked pixel, samples neighbouring frames where the same pixel is unmasked and reconstructs the true background. Residual (always-masked) pixels fall back to cv2.
- Hybrid ProPainter -- TBE with higher coverage bar + LaMa residual refinement blended 65/35. ProPainter-tier quality on motion-heavy footage without the 10+ GB VRAM footprint.
- Gaussian mask feathering -- alpha-blended boundary compositing eliminates visible cut lines at the edge of the removal region. Applies to every inpaint path.
- Engine badge -- About dialog surfaces "Temporal BG (TBE)" as an always-available inpainting backend and reflects RapidOCR in the detection chain.
Premium polish pass -- design-token system, custom widgets (ModernToggle / ModernSlider / SegmentedPicker), onboarding, toast system, batch summary modal, taskbar progress. See CHANGELOG for the full list.
Surya detector, detection frame-skip, mask dilation, hardware encoding (NVENC / QSV / AMF), PaddleOCR PP-OCRv5, DirectML support, time-range processing, right-click mask preview, CLI flags. See CHANGELOG for historical detail.
Ordered by impact. Most of these are additive and do not require external model-weight downloads.
- Florence-2 / Qwen2-VL experimental detector -- gated optional dependency. Better multi-lang, layout-aware, but heavy. Sits behind a settings toggle for users who want maximum accuracy on stylized text.
- LaMa via ONNX Runtime -- drop
simple-lama-inpainting(PyTorch) in favour ofCarve/LaMa-ONNX. 3-5x faster, 70% smaller install, CPU-friendly. Keep the PyTorch path as a fallback until ONNX is verified on all three GPU vendors. - MI-GAN fast mode -- optional mode for users who want single-frame
inpainting at mobile-grade speed. ONNX-based, ~10ms per 512x512 on
CPU. New
InpaintMode.MI_GANenum value.
- Whisper fallback for hard-coded video without clear text --
when OCR confidence drops below a threshold, call
faster-whisperon the audio to generate a synthetic SRT and use its timing to mask the bottom-third region. Covers videos where the subtitle is anti-aliased past OCR's reliability floor.
- Pre-detect denoise -- noisy sources (VHS rips, low-light
phone clips) starve OCR of contrast and leak flicker into TBE.
An optional
FastDVDnetpass on the detection frames (not the output) boosts OCR hit rate by ~15% without altering the final inpainted video. Reference: m-tassano/fastdvdnet. - PySceneDetect-backed scene splitter -- swap our histogram
scene-cut heuristic (roadmap #11) for
scenedetect. Adaptive detector handles dissolves / fades / flashes correctly; the current threshold approach mis-fires on bright cuts. Reference: Breakthrough/PySceneDetect. - Proxy-file workflow -- render a low-res proxy for GUI preview and tuning; apply the final pass at full resolution only when the user commits. Matches the way Premiere / DaVinci handle 4K-8K source footage.
- Raw frame-sequence input -- accept DPX / EXR / PNG sequences as input (and output) for VFX pipelines that never touch an mp4. Wrap the existing video codepath with a virtual "frames-as-video" adapter.
- "Repeat last job" shortcut -- one-click resubmit of the
most recent settings on a new file. The queue already stores
ProcessingConfigper item, so this is surfacing what's already persisted.
- Manga / anime mode -- route detection through
manga-ocr(vertical Japanese) +comic-text-detector(irregular speech-bubble shapes). OCR that understands panel layout beats generic text detection on anime subs and sign translations. Reference: kha-white/manga-ocr, dmMaze/comic-text-detector. - Karaoke / animated subtitle tracking -- music-video karaoke captions morph per-syllable; the current detector sees them as constantly-new text. Treat the whole karaoke line as one moving object via optical-flow association and mask the union bounds.
- Vertical text mode -- Japanese tategaki and classical
Chinese read top-to-bottom. Pass a
vertical=Trueflag through to the OCR backends that support it (RapidOCR, PaddleOCR) and rotate the detected boxes accordingly before masking. - Chyron vs subtitle distinction -- news graphics (station logos, persistent lower-thirds, breaking-news tickers) are conceptually different from dialogue subtitles. Auto-classify each detected box and let users toggle which categories to remove.
Heavier integrations; most require model-weight downloads or extra install steps.
- Real ProPainter (ICCV 2023) -- the actual
sczhou/ProPainterreference with RAFT optical-flow propagation and the recurrent transformer. Needs~400 MBof model weights auto-downloaded on first use. Wire as a new enum valueInpaintMode.PROPAINTER_REAL, keep the current TBE-based "ProPainter" as the default fast path. Reference: sczhou/ProPainter. - DiffuEraser (2025, diffusion) -- state-of-the-art for temporal-consistent video inpainting. Combines BrushNet + ProPainter + AnimateDiff. Very heavy (8+ GB VRAM, ~5 GB weights) -- ship as opt-in "DiffuEraser" mode for users with the hardware. Reference: lixiaowen-xw/DiffuEraser.
- Wan2.1-VACE (Alibaba, ICCV 2025) -- "all-in-one" video creation / editing. The 1.3 B-parameter variant is feasible on consumer GPUs and its inpainting mode beats ProPainter on motion coherence. Opt-in mode with weights auto-fetched from ModelScope. Reference: ali-vilab/VACE.
- EraserDiT -- diffusion-transformer video inpainting tuned for high-res subtitle removal specifically. Too new to commit to; track through 2026 and integrate once weights are stable. Reference: arXiv:2506.12853.
- FloED -- Optical-Flow guided Efficient Diffusion. Mid-weight alternative to DiffuEraser when VACE is overkill but TBE leaves visible residuals on fast pans. Reference: NevSNev/FloED.
- SAM 2 mask refinement -- user clicks inside the (already-detected) subtitle region, SAM 2 promotes that click into a clean object mask that follows the text exactly. Eliminates the need to dilate aggressively to catch serifs. SAM 2 also has consistent memory propagation, so one seed click carries the mask through the whole clip. Reference: facebookresearch/sam2.
- SAM 3 text-prompt segmentation -- SAM 3 (released Nov 2025) accepts natural-language prompts. One-click "segment all burned-in text" without any bounding boxes. Effectively auto-mask. Reference: Meta SAM 3.
- MatAnyone 2 (CVPR 2026) -- state-of-the-art video matting. Not directly a subtitle tool, but the same memory-propagation architecture is exactly what we want for tracking a thin, mobile mask (rolling subtitle line) across a clip. Use its trimap-free inference as an alternative mask generator for stylised/decorated subtitles that OCR+SAM can't clean up. Reference: pq-yang/MatAnyone2.
- TensorRT inpainter path -- compile the LaMa ONNX model to a
TensorRT engine on first GPU run. Expected ~2-3x further speedup
on top of ONNX Runtime alone for NVIDIA hardware. Cache compiled
engine under
%APPDATA%. - INT8 quantization of the OCR detector -- RapidOCR ships FP32. Post-training quantization via ONNX Runtime Quantizer should cut detection cost ~50% on CPU with <0.5% accuracy loss.
- Batched LaMa inference --
simple-lama-inpaintingruns one frame at a time. Re-expose the underlying model and pipe full batches (our batch size is 30) through a single forward pass. Dominant speedup for LAMA and ProPainter-hybrid modes. - PyNvVideoCodec hardware decode -- replace
cv2.VideoCapturewith NVIDIA's PyNvVideoCodec for NVIDIA users. Measured ~6x faster decode (91 fps vs 15 fps on reference hardware), plus decoded frames live on the GPU already -- zero CPU-GPU copies when the subsequent detect/inpaint is also on GPU. Reference: PyNvVideoCodec. - Prefetch / pipeline parallelism -- read frame N+1 on a worker thread while frame N is being inpainted. The GIL is released in cv2/numpy/onnxruntime calls, so simple threading is enough. Current code is strictly serial -- easy win of ~20% on any hardware.
- RIFE-interpolated fast mode -- detect+inpaint every Nth frame, synthesise intermediates with RIFE from the cleaned keyframes. Equivalent to 2-4x throughput when the background is smooth (dialogue scenes). Falls back gracefully on scene cuts (which RIFE handles by duplicating the nearer frame). Reference: hzwer/Practical-RIFE.
- 10-bit / HDR pipeline -- the current cv2 pipeline processes
in 8-bit BGR, clamping HDR10 / HLG / Dolby Vision to SDR.
Re-plumb as 16-bit numpy preserving
_Matrix/_Transfer/_Primariesmetadata; output viaffmpeg -c:v libx265 -pix_fmt yuv420p10le -color_primaries bt2020. Biggest single quality win for anyone editing modern phone footage. - AV1 + VP9 ingest/egress -- YouTube's native web format is
AV1, older YouTube and most WebM is VP9. Verify decode across
all codepaths and expose
output_format=av1(libsvtav1for speed,libaom-av1for quality). - VapourSynth bridge -- optional bridge that accepts a
.vpyscript as input and emits one as output. Lets advanced users slot VSR into a larger chain (QTGMC deinterlace, Waifu2x upscale, SMDegrain denoise). Reference: vapoursynth/vapoursynth. - NLE round-trip -- when input is an EDL / XML from Premiere or DaVinci, emit a matching sidecar with the cleaned video substituted. Keeps VSR inside a post pipeline.
- Reference-clip regression harness -- a
tests/clips/directory of 10-20 short clips covering edge cases (motion-heavy, static, karaoke, vertical JP, HDR, thin font, thick font, dissolve cuts). Nightly GHA run compares output hashes + metric scores to committed baselines. - Community edge-case corpus -- GitHub Discussions thread where users submit 10-second problem clips + settings. We periodically fold the worst performers into the regression harness and ship baseline fixes.
- Multi-track audio passthrough -- mux all N input audio streams unchanged (today we merge only the first). DVD / Bluray rips routinely ship 3-5 language tracks; current code silently drops them.
- Loudness normalisation -- optional
ffmpeg -af loudnorm=I=-16pass to bring batches to a consistent EBU R128 target. Useful for platform-specific output (YouTube -14 LUFS, Apple -16 LUFS, broadcast -23 LUFS). - Audio-guided subtitle validation -- cross-reference faster-whisper word timestamps against detected text boxes: boxes with no nearby speech are likely chyrons / captions, not dialogue. Feeds the chyron-vs-subtitle classifier.
- Real-ESRGAN output upscale -- optional "enhance" stage after inpainting. Users processing SD-era footage often want 2x / 4x upscale afterwards; bundling the step means one pipeline instead of hand-offs between tools. ONNX Runtime friendly, already fits our infra. Reference: xinntao/Real-ESRGAN.
- SwinIR restoration pass -- alternative to Real-ESRGAN for footage where the subtitle removal leaves subtle blur (happens on dense text); SwinIR restores local detail better than generic upscalers. Reference: JingyunLiang/SwinIR.
- Film-grain re-synthesis -- cinematic footage has uniform grain; inpainted regions come back grain-free and stand out on motion. A cheap additive-noise pass matched to the per-frame grain profile (estimated from an unmasked crop) re-integrates the inpainted region.
- Subtitle area drag-refinement in-GUI -- current region selector is modal and one-shot. Allow live-drag adjustment on the preview pane with instant mask re-render, so the user sees the mask update as they nudge edges.
- Per-file overrides surface -- every queue item already
carries its own
ProcessingConfig. Expose this in a themed "edit this item" popover so users can set different language / region / mode per file without cloning the global config. - Drag-drop onto the app icon -- register VSR as a file handler
for the target video extensions; double-clicking an
.mp4launches it straight into the queue. - A/B flicker-scrubber -- horizontal slider in the preview pane reveals original vs cleaned video, scrubbable with mouse drag. Makes it easy to spot residual flicker or colour mismatch before exporting.
- Quality self-test -- for each finished file, sample 10 random frames, render the mask region side-by-side (original / cleaned) into a summary sheet, and show an inline "Quality: Good / Review" tag based on a simple SSIM threshold on unmasked regions. Flags batches that need re-runs without full manual review.
Speculative. Anchored to projects that are under active research right now and may not be practical to integrate for 12+ months.
- Plugin architecture --
backend/plugins/*.pyauto-discovery for custom detectors and inpainters. Lets us ship the core app as a thin runtime and iterate on model integrations independently. - Distributed / multi-GPU -- split the frame batch across GPUs, with a coordinator that handles the temporal window crossing device boundaries. Pays off for 4K / feature-length content.
- Headless REST server mode --
python VideoSubtitleRemover.py --serve 8888exposes a local HTTP API (submit job, poll status, stream preview). Sets up the app for integration into other desktop tools and NAS workflows. - Web build -- port the GUI to a tauri / webview shell backed by the existing Python core. Keeps the native Windows feel while making macOS / Linux parity realistic.
- Subtitle translation pipeline -- after SRT export, route through a local LLM (gemma3 / qwen3) for translation, then optionally re-burn translated subtitles into the cleaned video. End-to-end "re-dub in another language" workflow.
- Streaming mode -- process a file while still being written (e.g. OBS recording, DVR capture). Tail the input and emit a cleaned output with a fixed N-second lag. Requires re-architecting the temporal window to be a streaming buffer rather than a fixed batch, but unlocks live-broadcast workflows.
- WebGPU / WASM fallback -- for environments where installing
Python is impossible (locked-down corporate, chromebooks),
compile the ONNX inpaint + detect path to WebGPU via
onnxruntime-web. Ship as a drag-and-drop web page. Not a replacement for the desktop app -- an escape hatch. - Font-aware inpainting -- classify the subtitle font style (Hanzi vs Latin vs Cyrillic vs decorative) via a small classifier on the detected boxes; apply per-class mask dilation and feather presets. Chinese characters need tighter masks than handwritten Latin karaoke captions.
- Logo / watermark mode -- specialised mode that persists the mask across the entire video (static watermark case) with TBE using the full video as the temporal window rather than a 30-frame batch. Subsumes the current "fixed subtitle area" workflow and extends it to full-video aggregation.
- Android port -- Kotlin + Compose shell around the Python
core via Chaquopy, with
ffmpeg-kitfor video I/O and NNAPI / Qualcomm AIP hooks for the inpainter. Detection runs on the device NPU if present (Tensor / Pixel Visual Core / QNN). Target: v1 lite supports image-only, v2 adds video. Reference: arthenica/ffmpeg-kit. - macOS / Linux parity -- the PyInstaller build is Windows-only; port the build workflow to a three-OS GHA matrix. tkinter ships on both platforms, FFmpeg and OpenCV work identically -- the blockers are Windows-specific pieces (taskbar progress, winsound beeps, console hiding).
- iOS / iPadOS port -- long shot. Likely via a BeeWare / Pyto-style bridge rather than Chaquopy (which is Android-only). Realistic only once Apple silicon + NeuralEngine inpainting is plumbed.
- Screen-reader support -- wire tkinter widgets to Windows UI Automation (UIA) providers so NVDA / Narrator can announce status changes (items queued, progress, completion). Today none of the custom widgets are announceable.
- High-contrast theme variant -- optional theme preset
respecting Windows'
HighContrastactive scheme. The existing design-token system makes this cheap: swap the palette constants, keep the widget code. - GUI localization --
gettext-based string externalisation with.mofiles for the top ~10 languages we already detect in subtitles (ironically the GUI itself is English-only). Community-contributable via Weblate or a simple PR flow. - Right-to-left UI support -- Arabic / Hebrew mirroring for text, button ordering, queue scroll direction. Secondary to #F.
- Restyle mode -- after subtitle removal + SRT export,
optionally re-burn a translated / restyled subtitle track at
the same position using ffmpeg's
subtitlesfilter with an .ass template. Users get "clean + re-dub-in-text" in one round-trip. - Watermark addition pipeline -- content-creator mode: after cleaning an input, burn a user-specified PNG watermark at a configurable corner with fade-in/out. Lets the tool also ship content for platforms that require branding.
- AI chat interface -- optional side-panel wired to a local
LLM (gemma3 / qwen3 via llama.cpp) that accepts natural
language like "remove the bottom banner between 0:15 and 0:45,
keep the song title top-right." LLM translates to
ProcessingConfigmutations. Useful for power users with complex per-segment overrides. - Live performance dashboard -- in-app panel showing current detection FPS, inpaint FPS, GPU VRAM, disk I/O, queued / completed / failed counts. Pure local; no telemetry leaves the machine.
- Cloud API integrations (OpenAI Whisper API, Anthropic, etc.). VSR is offline-first by design. Local-only Whisper / Qwen are fine; remote APIs add friction and privacy cost.
- Docker / containerised builds. The tool is Windows-first and ships as a PyInstaller binary. Containers add install complexity for the 95% Windows user base.
- Subscription model / telemetry. The project is a free single-binary tool. No analytics, no phone-home.
Research sources used to compile this roadmap, grouped by category:
- YaoFANGUK/video-subtitle-remover -- upstream project
- sczhou/ProPainter (ICCV 2023) -- real video inpainting reference
- lixiaowen-xw/DiffuEraser -- 2025 diffusion-based video inpainting
- ali-vilab/VACE (ICCV 2025) -- Alibaba Wan2.1 all-in-one video edit
- NevSNev/FloED -- optical-flow guided diffusion inpainting
- EraserDiT (arXiv:2506.12853) -- diffusion-transformer video inpainting
- hitachinsk/FGT (ECCV 2022) -- flow-guided transformer
- TencentARC/BrushNet (ECCV 2024) -- plug-and-play dual-branch
- advimman/lama (WACV 2022) -- LaMa reference
- Carve/LaMa-ONNX -- LaMa ONNX export
- Picsart-AI-Research/MI-GAN (ICCV 2023) -- mobile-fast inpainting
- TurboFill (2025) -- few-step diffusion inpaint
- RapidAI/RapidOCR -- PP-OCR ONNX runtime
- PaddleOCR 3.0 Technical Report
- Florence-2 -- VLM with OCR + grounding
- Qwen2-VL -- VLM with bounding-box output
- kha-white/manga-ocr -- Japanese manga OCR
- dmMaze/comic-text-detector -- irregular bubble detection
- facebookresearch/sam2 -- SAM 2 video segmentation
- Meta SAM 3 -- text-promptable segmentation (Nov 2025)
- pq-yang/MatAnyone2 (CVPR 2026) -- video matting with memory propagation
- PeterL1n/RobustVideoMatting -- real-time RVM
- faster-whisper -- 4x faster Whisper inference
- NVIDIA TensorRT -- GPU inference acceleration
- PyNvVideoCodec -- NVIDIA GPU-resident video decode
- ONNX Runtime -- cross-vendor inference runtime
- hzwer/Practical-RIFE -- real-time frame interpolation
- xinntao/Real-ESRGAN -- video super-resolution
- JingyunLiang/SwinIR -- image restoration
- m-tassano/fastdvdnet -- real-time video denoising
- Breakthrough/PySceneDetect -- scene-cut detection
- vapoursynth/vapoursynth -- scriptable video processing
- chaofengc/IQA-PyTorch -- PSNR / SSIM / LPIPS toolbox
- JunyaoHu/common_metrics_on_video_quality -- FVD + friends
- richzhang/PerceptualSimilarity -- LPIPS reference
- arthenica/ffmpeg-kit -- cross-platform FFmpeg wrappers
- Chaquopy -- Python on Android
- https://github.com/YaoFANGUK/video-subtitle-remover — reference STTN/LAMA/ProPainter pipeline
- https://github.com/YaoFANGUK/video-subtitle-extractor — companion OCR for SRT extraction
- https://github.com/JollyToday/GhostCut_Remove_Video_Text — multilingual OCR + inpainting (EN/CN/JA/KO/AR)
- https://github.com/rainwl/VideoRemoveText — lightweight LaMa + fixed ROI + color threshold
- https://github.com/Rats20/EraseSubtitles — multilingual text detection + inpainting research
- https://github.com/Keaneo/Scrubtitles — minimal script reference
- https://github.com/advimman/lama — upstream LaMa inpainting weights
- https://github.com/sczhou/ProPainter — upstream ProPainter (high-VRAM, best motion)
- https://github.com/researchmm/STTN — upstream STTN reference
- Multi-backend inpaint switch: sttn-auto / sttn-det / lama / propainter / opencv (YaoFANGUK VSR)
- DirectML execution provider for AMD/Intel GPUs in addition to CUDA (YaoFANGUK VSR)
- Docker images keyed by hardware variant (CUDA 11.8/12.6/12.8, DirectML, CPU) (YaoFANGUK VSR)
- Paired extractor → remover workflow: OCR to SRT first, then remove (YaoFANGUK pair)
- Fixed-ROI + color-threshold fast path for speed when text location is stable (rainwl)
- Auto device pick: cuda → mps → cpu with CLI override (rainwl)
- Multilingual text detection module covering CJK + Arabic scripts (GhostCut, EraseSubtitles)
- Reference-frame picker UI so user can hint STTN which frames have clean background
- VRAM budget slider that down-selects available backends at runtime
- Batch mode for a folder of clips sharing the same subtitle ROI
- Pluggable inpainter interface — each backend implements (detect, inpaint, dispose) so new models drop in (YaoFANGUK VSR)
- First-run weight download with checksum verify, cache under %LOCALAPPDATA% (rainwl, LaMa)
- ONNXRuntime + DirectML for non-NVIDIA GPUs rather than forcing CUDA (YaoFANGUK VSR)
- Frame-sequence temp cache as PNG chunks so a crashed run resumes from last completed frame
- Offload inference to a child process so GUI stays responsive and OOM doesn't kill the app