You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Now that the build runs on free-threaded Python 3.13t (commit 02ccec0) with `PYTHON_GIL=0` (commit 07d8e26), real CPU parallelism is finally available — `numpy`, `scipy`, `astropy`, `astroalign` all release the GIL during their C-level work, and with the GIL globally off they can run truly in parallel across threads. Until now the renderer pipeline ran every CPU-heavy step strictly sequentially in a single `asyncio.to_thread`. This epic catalogs where we can spread that work across cores.
Real-world baseline (from a recent vulpecula20260422 render, 13 keyframes, linear-pan)
```
Alignment phase: 159.4 s (12 pairs × ~13 s each, fully sequential)
Stretch phase: 60 s (13 frames × ~4.5 s each, fully sequential)
Transition gen: ~10 s (288 transition frames, vector ops, sequential)
PNG write: ~10 s (≈300 frames, I/O dominated)
ffmpeg encode: 1 s
Total: ~235 s
```
So alignment + stretch ≈ 220 s of the 235 s is wall-time spent in pure CPU loops with no parallelism. This is the obvious target.
Parallelization candidates (ranked by expected impact)
1. Alignment phase — `astroalign.find_transform` per pair (Highest impact)
Each pair (i, i+1) is fully independent of the others; nothing about pair 5→6 needs the result of pair 4→5. With 12 pairs and ~13 s per pair, a 4-worker `ThreadPoolExecutor` would cut alignment from 159 s to ~40 s. An 8-worker pool would push toward I/O / cache contention but might still help.
Caveats:
Memory: each pair loads two 4168×6224 uint16 arrays (≈ 50 MB / pair). 4 workers × 2 frames = 400 MB peak. Acceptable on workstations (the renderer's intended deployment).
`astroalign` uses `scikit-image` and `sep` internally. `sep` requests GIL re-enable on import (we know this from issue Renderer: try Python 3.13t (free-threaded) by removing orjson #107) — needs verification that `PYTHON_GIL=0` keeps it off across worker threads under load.
Disk I/O: 12 pairs × 2 frames × 35 MB = 840 MB total read. SSD handles this; spinning disk would bottleneck.
2. Stretch phase — debayer + tone map per frame (High impact)
`stretch_frame()` is `load_frame → debayer_frame → apply_stretch`. Each frame is independent. 13 frames × ~4.5 s sequential → ~60 s. With 4 workers and adequate memory, could drop to ~15-20 s.
Caveats:
Memory: per frame, the debayered RGB array is 6224×4168×3 bytes = ~78 MB. 4 workers × 2 stages × 78 MB ≈ 600 MB peak. Tighter than alignment but still fine on workstations.
`colour-demosaicing` calls `scipy.ndimage.convolve` heavily — releases GIL, should parallelize well.
The current pipeline streams frames (max 2 in memory) for a reason: large captures could exhaust RAM. We'd lose that property; need a pool sizing knob.
Within a transition, each of the N interpolated frames is a pure function of the two key frames and the parameter t. The N frames inside a single transition are independent. ~10 s of the render is spent here.
Caveats:
The frames are large (78 MB each). Generating many in parallel multiplies memory by worker count.
Linear-pan uses `scipy.ndimage.shift` which does release GIL.
Only ~10 s of the total render — even a 4× speedup saves ~7 s. Lower priority unless we tackle this for free as part of refactoring.
4. PNG writing (Low impact)
PIL's PNG encoder is slow but releases GIL. ~300 frames × ~30 ms each = ~10 s. Could batch-parallelize but the marginal saving is small.
Caveats:
Disk I/O bound on traditional storage.
Could be merged with parallel transition generation (worker writes its frame straight to disk).
5. Thumbnail generation (already in scope; revisit)
`_make_thumbnail` was already tested with `ThreadPoolExecutor` (1.1× speedup; disk-bound). Now with free-threading verified, retest to see if there's a CPU-side win we missed. Lowest priority — we already have a working disk cache.
Cross-cutting infrastructure
A. Worker pool sizing
Default to `min(cpu_count(), 4)` to avoid swap on 8GB systems
Add `NC_RENDER_WORKERS` env var / CLI flag for override
Document tradeoffs in README
B. GIL re-enable detection at runtime
We have `scripts/freethread_smoke.py` for import-time detection. Need a runtime check: log a warning if `sys._is_gil_enabled()` flips to `True` during a render. Useful for debugging and for catching regressions when a dep adds a new C-extension.
C. Memory pressure handling
A naïve `ThreadPoolExecutor.map` over all 13 stretch tasks creates 13×78 MB simultaneous = 1 GB peak. Need a streaming variant (e.g., `as_completed` with a bounded queue, or chunking) to keep peak memory bounded regardless of total frame count.
D. Test on real hardware
Current measurements come from `obelix` (modern dev workstation). Rendering on RPi (capture machine) will hit different bottlenecks — `astroalign` may already be I/O bound there. Should not assume linear scaling.
E. Progress reporting under parallelism
Today's progress callback is implicitly sequential: `on_progress(current, total)` is called after each frame finishes in order, and the UI label says "Rendering frame 5/13". With parallel workers, frames finish in arbitrary order — frame 7 may complete before frame 3. The progress signal needs to change shape:
Counter, not index. `on_progress` becomes "increment by one" rather than "I just finished frame N". Implementation: a thread-safe counter (`threading.Lock` around an int, or `itertools.count`).
Status text wording. "Processing frame 5/13" implies a current frame and stops being honest under parallelism — better is "5/13 done" or "5 of 13 aligned" / "5 of 13 stretched". The label should reflect a count, not a position.
Phase awareness. Different render stages have different totals (12 alignment pairs, then 13 stretch frames, then ~288 transition frames). The progress bar today already conflates these via `total_estimated`; under parallelism this stays the same but the per-stage updates are now non-monotonic per worker. The aggregate counter still grows monotonically — that's what matters for the bar.
Update rate / coalescing. Many small completions in parallel could spam UI updates. The existing `ui.timer(0.5, ...)` polling pattern in `_render` already coalesces; we just need to make sure the shared `progress_state` dict is read/written under a lock.
No surprises if a worker fails. Each parallel sub-issue must keep the counter consistent on exception (increment in a `finally`, or use `as_completed` to count only successful completions — explicit decision per stage).
This is cross-cutting because every parallel sub-issue (#1, #2, #3) touches it. Easiest path: do it once as a small refactor before #1, then each parallel sub-issue just plugs into the new counter.
Context
Now that the build runs on free-threaded Python 3.13t (commit 02ccec0) with `PYTHON_GIL=0` (commit 07d8e26), real CPU parallelism is finally available — `numpy`, `scipy`, `astropy`, `astroalign` all release the GIL during their C-level work, and with the GIL globally off they can run truly in parallel across threads. Until now the renderer pipeline ran every CPU-heavy step strictly sequentially in a single `asyncio.to_thread`. This epic catalogs where we can spread that work across cores.
Real-world baseline (from a recent vulpecula20260422 render, 13 keyframes, linear-pan)
```
Alignment phase: 159.4 s (12 pairs × ~13 s each, fully sequential)
Stretch phase: 60 s (13 frames × ~4.5 s each, fully sequential)
Transition gen: ~10 s (288 transition frames, vector ops, sequential)
PNG write: ~10 s (≈300 frames, I/O dominated)
ffmpeg encode: 1 s
Total: ~235 s
```
So alignment + stretch ≈ 220 s of the 235 s is wall-time spent in pure CPU loops with no parallelism. This is the obvious target.
Parallelization candidates (ranked by expected impact)
1. Alignment phase — `astroalign.find_transform` per pair (Highest impact)
Each pair (i, i+1) is fully independent of the others; nothing about pair 5→6 needs the result of pair 4→5. With 12 pairs and ~13 s per pair, a 4-worker `ThreadPoolExecutor` would cut alignment from 159 s to ~40 s. An 8-worker pool would push toward I/O / cache contention but might still help.
Caveats:
2. Stretch phase — debayer + tone map per frame (High impact)
`stretch_frame()` is `load_frame → debayer_frame → apply_stretch`. Each frame is independent. 13 frames × ~4.5 s sequential → ~60 s. With 4 workers and adequate memory, could drop to ~15-20 s.
Caveats:
3. Crossfade / linear-pan transition frames (Medium impact)
Within a transition, each of the N interpolated frames is a pure function of the two key frames and the parameter t. The N frames inside a single transition are independent. ~10 s of the render is spent here.
Caveats:
4. PNG writing (Low impact)
PIL's PNG encoder is slow but releases GIL. ~300 frames × ~30 ms each = ~10 s. Could batch-parallelize but the marginal saving is small.
Caveats:
5. Thumbnail generation (already in scope; revisit)
`_make_thumbnail` was already tested with `ThreadPoolExecutor` (1.1× speedup; disk-bound). Now with free-threading verified, retest to see if there's a CPU-side win we missed. Lowest priority — we already have a working disk cache.
Cross-cutting infrastructure
A. Worker pool sizing
B. GIL re-enable detection at runtime
We have `scripts/freethread_smoke.py` for import-time detection. Need a runtime check: log a warning if `sys._is_gil_enabled()` flips to `True` during a render. Useful for debugging and for catching regressions when a dep adds a new C-extension.
C. Memory pressure handling
A naïve `ThreadPoolExecutor.map` over all 13 stretch tasks creates 13×78 MB simultaneous = 1 GB peak. Need a streaming variant (e.g., `as_completed` with a bounded queue, or chunking) to keep peak memory bounded regardless of total frame count.
D. Test on real hardware
Current measurements come from `obelix` (modern dev workstation). Rendering on RPi (capture machine) will hit different bottlenecks — `astroalign` may already be I/O bound there. Should not assume linear scaling.
E. Progress reporting under parallelism
Today's progress callback is implicitly sequential: `on_progress(current, total)` is called after each frame finishes in order, and the UI label says "Rendering frame 5/13". With parallel workers, frames finish in arbitrary order — frame 7 may complete before frame 3. The progress signal needs to change shape:
This is cross-cutting because every parallel sub-issue (#1, #2, #3) touches it. Easiest path: do it once as a small refactor before #1, then each parallel sub-issue just plugs into the new counter.
Ordering proposal
pipeline.py's alignment loopItems 4 (transitions) and 5 (thumbnail) are noted but deferred — sub-issues if/when the impact justifies the work.
Out of scope