Skip to content

Epic: Performance and parallelization #109

@zworkb

Description

@zworkb

Context

Now that the build runs on free-threaded Python 3.13t (commit 02ccec0) with `PYTHON_GIL=0` (commit 07d8e26), real CPU parallelism is finally available — `numpy`, `scipy`, `astropy`, `astroalign` all release the GIL during their C-level work, and with the GIL globally off they can run truly in parallel across threads. Until now the renderer pipeline ran every CPU-heavy step strictly sequentially in a single `asyncio.to_thread`. This epic catalogs where we can spread that work across cores.

Real-world baseline (from a recent vulpecula20260422 render, 13 keyframes, linear-pan)

```
Alignment phase: 159.4 s (12 pairs × ~13 s each, fully sequential)
Stretch phase: 60 s (13 frames × ~4.5 s each, fully sequential)
Transition gen: ~10 s (288 transition frames, vector ops, sequential)
PNG write: ~10 s (≈300 frames, I/O dominated)
ffmpeg encode: 1 s
Total: ~235 s
```

So alignment + stretch ≈ 220 s of the 235 s is wall-time spent in pure CPU loops with no parallelism. This is the obvious target.

Parallelization candidates (ranked by expected impact)

1. Alignment phase — `astroalign.find_transform` per pair (Highest impact)

Each pair (i, i+1) is fully independent of the others; nothing about pair 5→6 needs the result of pair 4→5. With 12 pairs and ~13 s per pair, a 4-worker `ThreadPoolExecutor` would cut alignment from 159 s to ~40 s. An 8-worker pool would push toward I/O / cache contention but might still help.

Caveats:

  • Memory: each pair loads two 4168×6224 uint16 arrays (≈ 50 MB / pair). 4 workers × 2 frames = 400 MB peak. Acceptable on workstations (the renderer's intended deployment).
  • `astroalign` uses `scikit-image` and `sep` internally. `sep` requests GIL re-enable on import (we know this from issue Renderer: try Python 3.13t (free-threaded) by removing orjson #107) — needs verification that `PYTHON_GIL=0` keeps it off across worker threads under load.
  • Disk I/O: 12 pairs × 2 frames × 35 MB = 840 MB total read. SSD handles this; spinning disk would bottleneck.

2. Stretch phase — debayer + tone map per frame (High impact)

`stretch_frame()` is `load_frame → debayer_frame → apply_stretch`. Each frame is independent. 13 frames × ~4.5 s sequential → ~60 s. With 4 workers and adequate memory, could drop to ~15-20 s.

Caveats:

  • Memory: per frame, the debayered RGB array is 6224×4168×3 bytes = ~78 MB. 4 workers × 2 stages × 78 MB ≈ 600 MB peak. Tighter than alignment but still fine on workstations.
  • `colour-demosaicing` calls `scipy.ndimage.convolve` heavily — releases GIL, should parallelize well.
  • `astropy.io.fits` already came up in Renderer: try Python 3.13t (free-threaded) by removing orjson #107 as a GIL re-enable suspect — needs verification.
  • The current pipeline streams frames (max 2 in memory) for a reason: large captures could exhaust RAM. We'd lose that property; need a pool sizing knob.

3. Crossfade / linear-pan transition frames (Medium impact)

Within a transition, each of the N interpolated frames is a pure function of the two key frames and the parameter t. The N frames inside a single transition are independent. ~10 s of the render is spent here.

Caveats:

  • The frames are large (78 MB each). Generating many in parallel multiplies memory by worker count.
  • Linear-pan uses `scipy.ndimage.shift` which does release GIL.
  • Only ~10 s of the total render — even a 4× speedup saves ~7 s. Lower priority unless we tackle this for free as part of refactoring.

4. PNG writing (Low impact)

PIL's PNG encoder is slow but releases GIL. ~300 frames × ~30 ms each = ~10 s. Could batch-parallelize but the marginal saving is small.

Caveats:

  • Disk I/O bound on traditional storage.
  • Could be merged with parallel transition generation (worker writes its frame straight to disk).

5. Thumbnail generation (already in scope; revisit)

`_make_thumbnail` was already tested with `ThreadPoolExecutor` (1.1× speedup; disk-bound). Now with free-threading verified, retest to see if there's a CPU-side win we missed. Lowest priority — we already have a working disk cache.

Cross-cutting infrastructure

A. Worker pool sizing

  • Default to `min(cpu_count(), 4)` to avoid swap on 8GB systems
  • Add `NC_RENDER_WORKERS` env var / CLI flag for override
  • Document tradeoffs in README

B. GIL re-enable detection at runtime

We have `scripts/freethread_smoke.py` for import-time detection. Need a runtime check: log a warning if `sys._is_gil_enabled()` flips to `True` during a render. Useful for debugging and for catching regressions when a dep adds a new C-extension.

C. Memory pressure handling

A naïve `ThreadPoolExecutor.map` over all 13 stretch tasks creates 13×78 MB simultaneous = 1 GB peak. Need a streaming variant (e.g., `as_completed` with a bounded queue, or chunking) to keep peak memory bounded regardless of total frame count.

D. Test on real hardware

Current measurements come from `obelix` (modern dev workstation). Rendering on RPi (capture machine) will hit different bottlenecks — `astroalign` may already be I/O bound there. Should not assume linear scaling.

E. Progress reporting under parallelism

Today's progress callback is implicitly sequential: `on_progress(current, total)` is called after each frame finishes in order, and the UI label says "Rendering frame 5/13". With parallel workers, frames finish in arbitrary order — frame 7 may complete before frame 3. The progress signal needs to change shape:

  • Counter, not index. `on_progress` becomes "increment by one" rather than "I just finished frame N". Implementation: a thread-safe counter (`threading.Lock` around an int, or `itertools.count`).
  • Status text wording. "Processing frame 5/13" implies a current frame and stops being honest under parallelism — better is "5/13 done" or "5 of 13 aligned" / "5 of 13 stretched". The label should reflect a count, not a position.
  • Phase awareness. Different render stages have different totals (12 alignment pairs, then 13 stretch frames, then ~288 transition frames). The progress bar today already conflates these via `total_estimated`; under parallelism this stays the same but the per-stage updates are now non-monotonic per worker. The aggregate counter still grows monotonically — that's what matters for the bar.
  • Update rate / coalescing. Many small completions in parallel could spam UI updates. The existing `ui.timer(0.5, ...)` polling pattern in `_render` already coalesces; we just need to make sure the shared `progress_state` dict is read/written under a lock.
  • No surprises if a worker fails. Each parallel sub-issue must keep the counter consistent on exception (increment in a `finally`, or use `as_completed` to count only successful completions — explicit decision per stage).

This is cross-cutting because every parallel sub-issue (#1, #2, #3) touches it. Easiest path: do it once as a small refactor before #1, then each parallel sub-issue just plugs into the new counter.

Ordering proposal

  1. Renderer: progress counter refactor (prerequisite for parallelization) #116 Progress counter refactor — small, prerequisite for the parallel issues; landing it first means each parallel issue is just a thread-pool change
  2. Renderer: parallelize alignment phase (biggest single win) #117 Alignment parallel — biggest win, isolated change in pipeline.py's alignment loop
  3. Renderer: parallelize stretch phase (debayer + tonemap per frame) #118 Stretch parallel — second biggest win, but requires memory-bounding work
  4. Renderer: GIL-runtime watchdog (log warning if free-threading regresses) #119 GIL-runtime watchdog — small, defensive, makes future changes safer
  5. Renderer: worker pool sizing config + documentation #120 Worker-pool config & docs — cross-cutting, do once

Items 4 (transitions) and 5 (thumbnail) are noted but deferred — sub-issues if/when the impact justifies the work.

Out of scope

  • Multiprocessing / subprocess architecture (was the fallback plan if free-threading didn't pan out — no longer needed)
  • GPU acceleration (separate epic if ever)
  • Distributed rendering across machines (overengineering for current use case)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestepicEpic / feature group

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions