Epic: Performance and parallelization

## Context

Now that the build runs on free-threaded Python 3.13t (commit 02ccec0) with \`PYTHON_GIL=0\` (commit 07d8e26), real CPU parallelism is finally available — \`numpy\`, \`scipy\`, \`astropy\`, \`astroalign\` all release the GIL during their C-level work, and with the GIL globally off they can run truly in parallel across threads. Until now the renderer pipeline ran every CPU-heavy step strictly sequentially in a single \`asyncio.to_thread\`. This epic catalogs where we can spread that work across cores.

## Real-world baseline (from a recent vulpecula20260422 render, 13 keyframes, linear-pan)

\`\`\`
Alignment phase:    159.4 s  (12 pairs × ~13 s each, fully sequential)
Stretch phase:       60   s  (13 frames × ~4.5 s each, fully sequential)
Transition gen:     ~10   s  (288 transition frames, vector ops, sequential)
PNG write:          ~10   s  (≈300 frames, I/O dominated)
ffmpeg encode:        1   s
Total:              ~235  s
\`\`\`

So **alignment + stretch ≈ 220 s of the 235 s** is wall-time spent in pure CPU loops with no parallelism. This is the obvious target.

## Parallelization candidates (ranked by expected impact)

### 1. Alignment phase — \`astroalign.find_transform\` per pair (Highest impact)
Each pair (i, i+1) is fully independent of the others; nothing about pair 5→6 needs the result of pair 4→5. With 12 pairs and ~13 s per pair, a 4-worker \`ThreadPoolExecutor\` would cut alignment from 159 s to ~40 s. An 8-worker pool would push toward I/O / cache contention but might still help.

Caveats:
- Memory: each pair loads two 4168×6224 uint16 arrays (≈ 50 MB / pair). 4 workers × 2 frames = 400 MB peak. Acceptable on workstations (the renderer's intended deployment).
- \`astroalign\` uses \`scikit-image\` and \`sep\` internally. \`sep\` requests GIL re-enable on import (we know this from issue #107) — needs verification that \`PYTHON_GIL=0\` keeps it off across worker threads under load.
- Disk I/O: 12 pairs × 2 frames × 35 MB = 840 MB total read. SSD handles this; spinning disk would bottleneck.

### 2. Stretch phase — debayer + tone map per frame (High impact)
\`stretch_frame()\` is \`load_frame → debayer_frame → apply_stretch\`. Each frame is independent. 13 frames × ~4.5 s sequential → ~60 s. With 4 workers and adequate memory, could drop to ~15-20 s.

Caveats:
- Memory: per frame, the debayered RGB array is 6224×4168×3 bytes = ~78 MB. 4 workers × 2 stages × 78 MB ≈ 600 MB peak. Tighter than alignment but still fine on workstations.
- \`colour-demosaicing\` calls \`scipy.ndimage.convolve\` heavily — releases GIL, should parallelize well.
- \`astropy.io.fits\` already came up in #107 as a GIL re-enable suspect — needs verification.
- The current pipeline streams frames (max 2 in memory) for a reason: large captures could exhaust RAM. We'd lose that property; need a pool sizing knob.

### 3. Crossfade / linear-pan transition frames (Medium impact)
Within a transition, each of the N interpolated frames is a pure function of the two key frames and the parameter t. The N frames inside a single transition are independent. ~10 s of the render is spent here.

Caveats:
- The frames are large (78 MB each). Generating many in parallel multiplies memory by worker count.
- Linear-pan uses \`scipy.ndimage.shift\` which does release GIL.
- Only ~10 s of the total render — even a 4× speedup saves ~7 s. Lower priority unless we tackle this for free as part of refactoring.

### 4. PNG writing (Low impact)
PIL's PNG encoder is slow but releases GIL. ~300 frames × ~30 ms each = ~10 s. Could batch-parallelize but the marginal saving is small.

Caveats:
- Disk I/O bound on traditional storage.
- Could be merged with parallel transition generation (worker writes its frame straight to disk).

### 5. Thumbnail generation (already in scope; revisit)
\`_make_thumbnail\` was already tested with \`ThreadPoolExecutor\` (1.1× speedup; disk-bound). Now with free-threading verified, retest to see if there's a CPU-side win we missed. **Lowest priority** — we already have a working disk cache.

## Cross-cutting infrastructure

### A. Worker pool sizing
- Default to \`min(cpu_count(), 4)\` to avoid swap on 8GB systems
- Add \`NC_RENDER_WORKERS\` env var / CLI flag for override
- Document tradeoffs in README

### B. GIL re-enable detection at runtime
We have \`scripts/freethread_smoke.py\` for import-time detection. Need a runtime check: log a warning if \`sys._is_gil_enabled()\` flips to \`True\` during a render. Useful for debugging and for catching regressions when a dep adds a new C-extension.

### C. Memory pressure handling
A naïve \`ThreadPoolExecutor.map\` over all 13 stretch tasks creates 13×78 MB simultaneous = 1 GB peak. Need a streaming variant (e.g., \`as_completed\` with a bounded queue, or chunking) to keep peak memory bounded regardless of total frame count.

### D. Test on real hardware
Current measurements come from \`obelix\` (modern dev workstation). Rendering on RPi (capture machine) will hit different bottlenecks — \`astroalign\` may already be I/O bound there. Should not assume linear scaling.

### E. Progress reporting under parallelism
Today's progress callback is implicitly sequential: \`on_progress(current, total)\` is called after each frame finishes in order, and the UI label says "Rendering frame 5/13". With parallel workers, frames finish in arbitrary order — frame 7 may complete before frame 3. The progress signal needs to change shape:

- **Counter, not index.** \`on_progress\` becomes "increment by one" rather than "I just finished frame N". Implementation: a thread-safe counter (\`threading.Lock\` around an int, or \`itertools.count\`).
- **Status text wording.** "Processing frame 5/13" implies a current frame and stops being honest under parallelism — better is "5/13 done" or "5 of 13 aligned" / "5 of 13 stretched". The label should reflect a count, not a position.
- **Phase awareness.** Different render stages have different totals (12 alignment pairs, then 13 stretch frames, then ~288 transition frames). The progress bar today already conflates these via \`total_estimated\`; under parallelism this stays the same but the per-stage updates are now non-monotonic per worker. The aggregate counter still grows monotonically — that's what matters for the bar.
- **Update rate / coalescing.** Many small completions in parallel could spam UI updates. The existing \`ui.timer(0.5, ...)\` polling pattern in \`_render\` already coalesces; we just need to make sure the shared \`progress_state\` dict is read/written under a lock.
- **No surprises if a worker fails.** Each parallel sub-issue must keep the counter consistent on exception (increment in a \`finally\`, or use \`as_completed\` to count only successful completions — explicit decision per stage).

This is cross-cutting because every parallel sub-issue (#1, #2, #3) touches it. Easiest path: do it once as a small refactor before #1, then each parallel sub-issue just plugs into the new counter.

## Ordering proposal

1. **#116 Progress counter refactor** — small, prerequisite for the parallel issues; landing it first means each parallel issue is just a thread-pool change
2. **#117 Alignment parallel** — biggest win, isolated change in `pipeline.py`'s alignment loop
3. **#118 Stretch parallel** — second biggest win, but requires memory-bounding work
4. **#119 GIL-runtime watchdog** — small, defensive, makes future changes safer
5. **#120 Worker-pool config & docs** — cross-cutting, do once

Items 4 (transitions) and 5 (thumbnail) are noted but deferred — sub-issues if/when the impact justifies the work.


## Out of scope

- Multiprocessing / subprocess architecture (was the fallback plan if free-threading didn't pan out — no longer needed)
- GPU acceleration (separate epic if ever)
- Distributed rendering across machines (overengineering for current use case)



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Epic: Performance and parallelization #109

Context

Real-world baseline (from a recent vulpecula20260422 render, 13 keyframes, linear-pan)

Parallelization candidates (ranked by expected impact)

1. Alignment phase — `astroalign.find_transform` per pair (Highest impact)

2. Stretch phase — debayer + tone map per frame (High impact)

3. Crossfade / linear-pan transition frames (Medium impact)

4. PNG writing (Low impact)

5. Thumbnail generation (already in scope; revisit)

Cross-cutting infrastructure

A. Worker pool sizing

B. GIL re-enable detection at runtime

C. Memory pressure handling

D. Test on real hardware

E. Progress reporting under parallelism

Ordering proposal

Out of scope

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Epic: Performance and parallelization #109

Description

Context

Real-world baseline (from a recent vulpecula20260422 render, 13 keyframes, linear-pan)

Parallelization candidates (ranked by expected impact)

1. Alignment phase — `astroalign.find_transform` per pair (Highest impact)

2. Stretch phase — debayer + tone map per frame (High impact)

3. Crossfade / linear-pan transition frames (Medium impact)

4. PNG writing (Low impact)

5. Thumbnail generation (already in scope; revisit)

Cross-cutting infrastructure

A. Worker pool sizing

B. GIL re-enable detection at runtime

C. Memory pressure handling

D. Test on real hardware

E. Progress reporting under parallelism

Ordering proposal

Out of scope

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions