RFC: opt-in GPU backend for invert_network (torch.linalg.lstsq, CUDA)
Motivation
invert_network is consistently the longest single step in time-series InSAR processing. Within run_ifgram_inversion_patch, the dominant cost on real data is per-pixel Python looping.
The current CPU path splits pixels into three cases:
- OLS pixels with all ifgrams valid → solved in a single
scipy.linalg.lstsq call (good).
- OLS pixels with per-pixel NaN masks → Python loop, one
scipy.linalg.lstsq call per pixel.
- WLS pixels (per-pixel weights) → Python loop, one
scipy.linalg.lstsq call per pixel, regardless of NaN pattern.
On real data, NaN observations (atmosphere, decorrelation) are common and WLS is preferred for accuracy, so cases 2 and 3 dominate. scipy.linalg.lstsq cannot vectorize these because pixels effectively have different design matrices (different row masks, different weights).
torch.linalg.lstsq accepts a batched stack of independent (A_k, y_k) systems and dispatches to CUDA. This eliminates the Python loop in cases 2 and 3 by solving them in a single GPU call.
Proposal
Add an opt-in CUDA-only GPU backend for invert_network, selectable via template (or --backend CLI flag):
mintpy.networkInversion.backend = auto | cpu | torch
auto (default): resolves to cpu via the existing check_template_auto_value static lookup. Behavior unchanged
for any user who does not modify their template.
cpu (explicit): existing scipy path, byte-for-byte unchanged.
torch (explicit opt-in): batched torch.linalg.lstsq on CUDA.
PyTorch is gated behind a new [gpu] extras group in pyproject.toml. Users without the extras pay zero — neither dependency nor runtime cost.
Scope (intentionally narrow)
- CUDA only. When the user requests
backend='torch' and CUDA is unavailable, the run fails fast with an explicit error rather than silently falling back. Rationale: the user explicitly opted in; silent fallback would mask configuration / driver issues. Users without CUDA simply leave backend = auto (default cpu).
- No CPU torch backend. Out of scope. Could be a separate proposal later if there is demand; would expand test surface and maintenance.
- Full-rank pixels only on GPU. CUDA's
gels driver does not handle rank-deficient systems. Rare in real SBAS networks;
encountered cases produce NaN in the output. CPU path retains its existing rank handling.
Evidence (reference run, single machine)
On FernandinaSenDT128 with an RTX 5080 (16 GiB VRAM, warm SSD) I observed 1.43× wall-time speedup for the invert_network step vs the CPU path. Numerical equivalence verified at float32 round-off. A modest figure on a tutorial dataset, but applied to production-scale runs where invert_network dominates wall, the absolute time saved is practical. A larger-scene benchmark is planned to confirm the scaling story before PR.
Open questions for maintainers
- API surface: template flag (current) vs CLI flag vs env var — preference?
- Extras layout:
[gpu] (current) vs more specific name ([cuda] / [torch-cuda])?
- CI: comfortable adding a GPU-tagged CI job (or keeping it manual)? A no-op smoke import test is cheap and would catch packaging regressions.
- Docs scope: install steps in existing
docs/installation.md, or a separate docs/gpu.md modelled on docs/dask.md?
Following the Dask integration playbook (#349 → #351 → #357) as a reference for staged contribution. Happy to split into smaller issues or PRs as preferred.
RFC: opt-in GPU backend for invert_network (torch.linalg.lstsq, CUDA)
Motivation
invert_networkis consistently the longest single step in time-series InSAR processing. Withinrun_ifgram_inversion_patch, the dominant cost on real data is per-pixel Python looping.The current CPU path splits pixels into three cases:
scipy.linalg.lstsqcall (good).scipy.linalg.lstsqcall per pixel.scipy.linalg.lstsqcall per pixel, regardless of NaN pattern.On real data, NaN observations (atmosphere, decorrelation) are common and WLS is preferred for accuracy, so cases 2 and 3 dominate.
scipy.linalg.lstsqcannot vectorize these because pixels effectively have different design matrices (different row masks, different weights).torch.linalg.lstsqaccepts a batched stack of independent (A_k, y_k) systems and dispatches to CUDA. This eliminates the Python loop in cases 2 and 3 by solving them in a single GPU call.Proposal
Add an opt-in CUDA-only GPU backend for
invert_network, selectable via template (or--backendCLI flag):auto(default): resolves tocpuvia the existingcheck_template_auto_valuestatic lookup. Behavior unchangedfor any user who does not modify their template.
cpu(explicit): existing scipy path, byte-for-byte unchanged.torch(explicit opt-in): batchedtorch.linalg.lstsqon CUDA.PyTorch is gated behind a new
[gpu]extras group inpyproject.toml. Users without the extras pay zero — neither dependency nor runtime cost.Scope (intentionally narrow)
backend='torch'and CUDA is unavailable, the run fails fast with an explicit error rather than silently falling back. Rationale: the user explicitly opted in; silent fallback would mask configuration / driver issues. Users without CUDA simply leavebackend = auto(default cpu).gelsdriver does not handle rank-deficient systems. Rare in real SBAS networks;encountered cases produce NaN in the output. CPU path retains its existing rank handling.
Evidence (reference run, single machine)
On FernandinaSenDT128 with an RTX 5080 (16 GiB VRAM, warm SSD) I observed 1.43× wall-time speedup for the
invert_networkstep vs the CPU path. Numerical equivalence verified at float32 round-off. A modest figure on a tutorial dataset, but applied to production-scale runs whereinvert_networkdominates wall, the absolute time saved is practical. A larger-scene benchmark is planned to confirm the scaling story before PR.https://github.com/s-sasaki-earthsea-wizard/MintPy
Open questions for maintainers
[gpu](current) vs more specific name ([cuda]/[torch-cuda])?docs/installation.md, or a separatedocs/gpu.mdmodelled ondocs/dask.md?Following the Dask integration playbook (#349 → #351 → #357) as a reference for staged contribution. Happy to split into smaller issues or PRs as preferred.