Skip to content

RFC: opt-in GPU backend for invert_network (torch.linalg.lstsq, CUDA) #1489

@s-sasaki-earthsea-wizard

Description

RFC: opt-in GPU backend for invert_network (torch.linalg.lstsq, CUDA)

Motivation

invert_network is consistently the longest single step in time-series InSAR processing. Within run_ifgram_inversion_patch, the dominant cost on real data is per-pixel Python looping.

The current CPU path splits pixels into three cases:

  1. OLS pixels with all ifgrams valid → solved in a single scipy.linalg.lstsq call (good).
  2. OLS pixels with per-pixel NaN masks → Python loop, one scipy.linalg.lstsq call per pixel.
  3. WLS pixels (per-pixel weights) → Python loop, one scipy.linalg.lstsq call per pixel, regardless of NaN pattern.

On real data, NaN observations (atmosphere, decorrelation) are common and WLS is preferred for accuracy, so cases 2 and 3 dominate. scipy.linalg.lstsq cannot vectorize these because pixels effectively have different design matrices (different row masks, different weights).

torch.linalg.lstsq accepts a batched stack of independent (A_k, y_k) systems and dispatches to CUDA. This eliminates the Python loop in cases 2 and 3 by solving them in a single GPU call.

Proposal

Add an opt-in CUDA-only GPU backend for invert_network, selectable via template (or --backend CLI flag):

mintpy.networkInversion.backend = auto | cpu | torch
  • auto (default): resolves to cpu via the existing check_template_auto_value static lookup. Behavior unchanged
    for any user who does not modify their template.
  • cpu (explicit): existing scipy path, byte-for-byte unchanged.
  • torch (explicit opt-in): batched torch.linalg.lstsq on CUDA.

PyTorch is gated behind a new [gpu] extras group in pyproject.toml. Users without the extras pay zero — neither dependency nor runtime cost.

Scope (intentionally narrow)

  • CUDA only. When the user requests backend='torch' and CUDA is unavailable, the run fails fast with an explicit error rather than silently falling back. Rationale: the user explicitly opted in; silent fallback would mask configuration / driver issues. Users without CUDA simply leave backend = auto (default cpu).
  • No CPU torch backend. Out of scope. Could be a separate proposal later if there is demand; would expand test surface and maintenance.
  • Full-rank pixels only on GPU. CUDA's gels driver does not handle rank-deficient systems. Rare in real SBAS networks;
    encountered cases produce NaN in the output. CPU path retains its existing rank handling.

Evidence (reference run, single machine)

On FernandinaSenDT128 with an RTX 5080 (16 GiB VRAM, warm SSD) I observed 1.43× wall-time speedup for the invert_network step vs the CPU path. Numerical equivalence verified at float32 round-off. A modest figure on a tutorial dataset, but applied to production-scale runs where invert_network dominates wall, the absolute time saved is practical. A larger-scene benchmark is planned to confirm the scaling story before PR.

Open questions for maintainers

  1. API surface: template flag (current) vs CLI flag vs env var — preference?
  2. Extras layout: [gpu] (current) vs more specific name ([cuda] / [torch-cuda])?
  3. CI: comfortable adding a GPU-tagged CI job (or keeping it manual)? A no-op smoke import test is cheap and would catch packaging regressions.
  4. Docs scope: install steps in existing docs/installation.md, or a separate docs/gpu.md modelled on docs/dask.md?

Following the Dask integration playbook (#349#351#357) as a reference for staged contribution. Happy to split into smaller issues or PRs as preferred.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions