Skip to content

Add opt-in torch GPU solver for invert_network#1490

Open
s-sasaki-earthsea-wizard wants to merge 3 commits into
insarlab:mainfrom
s-sasaki-earthsea-wizard:gpu_torch_solver
Open

Add opt-in torch GPU solver for invert_network#1490
s-sasaki-earthsea-wizard wants to merge 3 commits into
insarlab:mainfrom
s-sasaki-earthsea-wizard:gpu_torch_solver

Conversation

@s-sasaki-earthsea-wizard
Copy link
Copy Markdown

@s-sasaki-earthsea-wizard s-sasaki-earthsea-wizard commented May 6, 2026

Description of proposed changes

This PR adds an opt-in CUDA-accelerated path for the per-pixel weighted
least-squares inversion in the invert_network step (ifgram_inversion.py).
The fork has been running this code on tutorial-scale and large-scale scenes
for several weeks; this submission consolidates the implementation as it
currently stands.

The default path is unchanged. mintpy.networkInversion.solver = auto
resolves to cpu, and the existing CPU code path is byte-for-byte identical
to upstream — every other step in smallbaselineApp.py continues to run on
the CPU regardless of this setting.

The aim is to contribute faster InSAR time-series processing for NVIDIA GPU
users, since invert_network is the dominant CPU bottleneck on typical
workflows and the gap widens with scene size.

Closes #1489 (RFC).

Implementation summary

  • New module src/mintpy/ifgram_inversion_gpu.py batches the per-pixel WLS
    systems on a single CUDA device. The solver is normal-equations +
    Cholesky
    via torch.linalg.cholesky_ex, which (a) is significantly
    faster than torch.linalg.lstsq on the matrix shapes encountered here and
    (b) lets us detect rank-deficient pixels through the returned info codes
    rather than via post-hoc residual checks.
  • ifgram_inversion.py dispatches to the GPU module only when
    solver = torch is explicitly requested; the CPU loop is untouched.
  • New [gpu] extras in pyproject.toml, sourced from requirements-gpu.txt
    (just torch>=2.11). Install requires the PyTorch CUDA wheel index:
    pip install -e ".[gpu]" --extra-index-url https://download.pytorch.org/whl/cu128
    (documented in docs/installation.md §2.4).
  • Tests in tests/test_ifgram_inversion_gpu.py cover the dispatch logic and
    the GPU fast paths with synthetic NaN / rank-deficient fixtures.

Behavior notes

  • VRAM auto-sizinggpuChunkSize = 0 (the default) probes free GPU
    memory at runtime and chooses a per-chunk pixel count with a fixed
    headroom factor; passing a positive integer overrides this for
    reproducible chunking across hosts with different VRAM.
  • Rank-deficient pixels are detected via cholesky_ex info codes and
    zeroed so NaN/Inf cannot propagate downstream; a warning line reports
    the count per chunk.
  • Per-pixel NaN observations are handled by zeroing the corresponding
    row weight, which is mathematically equivalent to dropping that row from
    the WLS system.
  • No silent CPU fallback — selecting solver = torch on a host without
    a visible CUDA device raises immediately rather than silently falling
    back to CPU; this keeps any performance regression visible.

Design pivot vs the original RFC

The RFC (#1489) originally described torch.linalg.lstsq as the GPU solver.
During development the path was switched to normal-equations + Cholesky
after a side-by-side benchmark showed it preserves output equivalence to
float32 round-off (RMS ~1e-5) while running ~16× faster than lstsq on the
same matrix shapes (tutorial dataset: FernandinaSenDT128). An RMS difference
on the order of 1e-5 in the displacement field is well below the typical
InSAR noise floor — sub-millimeter on a per-pixel basis — so the two solvers
are operationally equivalent for the geophysical use case. The lstsq path
was removed before this submission so there is only one supported GPU code
path to reason about.

Performance

Indicative numbers measured on an NVIDIA RTX 5080 (Blackwell sm_120, CUDA
12.8, PyTorch 2.11). Speedup will vary with scene size, GPU class, and
chunk-size tuning.

Scene Pixels ifgs invert_network internal step wall
FernandinaSenDT128 (tutorial) 270k 288 ~16× faster ~4.5× faster
GalapagosSenDT128 (large) 3.4M 475 ~44× faster ~36× faster

Large-scene absolute timings: CPU 6189 s → torch 170 s on the same machine.
Numerical equivalence between the cpu and torch solvers holds to
float32 round-off in both cases (RMS on the order of 1e-5; absolute RMS
max ~16 µm on the large-scene case).

Reproduction artifacts (harness scripts, raw logs, full reports) live in a
separate repository. Links below are pinned to a single sibling commit so
the data does not move during review:

Numbers are from a single development machine; absolute timings will vary
across hardware, but the qualitative findings (Cholesky > lstsq; GPU > CPU
at this matrix scale; speedup grows with scene size) should hold for any
recent NVIDIA CUDA-class device. Harness scripts and raw logs in the
mintpy-benchmark repository above let other GPU users reproduce on their
own data.

Local validation

Run on the PR branch (upstream/main + the three commits in this PR), against
FernandinaSenDT128:

  • pre-commit run --all-files exits clean (13 hooks pass + 1 skip on json,
    per the upstream .pre-commit-config.yaml).

  • smallbaselineApp.py end-to-end with default settings (solver = auto
    resolves to cpu): all 18 steps, Normal end of smallbaselineApp processing!,
    total wall time 2 h 2 m (the correct_troposphere step's CDS download
    dominated; the actual computation portion is small). All standard output
    products generated (timeseries.h5, timeseries_ERA5.h5,
    timeseries_ERA5_ramp.h5, timeseries_ERA5_ramp_demErr.h5, velocity.h5,
    velocityERA5.h5, geo/, etc.).

  • smallbaselineApp.py end-to-end with solver = torch (same dataset,
    ERA5 grib + tropo product reused via symlink so this run only re-exercises
    invert_network and the post-tropo steps): all 18 steps, Normal end of smallbaselineApp processing!, total wall time 6 m 57 s. Log confirms the
    GPU path was actually entered:

    mintpy.networkInversion.solver: auto --> torch
    estimating time-series via torch solver (batched, GPU)
    GPU auto chunk_size = 19403 pixels (free VRAM 15.1 GiB)
    estimating time-series via torch batched WLS in 14 chunk(s) of up to 19403 pixels ...
    

    Same set of standard output products as the CPU run.

Disclosure

This work was developed with the assistance of Claude Opus 4.7 (Anthropic's coding
assistant). All design decisions, benchmark execution, and review of the
generated code were performed by me. Per project convention the
Assisted-by: Claude Opus 4.7 trailers used during fork development have been stripped
from this branch's commit history; this paragraph is the canonical
disclosure.

If the AI-assisted aspect raises review or maintenance concerns for the
project, I'm happy to discuss — including whether to keep the GPU module
opt-in / under a feature flag.

Reminders

  • Fix RFC: opt-in GPU backend for invert_network (torch.linalg.lstsq, CUDA) #1489
  • Pass Pre-commit check (green) — verified locally with the upstream .pre-commit-config.yaml; CI to confirm.
  • Pass Codacy code review (green)
  • Pass Circle CI test (green)
  • Make sure that your code follows our style. Use the other functions/files as a basis.
  • If modifying functionality, describe changes to function behavior and arguments in a comment below the function declaration.
  • If adding new functionality, add a detailed description to the documentation and/or an example.

Summary by Sourcery

Add an opt-in GPU-accelerated solver for the invert_network step using PyTorch CUDA while keeping the existing CPU behavior as the default.

New Features:

  • Introduce a CUDA-batched weighted least-squares solver for invert_network backed by PyTorch, selectable via a new solver option.
  • Add configuration options and CLI flags to choose the WLS solver (CPU or torch) and control GPU chunk size.
  • Provide user documentation describing GPU setup, configuration, and performance characteristics for the new solver.

Enhancements:

  • Extend ifgram inversion to dispatch to a batched GPU path when the torch solver is selected, including NaN handling and rank-deficient pixel safeguards.

Build:

  • Define a new [gpu] optional dependency group in pyproject.toml sourced from requirements-gpu.txt for installing CUDA-enabled PyTorch.

Documentation:

  • Document GPU installation and configuration in installation.md and a new gpu.md guide, and cross-link it from the main docs and Dask documentation.

Tests:

  • Add GPU-focused tests that compare the batched torch solver against the existing CPU reference implementation and verify behavior across chunk sizes and invalid configurations.

Add a CUDA-accelerated path for the per-pixel weighted least-squares
inversion in `ifgram_inversion.py`, batched as normal-equations + Cholesky
on a single CUDA device via PyTorch. The solver is opt-in and the default
(`mintpy.networkInversion.solver = auto`) resolves to `cpu`, so existing
setups are unaffected and the CPU path is byte-for-byte unchanged.

Surface
- cfg keys: `mintpy.networkInversion.solver = cpu|torch` (default `auto`),
  `mintpy.networkInversion.gpuChunkSize = <int>` (default 0 = auto-size).
- CLI flags: `--solver {cpu,torch}` and `--gpu-chunk-size N` on
  `ifgram_inversion.py`.
- New module `src/mintpy/ifgram_inversion_gpu.py` holds the torch path;
  `ifgram_inversion.py` dispatches to it only when `solver=torch` is
  explicitly requested.

Behavior
- VRAM auto-sizing probes free GPU memory and chooses a per-chunk pixel
  count with a fixed headroom factor; `gpuChunkSize > 0` overrides.
- Rank-deficient pixels are detected via `torch.linalg.cholesky_ex` info
  codes and zeroed so NaN/Inf cannot propagate downstream.
- Per-pixel NaN observations are handled by zeroing the corresponding row
  weight, which is mathematically equivalent to dropping that row from the
  WLS system.
- Selecting `solver=torch` on a host without a visible CUDA device raises
  immediately rather than silently falling back to CPU, keeping any
  performance regression visible.

Packaging
- Adds `[gpu]` extras in `pyproject.toml`, sourced from
  `requirements-gpu.txt`. The PyTorch CUDA wheels live on a separate index;
  `installation.md` documents the install command in a follow-up commit.
The opt-in GPU solver in `ifgram_inversion_gpu.py` is implemented entirely
on top of `torch.linalg.cholesky_ex`, with no cupy entry point. Listing
`cupy-cuda12x` in `requirements-gpu.txt` therefore pulls a multi-hundred-MB
runtime that no code path imports. Drop it.

Pin `torch>=2.11` to match the version exercised in the bench matrix used
during development (Blackwell sm_120 wheel from the cu128 index). Earlier
torch releases have not been validated against this code path.
Document the new opt-in `torch` GPU solver added in the previous commits:

- `docs/gpu.md` — setup, CLI / template surface, behavior notes (VRAM
  auto-sizing, rank-deficient pixel handling, NaN observations, hard-fail
  on missing CUDA), and indicative performance numbers.
- `docs/installation.md` §2.4 — install the `[gpu]` extras together with
  the matching PyTorch CUDA wheel index.
- `docs/README.md` and `docs/dask.md` — add cross-links so readers can
  reach the GPU page from the documentation root and from the Dask page
  (since the two parallelism paths are orthogonal and need to be picked
  one or the other).

Performance numbers in `gpu.md` §4 are stated inline without any external
repository links so the page stays self-contained.
@sourcery-ai
Copy link
Copy Markdown
Contributor

sourcery-ai Bot commented May 6, 2026

Reviewer's Guide

Adds an opt-in PyTorch/CUDA-based batched weighted least-squares solver for the invert_network step, wires it through the CLI and templates while preserving the existing CPU default behavior, and documents GPU installation, configuration, and performance, along with CUDA-gated tests for numerical equivalence and behavior.

Sequence diagram for torch-based GPU solver dispatch in invert_network

sequenceDiagram
    actor User
    participant CLI as ifgram_inversion_cli
    participant Inv as run_ifgram_inversion
    participant Patch as run_ifgram_inversion_patch
    participant GPU as estimate_timeseries_batch
    participant Torch as torch

    User->>CLI: ifgram_inversion.py --solver torch --gpu-chunk-size 0
    CLI->>CLI: create_parser()
    CLI->>CLI: parse_args()
    CLI->>Inv: run_ifgram_inversion(inps)
    Inv->>Patch: run_ifgram_inversion_patch(
    Inv->>Patch: ... solver='torch', gpu_chunk_size=0 ...)

    Patch->>GPU: estimate_timeseries_batch(A, B, y,
    Patch->>GPU: tbase_diff, weight_sqrt,
    Patch->>GPU: min_norm_velocity,
    Patch->>GPU: rcond, min_redundancy,
    Patch->>GPU: inv_quality_name,
    Patch->>GPU: chunk_size=gpu_chunk_size,
    Patch->>GPU: solver='torch')

    GPU->>Torch: _get_torch_device('torch')
    GPU->>Torch: torch.cuda.is_available()
    alt chunk_size<=0
        GPU->>Torch: torch.cuda.mem_get_info()
    end

    loop for each chunk
        GPU->>Torch: _solve_cholesky(G_dev, w_dev, y_dev)
        Torch->>Torch: torch.linalg.cholesky_ex(N)
        Torch->>Torch: torch.cholesky_solve(r, L)
    end

    GPU-->>Patch: ts_sub, inv_quality_sub, num_inv_obs_sub
    Patch->>Patch: ts[:, idx_pixel2inv] = ts_sub
    Patch->>Patch: inv_quality[idx_pixel2inv] = inv_quality_sub
    Patch->>Patch: num_inv_obs[idx_pixel2inv] = num_inv_obs_sub
    Patch-->>Inv: ts, ts_cov, inv_quality
    Inv-->>CLI: write outputs
    CLI-->>User: Completed invert_network with GPU solver
Loading

File-Level Changes

Change Details Files
Introduce a GPU-batched WLS solver implementation for invert_network using PyTorch and Cholesky on normal equations, including chunking, NaN handling, and rank-deficiency handling.
  • Add new ifgram_inversion_gpu module providing estimate_timeseries_batch with batched Cholesky-based WLS/OLS solving on CUDA.
  • Implement VRAM-based automatic chunk-size selection with configurable override, and helper functions for solver/device availability.
  • Handle per-pixel NaNs via zeroed weights, detect rank-deficient pixels via cholesky_ex info codes and zero out their solutions, and compute temporal coherence / residual quality metrics on GPU.
src/mintpy/ifgram_inversion_gpu.py
Wire the GPU solver into the existing invert_network pipeline as an opt-in path, leaving CPU behavior and defaults unchanged.
  • Extend run_ifgram_inversion_patch to accept solver and gpu_chunk_size parameters and dispatch to the GPU batch solver when solver!='cpu'.
  • Keep the original CPU per-pixel inversion logic under the else branch, ensuring backward-compatible behavior when solver is cpu or unspecified.
  • Pass solver and gpuChunkSize from CLI/template inputs into run_ifgram_inversion_patch within run_ifgram_inversion.
src/mintpy/ifgram_inversion.py
Expose configuration knobs for selecting the solver and GPU chunk size via CLI and template defaults.
  • Add --solver and --gpu-chunk-size options to the ifgram_inversion CLI, including help text and valid choices.
  • Teach template parsing to recognize mintpy.networkInversion.solver and mintpy.networkInversion.gpuChunkSize keys and cast gpuChunkSize to int.
  • Set explicit defaults for solver and gpuChunkSize in smallbaselineApp_auto.cfg and add auto-resolving defaults with documentation comments in smallbaselineApp.cfg.
src/mintpy/cli/ifgram_inversion.py
src/mintpy/defaults/smallbaselineApp.cfg
src/mintpy/defaults/smallbaselineApp_auto.cfg
Add optional GPU extras and requirements to the build configuration for installing CUDA-enabled PyTorch.
  • Define a [tool.setuptools.dynamic.optional-dependencies.gpu] extra in pyproject.toml that pulls from requirements-gpu.txt.
  • Introduce requirements-gpu.txt to hold the torch (CUDA) dependency; the diff shows its creation as an empty placeholder, but it is referenced by the new gpu extra.
pyproject.toml
requirements-gpu.txt
Document GPU acceleration setup, configuration, and behavior for invert_network, and cross-link from existing docs.
  • Add a new section 2.4 to installation.md describing the optional PyTorch CUDA GPU path, installation of the [gpu] extras, verification, and enabling via CLI/template.
  • Create gpu.md with end-user documentation on enabling the GPU solver, CLI/template usage, behavior notes (VRAM sizing, NaNs, rank deficiency, no CPU fallback), and performance expectations.
  • Update docs index/readmes and Dask documentation to reference the new GPU docs and clarify that the GPU solver is orthogonal to Dask parallelism.
docs/installation.md
docs/gpu.md
docs/README.md
docs/dask.md
Add CUDA-gated tests to validate numerical equivalence between the CPU and GPU solvers and to cover solver selection behavior.
  • Create test_ifgram_inversion_gpu.py with synthetic SBAS-like network generators and fixtures to compare GPU estimate_timeseries_batch against the CPU estimate_timeseries reference across WLS/OLS, NaN presence, min_norm_velocity modes, and varying chunk sizes.
  • Gate tests on torch availability and CUDA using pytest.importorskip and skipif, and include a negative test ensuring unsupported solver values raise an error.
  • Verify that the GPU solver maintains finite outputs, preserves number-of-observation counts, and matches CPU timeseries/quality within float32-level tolerances.
tests/test_ifgram_inversion_gpu.py

Assessment against linked issues

Issue Objective Addressed Explanation
#1489 Implement an opt-in CUDA-based GPU backend for the invert_network step (using PyTorch), with configuration (template/CLI) to choose between auto/cpu/torch while keeping the existing CPU behavior as the default and unchanged.
#1489 Provide packaging and configuration support for the GPU backend via a new [gpu] optional dependency group and document installation, configuration, and usage of the GPU solver.
#1489 Ensure the GPU backend is CUDA-only (no CPU Torch path, no silent fallback when CUDA is unavailable) and correctly handles per-pixel NaNs/weights in a batched fashion.

Possibly linked issues


Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@s-sasaki-earthsea-wizard
Copy link
Copy Markdown
Author

Sharing additional benchmark data for the GPU torch solver. The PR description's Performance section reports per-step invert_network numbers on FernandinaSenDT128 and GalapagosSenDT128 (both ISCE2 / Sentinel-1 C-band). The bench has since been extended to 5 scenes spanning 4 InSAR processors (ISCE2, GMTSAR, ARIA, ROI_PAC) and 2 sensors / wavelengths (Sentinel-1 C-band, ALOS-1 L-band):

Scene Processor / Sensor Pixels Ifgs (K) Dates (D) invert_network cpu wall torch wall speedup
FernandinaSenDT128 ISCE2 / S1 (C) 270 k 288 98 645.12 s 6.88 s 93.77×
GalapagosSenDT128 ISCE2 / S1 (C) 3.40 M 490 98 2976.72 s 79.40 s 37.49×
SanFranBaySenD42 GMTSAR / S1 (C) 326 k 1297 333 1080.38 s 17.42 s 62.02×
KujuAlosAT422F650 ROI_PAC / ALOS-1 (L) 226 k 167 24 31.01 s 4.53 s 6.85×
SanFranSenDT42 ARIA / S1 (C) 1.04 M 505 114 58.85 s 11.07 s 5.32×

Configuration: warm SSD, NVIDIA RTX 5080 (16 GiB), float32, mintpy.networkInversion.solver = torch. Each run was end-to-end smallbaselineApp.py (full 18-step pipeline), not a direct call to the solver — wall numbers above are extracted from /usr/bin/time -v capture of the invert_network segment. The cpu-only steps in the same pipeline (load_data, modify_network, correct_SET, correct_troposphere, deramp, save_hdfeos5) stay within ±5 % between the cpu and torch runs of each scene, serving as an I/O / cache control.

The 5 — 94× range is structurally driven by per-pixel solve cost (∝ K · D²): Kuju (K=167, D=24) sits at the floor, SanFranBay (K=1297, D=333) at the ceiling. The Fernandina and Galapagos figures above are consistent with the ~16× internal / ~4.5× step-wall (Fernandina) and ~44× internal / ~36× step-wall (Galapagos) numbers in the original PR description; the larger headline here reflects the warm-SSD scene root and the per-step wall extracted from a fresh end-to-end run with both cpu and torch using identical fixtures.

Numerical agreement: the float32 round-off gate (rms / |cpu|.max < 1e-5) is met for the user-visible final products (velocity.h5, geocoded outputs) in all 5 scenes. Two scenes (Kuju, SanFranSF) show divergence on radar-coordinate intermediate products at rms/scale 1 — 7 %, but Kuju's geocoded velocity (filtered through maskTempCoh.h5) passes at 1.38e-7 — consistent with the divergence being confined to pixels that the downstream maskTempCoh.h5 mask drops anyway. Diagnosed in the report as cpu scipy.linalg.lstsq min-norm fill vs torch cholesky_ex fill for near-rank-deficient masked-out pixels; making the radar-coord diff tool mask-aware is queued as a sibling-repo follow-up.

Full per-step wall breakdown, fixture parity verification (cpu and torch fixtures verified byte-identical except for the 2 mintpy.*.solver = torch lines), and the numerical comparison methodology are in the report:

reports/report_end_to_end_bench.md @ 0fbf71b

Dataset records (Zenodo):

@yunjunz yunjunz requested a review from huchangyang May 27, 2026 08:00
@s-sasaki-earthsea-wizard
Copy link
Copy Markdown
Author

@sourcery-ai Please review this PR.

@sourcery-ai
Copy link
Copy Markdown
Contributor

sourcery-ai Bot commented Jun 4, 2026

Sure! I'm generating a new review now.

Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 4 issues, and left some high level feedback:

  • The inline comment in run_ifgram_inversion_patch above the GPU branch still states that rank-deficient pixels are not handled and NaN/Inf may propagate, which conflicts with the actual behavior in ifgram_inversion_gpu._solve_cholesky; consider updating this comment to accurately describe the current rank-deficiency handling.
  • The newly added requirements-gpu.txt is currently empty in the diff; if this file is meant to drive the [gpu] extra, it should explicitly declare the CUDA-enabled torch requirement so that pip install -e '.[gpu]' works as documented.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The inline comment in `run_ifgram_inversion_patch` above the GPU branch still states that rank-deficient pixels are not handled and NaN/Inf may propagate, which conflicts with the actual behavior in `ifgram_inversion_gpu._solve_cholesky`; consider updating this comment to accurately describe the current rank-deficiency handling.
- The newly added `requirements-gpu.txt` is currently empty in the diff; if this file is meant to drive the `[gpu]` extra, it should explicitly declare the CUDA-enabled `torch` requirement so that `pip install -e '.[gpu]'` works as documented.

## Individual Comments

### Comment 1
<location path="tests/test_ifgram_inversion_gpu.py" line_range="17-18" />
<code_context>
+from mintpy.ifgram_inversion import estimate_timeseries
+from mintpy.ifgram_inversion_gpu import estimate_timeseries_batch
+
+requires_cuda = pytest.mark.skipif(
+    not torch.cuda.is_available(),
+    reason="CUDA-capable GPU required for ifgram_inversion_gpu tests",
+)
</code_context>
<issue_to_address>
**suggestion (testing):** Add unit tests for CUDA-availability / solver-availability gating instead of only skipping tests when CUDA is absent.

Because this module is entirely skipped when CUDA is unavailable, we never actually test the CUDA-gating logic in `ifgram_inversion_gpu` (`is_solver_available`, `_get_torch_device`, and the error paths when CUDA is missing). Please add small unit tests that:

- Exercise `is_solver_available('cpu')` and `is_solver_available('torch')` under different conditions by monkeypatching `torch.cuda.is_available()` and `HAS_TORCH`.
- Check that `_get_torch_device('torch')` raises a clear error when PyTorch is missing or `torch.cuda.is_available()` is False.

These can be done by monkeypatching `torch` in the module namespace so they run on CPU-only CI without a real GPU.

Suggested implementation:

```python
import pytest

torch = pytest.importorskip("torch")

from mintpy.ifgram_inversion import estimate_timeseries
from mintpy import ifgram_inversion_gpu as ifgram_inversion_gpu_mod
from mintpy.ifgram_inversion_gpu import (
    estimate_timeseries_batch,
    is_solver_available,
    _get_torch_device,
)

```

```python
requires_cuda = pytest.mark.skipif(
    not torch.cuda.is_available(),
    reason="CUDA-capable GPU required for ifgram_inversion_gpu tests",
)


def test_is_solver_available_cpu_and_unknown_solver():
    # CPU solver should always be reported as available
    assert is_solver_available("cpu")
    # Unknown solver names should be rejected
    assert not is_solver_available("nonexistent-solver-name")


def test_is_solver_available_torch_with_and_without_cuda(monkeypatch):
    # Ensure we behave correctly when PyTorch is present and CUDA toggles
    monkeypatch.setattr(ifgram_inversion_gpu_mod, "HAS_TORCH", True, raising=False)

    # Simulate CUDA available
    monkeypatch.setattr(
        ifgram_inversion_gpu_mod.torch.cuda,
        "is_available",
        lambda: True,
        raising=True,
    )
    assert is_solver_available("torch")

    # Simulate CUDA not available
    monkeypatch.setattr(
        ifgram_inversion_gpu_mod.torch.cuda,
        "is_available",
        lambda: False,
        raising=True,
    )
    assert not is_solver_available("torch")


def test_is_solver_available_torch_without_pytorch(monkeypatch):
    # Simulate environment where PyTorch is not available to the module
    monkeypatch.setattr(ifgram_inversion_gpu_mod, "HAS_TORCH", False, raising=False)
    # torch attribute may still exist in the module; we explicitly mark it as unavailable
    assert not is_solver_available("torch")


def test_get_torch_device_raises_without_pytorch(monkeypatch):
    # Force the module to think PyTorch is missing
    monkeypatch.setattr(ifgram_inversion_gpu_mod, "HAS_TORCH", False, raising=False)
    with pytest.raises(RuntimeError, match="PyTorch"):
        _get_torch_device("torch")


def test_get_torch_device_raises_without_cuda(monkeypatch):
    # Force the module to think PyTorch is present but CUDA is not available
    monkeypatch.setattr(ifgram_inversion_gpu_mod, "HAS_TORCH", True, raising=False)
    monkeypatch.setattr(
        ifgram_inversion_gpu_mod.torch.cuda,
        "is_available",
        lambda: False,
        raising=True,
    )
    with pytest.raises(RuntimeError, match="CUDA"):
        _get_torch_device("torch")


def make_redundant_network(num_date, num_pair, *, max_span=4, seed=0):

```
</issue_to_address>

### Comment 2
<location path="tests/test_ifgram_inversion_gpu.py" line_range="126-127" />
<code_context>
+    return make_redundant_network(num_date=98, num_pair=288, max_span=6, seed=0)
+
+
+def _all_pixels_full_rank(A, y):
+    """Return True if every pixel's design matrix (after dropping NaN rows)
+    is still full-rank. Used to keep tests off the rank-deficient edge case
+    (which is handled separately at runtime by a CPU fallback path).
</code_context>
<issue_to_address>
**issue (testing):** Add a positive test for the rank-deficient pixel path handled by the GPU solver.

Current tests explicitly avoid rank-deficient cases via `_all_pixels_full_rank`, but `_solve_cholesky` has specific behavior for them (detect via `cholesky_ex` info, warn, zero solutions) that isn’t covered.

Please add a test that forces a rank-deficient pixel (e.g., duplicate rows in `A`/`B` or strong NaN masking) such that `info != 0` for at least one pixel, and assert that:
- A warning is emitted.
- The corresponding `ts` values are all zeros.
- `inv_quality` remains finite and non-NaN.

This will verify the documented GPU failure-handling path behaves as intended.
</issue_to_address>

### Comment 3
<location path="tests/test_ifgram_inversion_gpu.py" line_range="139-148" />
<code_context>
+    return True
+
+
+@requires_cuda
+def test_wls_no_nan(network):
+    """WLS, no NaN observations — expect ~ float32 round-off match."""
+    A, B, tbase_diff = network
+    y, w = synthesize_observations(A, B, num_pixel=64, nan_frac=0.0, seed=1)
+    cpu = cpu_reference(A, B, y, w, tbase_diff)
+    gpu = estimate_timeseries_batch(
+        A=A, B=B, y=y, tbase_diff=tbase_diff, weight_sqrt=w,
+        min_norm_velocity=True,
+        chunk_size=64, solver='torch', print_msg=False,
+    )
+    assert_equivalent(cpu, gpu, ts_rel_tol=1e-5, tcoh_abs_tol=1e-5)
+
+
+@requires_cuda
+def test_wls_with_nan_redundant(network):
+    """WLS with low NaN rate on a redundant network — float32 round-off match.
+
</code_context>
<issue_to_address>
**suggestion (testing):** Extend tests to cover the network-redundancy cutoff branch (`min_redundancy`).

All tests currently use `min_redundancy=1.0` and fully redundant networks, so the under-redundancy early-return branch (`np.min(np.sum(A != 0., axis=0)) < min_redundancy`) is never exercised.

Please add a test that constructs an under-redundant design matrix (e.g., one column with a single non-zero), sets `min_redundancy` above that count, and verifies that `ts`, `inv_quality`, and `num_inv_obs` are all zeros and match the CPU `estimate_timeseries` result. This will ensure the redundancy cutoff logic is covered and protected against regressions.

Suggested implementation:

```python
torch = pytest.importorskip("torch")


@requires_cuda
def test_min_redundancy_under_redundant_network(network):
    """Under-redundant design matrix triggers min_redundancy early return.

    Constructs a network with a column that has a single non-zero entry,
    sets ``min_redundancy`` above that count, and ensures GPU and CPU
    paths both return zero-valued results.
    """
    A, B, tbase_diff = network

    # Make the first column under-redundant: keep only a single non-zero.
    A_under = A.copy()
    col0_nz = np.where(A_under[:, 0] != 0.0)[0]
    if col0_nz.size < 1:
        pytest.skip(
            "Network fixture does not have a non-zero in the first column; "
            "cannot construct an under-redundant column."
        )
    A_under[col0_nz[1:], 0] = 0.0

    # Choose a redundancy threshold strictly greater than the actual count (1).
    min_redundancy = 2.0

    # Synthetic observations for a small number of pixels.
    y, w = synthesize_observations(
        A_under, B, num_pixel=8, nan_frac=0.0, seed=7
    )

    cpu = cpu_reference(
        A_under, B, y, w, tbase_diff, min_redundancy=min_redundancy
    )
    gpu = estimate_timeseries_batch(
        A=A_under,
        B=B,
        y=y,
        tbase_diff=tbase_diff,
        weight_sqrt=w,
        min_norm_velocity=True,
        min_redundancy=min_redundancy,
        chunk_size=8,
        solver="torch",
        print_msg=False,
    )

    # Both CPU and GPU paths should have early-returned zeros.
    for key in ("ts", "inv_quality", "num_inv_obs"):
        assert np.all(cpu[key] == 0.0)
        assert np.all(gpu[key] == 0.0)

    # And they must still agree exactly with each other.
    np.testing.assert_array_equal(cpu["ts"], gpu["ts"])
    np.testing.assert_array_equal(cpu["inv_quality"], gpu["inv_quality"])
    np.testing.assert_array_equal(cpu["num_inv_obs"], gpu["num_inv_obs"])

```

This test assumes:

1. `cpu_reference` accepts a `min_redundancy` keyword argument and forwards it to `mintpy.ifgram_inversion.estimate_timeseries`. If it does not yet accept this parameter, you will need to:
   - Update `cpu_reference` to take `min_redundancy` and pass it through to the underlying CPU implementation.
   - Ensure that the CPU implementation uses the same early-return behavior (zeroed `ts`, `inv_quality`, `num_inv_obs`) under the `min_redundancy` cutoff.

2. `estimate_timeseries_batch` already has a `min_redundancy` parameter that controls the `np.min(np.sum(A != 0., axis=0)) < min_redundancy` early-return branch mentioned in your comment. If the parameter name differs, adjust the keyword in the test accordingly.

3. If the keys in the output dictionaries differ from `("ts", "inv_quality", "num_inv_obs")`, adapt the key names used in the assertions to match your actual result structure.
</issue_to_address>

### Comment 4
<location path="tests/test_ifgram_inversion_gpu.py" line_range="15" />
<code_context>
+    # for the full-rank case. Rank-deficient pixels (rare on real SBAS networks)
+    # are not handled here; if encountered, NaN/Inf will propagate downstream.
+    if solver != 'cpu':
+        from mintpy.ifgram_inversion_gpu import estimate_timeseries_batch
+        print(f'estimating time-series via {solver} solver (batched, GPU)')
+        ts_sub, q_sub, n_sub = estimate_timeseries_batch(
</code_context>
<issue_to_address>
**suggestion (testing):** Consider adding lightweight tests for the high-level `solver` dispatch in `run_ifgram_inversion_patch`.

The new GPU path is only exercised via `estimate_timeseries_batch` tests; the `run_ifgram_inversion_patch` dispatch itself (solver selection and `gpu_chunk_size` threading) isn’t covered.

A compact integration test could:
- Use a small synthetic ifgram stack or in-memory HDF5.
- Monkeypatch `mintpy.ifgram_inversion_gpu.estimate_timeseries_batch` with a stub that records inputs and returns fixed outputs (no real GPU needed).
- Call `run_ifgram_inversion_patch` with `solver='torch'` and a non-default `gpu_chunk_size`, then assert the stub is called once with the expected arguments, including `chunk_size`.
- Verify `solver='cpu'` takes the CPU path and never calls the stub.

This would confirm the new CLI/template options are correctly wired into the inversion pipeline.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment on lines +17 to +18
requires_cuda = pytest.mark.skipif(
not torch.cuda.is_available(),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (testing): Add unit tests for CUDA-availability / solver-availability gating instead of only skipping tests when CUDA is absent.

Because this module is entirely skipped when CUDA is unavailable, we never actually test the CUDA-gating logic in ifgram_inversion_gpu (is_solver_available, _get_torch_device, and the error paths when CUDA is missing). Please add small unit tests that:

  • Exercise is_solver_available('cpu') and is_solver_available('torch') under different conditions by monkeypatching torch.cuda.is_available() and HAS_TORCH.
  • Check that _get_torch_device('torch') raises a clear error when PyTorch is missing or torch.cuda.is_available() is False.

These can be done by monkeypatching torch in the module namespace so they run on CPU-only CI without a real GPU.

Suggested implementation:

import pytest

torch = pytest.importorskip("torch")

from mintpy.ifgram_inversion import estimate_timeseries
from mintpy import ifgram_inversion_gpu as ifgram_inversion_gpu_mod
from mintpy.ifgram_inversion_gpu import (
    estimate_timeseries_batch,
    is_solver_available,
    _get_torch_device,
)
requires_cuda = pytest.mark.skipif(
    not torch.cuda.is_available(),
    reason="CUDA-capable GPU required for ifgram_inversion_gpu tests",
)


def test_is_solver_available_cpu_and_unknown_solver():
    # CPU solver should always be reported as available
    assert is_solver_available("cpu")
    # Unknown solver names should be rejected
    assert not is_solver_available("nonexistent-solver-name")


def test_is_solver_available_torch_with_and_without_cuda(monkeypatch):
    # Ensure we behave correctly when PyTorch is present and CUDA toggles
    monkeypatch.setattr(ifgram_inversion_gpu_mod, "HAS_TORCH", True, raising=False)

    # Simulate CUDA available
    monkeypatch.setattr(
        ifgram_inversion_gpu_mod.torch.cuda,
        "is_available",
        lambda: True,
        raising=True,
    )
    assert is_solver_available("torch")

    # Simulate CUDA not available
    monkeypatch.setattr(
        ifgram_inversion_gpu_mod.torch.cuda,
        "is_available",
        lambda: False,
        raising=True,
    )
    assert not is_solver_available("torch")


def test_is_solver_available_torch_without_pytorch(monkeypatch):
    # Simulate environment where PyTorch is not available to the module
    monkeypatch.setattr(ifgram_inversion_gpu_mod, "HAS_TORCH", False, raising=False)
    # torch attribute may still exist in the module; we explicitly mark it as unavailable
    assert not is_solver_available("torch")


def test_get_torch_device_raises_without_pytorch(monkeypatch):
    # Force the module to think PyTorch is missing
    monkeypatch.setattr(ifgram_inversion_gpu_mod, "HAS_TORCH", False, raising=False)
    with pytest.raises(RuntimeError, match="PyTorch"):
        _get_torch_device("torch")


def test_get_torch_device_raises_without_cuda(monkeypatch):
    # Force the module to think PyTorch is present but CUDA is not available
    monkeypatch.setattr(ifgram_inversion_gpu_mod, "HAS_TORCH", True, raising=False)
    monkeypatch.setattr(
        ifgram_inversion_gpu_mod.torch.cuda,
        "is_available",
        lambda: False,
        raising=True,
    )
    with pytest.raises(RuntimeError, match="CUDA"):
        _get_torch_device("torch")


def make_redundant_network(num_date, num_pair, *, max_span=4, seed=0):

Comment on lines +126 to +127
def _all_pixels_full_rank(A, y):
"""Return True if every pixel's design matrix (after dropping NaN rows)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (testing): Add a positive test for the rank-deficient pixel path handled by the GPU solver.

Current tests explicitly avoid rank-deficient cases via _all_pixels_full_rank, but _solve_cholesky has specific behavior for them (detect via cholesky_ex info, warn, zero solutions) that isn’t covered.

Please add a test that forces a rank-deficient pixel (e.g., duplicate rows in A/B or strong NaN masking) such that info != 0 for at least one pixel, and assert that:

  • A warning is emitted.
  • The corresponding ts values are all zeros.
  • inv_quality remains finite and non-NaN.

This will verify the documented GPU failure-handling path behaves as intended.

Comment on lines +139 to +148
@requires_cuda
def test_wls_no_nan(network):
"""WLS, no NaN observations — expect ~ float32 round-off match."""
A, B, tbase_diff = network
y, w = synthesize_observations(A, B, num_pixel=64, nan_frac=0.0, seed=1)
cpu = cpu_reference(A, B, y, w, tbase_diff)
gpu = estimate_timeseries_batch(
A=A, B=B, y=y, tbase_diff=tbase_diff, weight_sqrt=w,
min_norm_velocity=True,
chunk_size=64, solver='torch', print_msg=False,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (testing): Extend tests to cover the network-redundancy cutoff branch (min_redundancy).

All tests currently use min_redundancy=1.0 and fully redundant networks, so the under-redundancy early-return branch (np.min(np.sum(A != 0., axis=0)) < min_redundancy) is never exercised.

Please add a test that constructs an under-redundant design matrix (e.g., one column with a single non-zero), sets min_redundancy above that count, and verifies that ts, inv_quality, and num_inv_obs are all zeros and match the CPU estimate_timeseries result. This will ensure the redundancy cutoff logic is covered and protected against regressions.

Suggested implementation:

torch = pytest.importorskip("torch")


@requires_cuda
def test_min_redundancy_under_redundant_network(network):
    """Under-redundant design matrix triggers min_redundancy early return.

    Constructs a network with a column that has a single non-zero entry,
    sets ``min_redundancy`` above that count, and ensures GPU and CPU
    paths both return zero-valued results.
    """
    A, B, tbase_diff = network

    # Make the first column under-redundant: keep only a single non-zero.
    A_under = A.copy()
    col0_nz = np.where(A_under[:, 0] != 0.0)[0]
    if col0_nz.size < 1:
        pytest.skip(
            "Network fixture does not have a non-zero in the first column; "
            "cannot construct an under-redundant column."
        )
    A_under[col0_nz[1:], 0] = 0.0

    # Choose a redundancy threshold strictly greater than the actual count (1).
    min_redundancy = 2.0

    # Synthetic observations for a small number of pixels.
    y, w = synthesize_observations(
        A_under, B, num_pixel=8, nan_frac=0.0, seed=7
    )

    cpu = cpu_reference(
        A_under, B, y, w, tbase_diff, min_redundancy=min_redundancy
    )
    gpu = estimate_timeseries_batch(
        A=A_under,
        B=B,
        y=y,
        tbase_diff=tbase_diff,
        weight_sqrt=w,
        min_norm_velocity=True,
        min_redundancy=min_redundancy,
        chunk_size=8,
        solver="torch",
        print_msg=False,
    )

    # Both CPU and GPU paths should have early-returned zeros.
    for key in ("ts", "inv_quality", "num_inv_obs"):
        assert np.all(cpu[key] == 0.0)
        assert np.all(gpu[key] == 0.0)

    # And they must still agree exactly with each other.
    np.testing.assert_array_equal(cpu["ts"], gpu["ts"])
    np.testing.assert_array_equal(cpu["inv_quality"], gpu["inv_quality"])
    np.testing.assert_array_equal(cpu["num_inv_obs"], gpu["num_inv_obs"])

This test assumes:

  1. cpu_reference accepts a min_redundancy keyword argument and forwards it to mintpy.ifgram_inversion.estimate_timeseries. If it does not yet accept this parameter, you will need to:

    • Update cpu_reference to take min_redundancy and pass it through to the underlying CPU implementation.
    • Ensure that the CPU implementation uses the same early-return behavior (zeroed ts, inv_quality, num_inv_obs) under the min_redundancy cutoff.
  2. estimate_timeseries_batch already has a min_redundancy parameter that controls the np.min(np.sum(A != 0., axis=0)) < min_redundancy early-return branch mentioned in your comment. If the parameter name differs, adjust the keyword in the test accordingly.

  3. If the keys in the output dictionaries differ from ("ts", "inv_quality", "num_inv_obs"), adapt the key names used in the assertions to match your actual result structure.

torch = pytest.importorskip("torch")

from mintpy.ifgram_inversion import estimate_timeseries
from mintpy.ifgram_inversion_gpu import estimate_timeseries_batch
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (testing): Consider adding lightweight tests for the high-level solver dispatch in run_ifgram_inversion_patch.

The new GPU path is only exercised via estimate_timeseries_batch tests; the run_ifgram_inversion_patch dispatch itself (solver selection and gpu_chunk_size threading) isn’t covered.

A compact integration test could:

  • Use a small synthetic ifgram stack or in-memory HDF5.
  • Monkeypatch mintpy.ifgram_inversion_gpu.estimate_timeseries_batch with a stub that records inputs and returns fixed outputs (no real GPU needed).
  • Call run_ifgram_inversion_patch with solver='torch' and a non-default gpu_chunk_size, then assert the stub is called once with the expected arguments, including chunk_size.
  • Verify solver='cpu' takes the CPU path and never calls the stub.

This would confirm the new CLI/template options are correctly wired into the inversion pipeline.

@sourcery-ai
Copy link
Copy Markdown
Contributor

sourcery-ai Bot commented Jun 4, 2026

Hey @s-sasaki-earthsea-wizard, I've posted a new review for you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

RFC: opt-in GPU backend for invert_network (torch.linalg.lstsq, CUDA)

1 participant