Add opt-in torch GPU solver for invert_network#1490
Add opt-in torch GPU solver for invert_network#1490s-sasaki-earthsea-wizard wants to merge 3 commits into
Conversation
Add a CUDA-accelerated path for the per-pixel weighted least-squares
inversion in `ifgram_inversion.py`, batched as normal-equations + Cholesky
on a single CUDA device via PyTorch. The solver is opt-in and the default
(`mintpy.networkInversion.solver = auto`) resolves to `cpu`, so existing
setups are unaffected and the CPU path is byte-for-byte unchanged.
Surface
- cfg keys: `mintpy.networkInversion.solver = cpu|torch` (default `auto`),
`mintpy.networkInversion.gpuChunkSize = <int>` (default 0 = auto-size).
- CLI flags: `--solver {cpu,torch}` and `--gpu-chunk-size N` on
`ifgram_inversion.py`.
- New module `src/mintpy/ifgram_inversion_gpu.py` holds the torch path;
`ifgram_inversion.py` dispatches to it only when `solver=torch` is
explicitly requested.
Behavior
- VRAM auto-sizing probes free GPU memory and chooses a per-chunk pixel
count with a fixed headroom factor; `gpuChunkSize > 0` overrides.
- Rank-deficient pixels are detected via `torch.linalg.cholesky_ex` info
codes and zeroed so NaN/Inf cannot propagate downstream.
- Per-pixel NaN observations are handled by zeroing the corresponding row
weight, which is mathematically equivalent to dropping that row from the
WLS system.
- Selecting `solver=torch` on a host without a visible CUDA device raises
immediately rather than silently falling back to CPU, keeping any
performance regression visible.
Packaging
- Adds `[gpu]` extras in `pyproject.toml`, sourced from
`requirements-gpu.txt`. The PyTorch CUDA wheels live on a separate index;
`installation.md` documents the install command in a follow-up commit.
The opt-in GPU solver in `ifgram_inversion_gpu.py` is implemented entirely on top of `torch.linalg.cholesky_ex`, with no cupy entry point. Listing `cupy-cuda12x` in `requirements-gpu.txt` therefore pulls a multi-hundred-MB runtime that no code path imports. Drop it. Pin `torch>=2.11` to match the version exercised in the bench matrix used during development (Blackwell sm_120 wheel from the cu128 index). Earlier torch releases have not been validated against this code path.
Document the new opt-in `torch` GPU solver added in the previous commits: - `docs/gpu.md` — setup, CLI / template surface, behavior notes (VRAM auto-sizing, rank-deficient pixel handling, NaN observations, hard-fail on missing CUDA), and indicative performance numbers. - `docs/installation.md` §2.4 — install the `[gpu]` extras together with the matching PyTorch CUDA wheel index. - `docs/README.md` and `docs/dask.md` — add cross-links so readers can reach the GPU page from the documentation root and from the Dask page (since the two parallelism paths are orthogonal and need to be picked one or the other). Performance numbers in `gpu.md` §4 are stated inline without any external repository links so the page stays self-contained.
Reviewer's GuideAdds an opt-in PyTorch/CUDA-based batched weighted least-squares solver for the invert_network step, wires it through the CLI and templates while preserving the existing CPU default behavior, and documents GPU installation, configuration, and performance, along with CUDA-gated tests for numerical equivalence and behavior. Sequence diagram for torch-based GPU solver dispatch in invert_networksequenceDiagram
actor User
participant CLI as ifgram_inversion_cli
participant Inv as run_ifgram_inversion
participant Patch as run_ifgram_inversion_patch
participant GPU as estimate_timeseries_batch
participant Torch as torch
User->>CLI: ifgram_inversion.py --solver torch --gpu-chunk-size 0
CLI->>CLI: create_parser()
CLI->>CLI: parse_args()
CLI->>Inv: run_ifgram_inversion(inps)
Inv->>Patch: run_ifgram_inversion_patch(
Inv->>Patch: ... solver='torch', gpu_chunk_size=0 ...)
Patch->>GPU: estimate_timeseries_batch(A, B, y,
Patch->>GPU: tbase_diff, weight_sqrt,
Patch->>GPU: min_norm_velocity,
Patch->>GPU: rcond, min_redundancy,
Patch->>GPU: inv_quality_name,
Patch->>GPU: chunk_size=gpu_chunk_size,
Patch->>GPU: solver='torch')
GPU->>Torch: _get_torch_device('torch')
GPU->>Torch: torch.cuda.is_available()
alt chunk_size<=0
GPU->>Torch: torch.cuda.mem_get_info()
end
loop for each chunk
GPU->>Torch: _solve_cholesky(G_dev, w_dev, y_dev)
Torch->>Torch: torch.linalg.cholesky_ex(N)
Torch->>Torch: torch.cholesky_solve(r, L)
end
GPU-->>Patch: ts_sub, inv_quality_sub, num_inv_obs_sub
Patch->>Patch: ts[:, idx_pixel2inv] = ts_sub
Patch->>Patch: inv_quality[idx_pixel2inv] = inv_quality_sub
Patch->>Patch: num_inv_obs[idx_pixel2inv] = num_inv_obs_sub
Patch-->>Inv: ts, ts_cov, inv_quality
Inv-->>CLI: write outputs
CLI-->>User: Completed invert_network with GPU solver
File-Level Changes
Assessment against linked issues
Possibly linked issues
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
|
Sharing additional benchmark data for the GPU torch solver. The PR description's Performance section reports per-step
Configuration: warm SSD, NVIDIA RTX 5080 (16 GiB), float32, The 5 — 94× range is structurally driven by per-pixel solve cost (∝ K · D²): Kuju (K=167, D=24) sits at the floor, SanFranBay (K=1297, D=333) at the ceiling. The Fernandina and Galapagos figures above are consistent with the Numerical agreement: the float32 round-off gate (rms / |cpu|.max < 1e-5) is met for the user-visible final products ( Full per-step wall breakdown, fixture parity verification (cpu and torch fixtures verified byte-identical except for the 2 → Dataset records (Zenodo):
|
|
@sourcery-ai Please review this PR. |
|
Sure! I'm generating a new review now. |
There was a problem hiding this comment.
Hey - I've found 4 issues, and left some high level feedback:
- The inline comment in
run_ifgram_inversion_patchabove the GPU branch still states that rank-deficient pixels are not handled and NaN/Inf may propagate, which conflicts with the actual behavior inifgram_inversion_gpu._solve_cholesky; consider updating this comment to accurately describe the current rank-deficiency handling. - The newly added
requirements-gpu.txtis currently empty in the diff; if this file is meant to drive the[gpu]extra, it should explicitly declare the CUDA-enabledtorchrequirement so thatpip install -e '.[gpu]'works as documented.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The inline comment in `run_ifgram_inversion_patch` above the GPU branch still states that rank-deficient pixels are not handled and NaN/Inf may propagate, which conflicts with the actual behavior in `ifgram_inversion_gpu._solve_cholesky`; consider updating this comment to accurately describe the current rank-deficiency handling.
- The newly added `requirements-gpu.txt` is currently empty in the diff; if this file is meant to drive the `[gpu]` extra, it should explicitly declare the CUDA-enabled `torch` requirement so that `pip install -e '.[gpu]'` works as documented.
## Individual Comments
### Comment 1
<location path="tests/test_ifgram_inversion_gpu.py" line_range="17-18" />
<code_context>
+from mintpy.ifgram_inversion import estimate_timeseries
+from mintpy.ifgram_inversion_gpu import estimate_timeseries_batch
+
+requires_cuda = pytest.mark.skipif(
+ not torch.cuda.is_available(),
+ reason="CUDA-capable GPU required for ifgram_inversion_gpu tests",
+)
</code_context>
<issue_to_address>
**suggestion (testing):** Add unit tests for CUDA-availability / solver-availability gating instead of only skipping tests when CUDA is absent.
Because this module is entirely skipped when CUDA is unavailable, we never actually test the CUDA-gating logic in `ifgram_inversion_gpu` (`is_solver_available`, `_get_torch_device`, and the error paths when CUDA is missing). Please add small unit tests that:
- Exercise `is_solver_available('cpu')` and `is_solver_available('torch')` under different conditions by monkeypatching `torch.cuda.is_available()` and `HAS_TORCH`.
- Check that `_get_torch_device('torch')` raises a clear error when PyTorch is missing or `torch.cuda.is_available()` is False.
These can be done by monkeypatching `torch` in the module namespace so they run on CPU-only CI without a real GPU.
Suggested implementation:
```python
import pytest
torch = pytest.importorskip("torch")
from mintpy.ifgram_inversion import estimate_timeseries
from mintpy import ifgram_inversion_gpu as ifgram_inversion_gpu_mod
from mintpy.ifgram_inversion_gpu import (
estimate_timeseries_batch,
is_solver_available,
_get_torch_device,
)
```
```python
requires_cuda = pytest.mark.skipif(
not torch.cuda.is_available(),
reason="CUDA-capable GPU required for ifgram_inversion_gpu tests",
)
def test_is_solver_available_cpu_and_unknown_solver():
# CPU solver should always be reported as available
assert is_solver_available("cpu")
# Unknown solver names should be rejected
assert not is_solver_available("nonexistent-solver-name")
def test_is_solver_available_torch_with_and_without_cuda(monkeypatch):
# Ensure we behave correctly when PyTorch is present and CUDA toggles
monkeypatch.setattr(ifgram_inversion_gpu_mod, "HAS_TORCH", True, raising=False)
# Simulate CUDA available
monkeypatch.setattr(
ifgram_inversion_gpu_mod.torch.cuda,
"is_available",
lambda: True,
raising=True,
)
assert is_solver_available("torch")
# Simulate CUDA not available
monkeypatch.setattr(
ifgram_inversion_gpu_mod.torch.cuda,
"is_available",
lambda: False,
raising=True,
)
assert not is_solver_available("torch")
def test_is_solver_available_torch_without_pytorch(monkeypatch):
# Simulate environment where PyTorch is not available to the module
monkeypatch.setattr(ifgram_inversion_gpu_mod, "HAS_TORCH", False, raising=False)
# torch attribute may still exist in the module; we explicitly mark it as unavailable
assert not is_solver_available("torch")
def test_get_torch_device_raises_without_pytorch(monkeypatch):
# Force the module to think PyTorch is missing
monkeypatch.setattr(ifgram_inversion_gpu_mod, "HAS_TORCH", False, raising=False)
with pytest.raises(RuntimeError, match="PyTorch"):
_get_torch_device("torch")
def test_get_torch_device_raises_without_cuda(monkeypatch):
# Force the module to think PyTorch is present but CUDA is not available
monkeypatch.setattr(ifgram_inversion_gpu_mod, "HAS_TORCH", True, raising=False)
monkeypatch.setattr(
ifgram_inversion_gpu_mod.torch.cuda,
"is_available",
lambda: False,
raising=True,
)
with pytest.raises(RuntimeError, match="CUDA"):
_get_torch_device("torch")
def make_redundant_network(num_date, num_pair, *, max_span=4, seed=0):
```
</issue_to_address>
### Comment 2
<location path="tests/test_ifgram_inversion_gpu.py" line_range="126-127" />
<code_context>
+ return make_redundant_network(num_date=98, num_pair=288, max_span=6, seed=0)
+
+
+def _all_pixels_full_rank(A, y):
+ """Return True if every pixel's design matrix (after dropping NaN rows)
+ is still full-rank. Used to keep tests off the rank-deficient edge case
+ (which is handled separately at runtime by a CPU fallback path).
</code_context>
<issue_to_address>
**issue (testing):** Add a positive test for the rank-deficient pixel path handled by the GPU solver.
Current tests explicitly avoid rank-deficient cases via `_all_pixels_full_rank`, but `_solve_cholesky` has specific behavior for them (detect via `cholesky_ex` info, warn, zero solutions) that isn’t covered.
Please add a test that forces a rank-deficient pixel (e.g., duplicate rows in `A`/`B` or strong NaN masking) such that `info != 0` for at least one pixel, and assert that:
- A warning is emitted.
- The corresponding `ts` values are all zeros.
- `inv_quality` remains finite and non-NaN.
This will verify the documented GPU failure-handling path behaves as intended.
</issue_to_address>
### Comment 3
<location path="tests/test_ifgram_inversion_gpu.py" line_range="139-148" />
<code_context>
+ return True
+
+
+@requires_cuda
+def test_wls_no_nan(network):
+ """WLS, no NaN observations — expect ~ float32 round-off match."""
+ A, B, tbase_diff = network
+ y, w = synthesize_observations(A, B, num_pixel=64, nan_frac=0.0, seed=1)
+ cpu = cpu_reference(A, B, y, w, tbase_diff)
+ gpu = estimate_timeseries_batch(
+ A=A, B=B, y=y, tbase_diff=tbase_diff, weight_sqrt=w,
+ min_norm_velocity=True,
+ chunk_size=64, solver='torch', print_msg=False,
+ )
+ assert_equivalent(cpu, gpu, ts_rel_tol=1e-5, tcoh_abs_tol=1e-5)
+
+
+@requires_cuda
+def test_wls_with_nan_redundant(network):
+ """WLS with low NaN rate on a redundant network — float32 round-off match.
+
</code_context>
<issue_to_address>
**suggestion (testing):** Extend tests to cover the network-redundancy cutoff branch (`min_redundancy`).
All tests currently use `min_redundancy=1.0` and fully redundant networks, so the under-redundancy early-return branch (`np.min(np.sum(A != 0., axis=0)) < min_redundancy`) is never exercised.
Please add a test that constructs an under-redundant design matrix (e.g., one column with a single non-zero), sets `min_redundancy` above that count, and verifies that `ts`, `inv_quality`, and `num_inv_obs` are all zeros and match the CPU `estimate_timeseries` result. This will ensure the redundancy cutoff logic is covered and protected against regressions.
Suggested implementation:
```python
torch = pytest.importorskip("torch")
@requires_cuda
def test_min_redundancy_under_redundant_network(network):
"""Under-redundant design matrix triggers min_redundancy early return.
Constructs a network with a column that has a single non-zero entry,
sets ``min_redundancy`` above that count, and ensures GPU and CPU
paths both return zero-valued results.
"""
A, B, tbase_diff = network
# Make the first column under-redundant: keep only a single non-zero.
A_under = A.copy()
col0_nz = np.where(A_under[:, 0] != 0.0)[0]
if col0_nz.size < 1:
pytest.skip(
"Network fixture does not have a non-zero in the first column; "
"cannot construct an under-redundant column."
)
A_under[col0_nz[1:], 0] = 0.0
# Choose a redundancy threshold strictly greater than the actual count (1).
min_redundancy = 2.0
# Synthetic observations for a small number of pixels.
y, w = synthesize_observations(
A_under, B, num_pixel=8, nan_frac=0.0, seed=7
)
cpu = cpu_reference(
A_under, B, y, w, tbase_diff, min_redundancy=min_redundancy
)
gpu = estimate_timeseries_batch(
A=A_under,
B=B,
y=y,
tbase_diff=tbase_diff,
weight_sqrt=w,
min_norm_velocity=True,
min_redundancy=min_redundancy,
chunk_size=8,
solver="torch",
print_msg=False,
)
# Both CPU and GPU paths should have early-returned zeros.
for key in ("ts", "inv_quality", "num_inv_obs"):
assert np.all(cpu[key] == 0.0)
assert np.all(gpu[key] == 0.0)
# And they must still agree exactly with each other.
np.testing.assert_array_equal(cpu["ts"], gpu["ts"])
np.testing.assert_array_equal(cpu["inv_quality"], gpu["inv_quality"])
np.testing.assert_array_equal(cpu["num_inv_obs"], gpu["num_inv_obs"])
```
This test assumes:
1. `cpu_reference` accepts a `min_redundancy` keyword argument and forwards it to `mintpy.ifgram_inversion.estimate_timeseries`. If it does not yet accept this parameter, you will need to:
- Update `cpu_reference` to take `min_redundancy` and pass it through to the underlying CPU implementation.
- Ensure that the CPU implementation uses the same early-return behavior (zeroed `ts`, `inv_quality`, `num_inv_obs`) under the `min_redundancy` cutoff.
2. `estimate_timeseries_batch` already has a `min_redundancy` parameter that controls the `np.min(np.sum(A != 0., axis=0)) < min_redundancy` early-return branch mentioned in your comment. If the parameter name differs, adjust the keyword in the test accordingly.
3. If the keys in the output dictionaries differ from `("ts", "inv_quality", "num_inv_obs")`, adapt the key names used in the assertions to match your actual result structure.
</issue_to_address>
### Comment 4
<location path="tests/test_ifgram_inversion_gpu.py" line_range="15" />
<code_context>
+ # for the full-rank case. Rank-deficient pixels (rare on real SBAS networks)
+ # are not handled here; if encountered, NaN/Inf will propagate downstream.
+ if solver != 'cpu':
+ from mintpy.ifgram_inversion_gpu import estimate_timeseries_batch
+ print(f'estimating time-series via {solver} solver (batched, GPU)')
+ ts_sub, q_sub, n_sub = estimate_timeseries_batch(
</code_context>
<issue_to_address>
**suggestion (testing):** Consider adding lightweight tests for the high-level `solver` dispatch in `run_ifgram_inversion_patch`.
The new GPU path is only exercised via `estimate_timeseries_batch` tests; the `run_ifgram_inversion_patch` dispatch itself (solver selection and `gpu_chunk_size` threading) isn’t covered.
A compact integration test could:
- Use a small synthetic ifgram stack or in-memory HDF5.
- Monkeypatch `mintpy.ifgram_inversion_gpu.estimate_timeseries_batch` with a stub that records inputs and returns fixed outputs (no real GPU needed).
- Call `run_ifgram_inversion_patch` with `solver='torch'` and a non-default `gpu_chunk_size`, then assert the stub is called once with the expected arguments, including `chunk_size`.
- Verify `solver='cpu'` takes the CPU path and never calls the stub.
This would confirm the new CLI/template options are correctly wired into the inversion pipeline.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| requires_cuda = pytest.mark.skipif( | ||
| not torch.cuda.is_available(), |
There was a problem hiding this comment.
suggestion (testing): Add unit tests for CUDA-availability / solver-availability gating instead of only skipping tests when CUDA is absent.
Because this module is entirely skipped when CUDA is unavailable, we never actually test the CUDA-gating logic in ifgram_inversion_gpu (is_solver_available, _get_torch_device, and the error paths when CUDA is missing). Please add small unit tests that:
- Exercise
is_solver_available('cpu')andis_solver_available('torch')under different conditions by monkeypatchingtorch.cuda.is_available()andHAS_TORCH. - Check that
_get_torch_device('torch')raises a clear error when PyTorch is missing ortorch.cuda.is_available()is False.
These can be done by monkeypatching torch in the module namespace so they run on CPU-only CI without a real GPU.
Suggested implementation:
import pytest
torch = pytest.importorskip("torch")
from mintpy.ifgram_inversion import estimate_timeseries
from mintpy import ifgram_inversion_gpu as ifgram_inversion_gpu_mod
from mintpy.ifgram_inversion_gpu import (
estimate_timeseries_batch,
is_solver_available,
_get_torch_device,
)requires_cuda = pytest.mark.skipif(
not torch.cuda.is_available(),
reason="CUDA-capable GPU required for ifgram_inversion_gpu tests",
)
def test_is_solver_available_cpu_and_unknown_solver():
# CPU solver should always be reported as available
assert is_solver_available("cpu")
# Unknown solver names should be rejected
assert not is_solver_available("nonexistent-solver-name")
def test_is_solver_available_torch_with_and_without_cuda(monkeypatch):
# Ensure we behave correctly when PyTorch is present and CUDA toggles
monkeypatch.setattr(ifgram_inversion_gpu_mod, "HAS_TORCH", True, raising=False)
# Simulate CUDA available
monkeypatch.setattr(
ifgram_inversion_gpu_mod.torch.cuda,
"is_available",
lambda: True,
raising=True,
)
assert is_solver_available("torch")
# Simulate CUDA not available
monkeypatch.setattr(
ifgram_inversion_gpu_mod.torch.cuda,
"is_available",
lambda: False,
raising=True,
)
assert not is_solver_available("torch")
def test_is_solver_available_torch_without_pytorch(monkeypatch):
# Simulate environment where PyTorch is not available to the module
monkeypatch.setattr(ifgram_inversion_gpu_mod, "HAS_TORCH", False, raising=False)
# torch attribute may still exist in the module; we explicitly mark it as unavailable
assert not is_solver_available("torch")
def test_get_torch_device_raises_without_pytorch(monkeypatch):
# Force the module to think PyTorch is missing
monkeypatch.setattr(ifgram_inversion_gpu_mod, "HAS_TORCH", False, raising=False)
with pytest.raises(RuntimeError, match="PyTorch"):
_get_torch_device("torch")
def test_get_torch_device_raises_without_cuda(monkeypatch):
# Force the module to think PyTorch is present but CUDA is not available
monkeypatch.setattr(ifgram_inversion_gpu_mod, "HAS_TORCH", True, raising=False)
monkeypatch.setattr(
ifgram_inversion_gpu_mod.torch.cuda,
"is_available",
lambda: False,
raising=True,
)
with pytest.raises(RuntimeError, match="CUDA"):
_get_torch_device("torch")
def make_redundant_network(num_date, num_pair, *, max_span=4, seed=0):| def _all_pixels_full_rank(A, y): | ||
| """Return True if every pixel's design matrix (after dropping NaN rows) |
There was a problem hiding this comment.
issue (testing): Add a positive test for the rank-deficient pixel path handled by the GPU solver.
Current tests explicitly avoid rank-deficient cases via _all_pixels_full_rank, but _solve_cholesky has specific behavior for them (detect via cholesky_ex info, warn, zero solutions) that isn’t covered.
Please add a test that forces a rank-deficient pixel (e.g., duplicate rows in A/B or strong NaN masking) such that info != 0 for at least one pixel, and assert that:
- A warning is emitted.
- The corresponding
tsvalues are all zeros. inv_qualityremains finite and non-NaN.
This will verify the documented GPU failure-handling path behaves as intended.
| @requires_cuda | ||
| def test_wls_no_nan(network): | ||
| """WLS, no NaN observations — expect ~ float32 round-off match.""" | ||
| A, B, tbase_diff = network | ||
| y, w = synthesize_observations(A, B, num_pixel=64, nan_frac=0.0, seed=1) | ||
| cpu = cpu_reference(A, B, y, w, tbase_diff) | ||
| gpu = estimate_timeseries_batch( | ||
| A=A, B=B, y=y, tbase_diff=tbase_diff, weight_sqrt=w, | ||
| min_norm_velocity=True, | ||
| chunk_size=64, solver='torch', print_msg=False, |
There was a problem hiding this comment.
suggestion (testing): Extend tests to cover the network-redundancy cutoff branch (min_redundancy).
All tests currently use min_redundancy=1.0 and fully redundant networks, so the under-redundancy early-return branch (np.min(np.sum(A != 0., axis=0)) < min_redundancy) is never exercised.
Please add a test that constructs an under-redundant design matrix (e.g., one column with a single non-zero), sets min_redundancy above that count, and verifies that ts, inv_quality, and num_inv_obs are all zeros and match the CPU estimate_timeseries result. This will ensure the redundancy cutoff logic is covered and protected against regressions.
Suggested implementation:
torch = pytest.importorskip("torch")
@requires_cuda
def test_min_redundancy_under_redundant_network(network):
"""Under-redundant design matrix triggers min_redundancy early return.
Constructs a network with a column that has a single non-zero entry,
sets ``min_redundancy`` above that count, and ensures GPU and CPU
paths both return zero-valued results.
"""
A, B, tbase_diff = network
# Make the first column under-redundant: keep only a single non-zero.
A_under = A.copy()
col0_nz = np.where(A_under[:, 0] != 0.0)[0]
if col0_nz.size < 1:
pytest.skip(
"Network fixture does not have a non-zero in the first column; "
"cannot construct an under-redundant column."
)
A_under[col0_nz[1:], 0] = 0.0
# Choose a redundancy threshold strictly greater than the actual count (1).
min_redundancy = 2.0
# Synthetic observations for a small number of pixels.
y, w = synthesize_observations(
A_under, B, num_pixel=8, nan_frac=0.0, seed=7
)
cpu = cpu_reference(
A_under, B, y, w, tbase_diff, min_redundancy=min_redundancy
)
gpu = estimate_timeseries_batch(
A=A_under,
B=B,
y=y,
tbase_diff=tbase_diff,
weight_sqrt=w,
min_norm_velocity=True,
min_redundancy=min_redundancy,
chunk_size=8,
solver="torch",
print_msg=False,
)
# Both CPU and GPU paths should have early-returned zeros.
for key in ("ts", "inv_quality", "num_inv_obs"):
assert np.all(cpu[key] == 0.0)
assert np.all(gpu[key] == 0.0)
# And they must still agree exactly with each other.
np.testing.assert_array_equal(cpu["ts"], gpu["ts"])
np.testing.assert_array_equal(cpu["inv_quality"], gpu["inv_quality"])
np.testing.assert_array_equal(cpu["num_inv_obs"], gpu["num_inv_obs"])This test assumes:
-
cpu_referenceaccepts amin_redundancykeyword argument and forwards it tomintpy.ifgram_inversion.estimate_timeseries. If it does not yet accept this parameter, you will need to:- Update
cpu_referenceto takemin_redundancyand pass it through to the underlying CPU implementation. - Ensure that the CPU implementation uses the same early-return behavior (zeroed
ts,inv_quality,num_inv_obs) under themin_redundancycutoff.
- Update
-
estimate_timeseries_batchalready has amin_redundancyparameter that controls thenp.min(np.sum(A != 0., axis=0)) < min_redundancyearly-return branch mentioned in your comment. If the parameter name differs, adjust the keyword in the test accordingly. -
If the keys in the output dictionaries differ from
("ts", "inv_quality", "num_inv_obs"), adapt the key names used in the assertions to match your actual result structure.
| torch = pytest.importorskip("torch") | ||
|
|
||
| from mintpy.ifgram_inversion import estimate_timeseries | ||
| from mintpy.ifgram_inversion_gpu import estimate_timeseries_batch |
There was a problem hiding this comment.
suggestion (testing): Consider adding lightweight tests for the high-level solver dispatch in run_ifgram_inversion_patch.
The new GPU path is only exercised via estimate_timeseries_batch tests; the run_ifgram_inversion_patch dispatch itself (solver selection and gpu_chunk_size threading) isn’t covered.
A compact integration test could:
- Use a small synthetic ifgram stack or in-memory HDF5.
- Monkeypatch
mintpy.ifgram_inversion_gpu.estimate_timeseries_batchwith a stub that records inputs and returns fixed outputs (no real GPU needed). - Call
run_ifgram_inversion_patchwithsolver='torch'and a non-defaultgpu_chunk_size, then assert the stub is called once with the expected arguments, includingchunk_size. - Verify
solver='cpu'takes the CPU path and never calls the stub.
This would confirm the new CLI/template options are correctly wired into the inversion pipeline.
|
Hey @s-sasaki-earthsea-wizard, I've posted a new review for you! |
Description of proposed changes
This PR adds an opt-in CUDA-accelerated path for the per-pixel weighted
least-squares inversion in the
invert_networkstep (ifgram_inversion.py).The fork has been running this code on tutorial-scale and large-scale scenes
for several weeks; this submission consolidates the implementation as it
currently stands.
The default path is unchanged.
mintpy.networkInversion.solver = autoresolves to
cpu, and the existing CPU code path is byte-for-byte identicalto upstream — every other step in
smallbaselineApp.pycontinues to run onthe CPU regardless of this setting.
The aim is to contribute faster InSAR time-series processing for NVIDIA GPU
users, since
invert_networkis the dominant CPU bottleneck on typicalworkflows and the gap widens with scene size.
Closes #1489 (RFC).
Implementation summary
src/mintpy/ifgram_inversion_gpu.pybatches the per-pixel WLSsystems on a single CUDA device. The solver is normal-equations +
Cholesky via
torch.linalg.cholesky_ex, which (a) is significantlyfaster than
torch.linalg.lstsqon the matrix shapes encountered here and(b) lets us detect rank-deficient pixels through the returned
infocodesrather than via post-hoc residual checks.
ifgram_inversion.pydispatches to the GPU module only whensolver = torchis explicitly requested; the CPU loop is untouched.[gpu]extras inpyproject.toml, sourced fromrequirements-gpu.txt(just
torch>=2.11). Install requires the PyTorch CUDA wheel index:pip install -e ".[gpu]" --extra-index-url https://download.pytorch.org/whl/cu128(documented in
docs/installation.md§2.4).tests/test_ifgram_inversion_gpu.pycover the dispatch logic andthe GPU fast paths with synthetic NaN / rank-deficient fixtures.
Behavior notes
gpuChunkSize = 0(the default) probes free GPUmemory at runtime and chooses a per-chunk pixel count with a fixed
headroom factor; passing a positive integer overrides this for
reproducible chunking across hosts with different VRAM.
cholesky_exinfo codes andzeroed so NaN/Inf cannot propagate downstream; a warning line reports
the count per chunk.
row weight, which is mathematically equivalent to dropping that row from
the WLS system.
solver = torchon a host withouta visible CUDA device raises immediately rather than silently falling
back to CPU; this keeps any performance regression visible.
Design pivot vs the original RFC
The RFC (#1489) originally described
torch.linalg.lstsqas the GPU solver.During development the path was switched to normal-equations + Cholesky
after a side-by-side benchmark showed it preserves output equivalence to
float32 round-off (RMS ~1e-5) while running ~16× faster than
lstsqon thesame matrix shapes (tutorial dataset: FernandinaSenDT128). An RMS difference
on the order of
1e-5in the displacement field is well below the typicalInSAR noise floor — sub-millimeter on a per-pixel basis — so the two solvers
are operationally equivalent for the geophysical use case. The
lstsqpathwas removed before this submission so there is only one supported GPU code
path to reason about.
Performance
Indicative numbers measured on an NVIDIA RTX 5080 (Blackwell sm_120, CUDA
12.8, PyTorch 2.11). Speedup will vary with scene size, GPU class, and
chunk-size tuning.
invert_networkinternalLarge-scene absolute timings: CPU 6189 s → torch 170 s on the same machine.
Numerical equivalence between the
cpuandtorchsolvers holds tofloat32 round-off in both cases (RMS on the order of
1e-5; absolute RMSmax ~16 µm on the large-scene case).
Reproduction artifacts (harness scripts, raw logs, full reports) live in a
separate repository. Links below are pinned to a single sibling commit so
the data does not move during review:
cpuvstorchend-to-end on Fernandina:https://github.com/s-sasaki-earthsea-wizard/mintpy-benchmark/blob/c20ca8bb/reports/report_torch.md
lstsqvs Cholesky equivalence + per-step speedup:https://github.com/s-sasaki-earthsea-wizard/mintpy-benchmark/blob/c20ca8bb/reports/report_solver_comparison.md
https://github.com/s-sasaki-earthsea-wizard/mintpy-benchmark/blob/c20ca8bb/reports/report_chunk_sweep.md
torch.profilerGPU kernel breakdown:https://github.com/s-sasaki-earthsea-wizard/mintpy-benchmark/blob/c20ca8bb/reports/report_profile.md
https://github.com/s-sasaki-earthsea-wizard/mintpy-benchmark/blob/c20ca8bb/reports/report_large_scene.md
Numbers are from a single development machine; absolute timings will vary
across hardware, but the qualitative findings (Cholesky > lstsq; GPU > CPU
at this matrix scale; speedup grows with scene size) should hold for any
recent NVIDIA CUDA-class device. Harness scripts and raw logs in the
mintpy-benchmark repository above let other GPU users reproduce on their
own data.
Local validation
Run on the PR branch (
upstream/main+ the three commits in this PR), againstFernandinaSenDT128:
pre-commit run --all-filesexits clean (13 hooks pass + 1 skip on json,per the upstream
.pre-commit-config.yaml).smallbaselineApp.pyend-to-end with default settings (solver = autoresolves to
cpu): all 18 steps,Normal end of smallbaselineApp processing!,total wall time 2 h 2 m (the
correct_tropospherestep's CDS downloaddominated; the actual computation portion is small). All standard output
products generated (
timeseries.h5,timeseries_ERA5.h5,timeseries_ERA5_ramp.h5,timeseries_ERA5_ramp_demErr.h5,velocity.h5,velocityERA5.h5,geo/, etc.).smallbaselineApp.pyend-to-end withsolver = torch(same dataset,ERA5 grib + tropo product reused via symlink so this run only re-exercises
invert_networkand the post-tropo steps): all 18 steps,Normal end of smallbaselineApp processing!, total wall time 6 m 57 s. Log confirms theGPU path was actually entered:
Same set of standard output products as the CPU run.
Disclosure
This work was developed with the assistance of Claude Opus 4.7 (Anthropic's coding
assistant). All design decisions, benchmark execution, and review of the
generated code were performed by me. Per project convention the
Assisted-by: Claude Opus 4.7trailers used during fork development have been strippedfrom this branch's commit history; this paragraph is the canonical
disclosure.
If the AI-assisted aspect raises review or maintenance concerns for the
project, I'm happy to discuss — including whether to keep the GPU module
opt-in / under a feature flag.
Reminders
.pre-commit-config.yaml; CI to confirm.Summary by Sourcery
Add an opt-in GPU-accelerated solver for the invert_network step using PyTorch CUDA while keeping the existing CPU behavior as the default.
New Features:
Enhancements:
Build:
Documentation:
Tests: