|
| 1 | +# Sweep-Performance: Parallel Performance Triage and Fix Workflow |
| 2 | + |
| 3 | +**Date:** 2026-03-31 |
| 4 | +**Status:** Draft |
| 5 | + |
| 6 | +## Overview |
| 7 | + |
| 8 | +A `/sweep-performance` slash command that audits every xrspatial module for |
| 9 | +performance bottlenecks, OOM risk under large-scale dask workloads, and |
| 10 | +backend-specific anti-patterns. Uses parallel subagents for fast static triage, |
| 11 | +then a sequential ralph-loop to benchmark and fix confirmed HIGH-severity |
| 12 | +issues. |
| 13 | + |
| 14 | +The central question for every dask backend: "If the data on disk was 30TB |
| 15 | +and the machine only had 16GB of RAM, would this tool cause an out-of-memory |
| 16 | +error?" |
| 17 | + |
| 18 | +## Scope |
| 19 | + |
| 20 | +All `.py` modules under `xrspatial/` plus the `geotiff/` and `reproject/` |
| 21 | +subpackages. Excludes `__init__.py`, `_version.py`, `__main__.py`, `utils.py`, |
| 22 | +`accessor.py`, `preview.py`, `dataset_support.py`, `diagnostics.py`, |
| 23 | +`analytics.py`. |
| 24 | + |
| 25 | +## Architecture |
| 26 | + |
| 27 | +Two phases in a single invocation: |
| 28 | + |
| 29 | +``` |
| 30 | +/sweep-performance |
| 31 | + | |
| 32 | + +-- Phase 1: Parallel Static Triage |
| 33 | + | |-- Score & rank modules (git metadata + complexity heuristics) |
| 34 | + | |-- Dispatch one subagent per module |
| 35 | + | | |-- Static analysis (dask, GPU, memory, Numba patterns) |
| 36 | + | | |-- 30TB/16GB OOM simulation (task graph construction, no compute) |
| 37 | + | | +-- Return structured JSON findings |
| 38 | + | |-- Merge results into ranked report |
| 39 | + | +-- Update state file |
| 40 | + | |
| 41 | + +-- Phase 2: Ralph-Loop (HIGH severity only) |
| 42 | + |-- Generate /ralph-loop command targeting HIGH modules |
| 43 | + |-- Each iteration: |
| 44 | + | |-- Real benchmarks (wall time, tracemalloc, RSS, CuPy pool) |
| 45 | + | |-- Confirm finding is not false positive |
| 46 | + | |-- /rockout to fix |
| 47 | + | |-- Post-fix benchmark comparison |
| 48 | + | +-- Update state file |
| 49 | + +-- User pastes command to start |
| 50 | +``` |
| 51 | + |
| 52 | +--- |
| 53 | + |
| 54 | +## Phase 1: Module Scoring |
| 55 | + |
| 56 | +For every module in scope, collect via git: |
| 57 | + |
| 58 | +| Field | Source | |
| 59 | +|--------------------|-----------------------------------------------------------| |
| 60 | +| `last_modified` | `git log -1 --format=%aI -- <path>` | |
| 61 | +| `total_commits` | `git log --oneline -- <path> \| wc -l` | |
| 62 | +| `loc` | `wc -l < <path>` | |
| 63 | +| `has_dask_backend` | grep for `_run_dask`, `map_overlap`, `map_blocks` | |
| 64 | +| `has_cuda_backend` | grep for `@cuda.jit`, `import cupy` | |
| 65 | +| `is_io_module` | module is in geotiff/ or reproject/ | |
| 66 | +| `has_existing_bench` | matching file exists in `benchmarks/benchmarks/` | |
| 67 | + |
| 68 | +### Scoring Formula |
| 69 | + |
| 70 | +``` |
| 71 | +days_since_inspected = (today - last_perf_inspected).days # 9999 if never |
| 72 | +days_since_modified = (today - last_modified).days |
| 73 | +
|
| 74 | +score = (days_since_inspected * 3) |
| 75 | + + (loc * 0.1) |
| 76 | + + (total_commits * 0.5) |
| 77 | + + (has_dask_backend * 200) |
| 78 | + + (has_cuda_backend * 150) |
| 79 | + + (is_io_module * 300) |
| 80 | + - (days_since_modified * 0.2) |
| 81 | + - (has_existing_bench * 100) |
| 82 | +``` |
| 83 | + |
| 84 | +Rationale: |
| 85 | +- Never-inspected modules dominate (9999 * 3 = ~30,000). |
| 86 | +- Dask and CUDA backends boosted: that is where OOM and perf bugs live. |
| 87 | +- I/O modules get the highest boost: most relevant for 30TB question. |
| 88 | +- Larger modules more likely to contain issues. |
| 89 | +- Existing ASV benchmarks slightly deprioritize (perf already considered). |
| 90 | + |
| 91 | +--- |
| 92 | + |
| 93 | +## Phase 1: Subagent Static Analysis |
| 94 | + |
| 95 | +One subagent per module. Each performs the checks below and returns a |
| 96 | +structured JSON blob. |
| 97 | + |
| 98 | +### Dask Path Analysis |
| 99 | + |
| 100 | +- `.values` on dask-backed DataArray (premature materialization) — **HIGH** |
| 101 | +- `.compute()` inside a loop — **HIGH** |
| 102 | +- `np.array()` / `np.asarray()` wrapping dask or CuPy array — **HIGH** |
| 103 | +- `da.stack()` without `.rechunk()` — **MEDIUM** |
| 104 | +- `map_overlap` with depth >= chunk_size / 4 — **MEDIUM** |
| 105 | +- Missing `boundary` argument in `map_overlap` — **MEDIUM** |
| 106 | +- Redundant computation (same function called twice on same input) — **MEDIUM** |
| 107 | +- Python loops over dask chunks (serializes the graph) — **MEDIUM** |
| 108 | + |
| 109 | +### 30TB / 16GB OOM Verdict |
| 110 | + |
| 111 | +Two-part analysis for each dask code path: |
| 112 | + |
| 113 | +**Part 1 — Static trace.** Follow the dask code path and answer: does peak |
| 114 | +memory scale with total array size, or with chunk size? If any step forces |
| 115 | +full materialization, verdict is WILL OOM. |
| 116 | + |
| 117 | +**Part 2 — Task graph simulation.** Write and execute a script that: |
| 118 | + |
| 119 | +```python |
| 120 | +import dask.array as da |
| 121 | +import xarray as xr |
| 122 | + |
| 123 | +# Use a representative grid (2560x2560, 10x10 = 100 chunks) to inspect |
| 124 | +# graph structure. The pattern is identical at any scale — what matters |
| 125 | +# is whether the graph fans out, materializes, or stays chunk-local. |
| 126 | +arr = da.zeros((2560, 2560), chunks=(256, 256), dtype='float64') |
| 127 | +raster = xr.DataArray(arr, dims=['y', 'x']) |
| 128 | + |
| 129 | +# Call the function lazily |
| 130 | +result = module_function(raster, **default_args) |
| 131 | + |
| 132 | +# Inspect the graph without executing |
| 133 | +graph = result.__dask_graph__() |
| 134 | +task_count = len(graph) |
| 135 | +tasks_per_chunk = task_count / 100 # normalize to per-chunk |
| 136 | + |
| 137 | +# Check for fan-out patterns or full-materialization nodes |
| 138 | +# Extrapolate to 30TB: ~57 million chunks at 256x256 float64 |
| 139 | +# If tasks_per_chunk is constant => graph scales linearly => SAFE |
| 140 | +# If any node depends on all chunks => full materialization => WILL OOM |
| 141 | +``` |
| 142 | + |
| 143 | +The script constructs the graph only, never calls `.compute()`. Reports: |
| 144 | +- Task count and tasks-per-chunk ratio |
| 145 | +- Estimated peak memory per chunk (MB) |
| 146 | +- Whether the graph contains fan-out or materialization nodes |
| 147 | +- Extrapolation to 30TB: linear graph growth (SAFE) vs fan-out (WILL OOM) |
| 148 | + |
| 149 | +**Verdict**: `SAFE`, `RISKY` (bounded but tight), or `WILL OOM` (unbounded |
| 150 | +or materializes). |
| 151 | + |
| 152 | +### GPU Transfer Analysis |
| 153 | + |
| 154 | +- `.data.get()` followed by CuPy ops (GPU-CPU-GPU round-trip) — **HIGH** |
| 155 | +- `cupy.asarray()` inside a hot loop — **HIGH** |
| 156 | +- Mixing NumPy/CuPy ops without reason — **MEDIUM** |
| 157 | +- Register pressure: >20 float64 locals in `@cuda.jit` kernel — **MEDIUM** |
| 158 | +- Thread blocks >16x16 on register-heavy kernels — **MEDIUM** |
| 159 | + |
| 160 | +### Memory Allocation Patterns |
| 161 | + |
| 162 | +- Unnecessary `.copy()` on arrays never mutated — **MEDIUM** |
| 163 | +- `np.zeros_like()` + fill loop (could be `np.empty()`) — **LOW** |
| 164 | +- Large temporary arrays that could be fused into the kernel — **MEDIUM** |
| 165 | + |
| 166 | +### Numba Anti-Patterns |
| 167 | + |
| 168 | +- Missing `@ngjit` on nested for-loops over `.data` arrays — **MEDIUM** |
| 169 | +- `@jit` without `nopython=True` (object-mode fallback risk) — **MEDIUM** |
| 170 | +- Type instability (int/float mixing in Numba functions) — **LOW** |
| 171 | +- Column-major iteration on row-major arrays (cache-unfriendly) — **LOW** |
| 172 | + |
| 173 | +### Bottleneck Classification |
| 174 | + |
| 175 | +Based on static analysis, classify the module as one of: |
| 176 | +- **IO-bound** — dominated by disk reads/writes or serialization |
| 177 | +- **Memory-bound** — peak allocation is the limiting factor |
| 178 | +- **Compute-bound** — CPU/GPU time dominates, memory is fine |
| 179 | +- **Graph-bound** — dask task graph overhead dominates (too many small tasks) |
| 180 | + |
| 181 | +### Subagent Output Schema |
| 182 | + |
| 183 | +```json |
| 184 | +{ |
| 185 | + "module": "slope", |
| 186 | + "files_read": ["xrspatial/slope.py"], |
| 187 | + "findings": [ |
| 188 | + { |
| 189 | + "severity": "HIGH", |
| 190 | + "category": "dask_materialization", |
| 191 | + "file": "slope.py", |
| 192 | + "line": 142, |
| 193 | + "description": ".values on dask input in _run_dask", |
| 194 | + "fix": "Use .data.compute() or restructure to stay lazy", |
| 195 | + "backends_affected": ["dask+numpy", "dask+cupy"] |
| 196 | + } |
| 197 | + ], |
| 198 | + "oom_verdict": { |
| 199 | + "dask_numpy": "SAFE", |
| 200 | + "dask_cupy": "SAFE", |
| 201 | + "reasoning": "map_overlap with depth=1, memory bounded by chunk size", |
| 202 | + "estimated_peak_per_chunk_mb": 0.5, |
| 203 | + "task_count": 3721, |
| 204 | + "graph_simulation_ran": true |
| 205 | + }, |
| 206 | + "bottleneck": "compute-bound", |
| 207 | + "bottleneck_reasoning": "3x3 kernel with Numba JIT, no I/O, small overlap" |
| 208 | +} |
| 209 | +``` |
| 210 | + |
| 211 | +--- |
| 212 | + |
| 213 | +## Phase 1: Merged Report |
| 214 | + |
| 215 | +After all subagents return, print a consolidated report. |
| 216 | + |
| 217 | +### Module Risk Ranking Table |
| 218 | + |
| 219 | +``` |
| 220 | +| Rank | Module | Score | OOM Verdict | Bottleneck | HIGH | MED | LOW | |
| 221 | +|------|---------------|-------|-----------------|--------------|------|-----|-----| |
| 222 | +| 1 | geotiff | 31200 | WILL OOM (d+np) | IO-bound | 3 | 1 | 0 | |
| 223 | +| 2 | viewshed | 30050 | RISKY (d+np) | memory-bound | 2 | 2 | 1 | |
| 224 | +| ... | ... | ... | ... | ... | ... | ... | ... | |
| 225 | +``` |
| 226 | + |
| 227 | +### 30TB / 16GB Verdict Summary |
| 228 | + |
| 229 | +Grouped by verdict: |
| 230 | + |
| 231 | +- **WILL OOM (fix required):** list modules with reasoning |
| 232 | +- **RISKY (bounded but tight):** list modules with reasoning |
| 233 | +- **SAFE (memory bounded by chunk size):** list modules |
| 234 | + |
| 235 | +### Detailed Findings |
| 236 | + |
| 237 | +Per-module table of all findings grouped by severity (file:line, pattern, |
| 238 | +description, fix). |
| 239 | + |
| 240 | +### Actionable Rockout Commands |
| 241 | + |
| 242 | +For each HIGH-severity finding, a ready-to-paste `/rockout` command. |
| 243 | + |
| 244 | +### State File Update |
| 245 | + |
| 246 | +Write `.claude/performance-sweep-state.json`: |
| 247 | + |
| 248 | +```json |
| 249 | +{ |
| 250 | + "last_triage": "2026-03-31T14:00:00Z", |
| 251 | + "modules": { |
| 252 | + "slope": { |
| 253 | + "last_inspected": "2026-03-31T14:00:00Z", |
| 254 | + "oom_verdict": "SAFE", |
| 255 | + "bottleneck": "compute-bound", |
| 256 | + "high_count": 0, |
| 257 | + "issue": null |
| 258 | + } |
| 259 | + } |
| 260 | +} |
| 261 | +``` |
| 262 | + |
| 263 | +--- |
| 264 | + |
| 265 | +## Phase 2: Ralph-Loop for HIGH Severity Fixes |
| 266 | + |
| 267 | +Collect all modules with at least one HIGH-severity finding. Generate a |
| 268 | +`/ralph-loop` command targeting them in priority order. |
| 269 | + |
| 270 | +### Each Iteration |
| 271 | + |
| 272 | +1. **Benchmark** the module on a moderate array (512x512 default) across all |
| 273 | + available backends. Measure four metrics per backend per function: |
| 274 | + - Wall time: `timeit.repeat(number=1, repeat=3)`, median |
| 275 | + - Python memory: `tracemalloc.get_traced_memory()` peak |
| 276 | + - Process memory: `resource.getrusage(RUSAGE_SELF).ru_maxrss` delta |
| 277 | + - GPU memory (if CuPy): `cupy.get_default_memory_pool().used_bytes()` delta |
| 278 | + |
| 279 | +2. **Confirm the static finding** from Phase 1 is real. If the benchmark |
| 280 | + shows the issue does not manifest (false positive), downgrade to MEDIUM |
| 281 | + in the report and skip to next module. |
| 282 | + |
| 283 | +3. **Classify the bottleneck** with measured data: |
| 284 | + - IO-bound: wall time dominated by read/write, low CPU |
| 285 | + - Memory-bound: peak RSS much larger than expected for chunk size |
| 286 | + - Compute-bound: CPU pegged, memory stable |
| 287 | + - Graph-bound: dask task count extremely high, scheduler overhead visible |
| 288 | + |
| 289 | +4. **Run `/rockout`** to fix the confirmed issue (GitHub issue, worktree, |
| 290 | + implementation, tests, docs). |
| 291 | + |
| 292 | +5. **Post-fix benchmark** — rerun the same benchmark. Report before/after |
| 293 | + delta. |
| 294 | + |
| 295 | +6. **Update state** — record the fix in |
| 296 | + `.claude/performance-sweep-state.json` with issue number. |
| 297 | + |
| 298 | +7. Output `<promise>ITERATION DONE</promise>`. |
| 299 | + |
| 300 | +### Generated Command Shape |
| 301 | + |
| 302 | +``` |
| 303 | +/ralph-loop "Performance sweep Phase 2: benchmark and fix HIGH-severity findings. |
| 304 | +
|
| 305 | +**Target modules in priority order:** |
| 306 | +1. geotiff (3 HIGH findings, WILL OOM) -- eager .values materialization |
| 307 | +2. cost_distance (1 HIGH finding, WILL OOM) -- iterative solver unbounded memory |
| 308 | +
|
| 309 | +**For each module:** |
| 310 | +1. Write and run a benchmark script measuring wall time, peak memory |
| 311 | + (tracemalloc + RSS + CuPy pool) across all available backends |
| 312 | +2. Confirm the HIGH finding from Phase 1 triage is real |
| 313 | +3. If confirmed: run /rockout to fix it end-to-end |
| 314 | +4. After rockout: rerun benchmark, report before/after delta |
| 315 | +5. Update .claude/performance-sweep-state.json |
| 316 | +6. Output <promise>ITERATION DONE</promise> |
| 317 | +
|
| 318 | +If all targets addressed: <promise>ALL PERFORMANCE ISSUES FIXED</promise>." |
| 319 | +--max-iterations {N+2} --completion-promise "ALL PERFORMANCE ISSUES FIXED" |
| 320 | +``` |
| 321 | + |
| 322 | +### Reminder Text |
| 323 | + |
| 324 | +``` |
| 325 | +Phase 1 triage complete. To proceed with fixes: |
| 326 | + Copy the ralph-loop command above and paste it. |
| 327 | +
|
| 328 | +Other options: |
| 329 | + Fix one manually: copy any /rockout command from the report above |
| 330 | + Rerun triage only: /sweep-performance --report-only |
| 331 | + Skip Phase 1: /sweep-performance --skip-phase1 (reuses last triage) |
| 332 | + Reset all tracking: /sweep-performance --reset-state |
| 333 | +``` |
| 334 | + |
| 335 | +--- |
| 336 | + |
| 337 | +## Arguments |
| 338 | + |
| 339 | +| Argument | Effect | |
| 340 | +|--------------------|------------------------------------------------------------| |
| 341 | +| `--top N` | Limit Phase 1 subagents to top N scored modules (default: all) | |
| 342 | +| `--exclude m1,m2` | Remove named modules from scope | |
| 343 | +| `--only-terrain` | slope, aspect, curvature, terrain, terrain_metrics, hillshade, sky_view_factor | |
| 344 | +| `--only-focal` | focal, convolution, morphology, bilateral, edge_detection, glcm | |
| 345 | +| `--only-hydro` | flood, cost_distance, geodesic, surface_distance, viewshed, erosion, diffusion | |
| 346 | +| `--only-io` | geotiff, reproject, rasterize, polygonize | |
| 347 | +| `--reset-state` | Delete state file and start fresh | |
| 348 | +| `--skip-phase1` | Reuse last triage state, go straight to ralph-loop generation | |
| 349 | +| `--report-only` | Run Phase 1 only, no ralph-loop command | |
| 350 | +| `--size small` | Benchmark at 128x128 | |
| 351 | +| `--size large` | Benchmark at 2048x2048 | |
| 352 | +| `--high-only` | Only report HIGH severity findings | |
| 353 | + |
| 354 | +Default (no arguments): audit all modules, benchmark at 512x512, generate |
| 355 | +ralph-loop for HIGH items. |
| 356 | + |
| 357 | +--- |
| 358 | + |
| 359 | +## General Rules |
| 360 | + |
| 361 | +- Phase 1 subagents do NOT modify source files. Read-only analysis. |
| 362 | +- Phase 2 ralph-loop modifies code only through `/rockout`. |
| 363 | +- Temporary benchmark scripts go in `/tmp/` with unique names. |
| 364 | +- Only flag patterns actually present in the code; no hypothetical issues. |
| 365 | +- Include exact file path and line number for every finding. |
| 366 | +- False positives are worse than missed issues. |
| 367 | +- The 30TB simulation constructs the dask graph only; it never calls `.compute()`. |
| 368 | +- State file (`.claude/performance-sweep-state.json`) is gitignored by convention. |
0 commit comments