Skip to content

Commit 09b92b3

Browse files
committed
Add sweep-performance design spec
Parallel subagent triage + ralph-loop workflow for auditing all xrspatial modules for performance bottlenecks, OOM risk under 30TB dask workloads, and backend-specific anti-patterns.
1 parent 0ad1feb commit 09b92b3

File tree

1 file changed

+368
-0
lines changed

1 file changed

+368
-0
lines changed
Lines changed: 368 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,368 @@
1+
# Sweep-Performance: Parallel Performance Triage and Fix Workflow
2+
3+
**Date:** 2026-03-31
4+
**Status:** Draft
5+
6+
## Overview
7+
8+
A `/sweep-performance` slash command that audits every xrspatial module for
9+
performance bottlenecks, OOM risk under large-scale dask workloads, and
10+
backend-specific anti-patterns. Uses parallel subagents for fast static triage,
11+
then a sequential ralph-loop to benchmark and fix confirmed HIGH-severity
12+
issues.
13+
14+
The central question for every dask backend: "If the data on disk was 30TB
15+
and the machine only had 16GB of RAM, would this tool cause an out-of-memory
16+
error?"
17+
18+
## Scope
19+
20+
All `.py` modules under `xrspatial/` plus the `geotiff/` and `reproject/`
21+
subpackages. Excludes `__init__.py`, `_version.py`, `__main__.py`, `utils.py`,
22+
`accessor.py`, `preview.py`, `dataset_support.py`, `diagnostics.py`,
23+
`analytics.py`.
24+
25+
## Architecture
26+
27+
Two phases in a single invocation:
28+
29+
```
30+
/sweep-performance
31+
|
32+
+-- Phase 1: Parallel Static Triage
33+
| |-- Score & rank modules (git metadata + complexity heuristics)
34+
| |-- Dispatch one subagent per module
35+
| | |-- Static analysis (dask, GPU, memory, Numba patterns)
36+
| | |-- 30TB/16GB OOM simulation (task graph construction, no compute)
37+
| | +-- Return structured JSON findings
38+
| |-- Merge results into ranked report
39+
| +-- Update state file
40+
|
41+
+-- Phase 2: Ralph-Loop (HIGH severity only)
42+
|-- Generate /ralph-loop command targeting HIGH modules
43+
|-- Each iteration:
44+
| |-- Real benchmarks (wall time, tracemalloc, RSS, CuPy pool)
45+
| |-- Confirm finding is not false positive
46+
| |-- /rockout to fix
47+
| |-- Post-fix benchmark comparison
48+
| +-- Update state file
49+
+-- User pastes command to start
50+
```
51+
52+
---
53+
54+
## Phase 1: Module Scoring
55+
56+
For every module in scope, collect via git:
57+
58+
| Field | Source |
59+
|--------------------|-----------------------------------------------------------|
60+
| `last_modified` | `git log -1 --format=%aI -- <path>` |
61+
| `total_commits` | `git log --oneline -- <path> \| wc -l` |
62+
| `loc` | `wc -l < <path>` |
63+
| `has_dask_backend` | grep for `_run_dask`, `map_overlap`, `map_blocks` |
64+
| `has_cuda_backend` | grep for `@cuda.jit`, `import cupy` |
65+
| `is_io_module` | module is in geotiff/ or reproject/ |
66+
| `has_existing_bench` | matching file exists in `benchmarks/benchmarks/` |
67+
68+
### Scoring Formula
69+
70+
```
71+
days_since_inspected = (today - last_perf_inspected).days # 9999 if never
72+
days_since_modified = (today - last_modified).days
73+
74+
score = (days_since_inspected * 3)
75+
+ (loc * 0.1)
76+
+ (total_commits * 0.5)
77+
+ (has_dask_backend * 200)
78+
+ (has_cuda_backend * 150)
79+
+ (is_io_module * 300)
80+
- (days_since_modified * 0.2)
81+
- (has_existing_bench * 100)
82+
```
83+
84+
Rationale:
85+
- Never-inspected modules dominate (9999 * 3 = ~30,000).
86+
- Dask and CUDA backends boosted: that is where OOM and perf bugs live.
87+
- I/O modules get the highest boost: most relevant for 30TB question.
88+
- Larger modules more likely to contain issues.
89+
- Existing ASV benchmarks slightly deprioritize (perf already considered).
90+
91+
---
92+
93+
## Phase 1: Subagent Static Analysis
94+
95+
One subagent per module. Each performs the checks below and returns a
96+
structured JSON blob.
97+
98+
### Dask Path Analysis
99+
100+
- `.values` on dask-backed DataArray (premature materialization) — **HIGH**
101+
- `.compute()` inside a loop — **HIGH**
102+
- `np.array()` / `np.asarray()` wrapping dask or CuPy array — **HIGH**
103+
- `da.stack()` without `.rechunk()`**MEDIUM**
104+
- `map_overlap` with depth >= chunk_size / 4 — **MEDIUM**
105+
- Missing `boundary` argument in `map_overlap`**MEDIUM**
106+
- Redundant computation (same function called twice on same input) — **MEDIUM**
107+
- Python loops over dask chunks (serializes the graph) — **MEDIUM**
108+
109+
### 30TB / 16GB OOM Verdict
110+
111+
Two-part analysis for each dask code path:
112+
113+
**Part 1 — Static trace.** Follow the dask code path and answer: does peak
114+
memory scale with total array size, or with chunk size? If any step forces
115+
full materialization, verdict is WILL OOM.
116+
117+
**Part 2 — Task graph simulation.** Write and execute a script that:
118+
119+
```python
120+
import dask.array as da
121+
import xarray as xr
122+
123+
# Use a representative grid (2560x2560, 10x10 = 100 chunks) to inspect
124+
# graph structure. The pattern is identical at any scale — what matters
125+
# is whether the graph fans out, materializes, or stays chunk-local.
126+
arr = da.zeros((2560, 2560), chunks=(256, 256), dtype='float64')
127+
raster = xr.DataArray(arr, dims=['y', 'x'])
128+
129+
# Call the function lazily
130+
result = module_function(raster, **default_args)
131+
132+
# Inspect the graph without executing
133+
graph = result.__dask_graph__()
134+
task_count = len(graph)
135+
tasks_per_chunk = task_count / 100 # normalize to per-chunk
136+
137+
# Check for fan-out patterns or full-materialization nodes
138+
# Extrapolate to 30TB: ~57 million chunks at 256x256 float64
139+
# If tasks_per_chunk is constant => graph scales linearly => SAFE
140+
# If any node depends on all chunks => full materialization => WILL OOM
141+
```
142+
143+
The script constructs the graph only, never calls `.compute()`. Reports:
144+
- Task count and tasks-per-chunk ratio
145+
- Estimated peak memory per chunk (MB)
146+
- Whether the graph contains fan-out or materialization nodes
147+
- Extrapolation to 30TB: linear graph growth (SAFE) vs fan-out (WILL OOM)
148+
149+
**Verdict**: `SAFE`, `RISKY` (bounded but tight), or `WILL OOM` (unbounded
150+
or materializes).
151+
152+
### GPU Transfer Analysis
153+
154+
- `.data.get()` followed by CuPy ops (GPU-CPU-GPU round-trip) — **HIGH**
155+
- `cupy.asarray()` inside a hot loop — **HIGH**
156+
- Mixing NumPy/CuPy ops without reason — **MEDIUM**
157+
- Register pressure: >20 float64 locals in `@cuda.jit` kernel — **MEDIUM**
158+
- Thread blocks >16x16 on register-heavy kernels — **MEDIUM**
159+
160+
### Memory Allocation Patterns
161+
162+
- Unnecessary `.copy()` on arrays never mutated — **MEDIUM**
163+
- `np.zeros_like()` + fill loop (could be `np.empty()`) — **LOW**
164+
- Large temporary arrays that could be fused into the kernel — **MEDIUM**
165+
166+
### Numba Anti-Patterns
167+
168+
- Missing `@ngjit` on nested for-loops over `.data` arrays — **MEDIUM**
169+
- `@jit` without `nopython=True` (object-mode fallback risk) — **MEDIUM**
170+
- Type instability (int/float mixing in Numba functions) — **LOW**
171+
- Column-major iteration on row-major arrays (cache-unfriendly) — **LOW**
172+
173+
### Bottleneck Classification
174+
175+
Based on static analysis, classify the module as one of:
176+
- **IO-bound** — dominated by disk reads/writes or serialization
177+
- **Memory-bound** — peak allocation is the limiting factor
178+
- **Compute-bound** — CPU/GPU time dominates, memory is fine
179+
- **Graph-bound** — dask task graph overhead dominates (too many small tasks)
180+
181+
### Subagent Output Schema
182+
183+
```json
184+
{
185+
"module": "slope",
186+
"files_read": ["xrspatial/slope.py"],
187+
"findings": [
188+
{
189+
"severity": "HIGH",
190+
"category": "dask_materialization",
191+
"file": "slope.py",
192+
"line": 142,
193+
"description": ".values on dask input in _run_dask",
194+
"fix": "Use .data.compute() or restructure to stay lazy",
195+
"backends_affected": ["dask+numpy", "dask+cupy"]
196+
}
197+
],
198+
"oom_verdict": {
199+
"dask_numpy": "SAFE",
200+
"dask_cupy": "SAFE",
201+
"reasoning": "map_overlap with depth=1, memory bounded by chunk size",
202+
"estimated_peak_per_chunk_mb": 0.5,
203+
"task_count": 3721,
204+
"graph_simulation_ran": true
205+
},
206+
"bottleneck": "compute-bound",
207+
"bottleneck_reasoning": "3x3 kernel with Numba JIT, no I/O, small overlap"
208+
}
209+
```
210+
211+
---
212+
213+
## Phase 1: Merged Report
214+
215+
After all subagents return, print a consolidated report.
216+
217+
### Module Risk Ranking Table
218+
219+
```
220+
| Rank | Module | Score | OOM Verdict | Bottleneck | HIGH | MED | LOW |
221+
|------|---------------|-------|-----------------|--------------|------|-----|-----|
222+
| 1 | geotiff | 31200 | WILL OOM (d+np) | IO-bound | 3 | 1 | 0 |
223+
| 2 | viewshed | 30050 | RISKY (d+np) | memory-bound | 2 | 2 | 1 |
224+
| ... | ... | ... | ... | ... | ... | ... | ... |
225+
```
226+
227+
### 30TB / 16GB Verdict Summary
228+
229+
Grouped by verdict:
230+
231+
- **WILL OOM (fix required):** list modules with reasoning
232+
- **RISKY (bounded but tight):** list modules with reasoning
233+
- **SAFE (memory bounded by chunk size):** list modules
234+
235+
### Detailed Findings
236+
237+
Per-module table of all findings grouped by severity (file:line, pattern,
238+
description, fix).
239+
240+
### Actionable Rockout Commands
241+
242+
For each HIGH-severity finding, a ready-to-paste `/rockout` command.
243+
244+
### State File Update
245+
246+
Write `.claude/performance-sweep-state.json`:
247+
248+
```json
249+
{
250+
"last_triage": "2026-03-31T14:00:00Z",
251+
"modules": {
252+
"slope": {
253+
"last_inspected": "2026-03-31T14:00:00Z",
254+
"oom_verdict": "SAFE",
255+
"bottleneck": "compute-bound",
256+
"high_count": 0,
257+
"issue": null
258+
}
259+
}
260+
}
261+
```
262+
263+
---
264+
265+
## Phase 2: Ralph-Loop for HIGH Severity Fixes
266+
267+
Collect all modules with at least one HIGH-severity finding. Generate a
268+
`/ralph-loop` command targeting them in priority order.
269+
270+
### Each Iteration
271+
272+
1. **Benchmark** the module on a moderate array (512x512 default) across all
273+
available backends. Measure four metrics per backend per function:
274+
- Wall time: `timeit.repeat(number=1, repeat=3)`, median
275+
- Python memory: `tracemalloc.get_traced_memory()` peak
276+
- Process memory: `resource.getrusage(RUSAGE_SELF).ru_maxrss` delta
277+
- GPU memory (if CuPy): `cupy.get_default_memory_pool().used_bytes()` delta
278+
279+
2. **Confirm the static finding** from Phase 1 is real. If the benchmark
280+
shows the issue does not manifest (false positive), downgrade to MEDIUM
281+
in the report and skip to next module.
282+
283+
3. **Classify the bottleneck** with measured data:
284+
- IO-bound: wall time dominated by read/write, low CPU
285+
- Memory-bound: peak RSS much larger than expected for chunk size
286+
- Compute-bound: CPU pegged, memory stable
287+
- Graph-bound: dask task count extremely high, scheduler overhead visible
288+
289+
4. **Run `/rockout`** to fix the confirmed issue (GitHub issue, worktree,
290+
implementation, tests, docs).
291+
292+
5. **Post-fix benchmark** — rerun the same benchmark. Report before/after
293+
delta.
294+
295+
6. **Update state** — record the fix in
296+
`.claude/performance-sweep-state.json` with issue number.
297+
298+
7. Output `<promise>ITERATION DONE</promise>`.
299+
300+
### Generated Command Shape
301+
302+
```
303+
/ralph-loop "Performance sweep Phase 2: benchmark and fix HIGH-severity findings.
304+
305+
**Target modules in priority order:**
306+
1. geotiff (3 HIGH findings, WILL OOM) -- eager .values materialization
307+
2. cost_distance (1 HIGH finding, WILL OOM) -- iterative solver unbounded memory
308+
309+
**For each module:**
310+
1. Write and run a benchmark script measuring wall time, peak memory
311+
(tracemalloc + RSS + CuPy pool) across all available backends
312+
2. Confirm the HIGH finding from Phase 1 triage is real
313+
3. If confirmed: run /rockout to fix it end-to-end
314+
4. After rockout: rerun benchmark, report before/after delta
315+
5. Update .claude/performance-sweep-state.json
316+
6. Output <promise>ITERATION DONE</promise>
317+
318+
If all targets addressed: <promise>ALL PERFORMANCE ISSUES FIXED</promise>."
319+
--max-iterations {N+2} --completion-promise "ALL PERFORMANCE ISSUES FIXED"
320+
```
321+
322+
### Reminder Text
323+
324+
```
325+
Phase 1 triage complete. To proceed with fixes:
326+
Copy the ralph-loop command above and paste it.
327+
328+
Other options:
329+
Fix one manually: copy any /rockout command from the report above
330+
Rerun triage only: /sweep-performance --report-only
331+
Skip Phase 1: /sweep-performance --skip-phase1 (reuses last triage)
332+
Reset all tracking: /sweep-performance --reset-state
333+
```
334+
335+
---
336+
337+
## Arguments
338+
339+
| Argument | Effect |
340+
|--------------------|------------------------------------------------------------|
341+
| `--top N` | Limit Phase 1 subagents to top N scored modules (default: all) |
342+
| `--exclude m1,m2` | Remove named modules from scope |
343+
| `--only-terrain` | slope, aspect, curvature, terrain, terrain_metrics, hillshade, sky_view_factor |
344+
| `--only-focal` | focal, convolution, morphology, bilateral, edge_detection, glcm |
345+
| `--only-hydro` | flood, cost_distance, geodesic, surface_distance, viewshed, erosion, diffusion |
346+
| `--only-io` | geotiff, reproject, rasterize, polygonize |
347+
| `--reset-state` | Delete state file and start fresh |
348+
| `--skip-phase1` | Reuse last triage state, go straight to ralph-loop generation |
349+
| `--report-only` | Run Phase 1 only, no ralph-loop command |
350+
| `--size small` | Benchmark at 128x128 |
351+
| `--size large` | Benchmark at 2048x2048 |
352+
| `--high-only` | Only report HIGH severity findings |
353+
354+
Default (no arguments): audit all modules, benchmark at 512x512, generate
355+
ralph-loop for HIGH items.
356+
357+
---
358+
359+
## General Rules
360+
361+
- Phase 1 subagents do NOT modify source files. Read-only analysis.
362+
- Phase 2 ralph-loop modifies code only through `/rockout`.
363+
- Temporary benchmark scripts go in `/tmp/` with unique names.
364+
- Only flag patterns actually present in the code; no hypothetical issues.
365+
- Include exact file path and line number for every finding.
366+
- False positives are worse than missed issues.
367+
- The 30TB simulation constructs the dask graph only; it never calls `.compute()`.
368+
- State file (`.claude/performance-sweep-state.json`) is gitignored by convention.

0 commit comments

Comments
 (0)