Skip to content

Commit 8d64434

Browse files
authored
feat(swarm): ADR-149 evaluation harness — GDOP, IQM+bootstrap CI, noise sweep (ruvnet#875)
Stage-1 kinematic evaluator per ADR-149 (peer-reviewed). Pure Rust, no new deps. evals/: - gdop.rs: 2D Geometric Dilution of Precision ((HᵀH)⁻¹ trace-sqrt); None for <2 observers or collinear/singular geometry - stats.rs: IQM (Agarwal 2021) + 95% stratified-bootstrap CI (deterministic LCG) + probability_of_improvement - metrics.rs: EpisodeMetrics + AggregateMetrics::from_strata (IQM±CI, seed-stratified) - runner.rs: seeded kinematic rollout (FlightPattern-driven), seed×episode matrix, 3σ×3κ default noise sweep (Gaussian amplitude × von Mises phase) - report.rs + eval_swarm bin: generates evals/RESULTS.md leaderboard RESULTS.md surfaces the real coverage-vs-localization-precision trade-off via GDOP: partitioned wins coverage (100%) but single-drone sightings (GDOP 0 → 7.0m); pheromone gets multistatic fusion (GDOP 1.6 → 4.1m). Wi2SAR 5m paper-baseline row included. Stage-2 (Gazebo/PX4 SITL false-alarm + collision on median seeds) is documented follow-on. Tests: 116 default / 133 full+train (+13 eval tests), 0 failed. Clippy clean (-D warnings).
1 parent 0d3d835 commit 8d64434

12 files changed

Lines changed: 1368 additions & 0 deletions

File tree

Lines changed: 257 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,257 @@
1+
# ADR-149: Drone Swarm Benchmarking & Evaluation Methodology — Metrics, Leaderboards, and Statistical Rigor
2+
3+
| Field | Value |
4+
|------------|-----------------------------------------------------------------------------------------|
5+
| Status | Accepted (peer-reviewed 2026-05-30) |
6+
| Date | 2026-05-30 |
7+
| Deciders | ruv |
8+
| Relates to | ADR-148 (ruview-swarm), ADR-147 (OccWorld), ADR-146 (RF encoder), ADR-028 (witness) |
9+
10+
> Companion to ADR-148. ADR-148 shipped the swarm and 5 criterion micro-benchmarks
11+
> plus a `SotaComparison` against Wi2SAR. This ADR defines **how we evaluate the swarm
12+
> rigorously** — what metrics, what statistics, what baselines, and an honest account
13+
> of which external leaderboards do and do not apply.
14+
15+
---
16+
17+
## 1. Context
18+
19+
ADR-148's `ruview-swarm` reports performance via five `criterion` micro-benchmarks and a
20+
single `SotaComparison` (localization 1.732 m vs Wi2SAR 5 m; coverage ~223 s vs 810 s).
21+
These numbers are **internally valid but insufficient as scientific claims**:
22+
23+
- The criterion figures (3.3 µs MARL inference, 43 µs RRT-APF, 54 ns fusion, 248 µs PPO
24+
step) measure **wall-clock latency**, not policy quality or coverage/localization quality.
25+
- The 1.732 m localization comes from a **single synthetic geometry** (3 drones at 120°
26+
around a known point), not a distribution of victim positions under realistic noise.
27+
- The 223 s coverage is an **analytic estimate** (`estimate_coverage_time_secs()`), not an
28+
episode rollout.
29+
- All numbers are **single-run point estimates**. The MARL reproducibility literature
30+
(Henderson 2018; Agarwal 2021; Gorsane 2022) shows single/few-seed point estimates
31+
routinely flip algorithm rankings and overstate gains.
32+
33+
We need a defined, reproducible evaluation methodology before any "beats SOTA" claim can
34+
survive external review, and an honest position on external leaderboards.
35+
36+
---
37+
38+
## 2. Decision
39+
40+
Adopt a two-tier evaluation methodology:
41+
42+
1. **Micro-benchmarks (criterion)** — keep for compute-latency regression gating only.
43+
Explicitly labeled as latency, never as quality.
44+
2. **Domain evaluation harness** — a seeded, multi-run, statistically-reported harness
45+
producing SAR metrics (localization CEP, coverage, detection rate) and MARL metrics
46+
(IQM return, probability-of-improvement) over **≥10 seeds with 95% stratified-bootstrap
47+
confidence intervals**, against **≥3 baselines**, following the Agarwal/Gorsane standard.
48+
49+
Do **not** claim leaderboard standing — no public leaderboard accepts drone-swarm CSI-SAR
50+
submissions. Comparisons to Wi2SAR are **paper-to-paper**, labeled as such, acknowledging
51+
the sensing-modality difference (RSS bearing vs CSI multi-view fusion).
52+
53+
---
54+
55+
## 3. External Leaderboard Landscape — Honest Assessment
56+
57+
**There is no public, externally-administered leaderboard that accepts a drone-swarm,
58+
CSI-based, multi-view SAR system.** This is a research niche; comparison is paper-to-paper.
59+
The adjacent options and their fit:
60+
61+
| Benchmark / Leaderboard | Domain | Live submission? | Fit for ruview-swarm |
62+
|-------------------------|--------|------------------|----------------------|
63+
| **Wi2SAR** (arxiv 2604.09115) | Drone WiFi SAR | No (paper) | **Direct baseline** — paper-to-paper only; RSS bearing ≠ CSI fusion |
64+
| **MARL4DRP** (Springer 2023) | Drone routing MARL | No | Closest drone-MARL benchmark; would need a routing→coverage adapter |
65+
| **CSI-Bench** (NeurIPS 2025) | Static WiFi sensing | Splits + paper baselines | Adjacent (localization task) but no moving-sensor/multi-view fusion |
66+
| **SMAC / SMACv2** | StarCraft cooperative MARL | No live LB | Structural analogy (CTDE) only; combat task, not coverage |
67+
| **PettingZoo MPE** (Simple Spread) | 2D cooperative particles | No | Cheap MARL **correctness check**, no physics/CSI |
68+
| **Melting Pot** | Social-dynamics MARL | Closed (NeurIPS '24) | Not applicable |
69+
| **MAMuJoCo / Hanabi / GRF / Overcooked** | Various cooperative MARL | No live LB | Not applicable |
70+
| **OmniDrones / gym-pybullet-drones / Pegasus** | Drone-control sim platforms | No (platforms) | **Training infrastructure**, not leaderboards; no CSI layer |
71+
72+
**Conclusion:** We will (a) keep Wi2SAR as the cited paper baseline, (b) optionally build a
73+
MARL4DRP/MPE adapter to post a recognized cooperative-MARL number (tangential to SAR), and
74+
(c) **not** represent any internal number as a leaderboard placement.
75+
76+
---
77+
78+
## 4. Evaluation Metrics
79+
80+
### 4.1 SAR Domain Metrics (primary — comparable to Wi2SAR)
81+
82+
| Metric | Definition | Reporting |
83+
|--------|-----------|-----------|
84+
| Localization CEP50 | Median horizontal error, fused victim position vs ground truth | m, 95% CI |
85+
| Localization CEP95 | 95th-percentile horizontal error | m |
86+
| **GDOP** | Geometric Dilution of Precision of the contributing-drone constellation at detection time | dimensionless (tracked per detection) |
87+
| Coverage rate @ T | Fraction of area scanned ≥1× within T=240 s | %, 95% CI |
88+
| Coverage time to 95% | Time to scan 95% of bounded area | s, mean ± CI |
89+
| Time-to-first-detection | Mission start → first confident detection (conf > 0.85) | s, 95% CI |
90+
| Detection rate | P(detected \| victim present) per mission | %, 95% CI |
91+
| False-alarm rate | P(confident detection \| no victim) | %, 95% CI |
92+
| Collision rate | Collisions (d < 1.5 m) per mission | count/mission |
93+
| Overlap ratio | Fraction of path re-covering scanned cells | % |
94+
95+
### 4.2 MARL Policy-Quality Metrics
96+
97+
| Metric | Definition |
98+
|--------|-----------|
99+
| IQM episodic return | Interquartile mean over 10 seeds × 50 eval episodes (Agarwal 2021) |
100+
| Probability of improvement | P(MAPPO return > IPPO return) on a random episode |
101+
| Optimality gap | Expected gap to a defined reference performance |
102+
| Performance profile | Fraction of (seed, episode) with localization error < τ, plotted vs τ ∈ [0,10] m |
103+
| Sample efficiency | Return vs training steps (curve, not point) |
104+
105+
### 4.3 Micro-benchmarks (criterion — latency only)
106+
107+
Retained from ADR-148, **labeled as compute latency, not quality**:
108+
`marl_actor_inference` 3.3 µs · `rrt_apf_100iter` 43 µs · `multiview_fusion_3drones` 54 ns ·
109+
`demo_coverage_estimate` 100 ps · `ppo_update_64transitions` 248 µs. Purpose: prove the
110+
control loop has no compute bottleneck (all ≪ the 10 ms / 100 Hz budget) and gate
111+
performance regressions. They are **not** evidence of policy or localization quality.
112+
113+
---
114+
115+
## 5. Statistical Protocol (Agarwal 2021 / Gorsane 2022)
116+
117+
| Requirement | Standard adopted |
118+
|-------------|------------------|
119+
| Seeds per condition | **≥10** training runs from distinct seeds |
120+
| Evaluation episodes | 50 fixed, versioned episodes per trained policy (10 victim layouts × 5 CSI-noise levels) |
121+
| Aggregate metric | **IQM** (not mean, not median) + performance profiles |
122+
| Confidence intervals | **95% stratified bootstrap**, 1,000 resamples |
123+
| Baselines (≥3) | Random walk (lower bound), Boustrophedon+manual-triangulation (heuristic), IPPO (no shared critic) |
124+
| Reproducibility | Versioned YAML config (drone count, area, victims, CSI σ amplitude / κ phase, wind, packet loss) + all seeds committed with results |
125+
126+
Rationale: Henderson et al. (2018) found ≤5-seed point estimates flip rankings; Agarwal et
127+
al. (2021, NeurIPS Outstanding Paper) show IQM needs ~10 runs for the statistical power that
128+
the median needs ~200 runs for; Gorsane et al. (2022) made ≥10 seeds + IQM + stratified CIs
129+
the cooperative-MARL standard. `rliable` (google-research/rliable) is the reference impl.
130+
131+
---
132+
133+
## 6. Reproducibility Harness (`evals/`)
134+
135+
A new evaluation harness (separate from criterion micro-benchmarks):
136+
137+
1. **Seeded episodes** — every episode, noise perturbation, and training run seeded from a
138+
versioned config; seeds committed with results (no `Date.now()`/unseeded RNG).
139+
2. **Per-episode logging** — coverage %, localization error, GDOP, time-to-first-detection,
140+
collisions, detection binary → JSONL (reuses the ADR-148 telemetry schema).
141+
3. **Aggregation** — IQM ± 95% stratified-bootstrap CI across the 10-seed × 50-episode matrix.
142+
4. **Baseline sweep** — random / boustrophedon-heuristic / IPPO / MAPPO, so
143+
probability-of-improvement and performance profiles are computable.
144+
5. **Output** — committed `evals/RESULTS.md`: a reproducible internal leaderboard ranking
145+
our 6 flight patterns × learning patterns on the SAR metrics, plus the Wi2SAR paper row.
146+
147+
This `RESULTS.md` is the **real, defensible "leaderboard" for this system** — patterns ranked
148+
against each other and the cited baseline, reproducibly, with CIs.
149+
150+
### 6.1 Dual-stage pipeline (compute-cost mitigation)
151+
152+
The full matrix is **10 seeds × 50 episodes × ≥4 conditions = ≥2,000 rollouts per policy**.
153+
Running each rollout against the OccWorld 3D prior (ADR-147, ~375 ms/inference) would melt
154+
the L4 / RTX 5080 budget. Split evaluation into two stages:
155+
156+
- **Stage 1 — Kinematic (fast, full matrix).** Stripped vector environment; OccWorld paths
157+
pre-cached or treated as static analytical volumes. Produces episodic **return, IQM,
158+
sample-efficiency curves, coverage %, GDOP, localization error** over the full 10-seed matrix.
159+
- **Stage 2 — High-fidelity physics (sub-sampled).** Take the **3 median seeds** (by Stage-1
160+
IQM) into Gazebo + PX4 SITL with full CSI phase/amplitude noise. Extracts **false-alarm
161+
rate** and **collision rate** under realistic dynamics (heading-rate limits, APF repulsion,
162+
motor response) that the kinematic sim omits.
163+
164+
Stage 1 is CI-runnable today; Stage 2 requires the Gazebo/PX4 SITL bring-up (follow-on).
165+
166+
### 6.2 Noise sweep (coherence-gate threshold)
167+
168+
The config generator systematically varies the two CSI noise parameters:
169+
- **σ** — Gaussian amplitude noise (CSI magnitude)
170+
- **κ** — von Mises phase concentration (lower κ = noisier phase)
171+
172+
Sweeping (σ, κ) isolates the exact environmental threshold where `CrossViewpointAttention`
173+
(ADR-016) drops out of its coherence gate (`coherence_gate.rs` Accept → PredictOnly/Reject,
174+
ADR-135). This finds the operating envelope, not just a single-point accuracy.
175+
176+
### 6.3 GDOP tracking
177+
178+
Localization accuracy is meaningless without the constellation geometry that produced it.
179+
The harness records **GDOP** per detection: 3 drones in a ~120° constellation give the
180+
√3 ≈ 1.73× CRLB improvement; 3 **collinear** drones degrade toward the single-view
181+
Cramer-Rao limit (~2.9 m). Reporting localization error **stratified by GDOP band** prevents
182+
the headline number from being a best-case geometric artifact.
183+
184+
---
185+
186+
## 7. Evidence Grading of Current ADR-148 Numbers
187+
188+
| Claim | Grade | Why |
189+
|-------|-------|-----|
190+
| criterion latencies (3.3 µs / 43 µs / 54 ns / 248 µs) | **High** | Deterministic compute, hardware-specific, reproducible |
191+
| Wi2SAR baseline (5 m, 160k m²/13.5 min) | **High** | Published field trial, open source |
192+
| 1.732 m 3-view localization | **Low–Medium** | Single synthetic geometry; no noise distribution; CRLB predicts ~2.9 m for N=3 |
193+
| 223 s 4-drone coverage | **Low** | Analytic estimate, not an episode rollout |
194+
| "beats SOTA" | **Directional only** | Valid as paper-to-paper direction; not leaderboard, not multi-seed |
195+
196+
The √N multi-view scaling claim is theoretically sound (CRLB: σ ∝ 1/√(N·SNR); N=3 → √3 ≈
197+
1.73× improvement), but the measured 1.732 m must be reproduced over a victim-position and
198+
noise distribution before it is defensible.
199+
200+
---
201+
202+
## 8. Consequences
203+
204+
### Positive
205+
- Converts scattered numbers into a reproducible, statistically-honest evaluation.
206+
- The `RESULTS.md` internal leaderboard ranks the 6 flight × 4 learning patterns fairly.
207+
- Aligns with the recognized MARL evaluation standard (IQM + stratified CIs + ≥10 seeds).
208+
- Honest external-leaderboard position avoids overclaiming.
209+
210+
### Costs / Risks
211+
- ≥10 seeds × 50 episodes × N patterns × N baselines is a real compute cost — this is where
212+
the ADR-148 GCP L4 / local RTX 5080 training budget is actually spent.
213+
- Requires the MARL policy to be **trained to convergence** first (the ADR-148 5-episode CPU
214+
run shows decreasing value_loss, not convergence).
215+
- Coverage/localization must move from analytic estimate / synthetic geometry to **episode
216+
rollouts under realistic CSI noise** before headline numbers are republished.
217+
218+
### Open issues → follow-on work
219+
1. Train MAPPO/IPPO to convergence (M4 follow-on) before running the eval harness.
220+
2. Build the seeded `evals/` harness + `RESULTS.md` generator.
221+
3. Optional: MARL4DRP or MPE Simple-Spread adapter for a recognized cooperative-MARL number.
222+
4. Re-state ADR-148 §14 headline numbers with CIs once the harness has run.
223+
224+
---
225+
226+
## 9. Research Notes & References
227+
228+
Compiled by `ruflo-goals:deep-researcher` (2026-05-30). Full landscape in the agent record.
229+
230+
**MARL evaluation rigor**
231+
- Henderson et al., "Deep RL That Matters", arxiv 1709.06560 — ≤5-seed estimates flip rankings
232+
- Agarwal et al., "Deep RL at the Edge of the Statistical Precipice", NeurIPS 2021, arxiv 2108.13264 — IQM, performance profiles, stratified bootstrap; `rliable`
233+
- Gorsane et al., "Standardised Evaluation Protocol for Cooperative MARL", NeurIPS 2022, arxiv 2209.10485 — ≥10 seeds + IQM standard
234+
- BenchMARL, arxiv 2312.01472 — operationalizes the above
235+
236+
**Cooperative-MARL benchmarks**
237+
- SMACv2, arxiv 2212.07489 · PettingZoo MPE (Farama) · Melting Pot (DeepMind, NeurIPS 2024 contest) · MAMuJoCo (Gymnasium-Robotics) · MARL4DRP, Springer 2023 (closest drone-MARL)
238+
239+
**Drone-sim platforms**
240+
- gym-pybullet-drones, arxiv 2103.02142 · OmniDrones, IEEE RA-L 2024 · Pegasus, arxiv 2307.05263 · Flightmare (IROS 2021) · AirSim (discontinued 2022) · Crazyswarm2
241+
242+
**SAR / coverage / CSI sensing**
243+
- Wi2SAR, arxiv 2604.09115 (direct baseline: 5 m, 160k m²/13.5 min, 18.4° median AoA)
244+
- CSI-Bench, NeurIPS 2025, arxiv 2505.21866 (461 h WiFi sensing, localization task)
245+
- Coverage path planning, PMC9571681 (boustrophedon ~5% faster than spiral)
246+
- Bio-inspired SAR, Nature s41598-025-33223-z (PSO > Levy/ACO on exploration score)
247+
- CRLB for CSI localization, IEEE 8110647 (σ ∝ 1/√(N·SNR))
248+
249+
**Tooling**
250+
- criterion.rs known limitations — wall-clock only, not algorithmic quality
251+
- rliable, github.com/google-research/rliable
252+
253+
---
254+
255+
*ADR authored with research support from `ruflo-goals:deep-researcher` (2026-05-30).
256+
Companion to ADR-148. Defines the evaluation methodology that the ADR-148 headline
257+
numbers must satisfy before being republished as defensible claims.*

v2/crates/ruview-swarm/Cargo.toml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -78,3 +78,7 @@ harness = false
7878
[[bin]]
7979
name = "train_marl"
8080
required-features = ["train"]
81+
82+
# ADR-149 Stage-1 evaluation CLI — pure Rust, no special feature needed.
83+
[[bin]]
84+
name = "eval_swarm"
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
# ADR-149 evaluation outputs
2+
RESULTS.md is generated by the `eval_swarm` binary.
Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
# ruview-swarm Evaluation Results (ADR-149 Stage 1, kinematic)
2+
3+
Statistically-rigorous evaluation harness: seeded multi-run rollouts with IQM + 95% stratified-bootstrap confidence intervals (Agarwal et al., NeurIPS 2021).
4+
5+
## Run configuration
6+
7+
- **Stage**: 1 (kinematic, self-contained, deterministic per seed)
8+
- **Episodes per pattern**: 100 (seed × episode matrix)
9+
- **CI method**: 95% stratified bootstrap of the IQM, stratified by seed
10+
- **GDOP**: 2-D geometric dilution of precision at first detection
11+
12+
> **Stage 2 pending**: high-fidelity Gazebo/PX4 SITL evaluation (false-alarm rate, real collision rate on the median seeds) is a follow-on — see ADR-149 §6.1. The collision figures below are a kinematic min-separation proxy, not SITL physics.
13+
14+
## Flight-pattern leaderboard
15+
16+
| Flight pattern | Coverage IQM [95% CI] | Localization (m) IQM [95% CI] | Detection rate | Mean GDOP |
17+
|----------------|-----------------------|-------------------------------|----------------|-----------|
18+
| partitioned_lawnmower | 1.000 [1.000, 1.000] | 7.022 [5.669, 8.379] | 100.0% | 0.000 |
19+
| pheromone | 0.662 [0.652, 0.671] | 4.110 [3.346, 5.141] | 95.0% | 1.598 |
20+
| levy_flight | 0.490 [0.489, 0.491] | 3.523 [2.897, 4.160] | 100.0% | 0.000 |
21+
| boustrophedon | 0.370 [0.370, 0.370] | 2.740 [2.357, 3.207] | 100.0% | 0.000 |
22+
| spiral | 0.336 [0.336, 0.336] | 3.082 [2.678, 3.568] | 100.0% | 0.000 |
23+
| potential_field | 0.254 [0.252, 0.256] | 4.343 [3.489, 5.265] | 100.0% | 0.000 |
24+
| _Wi2SAR (paper baseline)_ | _n/a_ | _5.0 (paper)_ | _n/a_ | _n/a_ |
25+
26+
_Wi2SAR row is the published single-drone localization figure (arxiv 2604.09115), shown paper-to-paper for reference only — it was not re-run through this kinematic harness._

0 commit comments

Comments
 (0)