|
| 1 | +# ADR-149: Drone Swarm Benchmarking & Evaluation Methodology — Metrics, Leaderboards, and Statistical Rigor |
| 2 | + |
| 3 | +| Field | Value | |
| 4 | +|------------|-----------------------------------------------------------------------------------------| |
| 5 | +| Status | Accepted (peer-reviewed 2026-05-30) | |
| 6 | +| Date | 2026-05-30 | |
| 7 | +| Deciders | ruv | |
| 8 | +| Relates to | ADR-148 (ruview-swarm), ADR-147 (OccWorld), ADR-146 (RF encoder), ADR-028 (witness) | |
| 9 | + |
| 10 | +> Companion to ADR-148. ADR-148 shipped the swarm and 5 criterion micro-benchmarks |
| 11 | +> plus a `SotaComparison` against Wi2SAR. This ADR defines **how we evaluate the swarm |
| 12 | +> rigorously** — what metrics, what statistics, what baselines, and an honest account |
| 13 | +> of which external leaderboards do and do not apply. |
| 14 | +
|
| 15 | +--- |
| 16 | + |
| 17 | +## 1. Context |
| 18 | + |
| 19 | +ADR-148's `ruview-swarm` reports performance via five `criterion` micro-benchmarks and a |
| 20 | +single `SotaComparison` (localization 1.732 m vs Wi2SAR 5 m; coverage ~223 s vs 810 s). |
| 21 | +These numbers are **internally valid but insufficient as scientific claims**: |
| 22 | + |
| 23 | +- The criterion figures (3.3 µs MARL inference, 43 µs RRT-APF, 54 ns fusion, 248 µs PPO |
| 24 | + step) measure **wall-clock latency**, not policy quality or coverage/localization quality. |
| 25 | +- The 1.732 m localization comes from a **single synthetic geometry** (3 drones at 120° |
| 26 | + around a known point), not a distribution of victim positions under realistic noise. |
| 27 | +- The 223 s coverage is an **analytic estimate** (`estimate_coverage_time_secs()`), not an |
| 28 | + episode rollout. |
| 29 | +- All numbers are **single-run point estimates**. The MARL reproducibility literature |
| 30 | + (Henderson 2018; Agarwal 2021; Gorsane 2022) shows single/few-seed point estimates |
| 31 | + routinely flip algorithm rankings and overstate gains. |
| 32 | + |
| 33 | +We need a defined, reproducible evaluation methodology before any "beats SOTA" claim can |
| 34 | +survive external review, and an honest position on external leaderboards. |
| 35 | + |
| 36 | +--- |
| 37 | + |
| 38 | +## 2. Decision |
| 39 | + |
| 40 | +Adopt a two-tier evaluation methodology: |
| 41 | + |
| 42 | +1. **Micro-benchmarks (criterion)** — keep for compute-latency regression gating only. |
| 43 | + Explicitly labeled as latency, never as quality. |
| 44 | +2. **Domain evaluation harness** — a seeded, multi-run, statistically-reported harness |
| 45 | + producing SAR metrics (localization CEP, coverage, detection rate) and MARL metrics |
| 46 | + (IQM return, probability-of-improvement) over **≥10 seeds with 95% stratified-bootstrap |
| 47 | + confidence intervals**, against **≥3 baselines**, following the Agarwal/Gorsane standard. |
| 48 | + |
| 49 | +Do **not** claim leaderboard standing — no public leaderboard accepts drone-swarm CSI-SAR |
| 50 | +submissions. Comparisons to Wi2SAR are **paper-to-paper**, labeled as such, acknowledging |
| 51 | +the sensing-modality difference (RSS bearing vs CSI multi-view fusion). |
| 52 | + |
| 53 | +--- |
| 54 | + |
| 55 | +## 3. External Leaderboard Landscape — Honest Assessment |
| 56 | + |
| 57 | +**There is no public, externally-administered leaderboard that accepts a drone-swarm, |
| 58 | +CSI-based, multi-view SAR system.** This is a research niche; comparison is paper-to-paper. |
| 59 | +The adjacent options and their fit: |
| 60 | + |
| 61 | +| Benchmark / Leaderboard | Domain | Live submission? | Fit for ruview-swarm | |
| 62 | +|-------------------------|--------|------------------|----------------------| |
| 63 | +| **Wi2SAR** (arxiv 2604.09115) | Drone WiFi SAR | No (paper) | **Direct baseline** — paper-to-paper only; RSS bearing ≠ CSI fusion | |
| 64 | +| **MARL4DRP** (Springer 2023) | Drone routing MARL | No | Closest drone-MARL benchmark; would need a routing→coverage adapter | |
| 65 | +| **CSI-Bench** (NeurIPS 2025) | Static WiFi sensing | Splits + paper baselines | Adjacent (localization task) but no moving-sensor/multi-view fusion | |
| 66 | +| **SMAC / SMACv2** | StarCraft cooperative MARL | No live LB | Structural analogy (CTDE) only; combat task, not coverage | |
| 67 | +| **PettingZoo MPE** (Simple Spread) | 2D cooperative particles | No | Cheap MARL **correctness check**, no physics/CSI | |
| 68 | +| **Melting Pot** | Social-dynamics MARL | Closed (NeurIPS '24) | Not applicable | |
| 69 | +| **MAMuJoCo / Hanabi / GRF / Overcooked** | Various cooperative MARL | No live LB | Not applicable | |
| 70 | +| **OmniDrones / gym-pybullet-drones / Pegasus** | Drone-control sim platforms | No (platforms) | **Training infrastructure**, not leaderboards; no CSI layer | |
| 71 | + |
| 72 | +**Conclusion:** We will (a) keep Wi2SAR as the cited paper baseline, (b) optionally build a |
| 73 | +MARL4DRP/MPE adapter to post a recognized cooperative-MARL number (tangential to SAR), and |
| 74 | +(c) **not** represent any internal number as a leaderboard placement. |
| 75 | + |
| 76 | +--- |
| 77 | + |
| 78 | +## 4. Evaluation Metrics |
| 79 | + |
| 80 | +### 4.1 SAR Domain Metrics (primary — comparable to Wi2SAR) |
| 81 | + |
| 82 | +| Metric | Definition | Reporting | |
| 83 | +|--------|-----------|-----------| |
| 84 | +| Localization CEP50 | Median horizontal error, fused victim position vs ground truth | m, 95% CI | |
| 85 | +| Localization CEP95 | 95th-percentile horizontal error | m | |
| 86 | +| **GDOP** | Geometric Dilution of Precision of the contributing-drone constellation at detection time | dimensionless (tracked per detection) | |
| 87 | +| Coverage rate @ T | Fraction of area scanned ≥1× within T=240 s | %, 95% CI | |
| 88 | +| Coverage time to 95% | Time to scan 95% of bounded area | s, mean ± CI | |
| 89 | +| Time-to-first-detection | Mission start → first confident detection (conf > 0.85) | s, 95% CI | |
| 90 | +| Detection rate | P(detected \| victim present) per mission | %, 95% CI | |
| 91 | +| False-alarm rate | P(confident detection \| no victim) | %, 95% CI | |
| 92 | +| Collision rate | Collisions (d < 1.5 m) per mission | count/mission | |
| 93 | +| Overlap ratio | Fraction of path re-covering scanned cells | % | |
| 94 | + |
| 95 | +### 4.2 MARL Policy-Quality Metrics |
| 96 | + |
| 97 | +| Metric | Definition | |
| 98 | +|--------|-----------| |
| 99 | +| IQM episodic return | Interquartile mean over 10 seeds × 50 eval episodes (Agarwal 2021) | |
| 100 | +| Probability of improvement | P(MAPPO return > IPPO return) on a random episode | |
| 101 | +| Optimality gap | Expected gap to a defined reference performance | |
| 102 | +| Performance profile | Fraction of (seed, episode) with localization error < τ, plotted vs τ ∈ [0,10] m | |
| 103 | +| Sample efficiency | Return vs training steps (curve, not point) | |
| 104 | + |
| 105 | +### 4.3 Micro-benchmarks (criterion — latency only) |
| 106 | + |
| 107 | +Retained from ADR-148, **labeled as compute latency, not quality**: |
| 108 | +`marl_actor_inference` 3.3 µs · `rrt_apf_100iter` 43 µs · `multiview_fusion_3drones` 54 ns · |
| 109 | +`demo_coverage_estimate` 100 ps · `ppo_update_64transitions` 248 µs. Purpose: prove the |
| 110 | +control loop has no compute bottleneck (all ≪ the 10 ms / 100 Hz budget) and gate |
| 111 | +performance regressions. They are **not** evidence of policy or localization quality. |
| 112 | + |
| 113 | +--- |
| 114 | + |
| 115 | +## 5. Statistical Protocol (Agarwal 2021 / Gorsane 2022) |
| 116 | + |
| 117 | +| Requirement | Standard adopted | |
| 118 | +|-------------|------------------| |
| 119 | +| Seeds per condition | **≥10** training runs from distinct seeds | |
| 120 | +| Evaluation episodes | 50 fixed, versioned episodes per trained policy (10 victim layouts × 5 CSI-noise levels) | |
| 121 | +| Aggregate metric | **IQM** (not mean, not median) + performance profiles | |
| 122 | +| Confidence intervals | **95% stratified bootstrap**, 1,000 resamples | |
| 123 | +| Baselines (≥3) | Random walk (lower bound), Boustrophedon+manual-triangulation (heuristic), IPPO (no shared critic) | |
| 124 | +| Reproducibility | Versioned YAML config (drone count, area, victims, CSI σ amplitude / κ phase, wind, packet loss) + all seeds committed with results | |
| 125 | + |
| 126 | +Rationale: Henderson et al. (2018) found ≤5-seed point estimates flip rankings; Agarwal et |
| 127 | +al. (2021, NeurIPS Outstanding Paper) show IQM needs ~10 runs for the statistical power that |
| 128 | +the median needs ~200 runs for; Gorsane et al. (2022) made ≥10 seeds + IQM + stratified CIs |
| 129 | +the cooperative-MARL standard. `rliable` (google-research/rliable) is the reference impl. |
| 130 | + |
| 131 | +--- |
| 132 | + |
| 133 | +## 6. Reproducibility Harness (`evals/`) |
| 134 | + |
| 135 | +A new evaluation harness (separate from criterion micro-benchmarks): |
| 136 | + |
| 137 | +1. **Seeded episodes** — every episode, noise perturbation, and training run seeded from a |
| 138 | + versioned config; seeds committed with results (no `Date.now()`/unseeded RNG). |
| 139 | +2. **Per-episode logging** — coverage %, localization error, GDOP, time-to-first-detection, |
| 140 | + collisions, detection binary → JSONL (reuses the ADR-148 telemetry schema). |
| 141 | +3. **Aggregation** — IQM ± 95% stratified-bootstrap CI across the 10-seed × 50-episode matrix. |
| 142 | +4. **Baseline sweep** — random / boustrophedon-heuristic / IPPO / MAPPO, so |
| 143 | + probability-of-improvement and performance profiles are computable. |
| 144 | +5. **Output** — committed `evals/RESULTS.md`: a reproducible internal leaderboard ranking |
| 145 | + our 6 flight patterns × learning patterns on the SAR metrics, plus the Wi2SAR paper row. |
| 146 | + |
| 147 | +This `RESULTS.md` is the **real, defensible "leaderboard" for this system** — patterns ranked |
| 148 | +against each other and the cited baseline, reproducibly, with CIs. |
| 149 | + |
| 150 | +### 6.1 Dual-stage pipeline (compute-cost mitigation) |
| 151 | + |
| 152 | +The full matrix is **10 seeds × 50 episodes × ≥4 conditions = ≥2,000 rollouts per policy**. |
| 153 | +Running each rollout against the OccWorld 3D prior (ADR-147, ~375 ms/inference) would melt |
| 154 | +the L4 / RTX 5080 budget. Split evaluation into two stages: |
| 155 | + |
| 156 | +- **Stage 1 — Kinematic (fast, full matrix).** Stripped vector environment; OccWorld paths |
| 157 | + pre-cached or treated as static analytical volumes. Produces episodic **return, IQM, |
| 158 | + sample-efficiency curves, coverage %, GDOP, localization error** over the full 10-seed matrix. |
| 159 | +- **Stage 2 — High-fidelity physics (sub-sampled).** Take the **3 median seeds** (by Stage-1 |
| 160 | + IQM) into Gazebo + PX4 SITL with full CSI phase/amplitude noise. Extracts **false-alarm |
| 161 | + rate** and **collision rate** under realistic dynamics (heading-rate limits, APF repulsion, |
| 162 | + motor response) that the kinematic sim omits. |
| 163 | + |
| 164 | +Stage 1 is CI-runnable today; Stage 2 requires the Gazebo/PX4 SITL bring-up (follow-on). |
| 165 | + |
| 166 | +### 6.2 Noise sweep (coherence-gate threshold) |
| 167 | + |
| 168 | +The config generator systematically varies the two CSI noise parameters: |
| 169 | +- **σ** — Gaussian amplitude noise (CSI magnitude) |
| 170 | +- **κ** — von Mises phase concentration (lower κ = noisier phase) |
| 171 | + |
| 172 | +Sweeping (σ, κ) isolates the exact environmental threshold where `CrossViewpointAttention` |
| 173 | +(ADR-016) drops out of its coherence gate (`coherence_gate.rs` Accept → PredictOnly/Reject, |
| 174 | +ADR-135). This finds the operating envelope, not just a single-point accuracy. |
| 175 | + |
| 176 | +### 6.3 GDOP tracking |
| 177 | + |
| 178 | +Localization accuracy is meaningless without the constellation geometry that produced it. |
| 179 | +The harness records **GDOP** per detection: 3 drones in a ~120° constellation give the |
| 180 | +√3 ≈ 1.73× CRLB improvement; 3 **collinear** drones degrade toward the single-view |
| 181 | +Cramer-Rao limit (~2.9 m). Reporting localization error **stratified by GDOP band** prevents |
| 182 | +the headline number from being a best-case geometric artifact. |
| 183 | + |
| 184 | +--- |
| 185 | + |
| 186 | +## 7. Evidence Grading of Current ADR-148 Numbers |
| 187 | + |
| 188 | +| Claim | Grade | Why | |
| 189 | +|-------|-------|-----| |
| 190 | +| criterion latencies (3.3 µs / 43 µs / 54 ns / 248 µs) | **High** | Deterministic compute, hardware-specific, reproducible | |
| 191 | +| Wi2SAR baseline (5 m, 160k m²/13.5 min) | **High** | Published field trial, open source | |
| 192 | +| 1.732 m 3-view localization | **Low–Medium** | Single synthetic geometry; no noise distribution; CRLB predicts ~2.9 m for N=3 | |
| 193 | +| 223 s 4-drone coverage | **Low** | Analytic estimate, not an episode rollout | |
| 194 | +| "beats SOTA" | **Directional only** | Valid as paper-to-paper direction; not leaderboard, not multi-seed | |
| 195 | + |
| 196 | +The √N multi-view scaling claim is theoretically sound (CRLB: σ ∝ 1/√(N·SNR); N=3 → √3 ≈ |
| 197 | +1.73× improvement), but the measured 1.732 m must be reproduced over a victim-position and |
| 198 | +noise distribution before it is defensible. |
| 199 | + |
| 200 | +--- |
| 201 | + |
| 202 | +## 8. Consequences |
| 203 | + |
| 204 | +### Positive |
| 205 | +- Converts scattered numbers into a reproducible, statistically-honest evaluation. |
| 206 | +- The `RESULTS.md` internal leaderboard ranks the 6 flight × 4 learning patterns fairly. |
| 207 | +- Aligns with the recognized MARL evaluation standard (IQM + stratified CIs + ≥10 seeds). |
| 208 | +- Honest external-leaderboard position avoids overclaiming. |
| 209 | + |
| 210 | +### Costs / Risks |
| 211 | +- ≥10 seeds × 50 episodes × N patterns × N baselines is a real compute cost — this is where |
| 212 | + the ADR-148 GCP L4 / local RTX 5080 training budget is actually spent. |
| 213 | +- Requires the MARL policy to be **trained to convergence** first (the ADR-148 5-episode CPU |
| 214 | + run shows decreasing value_loss, not convergence). |
| 215 | +- Coverage/localization must move from analytic estimate / synthetic geometry to **episode |
| 216 | + rollouts under realistic CSI noise** before headline numbers are republished. |
| 217 | + |
| 218 | +### Open issues → follow-on work |
| 219 | +1. Train MAPPO/IPPO to convergence (M4 follow-on) before running the eval harness. |
| 220 | +2. Build the seeded `evals/` harness + `RESULTS.md` generator. |
| 221 | +3. Optional: MARL4DRP or MPE Simple-Spread adapter for a recognized cooperative-MARL number. |
| 222 | +4. Re-state ADR-148 §14 headline numbers with CIs once the harness has run. |
| 223 | + |
| 224 | +--- |
| 225 | + |
| 226 | +## 9. Research Notes & References |
| 227 | + |
| 228 | +Compiled by `ruflo-goals:deep-researcher` (2026-05-30). Full landscape in the agent record. |
| 229 | + |
| 230 | +**MARL evaluation rigor** |
| 231 | +- Henderson et al., "Deep RL That Matters", arxiv 1709.06560 — ≤5-seed estimates flip rankings |
| 232 | +- Agarwal et al., "Deep RL at the Edge of the Statistical Precipice", NeurIPS 2021, arxiv 2108.13264 — IQM, performance profiles, stratified bootstrap; `rliable` |
| 233 | +- Gorsane et al., "Standardised Evaluation Protocol for Cooperative MARL", NeurIPS 2022, arxiv 2209.10485 — ≥10 seeds + IQM standard |
| 234 | +- BenchMARL, arxiv 2312.01472 — operationalizes the above |
| 235 | + |
| 236 | +**Cooperative-MARL benchmarks** |
| 237 | +- SMACv2, arxiv 2212.07489 · PettingZoo MPE (Farama) · Melting Pot (DeepMind, NeurIPS 2024 contest) · MAMuJoCo (Gymnasium-Robotics) · MARL4DRP, Springer 2023 (closest drone-MARL) |
| 238 | + |
| 239 | +**Drone-sim platforms** |
| 240 | +- gym-pybullet-drones, arxiv 2103.02142 · OmniDrones, IEEE RA-L 2024 · Pegasus, arxiv 2307.05263 · Flightmare (IROS 2021) · AirSim (discontinued 2022) · Crazyswarm2 |
| 241 | + |
| 242 | +**SAR / coverage / CSI sensing** |
| 243 | +- Wi2SAR, arxiv 2604.09115 (direct baseline: 5 m, 160k m²/13.5 min, 18.4° median AoA) |
| 244 | +- CSI-Bench, NeurIPS 2025, arxiv 2505.21866 (461 h WiFi sensing, localization task) |
| 245 | +- Coverage path planning, PMC9571681 (boustrophedon ~5% faster than spiral) |
| 246 | +- Bio-inspired SAR, Nature s41598-025-33223-z (PSO > Levy/ACO on exploration score) |
| 247 | +- CRLB for CSI localization, IEEE 8110647 (σ ∝ 1/√(N·SNR)) |
| 248 | + |
| 249 | +**Tooling** |
| 250 | +- criterion.rs known limitations — wall-clock only, not algorithmic quality |
| 251 | +- rliable, github.com/google-research/rliable |
| 252 | + |
| 253 | +--- |
| 254 | + |
| 255 | +*ADR authored with research support from `ruflo-goals:deep-researcher` (2026-05-30). |
| 256 | + Companion to ADR-148. Defines the evaluation methodology that the ADR-148 headline |
| 257 | + numbers must satisfy before being republished as defensible claims.* |
0 commit comments