|
| 1 | +# researchReport — methodology |
| 2 | + |
| 3 | +This document is the methodological brief for `researchReport` (exported from |
| 4 | +`@tangle-network/agent-eval` and `@tangle-network/agent-eval/reporting`). It |
| 5 | +exists so a launch reviewer, peer reviewer, or auditor can quickly verify that |
| 6 | +the verdict embedded in any rendered report is defensible, reproducible, and |
| 7 | +appropriate to the data. |
| 8 | + |
| 9 | +The companion code is `src/summary-report.ts`. Each item below names the |
| 10 | +corresponding function or option so the doc and the code don't drift. |
| 11 | + |
| 12 | +## Inputs |
| 13 | + |
| 14 | +- `runs: RunRecord[]` — every record carries `runId`, `candidateId`, `seed`, |
| 15 | + `experimentId`, `splitTag`, and an `outcome` with the configured score. |
| 16 | +- `comparator: string` — the candidate id treated as the null reference. Must |
| 17 | + be selected before data inspection; `preregistrationHash` should pin this. |
| 18 | +- `split: 'search' | 'holdout'` — defaults to `holdout`. Decisions on `search` |
| 19 | + are descriptive only; promotion calls require the holdout. |
| 20 | +- `rope: { low, high }` — Region of Practical Equivalence on the paired delta, |
| 21 | + in score units. Must come from the domain owner — there is no |
| 22 | + statistically-defensible default. |
| 23 | +- `minPairs` (soft floor, default 20) and `RESEARCH_REPORT_HARD_PAIR_FLOOR` |
| 24 | + (hard floor, 6). Below the soft floor, the verdict is `needs_more_data` and |
| 25 | + the report carries the MDE at the current N. |
| 26 | +- `fdr` (default 0.05), `confidence` (default 0.95), `mdePower` (default 0.8), |
| 27 | + `mdeAlpha` (default = `fdr`). |
| 28 | + |
| 29 | +## Pairing |
| 30 | + |
| 31 | +Pairs are joined by `(experimentId, seed)` so the comparator and candidate |
| 32 | +share scenario *and* seed. This is the same join `gainHistogram` uses; see |
| 33 | +`pairScoresByKey` in `src/summary-report.ts`. Records on the wrong split or |
| 34 | +with non-finite scores are dropped before pairing. |
| 35 | + |
| 36 | +## Decision rule |
| 37 | + |
| 38 | +In order — first match wins: |
| 39 | + |
| 40 | +1. `comparator` itself → `hold` (baseline). |
| 41 | +2. No comparator → `hold` if on the cost/quality Pareto frontier, else |
| 42 | + `needs_more_data`. The verdict is descriptive, not causal. |
| 43 | +3. Held-out gate verdict ≠ `promote` → `reject`. The gate is *necessary but |
| 44 | + not sufficient*; even a `promote` gate must clear the paired test below. |
| 45 | +4. Paired N < `RESEARCH_REPORT_HARD_PAIR_FLOOR` → `needs_more_data` with a |
| 46 | + "below hard floor" reason. Bootstrap CIs degenerate at this size. |
| 47 | +5. ROPE configured AND paired-delta CI ⊂ ROPE → `equivalent`. |
| 48 | +6. Paired-delta CI upper bound < 0 → `reject` (CI excludes a non-negative |
| 49 | + effect). Note: this uses **paired delta only** — not the marginal mean. |
| 50 | +7. Paired N < `minPairs` (soft floor) → `needs_more_data` with the MDE at |
| 51 | + current N attached so the verdict is actionable. |
| 52 | +8. BH-adjusted q ≤ `fdr` AND CI lower bound > 0 → `promote`. The BH q-value |
| 53 | + controls FDR across all candidates in the same sweep; the bootstrap CI |
| 54 | + provides an effect-size guarantee independent of the test. |
| 55 | +9. Otherwise → `hold`. |
| 56 | + |
| 57 | +## Statistical primitives used |
| 58 | + |
| 59 | +| Quantity | Function | Source file | |
| 60 | +|---|---|---| |
| 61 | +| Marginal CI on score mean | `confidenceInterval` | `statistics.ts` | |
| 62 | +| Cohen's d vs comparator | `cohensD` | `statistics.ts` | |
| 63 | +| Wilcoxon signed-rank (paired) | `wilcoxonSignedRank` | `statistics.ts` | |
| 64 | +| BH-FDR q-values | `benjaminiHochberg` | `power-analysis.ts` | |
| 65 | +| Paired bootstrap CI on median delta | `pairedBootstrap` | `paired-stats.ts` | |
| 66 | +| Bayesian-bootstrap-style Pr(Δ>0), Pr(Δ∈ROPE) | `bootstrapMeanSamples` | `summary-report.ts` (private) | |
| 67 | +| Minimum detectable paired effect | `pairedMde` | `power-analysis.ts` | |
| 68 | +| Run fingerprint | `hashJson(canonicalize(...))` | `pre-registration.ts` | |
| 69 | + |
| 70 | +The Pr(Δ>0) and Pr(Δ∈ROPE) summaries use the bootstrap-prior duality of |
| 71 | +[Rubin 1981]: under a non-informative Dirichlet prior, the bootstrap |
| 72 | +distribution of a sample statistic is its posterior. We expose these as |
| 73 | +posterior summaries on the **mean** delta and the bootstrap CI on the |
| 74 | +**median** delta — the median is more robust to the heavy-tailed score |
| 75 | +distributions seen in agent benchmarks; the mean lets us read off the |
| 76 | +Bayesian-style probability of superiority in a single number. |
| 77 | + |
| 78 | +## MDE |
| 79 | + |
| 80 | +The minimum detectable paired effect at N pairs, two-sided α, and power β: |
| 81 | + |
| 82 | +$$d_\text{min} = \frac{z_{1-\alpha/2} + z_\beta}{\sqrt{n}}$$ |
| 83 | + |
| 84 | +reported on the standardised scale, then multiplied by the observed paired- |
| 85 | +delta SD to get the MDE in score units. Consumers reading a `needs_more_data` |
| 86 | +verdict can use the MDE to budget the next round of runs: |
| 87 | + |
| 88 | +- Observed paired SD = 0.10 score units, paired N = 20, α = 0.05, β = 0.8 → |
| 89 | + d_min ≈ 0.63 standardised → MDE ≈ 0.063 score units. If the smallest |
| 90 | + effect that would change a launch decision is below this, run more pairs. |
| 91 | + |
| 92 | +## Provenance |
| 93 | + |
| 94 | +Every report carries: |
| 95 | + |
| 96 | +- `runFingerprint`: SHA-256 over the canonicalised list of |
| 97 | + `(runId, candidateId, splitTag)` triples (sorted by runId), plus the |
| 98 | + comparator id and split. Same `(runs, comparator, split)` produces the same |
| 99 | + fingerprint regardless of input order. |
| 100 | +- `preregistrationHash`: the caller passes the hash of a signed |
| 101 | + `HypothesisManifest` (see `pre-registration.ts`). The fingerprint and the |
| 102 | + preregistration hash together let a reader verify both *what data the |
| 103 | + report saw* and *what protocol it was supposed to run.* |
| 104 | + |
| 105 | +Reports without a `preregistrationHash` carry a "post-hoc" warning in the |
| 106 | +risks list and the executive summary. Treat them as descriptive only. |
| 107 | + |
| 108 | +## Alternatives considered |
| 109 | + |
| 110 | +- **Paired t-test instead of Wilcoxon + bootstrap.** Rejected: agent score |
| 111 | + distributions are heavy-tailed (judges saturate near 0 and 1) and the t |
| 112 | + approximation breaks down with the small N typical of holdouts. |
| 113 | +- **Unpaired Mann–Whitney.** Rejected: matched scenarios make pairing free, |
| 114 | + and unpaired tests throw away the variance reduction. Use the paired test |
| 115 | + by default. |
| 116 | +- **Sequential / always-valid inference (e-values, mSPRT, alpha-spending).** |
| 117 | + Out of scope for a single-look report. If users iterate, wrap this report |
| 118 | + in an alpha-spending schedule, or commit to one preregistered look. |
| 119 | +- **Hierarchical Bayesian shrinkage across many candidates.** Future work. |
| 120 | + The current ranking is on raw paired statistics and over-credits the top |
| 121 | + candidate when many are tested. |
| 122 | +- **Calibration / coverage simulation on the bootstrap CI.** Future work; we |
| 123 | + rely on the asymptotic guarantee plus the hard pair floor to keep coverage |
| 124 | + reasonable. |
| 125 | + |
| 126 | +## When NOT to apply |
| 127 | + |
| 128 | +- Paired N below the hard floor (6) on any candidate. |
| 129 | +- Comparator chosen by inspecting the data (post-hoc selection inflates |
| 130 | + false-discovery rates beyond the BH guarantee). |
| 131 | +- Mid-run distribution shift: judge model swap, rubric change, infrastructure |
| 132 | + outage. Pair exchangeability is violated and the bootstrap is not valid. |
| 133 | +- Scenarios drawn non-randomly from a stream the candidate can influence |
| 134 | + (data-leak across runs). The pairing is no longer ignorable. |
| 135 | +- Highly skewed cost distributions: the Pareto frontier still works but the |
| 136 | + marginal CI on cost may be misleading. |
| 137 | + |
| 138 | +## Citations |
| 139 | + |
| 140 | +- Benjamini, Y. & Hochberg, Y. (1995). Controlling the false discovery rate: |
| 141 | + a practical and powerful approach to multiple testing. *JRSS B*, |
| 142 | + 57(1), 289–300. |
| 143 | +- Wilcoxon, F. (1945). Individual comparisons by ranking methods. |
| 144 | + *Biometrics Bulletin*, 1(6), 80–83. |
| 145 | +- Efron, B. (1979). Bootstrap methods: another look at the jackknife. |
| 146 | + *Annals of Statistics*, 7(1), 1–26. |
| 147 | +- Rubin, D. B. (1981). The Bayesian bootstrap. |
| 148 | + *Annals of Statistics*, 9(1), 130–134. |
| 149 | +- Kruschke, J. K. (2018). Rejecting or accepting parameter values in |
| 150 | + Bayesian estimation. *Advances in Methods and Practices in |
| 151 | + Psychological Science*, 1(2), 270–280. (ROPE.) |
| 152 | +- Howard, S. R., Ramdas, A., McAuliffe, J., Sekhon, J. (2021). |
| 153 | + Time-uniform, nonparametric, nonasymptotic confidence sequences. |
| 154 | + *Annals of Statistics*, 49(2), 1055–1080. (Background reading on |
| 155 | + always-valid inference for sequential extensions.) |
0 commit comments