feat(reporting): elevate researchReport to launch-review grade (#35)

drewstone · web-flow · commit c8f03bdfb505 · 2026-05-08T09:45:33.000-06:00
Upgrades the researchReport that landed in #34 from a senior-applied-DS deliverable to one a launch reviewer or peer reviewer can sign off on. Decisions are now made on paired evidence consistently; the report carries its own methodology, provenance, and actionable next-step inference. Decision rule (paired-delta-only — never marginal means inside the paired framework): 1. Comparator → hold (baseline) 2. No comparator → hold if on Pareto frontier else needs_more_data 3. Held-out gate ≠ promote → reject (gate is necessary, not sufficient) 4. Paired N < 6 (RESEARCH_REPORT_HARD_PAIR_FLOOR) → needs_more_data 5. ROPE configured AND paired CI ⊂ ROPE → equivalent 6. Paired CI upper < 0 → reject 7. Paired N < minPairs (default 20) → needs_more_data + MDE attached 8. q ≤ fdr AND CI lower > 0 → promote 9. Otherwise → hold Per-candidate additions: - Bayesian-bootstrap-style Pr(Δ>0), Pr(Δ∈ROPE) on mean paired delta (Rubin 1981 bootstrap-prior duality) - Minimum detectable paired effect at the configured power / α via the new pairedMde primitive in power-analysis - meanGain alongside the existing medianGain Per-report additions: - runFingerprint: SHA-256 over the canonicalised input run set - preregistrationHash: optional, links a signed HypothesisManifest - methodology: assumptions, methods, alternatives, when-not-to-apply, citations (Benjamini & Hochberg 1995; Wilcoxon 1945; Efron 1979; Rubin 1981; Kruschke 2018) — embedded in markdown and as structured data Standalone methodology companion at docs/research-report-methodology.md. Behavioural changes: - Default minPairs raised from 6 to 20 (soft floor); hard floor held at 6. - Reject rule no longer mixes paired delta with marginal means. - Held-out gate ≠ promote now reliably overrides paired stats. - researchReport is async (Web Crypto via hashJson for the fingerprint). Adds a CLAUDE.md authorship directive (no AI-attribution trailers). 837/837 tests passing — 11 dedicated researchReport cases including: hard floor below 6 pairs, ROPE → equivalent, gate override, fingerprint determinism + preregistration passthrough, MDE in the needs_more_data reason, plus the existing summaryTable / paretoChart / gainHistogram coverage.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -7,8 +7,36 @@
 - `researchReport`, an executive research-report layer for coding-vertical
   benchmark runs. Composes `summaryTable`, `paretoChart`, `gainHistogram`,
   held-out gate decisions, and optional `failureClusterView` output into
-  promote/hold/reject guidance, risks, next actions, markdown, HTML, and JSON
-  chart specs.
+  promote / hold / equivalent / reject / needs-more-data guidance with
+  rationale, risks, next actions, markdown, HTML, and JSON chart specs.
+  - Decisions are made on paired evidence — never on marginal means alone.
+  - ROPE (Region of Practical Equivalence) supported via the `rope` option;
+    candidates whose paired-delta CI is fully inside the ROPE are returned
+    as `equivalent` rather than `hold`.
+  - Bayesian-bootstrap-style Pr(Δ>0) and Pr(Δ∈ROPE) summaries on the mean
+    paired delta (Rubin 1981 bootstrap-prior duality), reported per
+    candidate alongside the bootstrap CI on the median.
+  - Per-candidate minimum detectable paired effect at the configured power
+    and α via the new `pairedMde` primitive in `power-analysis`, so a
+    `needs_more_data` verdict is actionable.
+  - SHA-256 `runFingerprint` over the canonicalised input run set + an
+    optional `preregistrationHash` field so the report can cite a signed
+    `HypothesisManifest`.
+  - Soft floor `minPairs` (default 20) and a hard floor of 6 pairs
+    (`RESEARCH_REPORT_HARD_PAIR_FLOOR`) below which any paired call returns
+    `needs_more_data` regardless of the option.
+  - Embedded methodology section in the rendered markdown plus a standalone
+    [`docs/research-report-methodology.md`](./docs/research-report-methodology.md)
+    with assumptions, alternatives, when-not-to-apply, and citations
+    (Benjamini & Hochberg 1995; Wilcoxon 1945; Efron 1979; Rubin 1981;
+    Kruschke 2018).
+- `pairedMde` in `power-analysis`: closed-form minimum detectable paired
+  effect inverse to the paired-t / sign-rank power formula.
+
+### Changed
+
+- `researchReport` is now async (uses Web Crypto via `hashJson` for the run
+  fingerprint).
 
 ## 0.20.10 — hardening audit follow-up
 
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -22,3 +22,7 @@ pnpm build        # tsup
 pnpm test         # vitest
 pnpm typecheck    # tsc --noEmit
 ```
+
+## Authorship
+
+Do not add `Co-Authored-By:` trailers (or any other AI-attribution lines) to commits, PR descriptions, or other artifacts in this repo. Author = the human running the session. This applies even when the default Claude Code template suggests it.
diff --git a/docs/research-report-methodology.md b/docs/research-report-methodology.md
@@ -0,0 +1,155 @@
+# researchReport — methodology
+
+This document is the methodological brief for `researchReport` (exported from
+`@tangle-network/agent-eval` and `@tangle-network/agent-eval/reporting`). It
+exists so a launch reviewer, peer reviewer, or auditor can quickly verify that
+the verdict embedded in any rendered report is defensible, reproducible, and
+appropriate to the data.
+
+The companion code is `src/summary-report.ts`. Each item below names the
+corresponding function or option so the doc and the code don't drift.
+
+## Inputs
+
+- `runs: RunRecord[]` — every record carries `runId`, `candidateId`, `seed`,
+  `experimentId`, `splitTag`, and an `outcome` with the configured score.
+- `comparator: string` — the candidate id treated as the null reference. Must
+  be selected before data inspection; `preregistrationHash` should pin this.
+- `split: 'search' | 'holdout'` — defaults to `holdout`. Decisions on `search`
+  are descriptive only; promotion calls require the holdout.
+- `rope: { low, high }` — Region of Practical Equivalence on the paired delta,
+  in score units. Must come from the domain owner — there is no
+  statistically-defensible default.
+- `minPairs` (soft floor, default 20) and `RESEARCH_REPORT_HARD_PAIR_FLOOR`
+  (hard floor, 6). Below the soft floor, the verdict is `needs_more_data` and
+  the report carries the MDE at the current N.
+- `fdr` (default 0.05), `confidence` (default 0.95), `mdePower` (default 0.8),
+  `mdeAlpha` (default = `fdr`).
+
+## Pairing
+
+Pairs are joined by `(experimentId, seed)` so the comparator and candidate
+share scenario *and* seed. This is the same join `gainHistogram` uses; see
+`pairScoresByKey` in `src/summary-report.ts`. Records on the wrong split or
+with non-finite scores are dropped before pairing.
+
+## Decision rule
+
+In order — first match wins:
+
+1. `comparator` itself → `hold` (baseline).
+2. No comparator → `hold` if on the cost/quality Pareto frontier, else
+   `needs_more_data`. The verdict is descriptive, not causal.
+3. Held-out gate verdict ≠ `promote` → `reject`. The gate is *necessary but
+   not sufficient*; even a `promote` gate must clear the paired test below.
+4. Paired N < `RESEARCH_REPORT_HARD_PAIR_FLOOR` → `needs_more_data` with a
+   "below hard floor" reason. Bootstrap CIs degenerate at this size.
+5. ROPE configured AND paired-delta CI ⊂ ROPE → `equivalent`.
+6. Paired-delta CI upper bound < 0 → `reject` (CI excludes a non-negative
+   effect). Note: this uses **paired delta only** — not the marginal mean.
+7. Paired N < `minPairs` (soft floor) → `needs_more_data` with the MDE at
+   current N attached so the verdict is actionable.
+8. BH-adjusted q ≤ `fdr` AND CI lower bound > 0 → `promote`. The BH q-value
+   controls FDR across all candidates in the same sweep; the bootstrap CI
+   provides an effect-size guarantee independent of the test.
+9. Otherwise → `hold`.
+
+## Statistical primitives used
+
+| Quantity | Function | Source file |
+|---|---|---|
+| Marginal CI on score mean | `confidenceInterval` | `statistics.ts` |
+| Cohen's d vs comparator | `cohensD` | `statistics.ts` |
+| Wilcoxon signed-rank (paired) | `wilcoxonSignedRank` | `statistics.ts` |
+| BH-FDR q-values | `benjaminiHochberg` | `power-analysis.ts` |
+| Paired bootstrap CI on median delta | `pairedBootstrap` | `paired-stats.ts` |
+| Bayesian-bootstrap-style Pr(Δ>0), Pr(Δ∈ROPE) | `bootstrapMeanSamples` | `summary-report.ts` (private) |
+| Minimum detectable paired effect | `pairedMde` | `power-analysis.ts` |
+| Run fingerprint | `hashJson(canonicalize(...))` | `pre-registration.ts` |
+
+The Pr(Δ>0) and Pr(Δ∈ROPE) summaries use the bootstrap-prior duality of
+[Rubin 1981]: under a non-informative Dirichlet prior, the bootstrap
+distribution of a sample statistic is its posterior. We expose these as
+posterior summaries on the **mean** delta and the bootstrap CI on the
+**median** delta — the median is more robust to the heavy-tailed score
+distributions seen in agent benchmarks; the mean lets us read off the
+Bayesian-style probability of superiority in a single number.
+
+## MDE
+
+The minimum detectable paired effect at N pairs, two-sided α, and power β:
+
+$$d_\text{min} = \frac{z_{1-\alpha/2} + z_\beta}{\sqrt{n}}$$
+
+reported on the standardised scale, then multiplied by the observed paired-
+delta SD to get the MDE in score units. Consumers reading a `needs_more_data`
+verdict can use the MDE to budget the next round of runs:
+
+- Observed paired SD = 0.10 score units, paired N = 20, α = 0.05, β = 0.8 →
+  d_min ≈ 0.63 standardised → MDE ≈ 0.063 score units. If the smallest
+  effect that would change a launch decision is below this, run more pairs.
+
+## Provenance
+
+Every report carries:
+
+- `runFingerprint`: SHA-256 over the canonicalised list of
+  `(runId, candidateId, splitTag)` triples (sorted by runId), plus the
+  comparator id and split. Same `(runs, comparator, split)` produces the same
+  fingerprint regardless of input order.
+- `preregistrationHash`: the caller passes the hash of a signed
+  `HypothesisManifest` (see `pre-registration.ts`). The fingerprint and the
+  preregistration hash together let a reader verify both *what data the
+  report saw* and *what protocol it was supposed to run.*
+
+Reports without a `preregistrationHash` carry a "post-hoc" warning in the
+risks list and the executive summary. Treat them as descriptive only.
+
+## Alternatives considered
+
+- **Paired t-test instead of Wilcoxon + bootstrap.** Rejected: agent score
+  distributions are heavy-tailed (judges saturate near 0 and 1) and the t
+  approximation breaks down with the small N typical of holdouts.
+- **Unpaired Mann–Whitney.** Rejected: matched scenarios make pairing free,
+  and unpaired tests throw away the variance reduction. Use the paired test
+  by default.
+- **Sequential / always-valid inference (e-values, mSPRT, alpha-spending).**
+  Out of scope for a single-look report. If users iterate, wrap this report
+  in an alpha-spending schedule, or commit to one preregistered look.
+- **Hierarchical Bayesian shrinkage across many candidates.** Future work.
+  The current ranking is on raw paired statistics and over-credits the top
+  candidate when many are tested.
+- **Calibration / coverage simulation on the bootstrap CI.** Future work; we
+  rely on the asymptotic guarantee plus the hard pair floor to keep coverage
+  reasonable.
+
+## When NOT to apply
+
+- Paired N below the hard floor (6) on any candidate.
+- Comparator chosen by inspecting the data (post-hoc selection inflates
+  false-discovery rates beyond the BH guarantee).
+- Mid-run distribution shift: judge model swap, rubric change, infrastructure
+  outage. Pair exchangeability is violated and the bootstrap is not valid.
+- Scenarios drawn non-randomly from a stream the candidate can influence
+  (data-leak across runs). The pairing is no longer ignorable.
+- Highly skewed cost distributions: the Pareto frontier still works but the
+  marginal CI on cost may be misleading.
+
+## Citations
+
+- Benjamini, Y. & Hochberg, Y. (1995). Controlling the false discovery rate:
+  a practical and powerful approach to multiple testing. *JRSS B*,
+  57(1), 289–300.
+- Wilcoxon, F. (1945). Individual comparisons by ranking methods.
+  *Biometrics Bulletin*, 1(6), 80–83.
+- Efron, B. (1979). Bootstrap methods: another look at the jackknife.
+  *Annals of Statistics*, 7(1), 1–26.
+- Rubin, D. B. (1981). The Bayesian bootstrap.
+  *Annals of Statistics*, 9(1), 130–134.
+- Kruschke, J. K. (2018). Rejecting or accepting parameter values in
+  Bayesian estimation. *Advances in Methods and Practices in
+  Psychological Science*, 1(2), 270–280. (ROPE.)
+- Howard, S. R., Ramdas, A., McAuliffe, J., Sekhon, J. (2021).
+  Time-uniform, nonparametric, nonasymptotic confidence sequences.
+  *Annals of Statistics*, 49(2), 1055–1080. (Background reading on
+  always-valid inference for sequential extensions.)
diff --git a/src/index.ts b/src/index.ts
@@ -736,6 +736,7 @@ export {
   paretoChart,
   gainHistogram,
   researchReport,
+  RESEARCH_REPORT_HARD_PAIR_FLOOR,
 } from './summary-report'
 export type {
   SummaryTable,
@@ -750,6 +751,7 @@ export type {
   ResearchReportOptions,
   ResearchReportCandidate,
   ResearchReportDecision,
+  ResearchReportMethodology,
   ResearchReportRecommendation,
 } from './summary-report'
 
diff --git a/src/power-analysis.ts b/src/power-analysis.ts
@@ -33,6 +33,28 @@ export function requiredSampleSize(opts: { effect: number; alpha?: number; power
   return Math.ceil(n)
 }
 
+/**
+ * Minimum detectable paired effect (in standardised units) given a target
+ * paired sample size. Closed-form inverse of the paired-t / sign-rank power
+ * formula under the normal approximation:
+ *
+ *   d_min = (z_{1-α/2} + z_β) / sqrt(n_paired)
+ *
+ * Multiply by `sd(deltas)` to convert to score units. Treat as a lower bound:
+ * the Wilcoxon signed-rank test and bootstrap CIs have asymptotic relative
+ * efficiency below 1 against the t-test on heavy-tailed distributions, so the
+ * true achievable MDE in those regimes is somewhat larger.
+ */
+export function pairedMde(opts: { nPaired: number; alpha?: number; power?: number; twoSided?: boolean }): number {
+  if (!Number.isFinite(opts.nPaired) || opts.nPaired <= 0) return Infinity
+  const alpha = opts.alpha ?? 0.05
+  const power = opts.power ?? 0.8
+  const twoSided = opts.twoSided ?? true
+  const zAlpha = zQuantile(twoSided ? 1 - alpha / 2 : 1 - alpha)
+  const zBeta = zQuantile(power)
+  return (zAlpha + zBeta) / Math.sqrt(opts.nPaired)
+}
+
 /** Bonferroni adjustment: multiply every p-value by the number of tests, clamp at 1. */
 export function bonferroni(pValues: number[], alpha = 0.05): { adjusted: number[]; significant: boolean[] } {
   const k = pValues.length
diff --git a/src/reporting.ts b/src/reporting.ts
@@ -24,6 +24,7 @@ export {
   researchReport,
   summaryTable,
 } from './summary-report'
+export { RESEARCH_REPORT_HARD_PAIR_FLOOR } from './summary-report'
 export type {
   GainDistributionBin,
   GainDistributionFigureSpec,
@@ -33,6 +34,7 @@ export type {
   ResearchReport,
   ResearchReportCandidate,
   ResearchReportDecision,
+  ResearchReportMethodology,
   ResearchReportOptions,
   ResearchReportRecommendation,
   SummaryTable,
diff --git a/src/summary-report.ts b/src/summary-report.ts
diff --git a/tests/summary-report.test.ts b/tests/summary-report.test.ts