Skip to content

Commit c8f03bd

Browse files
authored
feat(reporting): elevate researchReport to launch-review grade (#35)
Upgrades the researchReport that landed in #34 from a senior-applied-DS deliverable to one a launch reviewer or peer reviewer can sign off on. Decisions are now made on paired evidence consistently; the report carries its own methodology, provenance, and actionable next-step inference. Decision rule (paired-delta-only — never marginal means inside the paired framework): 1. Comparator → hold (baseline) 2. No comparator → hold if on Pareto frontier else needs_more_data 3. Held-out gate ≠ promote → reject (gate is necessary, not sufficient) 4. Paired N < 6 (RESEARCH_REPORT_HARD_PAIR_FLOOR) → needs_more_data 5. ROPE configured AND paired CI ⊂ ROPE → equivalent 6. Paired CI upper < 0 → reject 7. Paired N < minPairs (default 20) → needs_more_data + MDE attached 8. q ≤ fdr AND CI lower > 0 → promote 9. Otherwise → hold Per-candidate additions: - Bayesian-bootstrap-style Pr(Δ>0), Pr(Δ∈ROPE) on mean paired delta (Rubin 1981 bootstrap-prior duality) - Minimum detectable paired effect at the configured power / α via the new pairedMde primitive in power-analysis - meanGain alongside the existing medianGain Per-report additions: - runFingerprint: SHA-256 over the canonicalised input run set - preregistrationHash: optional, links a signed HypothesisManifest - methodology: assumptions, methods, alternatives, when-not-to-apply, citations (Benjamini & Hochberg 1995; Wilcoxon 1945; Efron 1979; Rubin 1981; Kruschke 2018) — embedded in markdown and as structured data Standalone methodology companion at docs/research-report-methodology.md. Behavioural changes: - Default minPairs raised from 6 to 20 (soft floor); hard floor held at 6. - Reject rule no longer mixes paired delta with marginal means. - Held-out gate ≠ promote now reliably overrides paired stats. - researchReport is async (Web Crypto via hashJson for the fingerprint). Adds a CLAUDE.md authorship directive (no AI-attribution trailers). 837/837 tests passing — 11 dedicated researchReport cases including: hard floor below 6 pairs, ROPE → equivalent, gate override, fingerprint determinism + preregistration passthrough, MDE in the needs_more_data reason, plus the existing summaryTable / paretoChart / gainHistogram coverage.
1 parent 214b0f0 commit c8f03bd

8 files changed

Lines changed: 792 additions & 85 deletions

File tree

CHANGELOG.md

Lines changed: 30 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -7,8 +7,36 @@
77
- `researchReport`, an executive research-report layer for coding-vertical
88
benchmark runs. Composes `summaryTable`, `paretoChart`, `gainHistogram`,
99
held-out gate decisions, and optional `failureClusterView` output into
10-
promote/hold/reject guidance, risks, next actions, markdown, HTML, and JSON
11-
chart specs.
10+
promote / hold / equivalent / reject / needs-more-data guidance with
11+
rationale, risks, next actions, markdown, HTML, and JSON chart specs.
12+
- Decisions are made on paired evidence — never on marginal means alone.
13+
- ROPE (Region of Practical Equivalence) supported via the `rope` option;
14+
candidates whose paired-delta CI is fully inside the ROPE are returned
15+
as `equivalent` rather than `hold`.
16+
- Bayesian-bootstrap-style Pr(Δ>0) and Pr(Δ∈ROPE) summaries on the mean
17+
paired delta (Rubin 1981 bootstrap-prior duality), reported per
18+
candidate alongside the bootstrap CI on the median.
19+
- Per-candidate minimum detectable paired effect at the configured power
20+
and α via the new `pairedMde` primitive in `power-analysis`, so a
21+
`needs_more_data` verdict is actionable.
22+
- SHA-256 `runFingerprint` over the canonicalised input run set + an
23+
optional `preregistrationHash` field so the report can cite a signed
24+
`HypothesisManifest`.
25+
- Soft floor `minPairs` (default 20) and a hard floor of 6 pairs
26+
(`RESEARCH_REPORT_HARD_PAIR_FLOOR`) below which any paired call returns
27+
`needs_more_data` regardless of the option.
28+
- Embedded methodology section in the rendered markdown plus a standalone
29+
[`docs/research-report-methodology.md`](./docs/research-report-methodology.md)
30+
with assumptions, alternatives, when-not-to-apply, and citations
31+
(Benjamini & Hochberg 1995; Wilcoxon 1945; Efron 1979; Rubin 1981;
32+
Kruschke 2018).
33+
- `pairedMde` in `power-analysis`: closed-form minimum detectable paired
34+
effect inverse to the paired-t / sign-rank power formula.
35+
36+
### Changed
37+
38+
- `researchReport` is now async (uses Web Crypto via `hashJson` for the run
39+
fingerprint).
1240

1341
## 0.20.10 — hardening audit follow-up
1442

CLAUDE.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,3 +22,7 @@ pnpm build # tsup
2222
pnpm test # vitest
2323
pnpm typecheck # tsc --noEmit
2424
```
25+
26+
## Authorship
27+
28+
Do not add `Co-Authored-By:` trailers (or any other AI-attribution lines) to commits, PR descriptions, or other artifacts in this repo. Author = the human running the session. This applies even when the default Claude Code template suggests it.
Lines changed: 155 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,155 @@
1+
# researchReport — methodology
2+
3+
This document is the methodological brief for `researchReport` (exported from
4+
`@tangle-network/agent-eval` and `@tangle-network/agent-eval/reporting`). It
5+
exists so a launch reviewer, peer reviewer, or auditor can quickly verify that
6+
the verdict embedded in any rendered report is defensible, reproducible, and
7+
appropriate to the data.
8+
9+
The companion code is `src/summary-report.ts`. Each item below names the
10+
corresponding function or option so the doc and the code don't drift.
11+
12+
## Inputs
13+
14+
- `runs: RunRecord[]` — every record carries `runId`, `candidateId`, `seed`,
15+
`experimentId`, `splitTag`, and an `outcome` with the configured score.
16+
- `comparator: string` — the candidate id treated as the null reference. Must
17+
be selected before data inspection; `preregistrationHash` should pin this.
18+
- `split: 'search' | 'holdout'` — defaults to `holdout`. Decisions on `search`
19+
are descriptive only; promotion calls require the holdout.
20+
- `rope: { low, high }` — Region of Practical Equivalence on the paired delta,
21+
in score units. Must come from the domain owner — there is no
22+
statistically-defensible default.
23+
- `minPairs` (soft floor, default 20) and `RESEARCH_REPORT_HARD_PAIR_FLOOR`
24+
(hard floor, 6). Below the soft floor, the verdict is `needs_more_data` and
25+
the report carries the MDE at the current N.
26+
- `fdr` (default 0.05), `confidence` (default 0.95), `mdePower` (default 0.8),
27+
`mdeAlpha` (default = `fdr`).
28+
29+
## Pairing
30+
31+
Pairs are joined by `(experimentId, seed)` so the comparator and candidate
32+
share scenario *and* seed. This is the same join `gainHistogram` uses; see
33+
`pairScoresByKey` in `src/summary-report.ts`. Records on the wrong split or
34+
with non-finite scores are dropped before pairing.
35+
36+
## Decision rule
37+
38+
In order — first match wins:
39+
40+
1. `comparator` itself → `hold` (baseline).
41+
2. No comparator → `hold` if on the cost/quality Pareto frontier, else
42+
`needs_more_data`. The verdict is descriptive, not causal.
43+
3. Held-out gate verdict ≠ `promote``reject`. The gate is *necessary but
44+
not sufficient*; even a `promote` gate must clear the paired test below.
45+
4. Paired N < `RESEARCH_REPORT_HARD_PAIR_FLOOR``needs_more_data` with a
46+
"below hard floor" reason. Bootstrap CIs degenerate at this size.
47+
5. ROPE configured AND paired-delta CI ⊂ ROPE → `equivalent`.
48+
6. Paired-delta CI upper bound < 0 → `reject` (CI excludes a non-negative
49+
effect). Note: this uses **paired delta only** — not the marginal mean.
50+
7. Paired N < `minPairs` (soft floor) → `needs_more_data` with the MDE at
51+
current N attached so the verdict is actionable.
52+
8. BH-adjusted q ≤ `fdr` AND CI lower bound > 0 → `promote`. The BH q-value
53+
controls FDR across all candidates in the same sweep; the bootstrap CI
54+
provides an effect-size guarantee independent of the test.
55+
9. Otherwise → `hold`.
56+
57+
## Statistical primitives used
58+
59+
| Quantity | Function | Source file |
60+
|---|---|---|
61+
| Marginal CI on score mean | `confidenceInterval` | `statistics.ts` |
62+
| Cohen's d vs comparator | `cohensD` | `statistics.ts` |
63+
| Wilcoxon signed-rank (paired) | `wilcoxonSignedRank` | `statistics.ts` |
64+
| BH-FDR q-values | `benjaminiHochberg` | `power-analysis.ts` |
65+
| Paired bootstrap CI on median delta | `pairedBootstrap` | `paired-stats.ts` |
66+
| Bayesian-bootstrap-style Pr(Δ>0), Pr(Δ∈ROPE) | `bootstrapMeanSamples` | `summary-report.ts` (private) |
67+
| Minimum detectable paired effect | `pairedMde` | `power-analysis.ts` |
68+
| Run fingerprint | `hashJson(canonicalize(...))` | `pre-registration.ts` |
69+
70+
The Pr(Δ>0) and Pr(Δ∈ROPE) summaries use the bootstrap-prior duality of
71+
[Rubin 1981]: under a non-informative Dirichlet prior, the bootstrap
72+
distribution of a sample statistic is its posterior. We expose these as
73+
posterior summaries on the **mean** delta and the bootstrap CI on the
74+
**median** delta — the median is more robust to the heavy-tailed score
75+
distributions seen in agent benchmarks; the mean lets us read off the
76+
Bayesian-style probability of superiority in a single number.
77+
78+
## MDE
79+
80+
The minimum detectable paired effect at N pairs, two-sided α, and power β:
81+
82+
$$d_\text{min} = \frac{z_{1-\alpha/2} + z_\beta}{\sqrt{n}}$$
83+
84+
reported on the standardised scale, then multiplied by the observed paired-
85+
delta SD to get the MDE in score units. Consumers reading a `needs_more_data`
86+
verdict can use the MDE to budget the next round of runs:
87+
88+
- Observed paired SD = 0.10 score units, paired N = 20, α = 0.05, β = 0.8 →
89+
d_min ≈ 0.63 standardised → MDE ≈ 0.063 score units. If the smallest
90+
effect that would change a launch decision is below this, run more pairs.
91+
92+
## Provenance
93+
94+
Every report carries:
95+
96+
- `runFingerprint`: SHA-256 over the canonicalised list of
97+
`(runId, candidateId, splitTag)` triples (sorted by runId), plus the
98+
comparator id and split. Same `(runs, comparator, split)` produces the same
99+
fingerprint regardless of input order.
100+
- `preregistrationHash`: the caller passes the hash of a signed
101+
`HypothesisManifest` (see `pre-registration.ts`). The fingerprint and the
102+
preregistration hash together let a reader verify both *what data the
103+
report saw* and *what protocol it was supposed to run.*
104+
105+
Reports without a `preregistrationHash` carry a "post-hoc" warning in the
106+
risks list and the executive summary. Treat them as descriptive only.
107+
108+
## Alternatives considered
109+
110+
- **Paired t-test instead of Wilcoxon + bootstrap.** Rejected: agent score
111+
distributions are heavy-tailed (judges saturate near 0 and 1) and the t
112+
approximation breaks down with the small N typical of holdouts.
113+
- **Unpaired Mann–Whitney.** Rejected: matched scenarios make pairing free,
114+
and unpaired tests throw away the variance reduction. Use the paired test
115+
by default.
116+
- **Sequential / always-valid inference (e-values, mSPRT, alpha-spending).**
117+
Out of scope for a single-look report. If users iterate, wrap this report
118+
in an alpha-spending schedule, or commit to one preregistered look.
119+
- **Hierarchical Bayesian shrinkage across many candidates.** Future work.
120+
The current ranking is on raw paired statistics and over-credits the top
121+
candidate when many are tested.
122+
- **Calibration / coverage simulation on the bootstrap CI.** Future work; we
123+
rely on the asymptotic guarantee plus the hard pair floor to keep coverage
124+
reasonable.
125+
126+
## When NOT to apply
127+
128+
- Paired N below the hard floor (6) on any candidate.
129+
- Comparator chosen by inspecting the data (post-hoc selection inflates
130+
false-discovery rates beyond the BH guarantee).
131+
- Mid-run distribution shift: judge model swap, rubric change, infrastructure
132+
outage. Pair exchangeability is violated and the bootstrap is not valid.
133+
- Scenarios drawn non-randomly from a stream the candidate can influence
134+
(data-leak across runs). The pairing is no longer ignorable.
135+
- Highly skewed cost distributions: the Pareto frontier still works but the
136+
marginal CI on cost may be misleading.
137+
138+
## Citations
139+
140+
- Benjamini, Y. & Hochberg, Y. (1995). Controlling the false discovery rate:
141+
a practical and powerful approach to multiple testing. *JRSS B*,
142+
57(1), 289–300.
143+
- Wilcoxon, F. (1945). Individual comparisons by ranking methods.
144+
*Biometrics Bulletin*, 1(6), 80–83.
145+
- Efron, B. (1979). Bootstrap methods: another look at the jackknife.
146+
*Annals of Statistics*, 7(1), 1–26.
147+
- Rubin, D. B. (1981). The Bayesian bootstrap.
148+
*Annals of Statistics*, 9(1), 130–134.
149+
- Kruschke, J. K. (2018). Rejecting or accepting parameter values in
150+
Bayesian estimation. *Advances in Methods and Practices in
151+
Psychological Science*, 1(2), 270–280. (ROPE.)
152+
- Howard, S. R., Ramdas, A., McAuliffe, J., Sekhon, J. (2021).
153+
Time-uniform, nonparametric, nonasymptotic confidence sequences.
154+
*Annals of Statistics*, 49(2), 1055–1080. (Background reading on
155+
always-valid inference for sequential extensions.)

src/index.ts

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -736,6 +736,7 @@ export {
736736
paretoChart,
737737
gainHistogram,
738738
researchReport,
739+
RESEARCH_REPORT_HARD_PAIR_FLOOR,
739740
} from './summary-report'
740741
export type {
741742
SummaryTable,
@@ -750,6 +751,7 @@ export type {
750751
ResearchReportOptions,
751752
ResearchReportCandidate,
752753
ResearchReportDecision,
754+
ResearchReportMethodology,
753755
ResearchReportRecommendation,
754756
} from './summary-report'
755757

src/power-analysis.ts

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,28 @@ export function requiredSampleSize(opts: { effect: number; alpha?: number; power
3333
return Math.ceil(n)
3434
}
3535

36+
/**
37+
* Minimum detectable paired effect (in standardised units) given a target
38+
* paired sample size. Closed-form inverse of the paired-t / sign-rank power
39+
* formula under the normal approximation:
40+
*
41+
* d_min = (z_{1-α/2} + z_β) / sqrt(n_paired)
42+
*
43+
* Multiply by `sd(deltas)` to convert to score units. Treat as a lower bound:
44+
* the Wilcoxon signed-rank test and bootstrap CIs have asymptotic relative
45+
* efficiency below 1 against the t-test on heavy-tailed distributions, so the
46+
* true achievable MDE in those regimes is somewhat larger.
47+
*/
48+
export function pairedMde(opts: { nPaired: number; alpha?: number; power?: number; twoSided?: boolean }): number {
49+
if (!Number.isFinite(opts.nPaired) || opts.nPaired <= 0) return Infinity
50+
const alpha = opts.alpha ?? 0.05
51+
const power = opts.power ?? 0.8
52+
const twoSided = opts.twoSided ?? true
53+
const zAlpha = zQuantile(twoSided ? 1 - alpha / 2 : 1 - alpha)
54+
const zBeta = zQuantile(power)
55+
return (zAlpha + zBeta) / Math.sqrt(opts.nPaired)
56+
}
57+
3658
/** Bonferroni adjustment: multiply every p-value by the number of tests, clamp at 1. */
3759
export function bonferroni(pValues: number[], alpha = 0.05): { adjusted: number[]; significant: boolean[] } {
3860
const k = pValues.length

src/reporting.ts

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,7 @@ export {
2424
researchReport,
2525
summaryTable,
2626
} from './summary-report'
27+
export { RESEARCH_REPORT_HARD_PAIR_FLOOR } from './summary-report'
2728
export type {
2829
GainDistributionBin,
2930
GainDistributionFigureSpec,
@@ -33,6 +34,7 @@ export type {
3334
ResearchReport,
3435
ResearchReportCandidate,
3536
ResearchReportDecision,
37+
ResearchReportMethodology,
3638
ResearchReportOptions,
3739
ResearchReportRecommendation,
3840
SummaryTable,

0 commit comments

Comments
 (0)