Skip to content

Commit 9ea36c3

Browse files
authored
docs: add deterministic multi-family admissibility benchmark documentation
1 parent 764aaa4 commit 9ea36c3

1 file changed

Lines changed: 98 additions & 0 deletions

File tree

Lines changed: 98 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,98 @@
1+
# Deterministic Multi-Family Admissibility Benchmark
2+
3+
## Purpose
4+
5+
The deterministic multi-family admissibility benchmark tracks operational admissibility degradation across fixture families registered in the manifest.
6+
7+
Each manifest-registered fixture family contributes one deterministic degradation curve using the same standard levels, so contributors can compare progression behavior across families without changing scoring rules or artifact shape.
8+
9+
## Pipeline
10+
11+
```mermaid
12+
flowchart LR
13+
A[fixtures/manifest.json]
14+
B[DegradationCurveGenerator.fixtures_for_manifest_family(...)]
15+
C[AdmissibilityScorer]
16+
D[artifacts/multi_family_admissibility_results.json]
17+
E[Reproducibility and progression tests]
18+
19+
A --> B --> C --> D --> E
20+
```
21+
22+
### Pipeline notes
23+
24+
1. `fixtures/manifest.json` is the source of truth for which fixture families participate.
25+
2. `DegradationCurveGenerator.fixtures_for_manifest_family(...)` resolves fixtures for each family from manifest registration.
26+
3. `AdmissibilityScorer` computes exact admissibility component outcomes for each level.
27+
4. Results are written to `artifacts/multi_family_admissibility_results.json` in a stable deterministic JSON layout.
28+
5. Reproducibility and progression tests validate that the committed artifact remains consistent and semantically protected.
29+
30+
## Current fixture families
31+
32+
The current multi-family benchmark includes these manifest-registered families:
33+
34+
- `coding_workflow_pr_review`
35+
- `incident_response_page_triage`
36+
37+
## Standard degradation levels
38+
39+
Every included family is evaluated at exactly four standard levels in explicit order:
40+
41+
1. `baseline`
42+
2. `mild`
43+
3. `moderate`
44+
4. `severe`
45+
46+
## Determinism guarantees
47+
48+
The benchmark is designed to remain deterministic across local runs and CI runs:
49+
50+
- manifest-driven family selection
51+
- explicit level order (`baseline`, `mild`, `moderate`, `severe`)
52+
- exact rational score aggregation
53+
- stable JSON output structure and ordering
54+
- no timestamps or environment-dependent fields
55+
56+
## Regeneration commands
57+
58+
Use either command to regenerate the deterministic multi-family artifact:
59+
60+
```bash
61+
python scripts/generate_multi_family_admissibility_artifact.py
62+
```
63+
64+
```bash
65+
npm run generate:multi-family-admissibility
66+
```
67+
68+
## Validation commands
69+
70+
Run the targeted protections plus the repository-wide check entrypoint:
71+
72+
```bash
73+
pytest tests/test_multi_family_admissibility_artifact.py -q
74+
pytest tests/test_artifact_reproducibility.py -q
75+
pytest tests/test_manifest_fixture_families.py -q
76+
npm run check
77+
```
78+
79+
## Regression protections
80+
81+
The benchmark is protected by deterministic regression checks that enforce:
82+
83+
- committed artifact must match regenerated output
84+
- every family must expose all four standard levels
85+
- baseline and severe behavior is explicitly checked
86+
- mild and moderate behavior must be distinct
87+
- degradation must be progressive:
88+
- `baseline > mild >= moderate > severe`
89+
90+
## Non-goals
91+
92+
This benchmark intentionally excludes:
93+
94+
- LLM judging
95+
- embeddings
96+
- fuzzy semantic similarity
97+
- runtime orchestration
98+
- deployment/showcase dependencies

0 commit comments

Comments
 (0)