|
| 1 | +# Deterministic Multi-Family Admissibility Benchmark |
| 2 | + |
| 3 | +## Purpose |
| 4 | + |
| 5 | +The deterministic multi-family admissibility benchmark tracks operational admissibility degradation across fixture families registered in the manifest. |
| 6 | + |
| 7 | +Each manifest-registered fixture family contributes one deterministic degradation curve using the same standard levels, so contributors can compare progression behavior across families without changing scoring rules or artifact shape. |
| 8 | + |
| 9 | +## Pipeline |
| 10 | + |
| 11 | +```mermaid |
| 12 | +flowchart LR |
| 13 | + A[fixtures/manifest.json] |
| 14 | + B[DegradationCurveGenerator.fixtures_for_manifest_family(...)] |
| 15 | + C[AdmissibilityScorer] |
| 16 | + D[artifacts/multi_family_admissibility_results.json] |
| 17 | + E[Reproducibility and progression tests] |
| 18 | +
|
| 19 | + A --> B --> C --> D --> E |
| 20 | +``` |
| 21 | + |
| 22 | +### Pipeline notes |
| 23 | + |
| 24 | +1. `fixtures/manifest.json` is the source of truth for which fixture families participate. |
| 25 | +2. `DegradationCurveGenerator.fixtures_for_manifest_family(...)` resolves fixtures for each family from manifest registration. |
| 26 | +3. `AdmissibilityScorer` computes exact admissibility component outcomes for each level. |
| 27 | +4. Results are written to `artifacts/multi_family_admissibility_results.json` in a stable deterministic JSON layout. |
| 28 | +5. Reproducibility and progression tests validate that the committed artifact remains consistent and semantically protected. |
| 29 | + |
| 30 | +## Current fixture families |
| 31 | + |
| 32 | +The current multi-family benchmark includes these manifest-registered families: |
| 33 | + |
| 34 | +- `coding_workflow_pr_review` |
| 35 | +- `incident_response_page_triage` |
| 36 | + |
| 37 | +## Standard degradation levels |
| 38 | + |
| 39 | +Every included family is evaluated at exactly four standard levels in explicit order: |
| 40 | + |
| 41 | +1. `baseline` |
| 42 | +2. `mild` |
| 43 | +3. `moderate` |
| 44 | +4. `severe` |
| 45 | + |
| 46 | +## Determinism guarantees |
| 47 | + |
| 48 | +The benchmark is designed to remain deterministic across local runs and CI runs: |
| 49 | + |
| 50 | +- manifest-driven family selection |
| 51 | +- explicit level order (`baseline`, `mild`, `moderate`, `severe`) |
| 52 | +- exact rational score aggregation |
| 53 | +- stable JSON output structure and ordering |
| 54 | +- no timestamps or environment-dependent fields |
| 55 | + |
| 56 | +## Regeneration commands |
| 57 | + |
| 58 | +Use either command to regenerate the deterministic multi-family artifact: |
| 59 | + |
| 60 | +```bash |
| 61 | +python scripts/generate_multi_family_admissibility_artifact.py |
| 62 | +``` |
| 63 | + |
| 64 | +```bash |
| 65 | +npm run generate:multi-family-admissibility |
| 66 | +``` |
| 67 | + |
| 68 | +## Validation commands |
| 69 | + |
| 70 | +Run the targeted protections plus the repository-wide check entrypoint: |
| 71 | + |
| 72 | +```bash |
| 73 | +pytest tests/test_multi_family_admissibility_artifact.py -q |
| 74 | +pytest tests/test_artifact_reproducibility.py -q |
| 75 | +pytest tests/test_manifest_fixture_families.py -q |
| 76 | +npm run check |
| 77 | +``` |
| 78 | + |
| 79 | +## Regression protections |
| 80 | + |
| 81 | +The benchmark is protected by deterministic regression checks that enforce: |
| 82 | + |
| 83 | +- committed artifact must match regenerated output |
| 84 | +- every family must expose all four standard levels |
| 85 | +- baseline and severe behavior is explicitly checked |
| 86 | +- mild and moderate behavior must be distinct |
| 87 | +- degradation must be progressive: |
| 88 | + - `baseline > mild >= moderate > severe` |
| 89 | + |
| 90 | +## Non-goals |
| 91 | + |
| 92 | +This benchmark intentionally excludes: |
| 93 | + |
| 94 | +- LLM judging |
| 95 | +- embeddings |
| 96 | +- fuzzy semantic similarity |
| 97 | +- runtime orchestration |
| 98 | +- deployment/showcase dependencies |
0 commit comments