docs: publish sanitized benchmark reproducibility pack

DeveshParagiri · DeveshParagiri · commit 5bc736dd6221 · 2026-02-13T02:37:52.000-05:00
diff --git a/docs/benchmark/MANIFEST.sha256 b/docs/benchmark/MANIFEST.sha256
@@ -0,0 +1,5 @@
+e17df98f0186eaead80d126d4e972b80ab6c79542a2bdc2856300f2fb0820b21  docs/benchmark/artifacts/mini-ready12-seed42-p3-20260212-235715/blocked_studies.txt
+0dd9f126da13b5abf552d5ad5cb929451f02dd4d32b9cbd33482eea0f09422f2  docs/benchmark/artifacts/mini-ready12-seed42-p3-20260212-235715/leakage-readiness.md
+92e81d23ce31880e6bf3ed8ddfc2284d7ae8d3d928d3fd350a03a3e539de2a8a  docs/benchmark/artifacts/mini-ready12-seed42-p3-20260212-235715/mini-ready12-seed42-p3-20260212-235715.groundtruth-12table.md
+21e73dbf9724fcfffdc1d9b62b6e92d9bc2e6f37f3dd8b0e5c8f8db2027de997  docs/benchmark/artifacts/mini-ready12-seed42-p3-20260212-235715/mini-ready12-seed42-p3-20260212-235715.groundtruth-analysis.md
+cb3def4b49327daface4486b278ec760b5a0f8347f991e399219cc19cfb8ebf5  docs/benchmark/artifacts/mini-ready12-seed42-p3-20260212-235715/mini-ready12-seed42-p3-20260212-235715.summary.txt
diff --git a/docs/benchmark/README.md b/docs/benchmark/README.md
@@ -0,0 +1,43 @@
+# Benchmark Reproducibility Pack
+
+This folder publishes a **sanitized, frozen benchmark pack** for Extropy validation without exposing private `extropy-ds` study internals.
+
+## Frozen Run
+
+- Run tag: `mini-ready12-seed42-p3-20260212-235715`
+- Scope: 12-study mini benchmark
+- Provider/model profile: Azure OpenAI with `gpt-5-mini` (pivotal + routine)
+- Fixture profile: mini (`N=120` agents per study), seed `42`, max timesteps `12`
+
+## Included Artifacts
+
+- `artifacts/mini-ready12-seed42-p3-20260212-235715/mini-ready12-seed42-p3-20260212-235715.groundtruth-12table.md`
+  - Final 12-row scored table (study, target, prediction, error, status, mapping type).
+- `artifacts/mini-ready12-seed42-p3-20260212-235715/mini-ready12-seed42-p3-20260212-235715.groundtruth-analysis.md`
+  - Scored subset analysis, coverage diagnostics, miss triage.
+- `artifacts/mini-ready12-seed42-p3-20260212-235715/mini-ready12-seed42-p3-20260212-235715.summary.txt`
+  - Summary snapshot for the frozen run.
+- `artifacts/mini-ready12-seed42-p3-20260212-235715/blocked_studies.txt`
+  - Studies excluded from this benchmark due to leakage risk.
+- `artifacts/mini-ready12-seed42-p3-20260212-235715/leakage-readiness.md`
+  - Readiness and leakage triage matrix used for inclusion/exclusion.
+
+## Mapping and Scoring Policy
+
+- **Direct mapping**: simulation outcome aligns 1:1 with reported real-world metric.
+- **Proxy mapping**: predefined, documented conversion aligns simulation output with how external reporting is published.
+- Pass/fail is scored against each study's predeclared target band/rule. Error is shown in percentage points to boundary.
+
+## Verification
+
+Validate artifact integrity from repo root:
+
+```bash
+shasum -a 256 -c docs/benchmark/MANIFEST.sha256
+```
+
+## What Is Not Included
+
+- Private study configs and raw generated configs from `extropy-ds`
+- API keys, private prompts, and private run logs
+- Any restricted source material not suitable for public release
diff --git a/docs/benchmark/artifacts/mini-ready12-seed42-p3-20260212-235715/blocked_studies.txt b/docs/benchmark/artifacts/mini-ready12-seed42-p3-20260212-235715/blocked_studies.txt
@@ -0,0 +1,2 @@
+ma-mobile-sports-betting-launch-2023
+ny-mobile-sports-betting-launch-2022
diff --git a/docs/benchmark/artifacts/mini-ready12-seed42-p3-20260212-235715/leakage-readiness.md b/docs/benchmark/artifacts/mini-ready12-seed42-p3-20260212-235715/leakage-readiness.md
@@ -0,0 +1,16 @@
+| Study | Config | Files | Validate(pop/scn) | Ground truth doc | Leakage triage | Status |
+|---|---|---|---|---|---|---|
+| apple-att-privacy | 00-baseline | ok | PASS/PASS | yes (ground-truth.md,GROUND-TRUTH.md; registry=-) | clean | READY |
+| bud-light-boycott | 00-baseline | ok | PASS/PASS | yes (ground-truth.md,GROUND-TRUTH.md; registry=-) | clean | READY |
+| london-ulez-expansion-2023 | 00-baseline | ok | PASS/PASS | yes (ground-truth.md,GROUND-TRUTH.md,README.md; registry=True) | clean | READY |
+| ma-mobile-sports-betting-launch-2023 | 00-baseline | ok | PASS/PASS | yes (ground-truth.md,GROUND-TRUTH.md,README.md; registry=True) | flagged | BLOCKED |
+| netflix-ad-tier-launch | 00-baseline | ok | PASS/PASS | yes (ground-truth.md,GROUND-TRUTH.md; registry=-) | clean | READY |
+| netflix-password-sharing | 00-baseline | ok | PASS/PASS | yes (ground-truth.md,GROUND-TRUTH.md,README.md; registry=False) | clean | READY |
+| ny-mobile-sports-betting-launch-2022 | 00-baseline | ok | PASS/PASS | yes (ground-truth.md,GROUND-TRUTH.md,README.md; registry=True) | flagged | BLOCKED |
+| nyc-congestion-pricing | 00-baseline | ok | PASS/PASS | yes (ground-truth.md,GROUND-TRUTH.md; registry=-) | clean | READY |
+| plant-based-meat | 00-baseline | ok | PASS/PASS | yes (ground-truth.md,GROUND-TRUTH.md; registry=-) | clean | READY |
+| reddit-api-protest | 00-baseline | ok | PASS/PASS | yes (ground-truth.md,GROUND-TRUTH.md; registry=-) | clean | READY |
+| snapchat-plus-launch | 00-baseline | ok | PASS/PASS | yes (ground-truth.md,GROUND-TRUTH.md; registry=-) | clean | READY |
+| spotify-price-hike | 00-baseline | ok | PASS/PASS | yes (ground-truth.md,GROUND-TRUTH.md; registry=-) | clean | READY |
+| threads-launch | 00-baseline | ok | PASS/PASS | yes (ground-truth.md,GROUND-TRUTH.md; registry=-) | clean | READY |
+| x-premium-adoption | 01-revised-options | ok | PASS/PASS | yes (ground-truth.md,GROUND-TRUTH.md,README.md; registry=False) | clean | READY |
diff --git a/docs/benchmark/artifacts/mini-ready12-seed42-p3-20260212-235715/mini-ready12-seed42-p3-20260212-235715.groundtruth-12table.md b/docs/benchmark/artifacts/mini-ready12-seed42-p3-20260212-235715/mini-ready12-seed42-p3-20260212-235715.groundtruth-12table.md
@@ -0,0 +1,14 @@
+| Study | Metric | Target | Pred | Error | Status | Mapping |
+|---|---|---:|---:|---:|---|---|
+| apple-att-privacy | deny_tracking | 75-80% | 76.7% | 0.0pp | PASS | direct |
+| bud-light-boycott | maintain_bud_light | 80-90% (~85%) | 85.8% | 0.0pp | PASS | direct |
+| netflix-password-sharing | maintain_relationship (comply) | >80% | 94.2% | 0.0pp | PASS | direct |
+| spotify-price-hike | continue_same_plan | 95-98% | 95.8% | 0.0pp | PASS | direct |
+| plant-based-meat | regular_adoption_mapped | 4-8% | 5.8% | 0.0pp | PASS | direct |
+| threads-launch | active_use_mapped | 10-15% | 24.2% | 9.2pp | MISS | direct |
+| x-premium-adoption | subscribe_to_premium | 0.5-1.5% | 0.8% | 0.0pp | PASS | direct |
+| nyc-congestion-pricing | behavior_change_mapped | 15-20% | 48.3% | 28.3pp | MISS | direct |
+| london-ulez-expansion-2023 | compliance_proxy (=1-pay_daily_charge) | 95-96% (TfL) | 99.2% | 3.2pp | MISS | assumption |
+| netflix-ad-tier-launch | switch_to_ad_tier (short-horizon proxy) | 10-30% (proxy band) | 18.3% | 0.0pp | PASS | assumption |
+| reddit-api-protest | protest_participation_proxy | 40-70% (proxy band) | 57.5% | 0.0pp | PASS | assumption |
+| snapchat-plus-launch | subscribe_immediately (short-horizon proxy) | 2-8% (proxy band) | 5.0% | 0.0pp | PASS | assumption |
diff --git a/docs/benchmark/artifacts/mini-ready12-seed42-p3-20260212-235715/mini-ready12-seed42-p3-20260212-235715.groundtruth-analysis.md b/docs/benchmark/artifacts/mini-ready12-seed42-p3-20260212-235715/mini-ready12-seed42-p3-20260212-235715.groundtruth-analysis.md
@@ -0,0 +1,61 @@
+# Mini Bench Ground-Truth Analysis
+
+Run tag: `mini-ready12-seed42-p3-20260212-235715`
+Execution mode: 12 studies, parallel scheduler 3-at-a-time, Azure OpenAI, `gpt-5-mini` for pivotal + routine, per-study RPM 250.
+
+## Scored Metrics (Numeric Targets)
+
+| Study | Metric | Target | Pred | Error | Status |
+|---|---|---:|---:|---:|---|
+| apple-att-privacy | deny_tracking | 75-80% | 76.7% | 0.0pp | PASS |
+| bud-light-boycott | maintain_bud_light | ~85% (80-90%) | 85.8% | 0.0pp | PASS |
+| netflix-password-sharing | maintain_relationship (comply) | >80% | 94.2% | 0.0pp | PASS |
+| spotify-price-hike | continue_same_plan | 95-98% | 95.8% | 0.0pp | PASS |
+| plant-based-meat | regular_adoption_mapped | 4-8% | 5.8% | 0.0pp | PASS |
+| threads-launch | active_use_mapped | 10-15% | 24.2% | 9.2pp | MISS |
+| x-premium-adoption | subscribe_to_premium | 0.5-1.5% | 0.8% | 0.0pp | PASS |
+| nyc-congestion-pricing | behavior_change_mapped | 15-20% | 48.3% | 28.3pp | MISS |
+
+Scored pass rate: **6/8**
+
+## Directional Studies (No Single Clean Numeric Target)
+
+| Study | Key outputs |
+|---|---|
+| london-ulez-expansion-2023 | reduce_zone_trips 58.3%, switch_to_compliant_vehicle 35.8%, shift_to_alternative_transport 4.2%, pay_daily_charge 0.8% |
+| netflix-ad-tier-launch | maintain_current_plan 76.7%, switch_to_ad_tier 18.3%, reduce_or_cancel 5.0% |
+| reddit-api-protest | actively_protest 40.8%, continue_unaffected 40.0%, migrate_to_alternative 13.3%, reduce_engagement 3.3% |
+| snapchat-plus-launch | continue_free_satisfied 93.3%, subscribe_immediately 5.0% |
+
+## Coverage / Reliability Diagnostics
+
+| Study | Timesteps | Primary coverage |
+|---|---:|---:|
+| apple-att-privacy | 8 | 90.0% |
+| bud-light-boycott | 10 | 98.3% |
+| london-ulez-expansion-2023 | 6 | 99.2% |
+| netflix-ad-tier-launch | 6 | 100.0% |
+| netflix-password-sharing | 11 | 100.0% |
+| nyc-congestion-pricing | 6 | 99.2% |
+| plant-based-meat | 7 | 93.3% |
+| reddit-api-protest | 12 | 97.5% |
+| snapchat-plus-launch | 10 | 98.3% |
+| spotify-price-hike | 10 | 100.0% |
+| threads-launch | 10 | 92.5% |
+| x-premium-adoption | 12 | 97.5% |
+
+Interpretation:
+- Misses on `threads` and `nyc` are mostly behavioral (coverage is high enough that denominator alone does not explain 9.2pp and 28.3pp misses).
+- Lower coverage studies (`apple`, `plant`, `threads`) should still be stabilized in 5-seed reporting, but only `threads` is currently an accuracy miss.
+
+## Miss Triage
+
+1. `nyc-congestion-pricing` (largest miss): model over-shifts behavior away from driving.
+2. `threads-launch`: model over-predicts sustained active use and under-predicts trial-and-abandon.
+
+## Recommendation Before 5-Seed
+
+Proceed with 5-seed on all 12 (to separate noise from signal), but treat these as priority watch studies:
+1. `nyc-congestion-pricing`
+2. `threads-launch`
+
diff --git a/docs/benchmark/artifacts/mini-ready12-seed42-p3-20260212-235715/mini-ready12-seed42-p3-20260212-235715.summary.txt b/docs/benchmark/artifacts/mini-ready12-seed42-p3-20260212-235715/mini-ready12-seed42-p3-20260212-235715.summary.txt
@@ -0,0 +1,14 @@
+study                      | top_outcome                                                             | top_share | timesteps | reasoning_calls | stop_reason
+---------------------------+-------------------------------------------------------------------------+-----------+-----------+-----------------+------------
+apple-att-privacy          | tracking_decision:deny_tracking                                         | 0.767     | 8         | n/a             | n/a        
+bud-light-boycott          | beer_purchase_behavior:maintain_bud_light                               | 0.858     | 10        | n/a             | n/a        
+london-ulez-expansion-2023 | ulez_response_strategy:reduce_zone_trips                                | 0.583     | 6         | n/a             | n/a        
+netflix-ad-tier-launch     | subscription_response:maintain_current_plan                             | 0.767     | 6         | n/a             | n/a        
+netflix-password-sharing   | password_sharing_response:remove_shared_access                          | 0.558     | 11        | n/a             | n/a        
+nyc-congestion-pricing     | commute_response:continue_driving_and_pay_toll                          | 0.475     | 6         | n/a             | n/a        
+plant-based-meat           | plant_based_purchase_strategy:avoid_and_continue_with_conventional_meat | 0.858     | 7         | n/a             | n/a        
+reddit-api-protest         | reddit_response_action:actively_protest                                 | 0.408     | 12        | n/a             | n/a        
+snapchat-plus-launch       | subscription_response:continue_free_satisfied                           | 0.933     | 10        | n/a             | n/a        
+spotify-price-hike         | subscription_response:accept_price_increase_continue_same_plan          | 0.958     | 10        | n/a             | n/a        
+threads-launch             | threads_adoption_action:sign_up_but_consume_only                        | 0.342     | 10        | n/a             | n/a        
+x-premium-adoption         | premium_verification_response:continue_using_without_premium            | 0.625     | 12        | n/a             | n/a        

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,2 @@`
	`1`	`+ma-mobile-sports-betting-launch-2023`
	`2`	`+ny-mobile-sports-betting-launch-2022`