Skip to content

Commit 5bc736d

Browse files
docs: publish sanitized benchmark reproducibility pack
1 parent 018d62b commit 5bc736d

7 files changed

Lines changed: 155 additions & 0 deletions

docs/benchmark/MANIFEST.sha256

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
e17df98f0186eaead80d126d4e972b80ab6c79542a2bdc2856300f2fb0820b21 docs/benchmark/artifacts/mini-ready12-seed42-p3-20260212-235715/blocked_studies.txt
2+
0dd9f126da13b5abf552d5ad5cb929451f02dd4d32b9cbd33482eea0f09422f2 docs/benchmark/artifacts/mini-ready12-seed42-p3-20260212-235715/leakage-readiness.md
3+
92e81d23ce31880e6bf3ed8ddfc2284d7ae8d3d928d3fd350a03a3e539de2a8a docs/benchmark/artifacts/mini-ready12-seed42-p3-20260212-235715/mini-ready12-seed42-p3-20260212-235715.groundtruth-12table.md
4+
21e73dbf9724fcfffdc1d9b62b6e92d9bc2e6f37f3dd8b0e5c8f8db2027de997 docs/benchmark/artifacts/mini-ready12-seed42-p3-20260212-235715/mini-ready12-seed42-p3-20260212-235715.groundtruth-analysis.md
5+
cb3def4b49327daface4486b278ec760b5a0f8347f991e399219cc19cfb8ebf5 docs/benchmark/artifacts/mini-ready12-seed42-p3-20260212-235715/mini-ready12-seed42-p3-20260212-235715.summary.txt

docs/benchmark/README.md

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
# Benchmark Reproducibility Pack
2+
3+
This folder publishes a **sanitized, frozen benchmark pack** for Extropy validation without exposing private `extropy-ds` study internals.
4+
5+
## Frozen Run
6+
7+
- Run tag: `mini-ready12-seed42-p3-20260212-235715`
8+
- Scope: 12-study mini benchmark
9+
- Provider/model profile: Azure OpenAI with `gpt-5-mini` (pivotal + routine)
10+
- Fixture profile: mini (`N=120` agents per study), seed `42`, max timesteps `12`
11+
12+
## Included Artifacts
13+
14+
- `artifacts/mini-ready12-seed42-p3-20260212-235715/mini-ready12-seed42-p3-20260212-235715.groundtruth-12table.md`
15+
- Final 12-row scored table (study, target, prediction, error, status, mapping type).
16+
- `artifacts/mini-ready12-seed42-p3-20260212-235715/mini-ready12-seed42-p3-20260212-235715.groundtruth-analysis.md`
17+
- Scored subset analysis, coverage diagnostics, miss triage.
18+
- `artifacts/mini-ready12-seed42-p3-20260212-235715/mini-ready12-seed42-p3-20260212-235715.summary.txt`
19+
- Summary snapshot for the frozen run.
20+
- `artifacts/mini-ready12-seed42-p3-20260212-235715/blocked_studies.txt`
21+
- Studies excluded from this benchmark due to leakage risk.
22+
- `artifacts/mini-ready12-seed42-p3-20260212-235715/leakage-readiness.md`
23+
- Readiness and leakage triage matrix used for inclusion/exclusion.
24+
25+
## Mapping and Scoring Policy
26+
27+
- **Direct mapping**: simulation outcome aligns 1:1 with reported real-world metric.
28+
- **Proxy mapping**: predefined, documented conversion aligns simulation output with how external reporting is published.
29+
- Pass/fail is scored against each study's predeclared target band/rule. Error is shown in percentage points to boundary.
30+
31+
## Verification
32+
33+
Validate artifact integrity from repo root:
34+
35+
```bash
36+
shasum -a 256 -c docs/benchmark/MANIFEST.sha256
37+
```
38+
39+
## What Is Not Included
40+
41+
- Private study configs and raw generated configs from `extropy-ds`
42+
- API keys, private prompts, and private run logs
43+
- Any restricted source material not suitable for public release
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
ma-mobile-sports-betting-launch-2023
2+
ny-mobile-sports-betting-launch-2022
Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
| Study | Config | Files | Validate(pop/scn) | Ground truth doc | Leakage triage | Status |
2+
|---|---|---|---|---|---|---|
3+
| apple-att-privacy | 00-baseline | ok | PASS/PASS | yes (ground-truth.md,GROUND-TRUTH.md; registry=-) | clean | READY |
4+
| bud-light-boycott | 00-baseline | ok | PASS/PASS | yes (ground-truth.md,GROUND-TRUTH.md; registry=-) | clean | READY |
5+
| london-ulez-expansion-2023 | 00-baseline | ok | PASS/PASS | yes (ground-truth.md,GROUND-TRUTH.md,README.md; registry=True) | clean | READY |
6+
| ma-mobile-sports-betting-launch-2023 | 00-baseline | ok | PASS/PASS | yes (ground-truth.md,GROUND-TRUTH.md,README.md; registry=True) | flagged | BLOCKED |
7+
| netflix-ad-tier-launch | 00-baseline | ok | PASS/PASS | yes (ground-truth.md,GROUND-TRUTH.md; registry=-) | clean | READY |
8+
| netflix-password-sharing | 00-baseline | ok | PASS/PASS | yes (ground-truth.md,GROUND-TRUTH.md,README.md; registry=False) | clean | READY |
9+
| ny-mobile-sports-betting-launch-2022 | 00-baseline | ok | PASS/PASS | yes (ground-truth.md,GROUND-TRUTH.md,README.md; registry=True) | flagged | BLOCKED |
10+
| nyc-congestion-pricing | 00-baseline | ok | PASS/PASS | yes (ground-truth.md,GROUND-TRUTH.md; registry=-) | clean | READY |
11+
| plant-based-meat | 00-baseline | ok | PASS/PASS | yes (ground-truth.md,GROUND-TRUTH.md; registry=-) | clean | READY |
12+
| reddit-api-protest | 00-baseline | ok | PASS/PASS | yes (ground-truth.md,GROUND-TRUTH.md; registry=-) | clean | READY |
13+
| snapchat-plus-launch | 00-baseline | ok | PASS/PASS | yes (ground-truth.md,GROUND-TRUTH.md; registry=-) | clean | READY |
14+
| spotify-price-hike | 00-baseline | ok | PASS/PASS | yes (ground-truth.md,GROUND-TRUTH.md; registry=-) | clean | READY |
15+
| threads-launch | 00-baseline | ok | PASS/PASS | yes (ground-truth.md,GROUND-TRUTH.md; registry=-) | clean | READY |
16+
| x-premium-adoption | 01-revised-options | ok | PASS/PASS | yes (ground-truth.md,GROUND-TRUTH.md,README.md; registry=False) | clean | READY |
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
| Study | Metric | Target | Pred | Error | Status | Mapping |
2+
|---|---|---:|---:|---:|---|---|
3+
| apple-att-privacy | deny_tracking | 75-80% | 76.7% | 0.0pp | PASS | direct |
4+
| bud-light-boycott | maintain_bud_light | 80-90% (~85%) | 85.8% | 0.0pp | PASS | direct |
5+
| netflix-password-sharing | maintain_relationship (comply) | >80% | 94.2% | 0.0pp | PASS | direct |
6+
| spotify-price-hike | continue_same_plan | 95-98% | 95.8% | 0.0pp | PASS | direct |
7+
| plant-based-meat | regular_adoption_mapped | 4-8% | 5.8% | 0.0pp | PASS | direct |
8+
| threads-launch | active_use_mapped | 10-15% | 24.2% | 9.2pp | MISS | direct |
9+
| x-premium-adoption | subscribe_to_premium | 0.5-1.5% | 0.8% | 0.0pp | PASS | direct |
10+
| nyc-congestion-pricing | behavior_change_mapped | 15-20% | 48.3% | 28.3pp | MISS | direct |
11+
| london-ulez-expansion-2023 | compliance_proxy (=1-pay_daily_charge) | 95-96% (TfL) | 99.2% | 3.2pp | MISS | assumption |
12+
| netflix-ad-tier-launch | switch_to_ad_tier (short-horizon proxy) | 10-30% (proxy band) | 18.3% | 0.0pp | PASS | assumption |
13+
| reddit-api-protest | protest_participation_proxy | 40-70% (proxy band) | 57.5% | 0.0pp | PASS | assumption |
14+
| snapchat-plus-launch | subscribe_immediately (short-horizon proxy) | 2-8% (proxy band) | 5.0% | 0.0pp | PASS | assumption |
Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,61 @@
1+
# Mini Bench Ground-Truth Analysis
2+
3+
Run tag: `mini-ready12-seed42-p3-20260212-235715`
4+
Execution mode: 12 studies, parallel scheduler 3-at-a-time, Azure OpenAI, `gpt-5-mini` for pivotal + routine, per-study RPM 250.
5+
6+
## Scored Metrics (Numeric Targets)
7+
8+
| Study | Metric | Target | Pred | Error | Status |
9+
|---|---|---:|---:|---:|---|
10+
| apple-att-privacy | deny_tracking | 75-80% | 76.7% | 0.0pp | PASS |
11+
| bud-light-boycott | maintain_bud_light | ~85% (80-90%) | 85.8% | 0.0pp | PASS |
12+
| netflix-password-sharing | maintain_relationship (comply) | >80% | 94.2% | 0.0pp | PASS |
13+
| spotify-price-hike | continue_same_plan | 95-98% | 95.8% | 0.0pp | PASS |
14+
| plant-based-meat | regular_adoption_mapped | 4-8% | 5.8% | 0.0pp | PASS |
15+
| threads-launch | active_use_mapped | 10-15% | 24.2% | 9.2pp | MISS |
16+
| x-premium-adoption | subscribe_to_premium | 0.5-1.5% | 0.8% | 0.0pp | PASS |
17+
| nyc-congestion-pricing | behavior_change_mapped | 15-20% | 48.3% | 28.3pp | MISS |
18+
19+
Scored pass rate: **6/8**
20+
21+
## Directional Studies (No Single Clean Numeric Target)
22+
23+
| Study | Key outputs |
24+
|---|---|
25+
| london-ulez-expansion-2023 | reduce_zone_trips 58.3%, switch_to_compliant_vehicle 35.8%, shift_to_alternative_transport 4.2%, pay_daily_charge 0.8% |
26+
| netflix-ad-tier-launch | maintain_current_plan 76.7%, switch_to_ad_tier 18.3%, reduce_or_cancel 5.0% |
27+
| reddit-api-protest | actively_protest 40.8%, continue_unaffected 40.0%, migrate_to_alternative 13.3%, reduce_engagement 3.3% |
28+
| snapchat-plus-launch | continue_free_satisfied 93.3%, subscribe_immediately 5.0% |
29+
30+
## Coverage / Reliability Diagnostics
31+
32+
| Study | Timesteps | Primary coverage |
33+
|---|---:|---:|
34+
| apple-att-privacy | 8 | 90.0% |
35+
| bud-light-boycott | 10 | 98.3% |
36+
| london-ulez-expansion-2023 | 6 | 99.2% |
37+
| netflix-ad-tier-launch | 6 | 100.0% |
38+
| netflix-password-sharing | 11 | 100.0% |
39+
| nyc-congestion-pricing | 6 | 99.2% |
40+
| plant-based-meat | 7 | 93.3% |
41+
| reddit-api-protest | 12 | 97.5% |
42+
| snapchat-plus-launch | 10 | 98.3% |
43+
| spotify-price-hike | 10 | 100.0% |
44+
| threads-launch | 10 | 92.5% |
45+
| x-premium-adoption | 12 | 97.5% |
46+
47+
Interpretation:
48+
- Misses on `threads` and `nyc` are mostly behavioral (coverage is high enough that denominator alone does not explain 9.2pp and 28.3pp misses).
49+
- Lower coverage studies (`apple`, `plant`, `threads`) should still be stabilized in 5-seed reporting, but only `threads` is currently an accuracy miss.
50+
51+
## Miss Triage
52+
53+
1. `nyc-congestion-pricing` (largest miss): model over-shifts behavior away from driving.
54+
2. `threads-launch`: model over-predicts sustained active use and under-predicts trial-and-abandon.
55+
56+
## Recommendation Before 5-Seed
57+
58+
Proceed with 5-seed on all 12 (to separate noise from signal), but treat these as priority watch studies:
59+
1. `nyc-congestion-pricing`
60+
2. `threads-launch`
61+
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
study | top_outcome | top_share | timesteps | reasoning_calls | stop_reason
2+
---------------------------+-------------------------------------------------------------------------+-----------+-----------+-----------------+------------
3+
apple-att-privacy | tracking_decision:deny_tracking | 0.767 | 8 | n/a | n/a
4+
bud-light-boycott | beer_purchase_behavior:maintain_bud_light | 0.858 | 10 | n/a | n/a
5+
london-ulez-expansion-2023 | ulez_response_strategy:reduce_zone_trips | 0.583 | 6 | n/a | n/a
6+
netflix-ad-tier-launch | subscription_response:maintain_current_plan | 0.767 | 6 | n/a | n/a
7+
netflix-password-sharing | password_sharing_response:remove_shared_access | 0.558 | 11 | n/a | n/a
8+
nyc-congestion-pricing | commute_response:continue_driving_and_pay_toll | 0.475 | 6 | n/a | n/a
9+
plant-based-meat | plant_based_purchase_strategy:avoid_and_continue_with_conventional_meat | 0.858 | 7 | n/a | n/a
10+
reddit-api-protest | reddit_response_action:actively_protest | 0.408 | 12 | n/a | n/a
11+
snapchat-plus-launch | subscription_response:continue_free_satisfied | 0.933 | 10 | n/a | n/a
12+
spotify-price-hike | subscription_response:accept_price_increase_continue_same_plan | 0.958 | 10 | n/a | n/a
13+
threads-launch | threads_adoption_action:sign_up_but_consume_only | 0.342 | 10 | n/a | n/a
14+
x-premium-adoption | premium_verification_response:continue_using_without_premium | 0.625 | 12 | n/a | n/a

0 commit comments

Comments
 (0)