|
| 1 | +# Baselines (Tracked) |
| 2 | + |
| 3 | +This note tracks baseline evidence currently available for the frozen mini benchmark tag `mini-ready12-seed42-p3-20260212-235715`. |
| 4 | + |
| 5 | +## 1) Direct LLM Baseline (Measured) |
| 6 | + |
| 7 | +Source artifact (private run): `extropy-ds/minibench/direct-dual12-20260211-184557.json`. |
| 8 | + |
| 9 | +Setup: |
| 10 | +- Model/provider: `gpt-5-mini` on `azure_openai` |
| 11 | +- Sample size: `n=12` agents per study |
| 12 | +- Prompting mode used for baseline comparison: `current` (single-shot direct response) |
| 13 | + |
| 14 | +| Study | Direct LLM baseline pred | Ground-truth target | Extropy pred (frozen run) | Direct LLM status | Extropy status | |
| 15 | +|---|---:|---:|---:|---|---| |
| 16 | +| apple-att-privacy | 58.3% deny_tracking | 75-80% | 76.7% | MISS | PASS | |
| 17 | +| bud-light-boycott | 41.7% maintain_bud_light | 80-90% (~85%) | 85.8% | MISS | PASS | |
| 18 | +| netflix-password-sharing | 83.3% maintain_relationship (comply) | >80% | 94.2% | PASS | PASS | |
| 19 | +| x-premium-adoption | 41.7% subscribe_to_premium | 0.5-1.5% | 0.8% | MISS | PASS | |
| 20 | + |
| 21 | +Interpretation: |
| 22 | +- Direct LLM baseline is currently measured on **4 studies**, not all 12. |
| 23 | +- In this measured subset, Extropy outperforms direct LLM in 3 studies and ties/pass-matches in 1. |
| 24 | +- This baseline should be treated as **preliminary** due to small `n=12` and partial study coverage. |
| 25 | + |
| 26 | +## 2) Survey Baseline (Availability) |
| 27 | + |
| 28 | +Survey-style baseline context exists in many `ground-truth.md` files, but quality varies by study. The table below tracks whether a usable survey anchor is present. |
| 29 | + |
| 30 | +| Study | Survey baseline availability | Notes | |
| 31 | +|---|---|---| |
| 32 | +| apple-att-privacy | YES | Explicit survey/industry opt-in expectations present | |
| 33 | +| bud-light-boycott | YES | Stated boycott intent and polling context present | |
| 34 | +| netflix-password-sharing | YES | Borrower intent polling context present | |
| 35 | +| spotify-price-hike | YES | Stated cancellation-intent survey ranges present | |
| 36 | +| plant-based-meat | YES | Stated willingness/try rates present | |
| 37 | +| threads-launch | YES | Stated interest-to-try polling present | |
| 38 | +| nyc-congestion-pricing | YES | Polling opposition and self-reported behavior-change intent present | |
| 39 | +| london-ulez-expansion-2023 | PARTIAL | Polling context present; behavior target not fully survey-native | |
| 40 | +| reddit-api-protest | LIMITED | Mostly organizer commitments/public actions, limited formal survey basis | |
| 41 | +| snapchat-plus-launch | LIMITED | Mostly platform disclosures/market reporting, weak survey anchor | |
| 42 | +| netflix-ad-tier-launch | LIMITED | Primarily earnings/industry reporting, weak explicit survey baseline | |
| 43 | +| x-premium-adoption | PARTIAL | Mixed survey-style interest context and market estimates | |
| 44 | + |
| 45 | +## Fairness Constraints for Baseline Claims |
| 46 | + |
| 47 | +Use these constraints in public writeups: |
| 48 | +- Do not claim “full 12-study direct-LLM baseline win” yet; measured direct-LLM baseline is currently partial. |
| 49 | +- Label survey comparisons as **contextual anchors** unless metric definitions are fully normalized to simulation outcomes. |
| 50 | +- Keep benchmark headline based on frozen Extropy-vs-ground-truth table; baseline deltas should be marked with coverage. |
0 commit comments