Skip to content

Commit 82f44fb

Browse files
Merge main: add DeepWiki README badge
2 parents 5bc736d + d17d592 commit 82f44fb

3 files changed

Lines changed: 52 additions & 0 deletions

File tree

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44
[![PyPI version](https://img.shields.io/pypi/v/extropy-run.svg?cacheSeconds=300)](https://pypi.org/project/extropy-run/)
55
[![Python](https://img.shields.io/pypi/pyversions/extropy-run.svg?cacheSeconds=300)](https://pypi.org/project/extropy-run/)
66
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
7+
[![DeepWiki](https://deepwiki.com/badge-maker?url=https%3A%2F%2Fdeepwiki.com%2Fexaforge%2Fextropy)](https://deepwiki.com/exaforge/extropy)
78

89
Predictive intelligence through agent-based population simulation. Create synthetic populations grounded in real-world data, simulate how they respond to events, and watch opinions emerge through social networks.
910

docs/benchmark/MANIFEST.sha256

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
21634f124af1bbbc0083c17f8621c80e7f9865087681ae19b9ab75e1b775a470 docs/benchmark/artifacts/mini-ready12-seed42-p3-20260212-235715/baselines.md
12
e17df98f0186eaead80d126d4e972b80ab6c79542a2bdc2856300f2fb0820b21 docs/benchmark/artifacts/mini-ready12-seed42-p3-20260212-235715/blocked_studies.txt
23
0dd9f126da13b5abf552d5ad5cb929451f02dd4d32b9cbd33482eea0f09422f2 docs/benchmark/artifacts/mini-ready12-seed42-p3-20260212-235715/leakage-readiness.md
34
92e81d23ce31880e6bf3ed8ddfc2284d7ae8d3d928d3fd350a03a3e539de2a8a docs/benchmark/artifacts/mini-ready12-seed42-p3-20260212-235715/mini-ready12-seed42-p3-20260212-235715.groundtruth-12table.md
Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
# Baselines (Tracked)
2+
3+
This note tracks baseline evidence currently available for the frozen mini benchmark tag `mini-ready12-seed42-p3-20260212-235715`.
4+
5+
## 1) Direct LLM Baseline (Measured)
6+
7+
Source artifact (private run): `extropy-ds/minibench/direct-dual12-20260211-184557.json`.
8+
9+
Setup:
10+
- Model/provider: `gpt-5-mini` on `azure_openai`
11+
- Sample size: `n=12` agents per study
12+
- Prompting mode used for baseline comparison: `current` (single-shot direct response)
13+
14+
| Study | Direct LLM baseline pred | Ground-truth target | Extropy pred (frozen run) | Direct LLM status | Extropy status |
15+
|---|---:|---:|---:|---|---|
16+
| apple-att-privacy | 58.3% deny_tracking | 75-80% | 76.7% | MISS | PASS |
17+
| bud-light-boycott | 41.7% maintain_bud_light | 80-90% (~85%) | 85.8% | MISS | PASS |
18+
| netflix-password-sharing | 83.3% maintain_relationship (comply) | >80% | 94.2% | PASS | PASS |
19+
| x-premium-adoption | 41.7% subscribe_to_premium | 0.5-1.5% | 0.8% | MISS | PASS |
20+
21+
Interpretation:
22+
- Direct LLM baseline is currently measured on **4 studies**, not all 12.
23+
- In this measured subset, Extropy outperforms direct LLM in 3 studies and ties/pass-matches in 1.
24+
- This baseline should be treated as **preliminary** due to small `n=12` and partial study coverage.
25+
26+
## 2) Survey Baseline (Availability)
27+
28+
Survey-style baseline context exists in many `ground-truth.md` files, but quality varies by study. The table below tracks whether a usable survey anchor is present.
29+
30+
| Study | Survey baseline availability | Notes |
31+
|---|---|---|
32+
| apple-att-privacy | YES | Explicit survey/industry opt-in expectations present |
33+
| bud-light-boycott | YES | Stated boycott intent and polling context present |
34+
| netflix-password-sharing | YES | Borrower intent polling context present |
35+
| spotify-price-hike | YES | Stated cancellation-intent survey ranges present |
36+
| plant-based-meat | YES | Stated willingness/try rates present |
37+
| threads-launch | YES | Stated interest-to-try polling present |
38+
| nyc-congestion-pricing | YES | Polling opposition and self-reported behavior-change intent present |
39+
| london-ulez-expansion-2023 | PARTIAL | Polling context present; behavior target not fully survey-native |
40+
| reddit-api-protest | LIMITED | Mostly organizer commitments/public actions, limited formal survey basis |
41+
| snapchat-plus-launch | LIMITED | Mostly platform disclosures/market reporting, weak survey anchor |
42+
| netflix-ad-tier-launch | LIMITED | Primarily earnings/industry reporting, weak explicit survey baseline |
43+
| x-premium-adoption | PARTIAL | Mixed survey-style interest context and market estimates |
44+
45+
## Fairness Constraints for Baseline Claims
46+
47+
Use these constraints in public writeups:
48+
- Do not claim “full 12-study direct-LLM baseline win” yet; measured direct-LLM baseline is currently partial.
49+
- Label survey comparisons as **contextual anchors** unless metric definitions are fully normalized to simulation outcomes.
50+
- Keep benchmark headline based on frozen Extropy-vs-ground-truth table; baseline deltas should be marked with coverage.

0 commit comments

Comments
 (0)