Skip to content

Commit e64a48b

Browse files
committed
Update report and tests
1 parent 82867a6 commit e64a48b

11 files changed

Lines changed: 1394 additions & 1293 deletions

README.md

Lines changed: 60 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
**miniBen** evaluates Large Language Models on small benchmarks through [OpenRouter](https://openrouter.ai/). Each benchmark runs several prompt variants, sends them to a model, parses replies, and aggregates a score.
44

5-
> **Status:** `run_benchmark()` and `AIModel.ask()` work end-to-end. Chess outputs are parsed from JSON sentinels (`JSON_START` / `JSON_END`) and both benchmarks have implemented parsers and scorers.
5+
> **Status:** `run_benchmark()` and `AIModel.ask()` work end-to-end. Both benchmarks have parsers and scorers. Chess replies are parsed from JSON sentinels (`JSON_START` / `JSON_END`) when present, with a fallback to a valid 6×8 Python list literal in the reply text.
66
77
---
88

@@ -32,12 +32,12 @@ prompts.py → runner.run_benchmark() → model.AIModel.ask() → OpenRout
3232
scorers.score_*() ← parsers.parse_*()
3333
```
3434

35-
For `**cognitive flexibility**` and `**creativity**`, `run_benchmark()`:
35+
For **cognitive flexibility** and **creativity**, `run_benchmark()`:
3636

3737
1. Builds **5 prompts** (one per variant — opening or word set).
3838
2. Calls the API **5 times** via `AIModel.ask()`.
3939
3. Parses each reply (errors are caught; failed parses become `None`).
40-
4. Scores each parses with the benchmark scorer.
40+
4. Scores each parse with the benchmark scorer.
4141

4242
---
4343

@@ -76,9 +76,9 @@ python -c "from miniBen import BENCHMARKS; print(list(BENCHMARKS))"
7676

7777
## API key
7878

79-
Set `**OPENROUTER_API_KEY**` (loaded from `.env` via `python-dotenv` on import).
79+
Set **`OPENROUTER_API_KEY`** (loaded from `.env` via `python-dotenv` on import).
8080

81-
`**.env` in project root:**
81+
**`.env` in project root:**
8282

8383
```env
8484
OPENROUTER_API_KEY=sk-or-v1-your-key-here
@@ -168,7 +168,7 @@ Defined in `COG_FLEX_OPENINGS` (`prompts.py`):
168168
| 5 | d4 | c6 |
169169

170170

171-
Each call uses `build_cog_flex_prompt(white, black)`. The model is asked for one Meta-Chess game as JSON between `<<<JSON_START>>>` / `<<<JSON_END>>>` (see `JSON_START`, `JSON_END` in `prompts.py`).
171+
Each call uses `build_cog_flex_prompt(white, black)`. The model is asked for one Meta-Chess game as JSON between `<<<JSON_START>>>` / `<<<JSON_END>>>` (see `JSON_START`, `JSON_END` in `prompts.py`). `parse_chess_output()` prefers that JSON block; if sentinels are missing but the reply still contains a valid 6×8 list-of-lists of SAN strings, the parser accepts it as a fallback.
172172

173173
### Creativity — word triplets
174174

@@ -184,7 +184,7 @@ Defined in `CREATIVITY_WORD_TRIPLETS`:
184184
| 5 | study, laptop, desk |
185185

186186

187-
Each call uses `build_creativity_prompt(word1, word2, word3)` for one five-sentence story.
187+
Each call uses `build_creativity_prompt(word1, word2, word3)` for one five-sentence story. `score_creativity()` computes **surprise** per story from sentence-level semantic shifts. **Novelty** (distinctiveness vs. the other stories in the run) is only defined when at least two stories parse successfully; the aggregate score then includes `novelty_scores` and `avg_novelty`.
188188

189189
---
190190

@@ -224,6 +224,24 @@ Raises `KeyError` for unknown benchmarks. Raises `ValueError` if prompt and vari
224224
| `JSON_START`, `JSON_END` | Sentinel strings for chess JSON output |
225225

226226

227+
### Parsers (`parsers.py`)
228+
229+
230+
| Function | Returns |
231+
| --------------------------- | ----------------------------------------------------------------------- |
232+
| `parse_chess_output(raw)` | 6×8 SAN matrix, or `None` (JSON sentinels first, then list literal) |
233+
| `parse_creativity_output(raw)` | `{"story": str, "sentence_count": int}`, or `None` if empty after strip |
234+
235+
236+
### Scorers (`scorers.py`)
237+
238+
239+
| Function | Aggregate score keys (high level) |
240+
| ----------------------------- | ------------------------------------------------------------------------------------------------- |
241+
| `score_chess(parsed_list)` | `compliance_rate`, `total_violations`, `total_moves`, `failed_games`, `scored_games`, `games`, … |
242+
| `score_creativity(parsed_list)` | `stories_scored`, `surprise_scores`, `avg_surprise`, `novelty_scores`, `avg_novelty` |
243+
244+
227245
### Auth (`auth.py`)
228246

229247

@@ -265,7 +283,7 @@ from miniBen import BENCHMARKS
265283
}
266284
```
267285

268-
Each `**outputs**` item:
286+
Each **`outputs`** item:
269287

270288
```python
271289
{
@@ -280,6 +298,34 @@ Each `**outputs**` item:
280298

281299
Parse failures are logged as warnings; that call’s `parsed` is `None` and the run continues.
282300

301+
### Aggregate `score` (benchmark-specific)
302+
303+
**Cognitive flexibility** (`score_chess`):
304+
305+
```python
306+
{
307+
"compliance_rate": float, # 0.0–1.0; parse-failed games excluded
308+
"total_violations": int,
309+
"total_moves": int,
310+
"failed_games": int,
311+
"scored_games": int,
312+
"per_trial_violations": list[int],
313+
"games": list[dict], # per-game detail
314+
}
315+
```
316+
317+
**Creativity** (`score_creativity`):
318+
319+
```python
320+
{
321+
"stories_scored": int,
322+
"surprise_scores": list[float], # one per successfully parsed story
323+
"avg_surprise": float,
324+
"novelty_scores": list[float] | None, # None if fewer than 2 stories
325+
"avg_novelty": float | None,
326+
}
327+
```
328+
283329
---
284330

285331
## Project layout
@@ -289,6 +335,9 @@ mini_benchmarks/
289335
├── .miniben/
290336
│ └── run_history.json
291337
├── docs/
338+
│ ├── vignette.ipynb
339+
│ ├── vignette.md
340+
│ └── workflow.md
292341
├── src/
293342
│ ├── miniBen/
294343
│ │ ├── __init__.py
@@ -354,7 +403,8 @@ Use slugs from [OpenRouter models](https://openrouter.ai/models). If reasoning b
354403
| 401 / missing key | Set `OPENROUTER_API_KEY` in `.env` or environment |
355404
| `KeyError: Unknown benchmark` | Use `cognitive flexibility` or `creativity` exactly |
356405
| `ValueError: None content block` | Try another model or `reasoning=False` |
357-
| `parsed` is often `None` | Ensure model outputs JSON between `JSON_START` / `JSON_END` and matches schema |
406+
| `parsed` is often `None` (chess) | Prefer JSON between `JSON_START` / `JSON_END` with `{"trials": [[...], ...]}` (6×8 strings); or a bare 6×8 list literal of SAN moves |
407+
| `avg_novelty` is `n/a` (creativity) | Need at least two successfully parsed stories in the same run |
358408
| `[E050] Can't find model 'en_core_web_sm'` | Run `python -m spacy download en_core_web_sm` or reinstall: `pip install -e .` |
359409
| `ImportError` | `pip install -e .` from repo root |
360410
| Rate limits | Wait or switch model on OpenRouter |
@@ -381,7 +431,7 @@ Use slugs from [OpenRouter models](https://openrouter.ai/models). If reasoning b
381431
| Task | Credit |
382432
| ---------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------- |
383433
| Meta-Chess / cognitive flexibility | Original to this project |
384-
| Creativity | Adapted from [creative-story-gen](https://github.com/mismayil/creative-story-gen)[arXiv:2411.02316](https://arxiv.org/abs/2411.02316) | |
434+
| Creativity | Adapted from [creative-story-gen](https://github.com/mismayil/creative-story-gen)[arXiv:2411.02316](https://arxiv.org/abs/2411.02316) |
385435

386436

387437
```bibtex
258 KB
Loading
379 KB
Loading
223 KB
Loading
259 KB
Loading
282 KB
Loading

0 commit comments

Comments
 (0)