You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+60-10Lines changed: 60 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,7 +2,7 @@
2
2
3
3
**miniBen** evaluates Large Language Models on small benchmarks through [OpenRouter](https://openrouter.ai/). Each benchmark runs several prompt variants, sends them to a model, parses replies, and aggregates a score.
4
4
5
-
> **Status:**`run_benchmark()` and `AIModel.ask()` work end-to-end. Chess outputs are parsed from JSON sentinels (`JSON_START` / `JSON_END`) and both benchmarks have implemented parsers and scorers.
5
+
> **Status:**`run_benchmark()` and `AIModel.ask()` work end-to-end. Both benchmarks have parsers and scorers. Chess replies are parsed from JSON sentinels (`JSON_START` / `JSON_END`) when present, with a fallback to a valid 6×8 Python list literal in the reply text.
Set `**OPENROUTER_API_KEY**` (loaded from `.env` via `python-dotenv` on import).
79
+
Set **`OPENROUTER_API_KEY`** (loaded from `.env` via `python-dotenv` on import).
80
80
81
-
`**.env` in project root:**
81
+
**`.env` in project root:**
82
82
83
83
```env
84
84
OPENROUTER_API_KEY=sk-or-v1-your-key-here
@@ -168,7 +168,7 @@ Defined in `COG_FLEX_OPENINGS` (`prompts.py`):
168
168
| 5 | d4 | c6 |
169
169
170
170
171
-
Each call uses `build_cog_flex_prompt(white, black)`. The model is asked for one Meta-Chess game as JSON between `<<<JSON_START>>>` / `<<<JSON_END>>>` (see `JSON_START`, `JSON_END` in `prompts.py`).
171
+
Each call uses `build_cog_flex_prompt(white, black)`. The model is asked for one Meta-Chess game as JSON between `<<<JSON_START>>>` / `<<<JSON_END>>>` (see `JSON_START`, `JSON_END` in `prompts.py`).`parse_chess_output()` prefers that JSON block; if sentinels are missing but the reply still contains a valid 6×8 list-of-lists of SAN strings, the parser accepts it as a fallback.
172
172
173
173
### Creativity — word triplets
174
174
@@ -184,7 +184,7 @@ Defined in `CREATIVITY_WORD_TRIPLETS`:
184
184
| 5 | study, laptop, desk |
185
185
186
186
187
-
Each call uses `build_creativity_prompt(word1, word2, word3)` for one five-sentence story.
187
+
Each call uses `build_creativity_prompt(word1, word2, word3)` for one five-sentence story.`score_creativity()` computes **surprise** per story from sentence-level semantic shifts. **Novelty** (distinctiveness vs. the other stories in the run) is only defined when at least two stories parse successfully; the aggregate score then includes `novelty_scores` and `avg_novelty`.
188
188
189
189
---
190
190
@@ -224,6 +224,24 @@ Raises `KeyError` for unknown benchmarks. Raises `ValueError` if prompt and vari
224
224
|`JSON_START`, `JSON_END`| Sentinel strings for chess JSON output |
@@ -265,7 +283,7 @@ from miniBen import BENCHMARKS
265
283
}
266
284
```
267
285
268
-
Each `**outputs**` item:
286
+
Each **`outputs`** item:
269
287
270
288
```python
271
289
{
@@ -280,6 +298,34 @@ Each `**outputs**` item:
280
298
281
299
Parse failures are logged as warnings; that call’s `parsed` is `None` and the run continues.
282
300
301
+
### Aggregate `score` (benchmark-specific)
302
+
303
+
**Cognitive flexibility** (`score_chess`):
304
+
305
+
```python
306
+
{
307
+
"compliance_rate": float, # 0.0–1.0; parse-failed games excluded
308
+
"total_violations": int,
309
+
"total_moves": int,
310
+
"failed_games": int,
311
+
"scored_games": int,
312
+
"per_trial_violations": list[int],
313
+
"games": list[dict], # per-game detail
314
+
}
315
+
```
316
+
317
+
**Creativity** (`score_creativity`):
318
+
319
+
```python
320
+
{
321
+
"stories_scored": int,
322
+
"surprise_scores": list[float], # one per successfully parsed story
323
+
"avg_surprise": float,
324
+
"novelty_scores": list[float] |None, # None if fewer than 2 stories
325
+
"avg_novelty": float|None,
326
+
}
327
+
```
328
+
283
329
---
284
330
285
331
## Project layout
@@ -289,6 +335,9 @@ mini_benchmarks/
289
335
├── .miniben/
290
336
│ └── run_history.json
291
337
├── docs/
338
+
│ ├── vignette.ipynb
339
+
│ ├── vignette.md
340
+
│ └── workflow.md
292
341
├── src/
293
342
│ ├── miniBen/
294
343
│ │ ├── __init__.py
@@ -354,7 +403,8 @@ Use slugs from [OpenRouter models](https://openrouter.ai/models). If reasoning b
354
403
| 401 / missing key | Set `OPENROUTER_API_KEY` in `.env` or environment |
355
404
|`KeyError: Unknown benchmark`| Use `cognitive flexibility` or `creativity` exactly |
356
405
|`ValueError: None content block`| Try another model or `reasoning=False`|
357
-
|`parsed` is often `None`| Ensure model outputs JSON between `JSON_START` / `JSON_END` and matches schema |
406
+
|`parsed` is often `None` (chess) | Prefer JSON between `JSON_START` / `JSON_END` with `{"trials": [[...], ...]}` (6×8 strings); or a bare 6×8 list literal of SAN moves |
407
+
|`avg_novelty` is `n/a` (creativity) | Need at least two successfully parsed stories in the same run |
358
408
|`[E050] Can't find model 'en_core_web_sm'`| Run `python -m spacy download en_core_web_sm` or reinstall: `pip install -e .`|
359
409
|`ImportError`|`pip install -e .` from repo root |
360
410
| Rate limits | Wait or switch model on OpenRouter |
@@ -381,7 +431,7 @@ Use slugs from [OpenRouter models](https://openrouter.ai/models). If reasoning b
0 commit comments