Programming-The-Next-Step-2026
diff --git a/‎README.md‎
Lines changed: 60 additions & 10 deletions b/‎README.md‎
Lines changed: 60 additions & 10 deletions
diff --git a/‎docs/figures/streamlit_app_compare.png‎
258 KB b/‎docs/figures/streamlit_app_compare.png‎
258 KB
diff --git a/‎docs/figures/streamlit_app_history.png‎
379 KB b/‎docs/figures/streamlit_app_history.png‎
379 KB
diff --git a/‎docs/figures/streamlit_app_percalldetails.png‎
223 KB b/‎docs/figures/streamlit_app_percalldetails.png‎
223 KB
diff --git a/‎docs/figures/streamlit_app_runbench.png‎
259 KB b/‎docs/figures/streamlit_app_runbench.png‎
259 KB
diff --git a/‎docs/figures/streamlit_app_summaryhist.png‎
282 KB b/‎docs/figures/streamlit_app_summaryhist.png‎
282 KB
@@ -2,7 +2,7 @@
 
 **miniBen** evaluates Large Language Models on small benchmarks through [OpenRouter](https://openrouter.ai/). Each benchmark runs several prompt variants, sends them to a model, parses replies, and aggregates a score.
 
-> **Status:** `run_benchmark()` and `AIModel.ask()` work end-to-end. Chess outputs are parsed from JSON sentinels (`JSON_START` / `JSON_END`) and both benchmarks have implemented parsers and scorers.
+> **Status:** `run_benchmark()` and `AIModel.ask()` work end-to-end. Both benchmarks have parsers and scorers. Chess replies are parsed from JSON sentinels (`JSON_START` / `JSON_END`) when present, with a fallback to a valid 6×8 Python list literal in the reply text.
 
 ---
 
@@ -32,12 +32,12 @@ prompts.py  →  runner.run_benchmark()  →  model.AIModel.ask()  →  OpenRout
                                           scorers.score_*()  ←  parsers.parse_*()  
 ```
 
-For `**cognitive flexibility**` and `**creativity**`, `run_benchmark()`:
+For **cognitive flexibility** and **creativity**, `run_benchmark()`:
 
 1. Builds **5 prompts** (one per variant — opening or word set).
 2. Calls the API **5 times** via `AIModel.ask()`.
 3. Parses each reply (errors are caught; failed parses become `None`).
-4. Scores each parses with the benchmark scorer.
+4. Scores each parse with the benchmark scorer.
 
 ---
 
@@ -76,9 +76,9 @@ python -c "from miniBen import BENCHMARKS; print(list(BENCHMARKS))"
 
 ## API key
 
-Set `**OPENROUTER_API_KEY**` (loaded from `.env` via `python-dotenv` on import).
+Set **`OPENROUTER_API_KEY`** (loaded from `.env` via `python-dotenv` on import).
 
-`**.env` in project root:**
+**`.env` in project root:**
 
 ```env
 OPENROUTER_API_KEY=sk-or-v1-your-key-here
@@ -168,7 +168,7 @@ Defined in `COG_FLEX_OPENINGS` (`prompts.py`):
 | 5   | d4    | c6    |
 
 
-Each call uses `build_cog_flex_prompt(white, black)`. The model is asked for one Meta-Chess game as JSON between `<<<JSON_START>>>` / `<<<JSON_END>>>` (see `JSON_START`, `JSON_END` in `prompts.py`).
+Each call uses `build_cog_flex_prompt(white, black)`. The model is asked for one Meta-Chess game as JSON between `<<<JSON_START>>>` / `<<<JSON_END>>>` (see `JSON_START`, `JSON_END` in `prompts.py`). `parse_chess_output()` prefers that JSON block; if sentinels are missing but the reply still contains a valid 6×8 list-of-lists of SAN strings, the parser accepts it as a fallback.
 
 ### Creativity — word triplets
 
@@ -184,7 +184,7 @@ Defined in `CREATIVITY_WORD_TRIPLETS`:
 | 5   | study, laptop, desk |
 
 
-Each call uses `build_creativity_prompt(word1, word2, word3)` for one five-sentence story.
+Each call uses `build_creativity_prompt(word1, word2, word3)` for one five-sentence story. `score_creativity()` computes **surprise** per story from sentence-level semantic shifts. **Novelty** (distinctiveness vs. the other stories in the run) is only defined when at least two stories parse successfully; the aggregate score then includes `novelty_scores` and `avg_novelty`.
 
 ---
 
@@ -224,6 +224,24 @@ Raises `KeyError` for unknown benchmarks. Raises `ValueError` if prompt and vari
 | `JSON_START`, `JSON_END`              | Sentinel strings for chess JSON output |
 
 
+### Parsers (`parsers.py`)
+
+
+| Function                    | Returns                                                                 |
+| --------------------------- | ----------------------------------------------------------------------- |
+| `parse_chess_output(raw)`   | 6×8 SAN matrix, or `None` (JSON sentinels first, then list literal)     |
+| `parse_creativity_output(raw)` | `{"story": str, "sentence_count": int}`, or `None` if empty after strip |
+
+
+### Scorers (`scorers.py`)
+
+
+| Function                      | Aggregate score keys (high level)                                                                 |
+| ----------------------------- | ------------------------------------------------------------------------------------------------- |
+| `score_chess(parsed_list)`    | `compliance_rate`, `total_violations`, `total_moves`, `failed_games`, `scored_games`, `games`, … |
+| `score_creativity(parsed_list)` | `stories_scored`, `surprise_scores`, `avg_surprise`, `novelty_scores`, `avg_novelty`          |
+
+
 ### Auth (`auth.py`)
 
 
@@ -265,7 +283,7 @@ from miniBen import BENCHMARKS
 }
 ```
 
-Each `**outputs**` item:
+Each **`outputs`** item:
 
 ```python
 {
@@ -280,6 +298,34 @@ Each `**outputs**` item:
 
 Parse failures are logged as warnings; that call’s `parsed` is `None` and the run continues.
 
+### Aggregate `score` (benchmark-specific)
+
+**Cognitive flexibility** (`score_chess`):
+
+```python
+{
+    "compliance_rate": float,      # 0.0–1.0; parse-failed games excluded
+    "total_violations": int,
+    "total_moves": int,
+    "failed_games": int,
+    "scored_games": int,
+    "per_trial_violations": list[int],
+    "games": list[dict],           # per-game detail
+}
+```
+
+**Creativity** (`score_creativity`):
+
+```python
+{
+    "stories_scored": int,
+    "surprise_scores": list[float],   # one per successfully parsed story
+    "avg_surprise": float,
+    "novelty_scores": list[float] | None,  # None if fewer than 2 stories
+    "avg_novelty": float | None,
+}
+```
+
 ---
 
 ## Project layout
@@ -289,6 +335,9 @@ mini_benchmarks/
 ├── .miniben/
 │   └── run_history.json
 ├── docs/
+│   ├── vignette.ipynb
+│   ├── vignette.md
+│   └── workflow.md
 ├── src/
 │   ├── miniBen/
 │   │   ├── __init__.py
@@ -354,7 +403,8 @@ Use slugs from [OpenRouter models](https://openrouter.ai/models). If reasoning b
 | 401 / missing key                          | Set `OPENROUTER_API_KEY` in `.env` or environment                              |
 | `KeyError: Unknown benchmark`              | Use `cognitive flexibility` or `creativity` exactly                            |
 | `ValueError: None content block`           | Try another model or `reasoning=False`                                         |
-| `parsed` is often `None`                   | Ensure model outputs JSON between `JSON_START` / `JSON_END` and matches schema |
+| `parsed` is often `None` (chess)           | Prefer JSON between `JSON_START` / `JSON_END` with `{"trials": [[...], ...]}` (6×8 strings); or a bare 6×8 list literal of SAN moves |
+| `avg_novelty` is `n/a` (creativity)        | Need at least two successfully parsed stories in the same run |
 | `[E050] Can't find model 'en_core_web_sm'` | Run `python -m spacy download en_core_web_sm` or reinstall: `pip install -e .` |
 | `ImportError`                              | `pip install -e .` from repo root                                              |
 | Rate limits                                | Wait or switch model on OpenRouter                                             |
@@ -381,7 +431,7 @@ Use slugs from [OpenRouter models](https://openrouter.ai/models). If reasoning b
 | Task                               | Credit                                                                                                                                   |
 | ---------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------- |
 | Meta-Chess / cognitive flexibility | Original to this project                                                                                                                 |
-| Creativity                         | Adapted from [creative-story-gen](https://github.com/mismayil/creative-story-gen) — [arXiv:2411.02316](https://arxiv.org/abs/2411.02316) |                                                   |
+| Creativity                         | Adapted from [creative-story-gen](https://github.com/mismayil/creative-story-gen) — [arXiv:2411.02316](https://arxiv.org/abs/2411.02316) |
 
 
 ```bibtex