Programming-The-Next-Step-2026
diff --git a/‎.DS_Store‎
2 KB b/‎.DS_Store‎
2 KB
diff --git a/‎.gitignore‎
Lines changed: 16 additions & 0 deletions b/‎.gitignore‎
Lines changed: 16 additions & 0 deletions
diff --git a/‎.idea/mini_benchmarks.iml‎
Lines changed: 2 additions & 1 deletion b/‎.idea/mini_benchmarks.iml‎
Lines changed: 2 additions & 1 deletion
diff --git a/‎.idea/misc.xml‎
Lines changed: 1 addition & 1 deletion b/‎.idea/misc.xml‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎README.md‎
Lines changed: 329 additions & 2 deletions b/‎README.md‎
Lines changed: 329 additions & 2 deletions
@@ -0,0 +1,16 @@
+# Environment variables - never commit
+.env
+
+# Virtual environment
+.venvnew/
+
+# Python cache
+__pycache__/
+*.pyc
+*.pyo
+
+# PyCharm
+.idea/
+
+# macOS
+.DS_Store
@@ -1,3 +1,330 @@
-# Mini benchmarks Package
+# miniBen — Mini LLM Benchmarks
 
-Mini LLM Benchmarks is designed to evaluate the performance of Large Language Models through a collection of interesting benchmark tests (e.g., cognitive flexibility, creativity, humor...).
+**miniBen** is a Python package for running small, task-specific benchmarks against Large Language Models (LLMs) via [OpenRouter](https://openrouter.ai/). Each benchmark sends a structured prompt to a model, parses the reply, and scores the result so you can compare models on skills like cognitive flexibility and creativity.
+
+> **Status:** The package can call models and run the full benchmark pipeline. Parsers and scorers for some benchmarks are still stubs (`pass`); scores may be `None` until those are implemented.
+
+---
+
+## Table of contents
+
+- [Features](#features)
+- [Requirements](#requirements)
+- [Installation](#installation)
+- [API key setup](#api-key-setup)
+- [Quick start](#quick-start)
+- [Available benchmarks](#available-benchmarks)
+- [Usage guide](#usage-guide)
+- [Return values](#return-values)
+- [Project layout](#project-layout)
+- [Development](#development)
+- [Troubleshooting](#troubleshooting)
+- [Citations](#citations)
+- [License](#license)
+
+---
+
+## Features
+
+- **OpenRouter integration** — Use any model listed on OpenRouter with a single model ID string.
+- **Optional reasoning mode** — Enable chain-of-thought style reasoning on supported models.
+- **Built-in benchmarks** — Predefined prompts for Meta-Chess (cognitive flexibility) and creative writing.
+- **Extensible pipeline** — Each benchmark wires together a prompt, parser, and scorer; you can add new ones in `runner.py`.
+- **Simple Python API** — `AIModel.ask()` for one-off prompts, `run_benchmark()` for the full flow.
+
+---
+
+## Requirements
+
+- **Python** 3.9 or newer
+- An **[OpenRouter](https://openrouter.ai/)** account and API key
+- Internet access when calling models (requests go to `https://openrouter.ai/api/v1`)
+
+---
+
+## Installation
+
+### From a local clone (recommended for development)
+
+```bash
+git clone https://github.com/Programming-The-Next-Step-2026/mini_benchmarks.git
+cd mini_benchmarks
+python -m venv .venv
+source .venv/bin/activate   # Windows: .venv\Scripts\activate
+pip install -e .
+```
+
+### With development tools (pytest)
+
+```bash
+pip install -e ".[dev]"
+```
+
+### Verify the install
+
+```bash
+python -c "from miniBen import AIModel, run_benchmark, BENCHMARKS; print(list(BENCHMARKS))"
+```
+
+You should see the benchmark keys printed without errors.
+
+---
+
+## API key setup
+
+miniBen reads your key from the environment variable **`OPENROUTER_API_KEY`**.
+
+### Option 1 — `.env` file (recommended)
+
+Create a file named `.env` in the project root (it is gitignored by default):
+
+```env
+OPENROUTER_API_KEY=sk-or-v1-your-key-here
+```
+
+`python-dotenv` loads this automatically when you import `miniBen`.
+
+### Option 2 — export in the shell
+
+```bash
+export OPENROUTER_API_KEY="sk-or-v1-your-key-here"
+```
+
+### Option 3 — interactive prompt
+
+```python
+from miniBen import put_openrouter_api_key_into_env
+
+put_openrouter_api_key_into_env()  # prompts only if the key is missing
+```
+
+### Check that the key is set
+
+```python
+from miniBen import check_openrouter_api_key_exist
+
+print(check_openrouter_api_key_exist())  # True if OPENROUTER_API_KEY is set
+```
+
+Get a key from the [OpenRouter keys page](https://openrouter.ai/keys). Never commit your key to git.
+
+---
+
+## Quick start
+
+### 1. Ask a model a single question
+
+```python
+from miniBen import AIModel
+
+model = AIModel("openrouter/free")  # replace with any OpenRouter model ID
+content, reasoning = model.ask("Say hello in one sentence.", reasoning=False)
+
+print("Answer:", content)
+if reasoning:
+    print("Reasoning:", reasoning)
+```
+
+### 2. Run a full benchmark
+
+```python
+from miniBen import run_benchmark
+
+results = run_benchmark(
+    model_name="openrouter/free",
+    benchmark_name="creativity",
+    reasoning=True,
+)
+
+print(results["score"])
+```
+
+`run_benchmark` prints progress to the terminal and returns a dictionary (see [Return values](#return-values)).
+
+---
+
+## Available benchmarks
+
+Use these exact strings as `benchmark_name` in `run_benchmark()`:
+
+| Key | Display name | What it tests |
+|-----|----------------|---------------|
+| `cognitive flexibility` | Meta-Chess Game (Cognitive Flexibility) | Model must follow Meta-Chess rules and output structured move lists |
+| `creativity` | Creativity in story writing | Short story using the words *stamp*, *letter*, *send* |
+
+
+List keys programmatically:
+
+```python
+from miniBen import BENCHMARKS
+
+for key, meta in BENCHMARKS.items():
+    print(key, "→", meta["name"])
+```
+
+Additional prompt placeholders exist in `prompts.py` (`humor`, `bullshit`) but are not registered in `BENCHMARKS` yet.
+
+---
+
+## Usage guide
+
+### `AIModel`
+
+Wraps the OpenRouter chat API.
+
+```python
+from miniBen import AIModel
+
+model = AIModel(model="nvidia/nemotron-3-nano-omni-30b-a3b-reasoning:free")
+
+content, reasoning_text = model.ask(
+    prompt="How many r's are in the word strawberry?",
+    reasoning=True,   # default: True
+)
+```
+
+| Parameter | Type | Description |
+|-----------|------|-------------|
+| `model` (constructor) | `str` | OpenRouter model ID, e.g. `openrouter/free` |
+| `prompt` | `str` | User message sent to the model |
+| `reasoning` | `bool` | Passes `extra_body={"reasoning": {"enabled": ...}}` to the API |
+
+**Returns:** `(content, reasoning_text)` — answer text and optional reasoning trace (may be `None`).
+
+### `run_benchmark`
+
+Runs prompt → model → parse → score for one benchmark.
+
+```python
+from miniBen import run_benchmark
+
+results = run_benchmark(
+    model_name="deepseek/deepseek-chat",
+    benchmark_name="cognitive flexibility",
+    reasoning=True,
+)
+```
+
+| Parameter | Type | Description |
+|-----------|------|-------------|
+| `model_name` | `str` | OpenRouter model ID |
+| `benchmark_name` | `str` | Key from `BENCHMARKS` (see table above) |
+| `reasoning` | `bool` | Enable reasoning tokens when supported (default `True`) |
+
+### Using prompts directly
+
+```python
+from miniBen import AIModel, cog_flex, creativity
+
+model = AIModel("openrouter/free")
+content, _ = model.ask(cog_flex, reasoning=True)
+```
+
+### Auth helpers
+
+| Function | Purpose |
+|----------|---------|
+| `check_openrouter_api_key_exist()` | Returns `True` if `OPENROUTER_API_KEY` is set |
+| `put_openrouter_api_key_into_env()` | Prompts for a key if missing and sets the env var |
+
+---
+
+## Return values
+
+### `AIModel.ask()`
+
+```python
+(content: str, reasoning_text: str | None)
+```
+
+### `run_benchmark()`
+
+```python
+{
+    "model": str,           # model_name you passed in
+    "benchmark": str,       # benchmark_name you passed in
+    "raw_response": tuple,  # (content, reasoning_text) from ask()
+    "parsed": ...,          # output of the benchmark parser (stub → None for now)
+    "score": ...,           # output of the benchmark scorer (stub → None for now)
+}
+```
+
+---
+
+## Project layout
+
+```
+mini_benchmarks/
+├── src/miniBen/
+│   ├── __init__.py    # Public exports
+│   ├── auth.py        # API key helpers, load_dotenv
+│   ├── model.py       # AIModel, OpenRouter client
+│   ├── prompts.py     # Benchmark prompt strings
+│   ├── parsers.py     # Parse raw model text
+│   ├── scorers.py     # Score parsed output
+│   ├── runner.py      # BENCHMARKS registry, run_benchmark()
+│   └── example.py     # Re-exports (backward compatibility)
+├── tests/
+│   ├── test_model.py    # tests for miniBen.model (AIModel.ask)
+│   └── test_runner.py   # tests for miniBen.runner (run_benchmark)
+├── pyproject.toml
+├── README.md
+└── .env               # You create this; not in git
+```
+
+**Import style:** prefer `from miniBen import AIModel, run_benchmark`. Imports from `miniBen.example` still work but are equivalent to the package root.
+
+---
+
+## Development
+
+### Run tests
+
+From the repository root:
+
+```bash
+pip install -e ".[dev]"
+pytest
+```
+
+Tests mock the OpenRouter client so they do not use your API key or network.
+
+### Add a new benchmark
+
+1. Add a prompt string in `src/miniBen/prompts.py`.
+2. Implement `parse_*` and `score_*` in `parsers.py` and `scorers.py`.
+3. Register an entry in `BENCHMARKS` inside `src/miniBen/runner.py`.
+4. Export new symbols in `src/miniBen/__init__.py` if they should be public.
+
+### Choosing a model
+
+Browse models on [OpenRouter](https://openrouter.ai/models). Use the model slug exactly as shown (e.g. `openrouter/free`, `anthropic/claude-3.5-sonnet`). Free and reasoning-capable models vary; if reasoning fails, try `reasoning=False` or another model.
+
+---
+
+## Troubleshooting
+
+| Problem | What to try |
+|---------|-------------|
+| `OPENROUTER_API_KEY` errors / 401 | Set the key in `.env` or the shell; run `check_openrouter_api_key_exist()` |
+| `KeyError` on `benchmark_name` | Use keys from [Available benchmarks](#available-benchmarks) exactly |
+| `ValueError: returned a None content block` | Model refused or returned empty output; try another model or shorter prompt |
+| `parsed` / `score` are always `None` | Expected until parsers and scorers are implemented |
+| Import errors after clone | Run `pip install -e .` from the repo root |
+| Rate limits | OpenRouter quota; wait or use a different model tier |
+
+---
+
+## License
+
+This project is licensed under the **MIT License** — see [LICENSE](LICENSE).
+
+When using upstream benchmarks, also follow the licenses of [creative-story-gen](https://github.com/mismayil/creative-story-gen) and [bullshit-benchmark](https://github.com/petergpt/bullshit-benchmark) (both are open source; check their repositories for the exact terms).
+
+---
+
+## Links
+
+- **Repository:** [Programming-The-Next-Step-2026/mini_benchmarks](https://github.com/Programming-The-Next-Step-2026/mini_benchmarks)
+- **Issues:** [GitHub Issues](https://github.com/Programming-The-Next-Step-2026/mini_benchmarks/issues)
+- **OpenRouter:** [openrouter.ai](https://openrouter.ai/)