|
1 | | -# Mini benchmarks Package |
| 1 | +# miniBen — Mini LLM Benchmarks |
2 | 2 |
|
3 | | -Mini LLM Benchmarks is designed to evaluate the performance of Large Language Models through a collection of interesting benchmark tests (e.g., cognitive flexibility, creativity, humor...). |
| 3 | +**miniBen** is a Python package for running small, task-specific benchmarks against Large Language Models (LLMs) via [OpenRouter](https://openrouter.ai/). Each benchmark sends a structured prompt to a model, parses the reply, and scores the result so you can compare models on skills like cognitive flexibility and creativity. |
| 4 | + |
| 5 | +> **Status:** The package can call models and run the full benchmark pipeline. Parsers and scorers for some benchmarks are still stubs (`pass`); scores may be `None` until those are implemented. |
| 6 | +
|
| 7 | +--- |
| 8 | + |
| 9 | +## Table of contents |
| 10 | + |
| 11 | +- [Features](#features) |
| 12 | +- [Requirements](#requirements) |
| 13 | +- [Installation](#installation) |
| 14 | +- [API key setup](#api-key-setup) |
| 15 | +- [Quick start](#quick-start) |
| 16 | +- [Available benchmarks](#available-benchmarks) |
| 17 | +- [Usage guide](#usage-guide) |
| 18 | +- [Return values](#return-values) |
| 19 | +- [Project layout](#project-layout) |
| 20 | +- [Development](#development) |
| 21 | +- [Troubleshooting](#troubleshooting) |
| 22 | +- [Citations](#citations) |
| 23 | +- [License](#license) |
| 24 | + |
| 25 | +--- |
| 26 | + |
| 27 | +## Features |
| 28 | + |
| 29 | +- **OpenRouter integration** — Use any model listed on OpenRouter with a single model ID string. |
| 30 | +- **Optional reasoning mode** — Enable chain-of-thought style reasoning on supported models. |
| 31 | +- **Built-in benchmarks** — Predefined prompts for Meta-Chess (cognitive flexibility) and creative writing. |
| 32 | +- **Extensible pipeline** — Each benchmark wires together a prompt, parser, and scorer; you can add new ones in `runner.py`. |
| 33 | +- **Simple Python API** — `AIModel.ask()` for one-off prompts, `run_benchmark()` for the full flow. |
| 34 | + |
| 35 | +--- |
| 36 | + |
| 37 | +## Requirements |
| 38 | + |
| 39 | +- **Python** 3.9 or newer |
| 40 | +- An **[OpenRouter](https://openrouter.ai/)** account and API key |
| 41 | +- Internet access when calling models (requests go to `https://openrouter.ai/api/v1`) |
| 42 | + |
| 43 | +--- |
| 44 | + |
| 45 | +## Installation |
| 46 | + |
| 47 | +### From a local clone (recommended for development) |
| 48 | + |
| 49 | +```bash |
| 50 | +git clone https://github.com/Programming-The-Next-Step-2026/mini_benchmarks.git |
| 51 | +cd mini_benchmarks |
| 52 | +python -m venv .venv |
| 53 | +source .venv/bin/activate # Windows: .venv\Scripts\activate |
| 54 | +pip install -e . |
| 55 | +``` |
| 56 | + |
| 57 | +### With development tools (pytest) |
| 58 | + |
| 59 | +```bash |
| 60 | +pip install -e ".[dev]" |
| 61 | +``` |
| 62 | + |
| 63 | +### Verify the install |
| 64 | + |
| 65 | +```bash |
| 66 | +python -c "from miniBen import AIModel, run_benchmark, BENCHMARKS; print(list(BENCHMARKS))" |
| 67 | +``` |
| 68 | + |
| 69 | +You should see the benchmark keys printed without errors. |
| 70 | + |
| 71 | +--- |
| 72 | + |
| 73 | +## API key setup |
| 74 | + |
| 75 | +miniBen reads your key from the environment variable **`OPENROUTER_API_KEY`**. |
| 76 | + |
| 77 | +### Option 1 — `.env` file (recommended) |
| 78 | + |
| 79 | +Create a file named `.env` in the project root (it is gitignored by default): |
| 80 | + |
| 81 | +```env |
| 82 | +OPENROUTER_API_KEY=sk-or-v1-your-key-here |
| 83 | +``` |
| 84 | + |
| 85 | +`python-dotenv` loads this automatically when you import `miniBen`. |
| 86 | + |
| 87 | +### Option 2 — export in the shell |
| 88 | + |
| 89 | +```bash |
| 90 | +export OPENROUTER_API_KEY="sk-or-v1-your-key-here" |
| 91 | +``` |
| 92 | + |
| 93 | +### Option 3 — interactive prompt |
| 94 | + |
| 95 | +```python |
| 96 | +from miniBen import put_openrouter_api_key_into_env |
| 97 | + |
| 98 | +put_openrouter_api_key_into_env() # prompts only if the key is missing |
| 99 | +``` |
| 100 | + |
| 101 | +### Check that the key is set |
| 102 | + |
| 103 | +```python |
| 104 | +from miniBen import check_openrouter_api_key_exist |
| 105 | + |
| 106 | +print(check_openrouter_api_key_exist()) # True if OPENROUTER_API_KEY is set |
| 107 | +``` |
| 108 | + |
| 109 | +Get a key from the [OpenRouter keys page](https://openrouter.ai/keys). Never commit your key to git. |
| 110 | + |
| 111 | +--- |
| 112 | + |
| 113 | +## Quick start |
| 114 | + |
| 115 | +### 1. Ask a model a single question |
| 116 | + |
| 117 | +```python |
| 118 | +from miniBen import AIModel |
| 119 | + |
| 120 | +model = AIModel("openrouter/free") # replace with any OpenRouter model ID |
| 121 | +content, reasoning = model.ask("Say hello in one sentence.", reasoning=False) |
| 122 | + |
| 123 | +print("Answer:", content) |
| 124 | +if reasoning: |
| 125 | + print("Reasoning:", reasoning) |
| 126 | +``` |
| 127 | + |
| 128 | +### 2. Run a full benchmark |
| 129 | + |
| 130 | +```python |
| 131 | +from miniBen import run_benchmark |
| 132 | + |
| 133 | +results = run_benchmark( |
| 134 | + model_name="openrouter/free", |
| 135 | + benchmark_name="creativity", |
| 136 | + reasoning=True, |
| 137 | +) |
| 138 | + |
| 139 | +print(results["score"]) |
| 140 | +``` |
| 141 | + |
| 142 | +`run_benchmark` prints progress to the terminal and returns a dictionary (see [Return values](#return-values)). |
| 143 | + |
| 144 | +--- |
| 145 | + |
| 146 | +## Available benchmarks |
| 147 | + |
| 148 | +Use these exact strings as `benchmark_name` in `run_benchmark()`: |
| 149 | + |
| 150 | +| Key | Display name | What it tests | |
| 151 | +|-----|----------------|---------------| |
| 152 | +| `cognitive flexibility` | Meta-Chess Game (Cognitive Flexibility) | Model must follow Meta-Chess rules and output structured move lists | |
| 153 | +| `creativity` | Creativity in story writing | Short story using the words *stamp*, *letter*, *send* | |
| 154 | + |
| 155 | + |
| 156 | +List keys programmatically: |
| 157 | + |
| 158 | +```python |
| 159 | +from miniBen import BENCHMARKS |
| 160 | + |
| 161 | +for key, meta in BENCHMARKS.items(): |
| 162 | + print(key, "→", meta["name"]) |
| 163 | +``` |
| 164 | + |
| 165 | +Additional prompt placeholders exist in `prompts.py` (`humor`, `bullshit`) but are not registered in `BENCHMARKS` yet. |
| 166 | + |
| 167 | +--- |
| 168 | + |
| 169 | +## Usage guide |
| 170 | + |
| 171 | +### `AIModel` |
| 172 | + |
| 173 | +Wraps the OpenRouter chat API. |
| 174 | + |
| 175 | +```python |
| 176 | +from miniBen import AIModel |
| 177 | + |
| 178 | +model = AIModel(model="nvidia/nemotron-3-nano-omni-30b-a3b-reasoning:free") |
| 179 | + |
| 180 | +content, reasoning_text = model.ask( |
| 181 | + prompt="How many r's are in the word strawberry?", |
| 182 | + reasoning=True, # default: True |
| 183 | +) |
| 184 | +``` |
| 185 | + |
| 186 | +| Parameter | Type | Description | |
| 187 | +|-----------|------|-------------| |
| 188 | +| `model` (constructor) | `str` | OpenRouter model ID, e.g. `openrouter/free` | |
| 189 | +| `prompt` | `str` | User message sent to the model | |
| 190 | +| `reasoning` | `bool` | Passes `extra_body={"reasoning": {"enabled": ...}}` to the API | |
| 191 | + |
| 192 | +**Returns:** `(content, reasoning_text)` — answer text and optional reasoning trace (may be `None`). |
| 193 | + |
| 194 | +### `run_benchmark` |
| 195 | + |
| 196 | +Runs prompt → model → parse → score for one benchmark. |
| 197 | + |
| 198 | +```python |
| 199 | +from miniBen import run_benchmark |
| 200 | + |
| 201 | +results = run_benchmark( |
| 202 | + model_name="deepseek/deepseek-chat", |
| 203 | + benchmark_name="cognitive flexibility", |
| 204 | + reasoning=True, |
| 205 | +) |
| 206 | +``` |
| 207 | + |
| 208 | +| Parameter | Type | Description | |
| 209 | +|-----------|------|-------------| |
| 210 | +| `model_name` | `str` | OpenRouter model ID | |
| 211 | +| `benchmark_name` | `str` | Key from `BENCHMARKS` (see table above) | |
| 212 | +| `reasoning` | `bool` | Enable reasoning tokens when supported (default `True`) | |
| 213 | + |
| 214 | +### Using prompts directly |
| 215 | + |
| 216 | +```python |
| 217 | +from miniBen import AIModel, cog_flex, creativity |
| 218 | + |
| 219 | +model = AIModel("openrouter/free") |
| 220 | +content, _ = model.ask(cog_flex, reasoning=True) |
| 221 | +``` |
| 222 | + |
| 223 | +### Auth helpers |
| 224 | + |
| 225 | +| Function | Purpose | |
| 226 | +|----------|---------| |
| 227 | +| `check_openrouter_api_key_exist()` | Returns `True` if `OPENROUTER_API_KEY` is set | |
| 228 | +| `put_openrouter_api_key_into_env()` | Prompts for a key if missing and sets the env var | |
| 229 | + |
| 230 | +--- |
| 231 | + |
| 232 | +## Return values |
| 233 | + |
| 234 | +### `AIModel.ask()` |
| 235 | + |
| 236 | +```python |
| 237 | +(content: str, reasoning_text: str | None) |
| 238 | +``` |
| 239 | + |
| 240 | +### `run_benchmark()` |
| 241 | + |
| 242 | +```python |
| 243 | +{ |
| 244 | + "model": str, # model_name you passed in |
| 245 | + "benchmark": str, # benchmark_name you passed in |
| 246 | + "raw_response": tuple, # (content, reasoning_text) from ask() |
| 247 | + "parsed": ..., # output of the benchmark parser (stub → None for now) |
| 248 | + "score": ..., # output of the benchmark scorer (stub → None for now) |
| 249 | +} |
| 250 | +``` |
| 251 | + |
| 252 | +--- |
| 253 | + |
| 254 | +## Project layout |
| 255 | + |
| 256 | +``` |
| 257 | +mini_benchmarks/ |
| 258 | +├── src/miniBen/ |
| 259 | +│ ├── __init__.py # Public exports |
| 260 | +│ ├── auth.py # API key helpers, load_dotenv |
| 261 | +│ ├── model.py # AIModel, OpenRouter client |
| 262 | +│ ├── prompts.py # Benchmark prompt strings |
| 263 | +│ ├── parsers.py # Parse raw model text |
| 264 | +│ ├── scorers.py # Score parsed output |
| 265 | +│ ├── runner.py # BENCHMARKS registry, run_benchmark() |
| 266 | +│ └── example.py # Re-exports (backward compatibility) |
| 267 | +├── tests/ |
| 268 | +│ ├── test_model.py # tests for miniBen.model (AIModel.ask) |
| 269 | +│ └── test_runner.py # tests for miniBen.runner (run_benchmark) |
| 270 | +├── pyproject.toml |
| 271 | +├── README.md |
| 272 | +└── .env # You create this; not in git |
| 273 | +``` |
| 274 | + |
| 275 | +**Import style:** prefer `from miniBen import AIModel, run_benchmark`. Imports from `miniBen.example` still work but are equivalent to the package root. |
| 276 | + |
| 277 | +--- |
| 278 | + |
| 279 | +## Development |
| 280 | + |
| 281 | +### Run tests |
| 282 | + |
| 283 | +From the repository root: |
| 284 | + |
| 285 | +```bash |
| 286 | +pip install -e ".[dev]" |
| 287 | +pytest |
| 288 | +``` |
| 289 | + |
| 290 | +Tests mock the OpenRouter client so they do not use your API key or network. |
| 291 | + |
| 292 | +### Add a new benchmark |
| 293 | + |
| 294 | +1. Add a prompt string in `src/miniBen/prompts.py`. |
| 295 | +2. Implement `parse_*` and `score_*` in `parsers.py` and `scorers.py`. |
| 296 | +3. Register an entry in `BENCHMARKS` inside `src/miniBen/runner.py`. |
| 297 | +4. Export new symbols in `src/miniBen/__init__.py` if they should be public. |
| 298 | + |
| 299 | +### Choosing a model |
| 300 | + |
| 301 | +Browse models on [OpenRouter](https://openrouter.ai/models). Use the model slug exactly as shown (e.g. `openrouter/free`, `anthropic/claude-3.5-sonnet`). Free and reasoning-capable models vary; if reasoning fails, try `reasoning=False` or another model. |
| 302 | + |
| 303 | +--- |
| 304 | + |
| 305 | +## Troubleshooting |
| 306 | + |
| 307 | +| Problem | What to try | |
| 308 | +|---------|-------------| |
| 309 | +| `OPENROUTER_API_KEY` errors / 401 | Set the key in `.env` or the shell; run `check_openrouter_api_key_exist()` | |
| 310 | +| `KeyError` on `benchmark_name` | Use keys from [Available benchmarks](#available-benchmarks) exactly | |
| 311 | +| `ValueError: returned a None content block` | Model refused or returned empty output; try another model or shorter prompt | |
| 312 | +| `parsed` / `score` are always `None` | Expected until parsers and scorers are implemented | |
| 313 | +| Import errors after clone | Run `pip install -e .` from the repo root | |
| 314 | +| Rate limits | OpenRouter quota; wait or use a different model tier | |
| 315 | + |
| 316 | +--- |
| 317 | + |
| 318 | +## License |
| 319 | + |
| 320 | +This project is licensed under the **MIT License** — see [LICENSE](LICENSE). |
| 321 | + |
| 322 | +When using upstream benchmarks, also follow the licenses of [creative-story-gen](https://github.com/mismayil/creative-story-gen) and [bullshit-benchmark](https://github.com/petergpt/bullshit-benchmark) (both are open source; check their repositories for the exact terms). |
| 323 | + |
| 324 | +--- |
| 325 | + |
| 326 | +## Links |
| 327 | + |
| 328 | +- **Repository:** [Programming-The-Next-Step-2026/mini_benchmarks](https://github.com/Programming-The-Next-Step-2026/mini_benchmarks) |
| 329 | +- **Issues:** [GitHub Issues](https://github.com/Programming-The-Next-Step-2026/mini_benchmarks/issues) |
| 330 | +- **OpenRouter:** [openrouter.ai](https://openrouter.ai/) |
0 commit comments