Skip to content

Commit 772ee32

Browse files
authored
Merge pull request #2 from Programming-The-Next-Step-2026/week-2
functions documentation and unitest
2 parents 82c3060 + 4abac1d commit 772ee32

19 files changed

Lines changed: 710 additions & 14 deletions

.DS_Store

2 KB
Binary file not shown.

.gitignore

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
# Environment variables - never commit
2+
.env
3+
4+
# Virtual environment
5+
.venvnew/
6+
7+
# Python cache
8+
__pycache__/
9+
*.pyc
10+
*.pyo
11+
12+
# PyCharm
13+
.idea/
14+
15+
# macOS
16+
.DS_Store

.idea/mini_benchmarks.iml

Lines changed: 2 additions & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

.idea/misc.xml

Lines changed: 1 addition & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

README.md

Lines changed: 329 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,330 @@
1-
# Mini benchmarks Package
1+
# miniBen — Mini LLM Benchmarks
22

3-
Mini LLM Benchmarks is designed to evaluate the performance of Large Language Models through a collection of interesting benchmark tests (e.g., cognitive flexibility, creativity, humor...).
3+
**miniBen** is a Python package for running small, task-specific benchmarks against Large Language Models (LLMs) via [OpenRouter](https://openrouter.ai/). Each benchmark sends a structured prompt to a model, parses the reply, and scores the result so you can compare models on skills like cognitive flexibility and creativity.
4+
5+
> **Status:** The package can call models and run the full benchmark pipeline. Parsers and scorers for some benchmarks are still stubs (`pass`); scores may be `None` until those are implemented.
6+
7+
---
8+
9+
## Table of contents
10+
11+
- [Features](#features)
12+
- [Requirements](#requirements)
13+
- [Installation](#installation)
14+
- [API key setup](#api-key-setup)
15+
- [Quick start](#quick-start)
16+
- [Available benchmarks](#available-benchmarks)
17+
- [Usage guide](#usage-guide)
18+
- [Return values](#return-values)
19+
- [Project layout](#project-layout)
20+
- [Development](#development)
21+
- [Troubleshooting](#troubleshooting)
22+
- [Citations](#citations)
23+
- [License](#license)
24+
25+
---
26+
27+
## Features
28+
29+
- **OpenRouter integration** — Use any model listed on OpenRouter with a single model ID string.
30+
- **Optional reasoning mode** — Enable chain-of-thought style reasoning on supported models.
31+
- **Built-in benchmarks** — Predefined prompts for Meta-Chess (cognitive flexibility) and creative writing.
32+
- **Extensible pipeline** — Each benchmark wires together a prompt, parser, and scorer; you can add new ones in `runner.py`.
33+
- **Simple Python API**`AIModel.ask()` for one-off prompts, `run_benchmark()` for the full flow.
34+
35+
---
36+
37+
## Requirements
38+
39+
- **Python** 3.9 or newer
40+
- An **[OpenRouter](https://openrouter.ai/)** account and API key
41+
- Internet access when calling models (requests go to `https://openrouter.ai/api/v1`)
42+
43+
---
44+
45+
## Installation
46+
47+
### From a local clone (recommended for development)
48+
49+
```bash
50+
git clone https://github.com/Programming-The-Next-Step-2026/mini_benchmarks.git
51+
cd mini_benchmarks
52+
python -m venv .venv
53+
source .venv/bin/activate # Windows: .venv\Scripts\activate
54+
pip install -e .
55+
```
56+
57+
### With development tools (pytest)
58+
59+
```bash
60+
pip install -e ".[dev]"
61+
```
62+
63+
### Verify the install
64+
65+
```bash
66+
python -c "from miniBen import AIModel, run_benchmark, BENCHMARKS; print(list(BENCHMARKS))"
67+
```
68+
69+
You should see the benchmark keys printed without errors.
70+
71+
---
72+
73+
## API key setup
74+
75+
miniBen reads your key from the environment variable **`OPENROUTER_API_KEY`**.
76+
77+
### Option 1 — `.env` file (recommended)
78+
79+
Create a file named `.env` in the project root (it is gitignored by default):
80+
81+
```env
82+
OPENROUTER_API_KEY=sk-or-v1-your-key-here
83+
```
84+
85+
`python-dotenv` loads this automatically when you import `miniBen`.
86+
87+
### Option 2 — export in the shell
88+
89+
```bash
90+
export OPENROUTER_API_KEY="sk-or-v1-your-key-here"
91+
```
92+
93+
### Option 3 — interactive prompt
94+
95+
```python
96+
from miniBen import put_openrouter_api_key_into_env
97+
98+
put_openrouter_api_key_into_env() # prompts only if the key is missing
99+
```
100+
101+
### Check that the key is set
102+
103+
```python
104+
from miniBen import check_openrouter_api_key_exist
105+
106+
print(check_openrouter_api_key_exist()) # True if OPENROUTER_API_KEY is set
107+
```
108+
109+
Get a key from the [OpenRouter keys page](https://openrouter.ai/keys). Never commit your key to git.
110+
111+
---
112+
113+
## Quick start
114+
115+
### 1. Ask a model a single question
116+
117+
```python
118+
from miniBen import AIModel
119+
120+
model = AIModel("openrouter/free") # replace with any OpenRouter model ID
121+
content, reasoning = model.ask("Say hello in one sentence.", reasoning=False)
122+
123+
print("Answer:", content)
124+
if reasoning:
125+
print("Reasoning:", reasoning)
126+
```
127+
128+
### 2. Run a full benchmark
129+
130+
```python
131+
from miniBen import run_benchmark
132+
133+
results = run_benchmark(
134+
model_name="openrouter/free",
135+
benchmark_name="creativity",
136+
reasoning=True,
137+
)
138+
139+
print(results["score"])
140+
```
141+
142+
`run_benchmark` prints progress to the terminal and returns a dictionary (see [Return values](#return-values)).
143+
144+
---
145+
146+
## Available benchmarks
147+
148+
Use these exact strings as `benchmark_name` in `run_benchmark()`:
149+
150+
| Key | Display name | What it tests |
151+
|-----|----------------|---------------|
152+
| `cognitive flexibility` | Meta-Chess Game (Cognitive Flexibility) | Model must follow Meta-Chess rules and output structured move lists |
153+
| `creativity` | Creativity in story writing | Short story using the words *stamp*, *letter*, *send* |
154+
155+
156+
List keys programmatically:
157+
158+
```python
159+
from miniBen import BENCHMARKS
160+
161+
for key, meta in BENCHMARKS.items():
162+
print(key, "", meta["name"])
163+
```
164+
165+
Additional prompt placeholders exist in `prompts.py` (`humor`, `bullshit`) but are not registered in `BENCHMARKS` yet.
166+
167+
---
168+
169+
## Usage guide
170+
171+
### `AIModel`
172+
173+
Wraps the OpenRouter chat API.
174+
175+
```python
176+
from miniBen import AIModel
177+
178+
model = AIModel(model="nvidia/nemotron-3-nano-omni-30b-a3b-reasoning:free")
179+
180+
content, reasoning_text = model.ask(
181+
prompt="How many r's are in the word strawberry?",
182+
reasoning=True, # default: True
183+
)
184+
```
185+
186+
| Parameter | Type | Description |
187+
|-----------|------|-------------|
188+
| `model` (constructor) | `str` | OpenRouter model ID, e.g. `openrouter/free` |
189+
| `prompt` | `str` | User message sent to the model |
190+
| `reasoning` | `bool` | Passes `extra_body={"reasoning": {"enabled": ...}}` to the API |
191+
192+
**Returns:** `(content, reasoning_text)` — answer text and optional reasoning trace (may be `None`).
193+
194+
### `run_benchmark`
195+
196+
Runs prompt → model → parse → score for one benchmark.
197+
198+
```python
199+
from miniBen import run_benchmark
200+
201+
results = run_benchmark(
202+
model_name="deepseek/deepseek-chat",
203+
benchmark_name="cognitive flexibility",
204+
reasoning=True,
205+
)
206+
```
207+
208+
| Parameter | Type | Description |
209+
|-----------|------|-------------|
210+
| `model_name` | `str` | OpenRouter model ID |
211+
| `benchmark_name` | `str` | Key from `BENCHMARKS` (see table above) |
212+
| `reasoning` | `bool` | Enable reasoning tokens when supported (default `True`) |
213+
214+
### Using prompts directly
215+
216+
```python
217+
from miniBen import AIModel, cog_flex, creativity
218+
219+
model = AIModel("openrouter/free")
220+
content, _ = model.ask(cog_flex, reasoning=True)
221+
```
222+
223+
### Auth helpers
224+
225+
| Function | Purpose |
226+
|----------|---------|
227+
| `check_openrouter_api_key_exist()` | Returns `True` if `OPENROUTER_API_KEY` is set |
228+
| `put_openrouter_api_key_into_env()` | Prompts for a key if missing and sets the env var |
229+
230+
---
231+
232+
## Return values
233+
234+
### `AIModel.ask()`
235+
236+
```python
237+
(content: str, reasoning_text: str | None)
238+
```
239+
240+
### `run_benchmark()`
241+
242+
```python
243+
{
244+
"model": str, # model_name you passed in
245+
"benchmark": str, # benchmark_name you passed in
246+
"raw_response": tuple, # (content, reasoning_text) from ask()
247+
"parsed": ..., # output of the benchmark parser (stub → None for now)
248+
"score": ..., # output of the benchmark scorer (stub → None for now)
249+
}
250+
```
251+
252+
---
253+
254+
## Project layout
255+
256+
```
257+
mini_benchmarks/
258+
├── src/miniBen/
259+
│ ├── __init__.py # Public exports
260+
│ ├── auth.py # API key helpers, load_dotenv
261+
│ ├── model.py # AIModel, OpenRouter client
262+
│ ├── prompts.py # Benchmark prompt strings
263+
│ ├── parsers.py # Parse raw model text
264+
│ ├── scorers.py # Score parsed output
265+
│ ├── runner.py # BENCHMARKS registry, run_benchmark()
266+
│ └── example.py # Re-exports (backward compatibility)
267+
├── tests/
268+
│ ├── test_model.py # tests for miniBen.model (AIModel.ask)
269+
│ └── test_runner.py # tests for miniBen.runner (run_benchmark)
270+
├── pyproject.toml
271+
├── README.md
272+
└── .env # You create this; not in git
273+
```
274+
275+
**Import style:** prefer `from miniBen import AIModel, run_benchmark`. Imports from `miniBen.example` still work but are equivalent to the package root.
276+
277+
---
278+
279+
## Development
280+
281+
### Run tests
282+
283+
From the repository root:
284+
285+
```bash
286+
pip install -e ".[dev]"
287+
pytest
288+
```
289+
290+
Tests mock the OpenRouter client so they do not use your API key or network.
291+
292+
### Add a new benchmark
293+
294+
1. Add a prompt string in `src/miniBen/prompts.py`.
295+
2. Implement `parse_*` and `score_*` in `parsers.py` and `scorers.py`.
296+
3. Register an entry in `BENCHMARKS` inside `src/miniBen/runner.py`.
297+
4. Export new symbols in `src/miniBen/__init__.py` if they should be public.
298+
299+
### Choosing a model
300+
301+
Browse models on [OpenRouter](https://openrouter.ai/models). Use the model slug exactly as shown (e.g. `openrouter/free`, `anthropic/claude-3.5-sonnet`). Free and reasoning-capable models vary; if reasoning fails, try `reasoning=False` or another model.
302+
303+
---
304+
305+
## Troubleshooting
306+
307+
| Problem | What to try |
308+
|---------|-------------|
309+
| `OPENROUTER_API_KEY` errors / 401 | Set the key in `.env` or the shell; run `check_openrouter_api_key_exist()` |
310+
| `KeyError` on `benchmark_name` | Use keys from [Available benchmarks](#available-benchmarks) exactly |
311+
| `ValueError: returned a None content block` | Model refused or returned empty output; try another model or shorter prompt |
312+
| `parsed` / `score` are always `None` | Expected until parsers and scorers are implemented |
313+
| Import errors after clone | Run `pip install -e .` from the repo root |
314+
| Rate limits | OpenRouter quota; wait or use a different model tier |
315+
316+
---
317+
318+
## License
319+
320+
This project is licensed under the **MIT License** — see [LICENSE](LICENSE).
321+
322+
When using upstream benchmarks, also follow the licenses of [creative-story-gen](https://github.com/mismayil/creative-story-gen) and [bullshit-benchmark](https://github.com/petergpt/bullshit-benchmark) (both are open source; check their repositories for the exact terms).
323+
324+
---
325+
326+
## Links
327+
328+
- **Repository:** [Programming-The-Next-Step-2026/mini_benchmarks](https://github.com/Programming-The-Next-Step-2026/mini_benchmarks)
329+
- **Issues:** [GitHub Issues](https://github.com/Programming-The-Next-Step-2026/mini_benchmarks/issues)
330+
- **OpenRouter:** [openrouter.ai](https://openrouter.ai/)

0 commit comments

Comments
 (0)