How To: Bring Your Own Dataset

Evaluate Model Router on your own prompts instead of (or in addition to) the included sample datasets. This is the single most important thing you can do to make results meaningful — Model Router's behaviour depends on the type of prompts you send it, so the most useful benchmark is one built from prompts that look like your real workload.

What's a dataset here? A plain text file with one prompt per line (or row). Each prompt has at minimum an id and a prompt text field. Optional fields like category and difficulty make the report richer but aren't required.

Quick recipe (most common case)

Copy datasets/sample_custom.jsonl as a template.
Replace its prompts with 50–500 real prompts from your workload.
Run: python scripts/run_eval.py --dataset path/to/yours.jsonl

The rest of this page is reference material for when you need more than that.

Supported formats: JSONL, CSV, and SQL databases (SQLite built-in, others via SQLAlchemy).

JSONL format (recommended)

One JSON object per line — easy to edit by hand, easy to generate from scripts:

{"id": "001", "prompt": "Explain quantum entanglement in simple terms."}
{"id": "002", "prompt": "Write a Python function to merge two sorted lists.", "category": "code_generation", "difficulty": "medium"}

CSV format

A spreadsheet-style file with a header row. Only id and prompt are required:

id,prompt,category,difficulty
001,Explain quantum entanglement in simple terms.,,
002,Write a Python function to merge two sorted lists.,code_generation,medium

Tip: If your prompts contain commas or newlines, quote them: "Hello, world". Excel and Google Sheets handle this automatically when you Save As CSV.

Database

Point to a SQL database with a connection string. The table must have id and prompt columns.

# SQLite (built-in, no extra dependencies)
python scripts/run_eval.py --dataset "sqlite:///path/to/prompts.db?table=prompts"

# PostgreSQL, MySQL, etc. (requires: pip install -e ".[db]")
python scripts/run_eval.py --dataset "postgresql://user:pw@host/mydb?table=prompts"

Field reference

Field	Required	Type	Description
`id`	Yes	string	Unique identifier for each prompt — used to track results and resume runs
`prompt`	Yes	string	The text sent to both endpoints
`category`	No	string	Grouping label (e.g. `code_generation`, `summarization`) — enables per-category charts
`difficulty`	No	string	`easy`, `medium`, or `hard` — used for difficulty breakdowns
`ground_truth`	No	string	Reference answer (reserved for future use)
`metadata`	No	object	Arbitrary key-value pairs for your own tracking

Tips for good datasets

Use categories

Categories enable per-category breakdowns in the report — for example, you can see whether Model Router is faster on summarization than on code generation. Use consistent labels:

{"id": "c01", "prompt": "Write a binary search in Python.", "category": "code_generation"}
{"id": "c02", "prompt": "Explain TCP vs UDP.", "category": "technical_knowledge"}
{"id": "c03", "prompt": "Summarize this article: ...", "category": "summarization"}

Mix difficulty levels

Include easy, medium, and hard prompts. Model Router's value proposition ("send easy prompts to cheap models, hard ones to capable models") only shows up when your dataset has a mix.

Use realistic prompts

Prompts that match your production workload give the most useful answer. Toy examples like "What is 2 + 2?" tell you very little about how Model Router will behave on your real traffic.

Size recommendations

Prompts	Use case
10–50	Quick smoke test — directional only, not statistically reliable
100–500	Meaningful comparison — can spot real differences
1,000+	Production-grade benchmark — use `configs/large_scale.yaml`

Run with your dataset

# JSONL
python scripts/run_eval.py --dataset path/to/my_prompts.jsonl

# CSV
python scripts/run_eval.py --dataset path/to/my_prompts.csv

# SQLite database
python scripts/run_eval.py --dataset "sqlite:///prompts.db?table=prompts"

# Subset from any source (great for first runs)
python scripts/run_eval.py --dataset my_prompts.csv --sample-size 100

Validate before running

python scripts/run_eval.py --dataset my_prompts.jsonl --dry-run

This prints:

Number of prompts loaded
Category distribution
Any validation errors (missing id, missing prompt, duplicate IDs)

No API calls are made — fix any errors here before paying for a real run.

Example: customer support dataset

{"id": "cs-001", "prompt": "How do I reset my password?", "category": "account", "difficulty": "easy"}
{"id": "cs-002", "prompt": "My payment was charged twice. What should I do?", "category": "billing", "difficulty": "medium"}
{"id": "cs-003", "prompt": "Explain the difference between your Enterprise and Pro plans including all feature comparisons, pricing tiers, and migration paths.", "category": "product", "difficulty": "hard"}

Example: code evaluation dataset

{"id": "code-001", "prompt": "Implement an LRU cache in Python with O(1) get and put.", "category": "algorithms", "difficulty": "hard"}
{"id": "code-002", "prompt": "Write a SQL query to find employees who earn more than their manager.", "category": "sql", "difficulty": "medium"}
{"id": "code-003", "prompt": "Convert this JavaScript callback to use async/await: ...", "category": "refactoring", "difficulty": "easy"}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How To: Bring Your Own Dataset

Quick recipe (most common case)

JSONL format (recommended)

CSV format

Database

Field reference

Tips for good datasets

Use categories

Mix difficulty levels

Use realistic prompts

Size recommendations

Run with your dataset

Validate before running

Example: customer support dataset

Example: code evaluation dataset

Uh oh!

FilesExpand file tree

how-to-custom-dataset.md

Latest commit

History

how-to-custom-dataset.md

File metadata and controls

How To: Bring Your Own Dataset

Quick recipe (most common case)

JSONL format (recommended)

CSV format

Database

Field reference

Tips for good datasets

Use categories

Mix difficulty levels

Use realistic prompts

Size recommendations

Run with your dataset

Validate before running

Example: customer support dataset

Example: code evaluation dataset