The eval harness

LLM-driven systems regress in ways unit tests don't catch: the prompt drifts, the tool schema changes upstream, a model upgrade subtly changes behaviour. The eval harness is the regression net — golden cases that exercise the agent end-to-end and report accuracy by category and difficulty.

Layout

src/eval/
├── models.py        # EvalCase, EvalResult (Pydantic)
├── runner.py        # EvalRunner — generic, takes a Callable[[str], str]
├── judge.py         # LLMClient Protocol + semantic-similarity judge
├── report.py        # Markdown report generator
└── __main__.py      # python -m src.eval

eval/
├── golden_qa.json   # The dataset (one trivial example case ships)
└── test_golden_qa.py  # Parametrised pytest runner

How it works

The runner loads eval/golden_qa.json into a list of EvalCases.
For each case, it calls the configured answer_fn(question) -> str.
It compares the actual answer to the expected one using one of three tolerance modes:
- exact_match — normalised string equality (lowercased, whitespace-collapsed).
- numeric_close — extracts numbers from both sides; passes if any extracted number is within 1 % of the expected. Filters year-like values (2020-2029) so a question about a year doesn't accidentally provide the comparison target.
- semantic_similar — calls an LLM judge (src/eval/judge.py) that scores 0.0–1.0; passes at ≥ 0.8.
It returns a list of EvalResults; src/eval/report.py produces a markdown summary.

Wiring your agent

The runner doesn't know about your agent loop. Pass any Callable[[str], str]:

from src.eval.runner import EvalRunner

def my_agent(question: str) -> str:
    # Hit your agent loop / LLM client here.
    return ...

runner = EvalRunner(answer_fn=my_agent)
results = runner.evaluate_all()

For the LLM judge (semantic_similar cases), implement the LLMClient Protocol from src/eval/judge.py:

class MyLLMAdapter:
    def complete_json(self, *, model: str, prompt: str) -> str:
        # Hit your provider, return raw JSON body.
        ...

runner = EvalRunner(
    answer_fn=my_agent,
    judge_client=MyLLMAdapter(),
    judge_model="gpt-4o-mini",
)

If judge_client=None (default), semantic_similar cases pass with score=None and reason "no LLM client configured" — inconclusive, not a failure. That keeps the harness usable without LLM credentials.

Adding a case

{
  "id": "unique-kebab-id",
  "question": "...",
  "category": "category-name",
  "expected_answer": "...",
  "tolerance": "exact_match" | "numeric_close" | "semantic_similar",
  "difficulty": "easy" | "medium" | "hard",
  "notes": "Why this case earns a slot."
}

category and difficulty default to "general" and "easy"; explicit values are recommended once you have more than a handful of cases so the report breaks down meaningfully.

Running the harness

Locally:

uv run pytest eval/             # pytest runner with the marker
python -m src.eval               # CLI runner — prints the markdown report

The pytest invocation is marked @pytest.mark.eval, so the default pytest tests/ skips it.

Nightly opt-in

.github/workflows/eval-nightly.yml ships workflow_dispatch-only by default to avoid accidental LLM API spend. To turn on a real nightly:

Add the LLM secrets in repo settings: LLM_API_KEY (required), LLM_PROVIDER, LLM_BASE_URL, LLM_MODEL (optional, depending on adapter).

Replace the workflow's on: block with:

on:
  schedule:
    - cron: "0 6 * * *"   # daily 06:00 UTC
  workflow_dispatch:

Confirm eval-nightly.yml is still in EXEMPT_WORKFLOWS in .github/scripts/check_required_contexts.py (it should be — scheduled runs never gate PRs).

That's the full opt-in. Reverting is a one-line change back to workflow_dispatch: only.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The eval harness

Layout

How it works

Wiring your agent

Adding a case

Running the harness

Nightly opt-in

FilesExpand file tree

EVAL_HARNESS.md

Latest commit

History

EVAL_HARNESS.md

File metadata and controls

The eval harness

Layout

How it works

Wiring your agent

Adding a case

Running the harness

Nightly opt-in