LLM-driven systems regress in ways unit tests don't catch: the prompt drifts, the tool schema changes upstream, a model upgrade subtly changes behaviour. The eval harness is the regression net — golden cases that exercise the agent end-to-end and report accuracy by category and difficulty.
src/eval/
├── models.py # EvalCase, EvalResult (Pydantic)
├── runner.py # EvalRunner — generic, takes a Callable[[str], str]
├── judge.py # LLMClient Protocol + semantic-similarity judge
├── report.py # Markdown report generator
└── __main__.py # python -m src.eval
eval/
├── golden_qa.json # The dataset (one trivial example case ships)
└── test_golden_qa.py # Parametrised pytest runner
- The runner loads
eval/golden_qa.jsoninto a list ofEvalCases. - For each case, it calls the configured
answer_fn(question) -> str. - It compares the actual answer to the expected one using one of three tolerance modes:
exact_match— normalised string equality (lowercased, whitespace-collapsed).numeric_close— extracts numbers from both sides; passes if any extracted number is within 1 % of the expected. Filters year-like values (2020-2029) so a question about a year doesn't accidentally provide the comparison target.semantic_similar— calls an LLM judge (src/eval/judge.py) that scores 0.0–1.0; passes at ≥ 0.8.
- It returns a list of
EvalResults;src/eval/report.pyproduces a markdown summary.
The runner doesn't know about your agent loop. Pass any Callable[[str], str]:
from src.eval.runner import EvalRunner
def my_agent(question: str) -> str:
# Hit your agent loop / LLM client here.
return ...
runner = EvalRunner(answer_fn=my_agent)
results = runner.evaluate_all()For the LLM judge (semantic_similar cases), implement the LLMClient Protocol from src/eval/judge.py:
class MyLLMAdapter:
def complete_json(self, *, model: str, prompt: str) -> str:
# Hit your provider, return raw JSON body.
...
runner = EvalRunner(
answer_fn=my_agent,
judge_client=MyLLMAdapter(),
judge_model="gpt-4o-mini",
)If judge_client=None (default), semantic_similar cases pass with score=None and reason "no LLM client configured" — inconclusive, not a failure. That keeps the harness usable without LLM credentials.
{
"id": "unique-kebab-id",
"question": "...",
"category": "category-name",
"expected_answer": "...",
"tolerance": "exact_match" | "numeric_close" | "semantic_similar",
"difficulty": "easy" | "medium" | "hard",
"notes": "Why this case earns a slot."
}category and difficulty default to "general" and "easy"; explicit values are recommended once you have more than a handful of cases so the report breaks down meaningfully.
Locally:
uv run pytest eval/ # pytest runner with the marker
python -m src.eval # CLI runner — prints the markdown reportThe pytest invocation is marked @pytest.mark.eval, so the default pytest tests/ skips it.
.github/workflows/eval-nightly.yml ships workflow_dispatch-only by default to avoid accidental LLM API spend. To turn on a real nightly:
-
Add the LLM secrets in repo settings:
LLM_API_KEY(required),LLM_PROVIDER,LLM_BASE_URL,LLM_MODEL(optional, depending on adapter). -
Replace the workflow's
on:block with:on: schedule: - cron: "0 6 * * *" # daily 06:00 UTC workflow_dispatch:
-
Confirm
eval-nightly.ymlis still inEXEMPT_WORKFLOWSin.github/scripts/check_required_contexts.py(it should be — scheduled runs never gate PRs).
That's the full opt-in. Reverting is a one-line change back to workflow_dispatch: only.