Skip to content

Latest commit

 

History

History
106 lines (77 loc) · 3.89 KB

File metadata and controls

106 lines (77 loc) · 3.89 KB

The eval harness

LLM-driven systems regress in ways unit tests don't catch: the prompt drifts, the tool schema changes upstream, a model upgrade subtly changes behaviour. The eval harness is the regression net — golden cases that exercise the agent end-to-end and report accuracy by category and difficulty.

Layout

src/eval/
├── models.py        # EvalCase, EvalResult (Pydantic)
├── runner.py        # EvalRunner — generic, takes a Callable[[str], str]
├── judge.py         # LLMClient Protocol + semantic-similarity judge
├── report.py        # Markdown report generator
└── __main__.py      # python -m src.eval

eval/
├── golden_qa.json   # The dataset (one trivial example case ships)
└── test_golden_qa.py  # Parametrised pytest runner

How it works

  1. The runner loads eval/golden_qa.json into a list of EvalCases.
  2. For each case, it calls the configured answer_fn(question) -> str.
  3. It compares the actual answer to the expected one using one of three tolerance modes:
    • exact_match — normalised string equality (lowercased, whitespace-collapsed).
    • numeric_close — extracts numbers from both sides; passes if any extracted number is within 1 % of the expected. Filters year-like values (2020-2029) so a question about a year doesn't accidentally provide the comparison target.
    • semantic_similar — calls an LLM judge (src/eval/judge.py) that scores 0.0–1.0; passes at ≥ 0.8.
  4. It returns a list of EvalResults; src/eval/report.py produces a markdown summary.

Wiring your agent

The runner doesn't know about your agent loop. Pass any Callable[[str], str]:

from src.eval.runner import EvalRunner

def my_agent(question: str) -> str:
    # Hit your agent loop / LLM client here.
    return ...

runner = EvalRunner(answer_fn=my_agent)
results = runner.evaluate_all()

For the LLM judge (semantic_similar cases), implement the LLMClient Protocol from src/eval/judge.py:

class MyLLMAdapter:
    def complete_json(self, *, model: str, prompt: str) -> str:
        # Hit your provider, return raw JSON body.
        ...

runner = EvalRunner(
    answer_fn=my_agent,
    judge_client=MyLLMAdapter(),
    judge_model="gpt-4o-mini",
)

If judge_client=None (default), semantic_similar cases pass with score=None and reason "no LLM client configured" — inconclusive, not a failure. That keeps the harness usable without LLM credentials.

Adding a case

{
  "id": "unique-kebab-id",
  "question": "...",
  "category": "category-name",
  "expected_answer": "...",
  "tolerance": "exact_match" | "numeric_close" | "semantic_similar",
  "difficulty": "easy" | "medium" | "hard",
  "notes": "Why this case earns a slot."
}

category and difficulty default to "general" and "easy"; explicit values are recommended once you have more than a handful of cases so the report breaks down meaningfully.

Running the harness

Locally:

uv run pytest eval/             # pytest runner with the marker
python -m src.eval               # CLI runner — prints the markdown report

The pytest invocation is marked @pytest.mark.eval, so the default pytest tests/ skips it.

Nightly opt-in

.github/workflows/eval-nightly.yml ships workflow_dispatch-only by default to avoid accidental LLM API spend. To turn on a real nightly:

  1. Add the LLM secrets in repo settings: LLM_API_KEY (required), LLM_PROVIDER, LLM_BASE_URL, LLM_MODEL (optional, depending on adapter).

  2. Replace the workflow's on: block with:

    on:
      schedule:
        - cron: "0 6 * * *"   # daily 06:00 UTC
      workflow_dispatch:
  3. Confirm eval-nightly.yml is still in EXEMPT_WORKFLOWS in .github/scripts/check_required_contexts.py (it should be — scheduled runs never gate PRs).

That's the full opt-in. Reverting is a one-line change back to workflow_dispatch: only.