refactor(gatekeeper): async eval runner with structured classes and stats tracking#479
Conversation
owtaylor
commented
May 24, 2026
- Make check_run_script async (acompletion) and add check_run_script_with_stats returning GatekeeperStats (tokens, cost, latency) alongside results
- Introduce EvalSuite and FileEval classes in run-eval.py, replacing global variables and eliminating duplicate single-file/multi-file code paths
- Load all test cases up front so progress output shows global numbering (e.g. [3/121]) as individual evals complete
- Add StatsAggregator dataclass for cleaner inference statistics reporting
- Add GatekeeperException with optional stats for timeout, output limit, and parse failures
- Add gatekeeper.cost config option for custom per-token cost accounting
- Switch summary/stats tables from plain text to rich.Table
|
For team members: test commit |
51e6a98 to
35adcb9
Compare
|
For team members: test commit |
35adcb9 to
914b391
Compare
|
For team members: test commit |
Codecov Report✅ All modified and coverable lines are covered by tests.
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
914b391 to
f5f6da6
Compare
|
For team members: test commit |
Jazzcort
left a comment
There was a problem hiding this comment.
LGTM! Comments are just some of my thoughts! 😁
|
|
||
|
|
||
| # Maximum number of completion tokens (including reasoning) | ||
| GATEKEEPER_MAX_TOKENS = 8000 |
There was a problem hiding this comment.
Not sure if it makes sense that our users would like modify these values. These numbers are pretty reasonable and shouldn't be hit quite often. 😁
There was a problem hiding this comment.
These are basically just there as safety fallbacks to avoid the model running on forever - I don't immediately a use case for modifying them:
- There's no reason to make them bigger.
- If a user is actually hitting them enough to care, they will be having a bad day.
But we can always add configuration later if it seems useful.
| stats: list[GatekeeperStats] | ||
| _values: dict[str, list[float]] = field(init=False, default_factory=dict) | ||
|
|
||
| def values(self, name: FieldName): |
There was a problem hiding this comment.
I like this lazy processing concept!
|
|
||
| expected_status = expected.get("status") | ||
| actual_status = actual.get("status") | ||
| totals = [0 for i in range(len(columns))] |
There was a problem hiding this comment.
Find a perfect spot to use totals = [0] * len(colums)! 😁
There was a problem hiding this comment.
Where intuition fails - my intuition was that is clearer, but would be a bit slower, though not enough to matter. ("It's going to to allocate a bunch of lists, then append them").
>>> import timeit
>>> timeit.timeit(lambda: [0] * 1000, number=10000)
0.02545745200040983
>>> timeit.timeit(lambda: [0 for i in range(1000)], number=10000)
0.6684908530005487
But it actually seems well optimized, and avoiding Python in the inner loop is a huge win. (Not that it matters.)
There was a problem hiding this comment.
Thanks for even testing the performance of it! I'm surprised a performance difference exists—I thought it was just a different way to do the same thing 😆 The usage is actually pretty limited since this only works when setting up a default list with immutable objects, so this is the perfect place to keep it. 😁
| raise GatekeeperException("Failed to parse gatekeeper model output", stats=stats) from e | ||
|
|
||
|
|
||
| async def check_run_script(description: str, script_type: str, script: str, *, readonly: bool) -> GatekeeperResult: |
There was a problem hiding this comment.
It just bugs me a little bit that when users are using our mcp server and these stats are generated and never got used with this wrapper solution. But actually, it does not harm the performance since getting these stats does not introduce any heavy computation burden.
If we later on want to avoid getting these stats in every check_run_script call, we can either have a function overload
@overload
async def check_run_script(
description: str, script_type: str, script: str, *, readonly: bool, with_stats: Literal[False]
) -> GatekeeperResult: ...
@overload
async def check_run_script(
description: str, script_type: str, script: str, *, readonly: bool, with_stats: Literal[True]
) -> tuple[GatekeeperResult, GatekeeperStats]: ...or have a hidden private attribute in GatekeeperResult that does not get serialized.
class GatekeeperResult(BaseModel):
status: GatekeeperStatus
detail: str = ""
_stats: Optional["GatekeeperStats"] = PrivateAttr(default=None)
...😁
There was a problem hiding this comment.
Definitely possible alternate structures ... might even be better! Will stick to the current setup for now.
f5f6da6 to
02c1c99
Compare
|
Pushed a new version with |
…tats tracking - Make check_run_script async (acompletion) and add check_run_script_with_stats returning GatekeeperStats (tokens, cost, latency) alongside results - Introduce EvalSuite and FileEval classes in run-eval.py, replacing global variables and eliminating duplicate single-file/multi-file code paths - Load all test cases up front so progress output shows global numbering (e.g. [3/121]) as individual evals complete - Add StatsAggregator dataclass for cleaner inference statistics reporting - Add GatekeeperException with optional stats for timeout, output limit, and parse failures - Add gatekeeper.cost config option for custom per-token cost accounting - Switch summary/stats tables from plain text to rich.Table. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
02c1c99 to
507296f
Compare