Skip to content

refactor(gatekeeper): async eval runner with structured classes and stats tracking#479

Merged
owtaylor merged 1 commit into
rhel-lightspeed:mainfrom
owtaylor:async-gatekeeper-eval
May 29, 2026
Merged

refactor(gatekeeper): async eval runner with structured classes and stats tracking#479
owtaylor merged 1 commit into
rhel-lightspeed:mainfrom
owtaylor:async-gatekeeper-eval

Conversation

@owtaylor

Copy link
Copy Markdown
Contributor
  • Make check_run_script async (acompletion) and add check_run_script_with_stats returning GatekeeperStats (tokens, cost, latency) alongside results
  • Introduce EvalSuite and FileEval classes in run-eval.py, replacing global variables and eliminating duplicate single-file/multi-file code paths
  • Load all test cases up front so progress output shows global numbering (e.g. [3/121]) as individual evals complete
  • Add StatsAggregator dataclass for cleaner inference statistics reporting
  • Add GatekeeperException with optional stats for timeout, output limit, and parse failures
  • Add gatekeeper.cost config option for custom per-token cost accounting
  • Switch summary/stats tables from plain text to rich.Table

@owtaylor owtaylor requested a review from a team as a code owner May 24, 2026 21:00
@github-actions

Copy link
Copy Markdown

For team members: test commit 51e6a98 in internal GitLab

@owtaylor owtaylor force-pushed the async-gatekeeper-eval branch from 51e6a98 to 35adcb9 Compare May 24, 2026 21:01
@github-actions

Copy link
Copy Markdown

For team members: test commit 35adcb9 in internal GitLab

@owtaylor owtaylor force-pushed the async-gatekeeper-eval branch from 35adcb9 to 914b391 Compare May 25, 2026 14:47
@github-actions

Copy link
Copy Markdown

For team members: test commit 914b391 in internal GitLab

@codecov

codecov Bot commented May 25, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

Flag Coverage Δ
unittests 97.29% <100.00%> (+0.05%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
src/linux_mcp_server/config.py 99.19% <100.00%> (+0.09%) ⬆️
...rc/linux_mcp_server/gatekeeper/check_run_script.py 100.00% <100.00%> (ø)
src/linux_mcp_server/tools/run_script.py 96.31% <100.00%> (ø)
tests/gatekeeper/test_check_run_script.py 100.00% <100.00%> (ø)
tests/test_config.py 100.00% <100.00%> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@owtaylor owtaylor force-pushed the async-gatekeeper-eval branch from 914b391 to f5f6da6 Compare May 25, 2026 15:23
@github-actions

Copy link
Copy Markdown

For team members: test commit f5f6da6 in internal GitLab

Jazzcort
Jazzcort previously approved these changes May 28, 2026

@Jazzcort Jazzcort left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Comments are just some of my thoughts! 😁



# Maximum number of completion tokens (including reasoning)
GATEKEEPER_MAX_TOKENS = 8000

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if it makes sense that our users would like modify these values. These numbers are pretty reasonable and shouldn't be hit quite often. 😁

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are basically just there as safety fallbacks to avoid the model running on forever - I don't immediately a use case for modifying them:

  • There's no reason to make them bigger.
  • If a user is actually hitting them enough to care, they will be having a bad day.

But we can always add configuration later if it seems useful.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally agree! 😁

stats: list[GatekeeperStats]
_values: dict[str, list[float]] = field(init=False, default_factory=dict)

def values(self, name: FieldName):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this lazy processing concept!

Comment thread eval/gatekeeper/run-eval.py Outdated

expected_status = expected.get("status")
actual_status = actual.get("status")
totals = [0 for i in range(len(columns))]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Find a perfect spot to use totals = [0] * len(colums)! 😁

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where intuition fails - my intuition was that is clearer, but would be a bit slower, though not enough to matter. ("It's going to to allocate a bunch of lists, then append them").

>>> import timeit
>>> timeit.timeit(lambda: [0] * 1000, number=10000)
0.02545745200040983
>>> timeit.timeit(lambda: [0 for i in range(1000)], number=10000)
0.6684908530005487

But it actually seems well optimized, and avoiding Python in the inner loop is a huge win. (Not that it matters.)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for even testing the performance of it! I'm surprised a performance difference exists—I thought it was just a different way to do the same thing 😆 The usage is actually pretty limited since this only works when setting up a default list with immutable objects, so this is the perfect place to keep it. 😁

raise GatekeeperException("Failed to parse gatekeeper model output", stats=stats) from e


async def check_run_script(description: str, script_type: str, script: str, *, readonly: bool) -> GatekeeperResult:

@Jazzcort Jazzcort May 28, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It just bugs me a little bit that when users are using our mcp server and these stats are generated and never got used with this wrapper solution. But actually, it does not harm the performance since getting these stats does not introduce any heavy computation burden.

If we later on want to avoid getting these stats in every check_run_script call, we can either have a function overload

@overload
async def check_run_script(
    description: str, script_type: str, script: str, *, readonly: bool, with_stats: Literal[False]
) -> GatekeeperResult: ...


@overload
async def check_run_script(
    description: str, script_type: str, script: str, *, readonly: bool, with_stats: Literal[True]
) -> tuple[GatekeeperResult, GatekeeperStats]: ...

or have a hidden private attribute in GatekeeperResult that does not get serialized.

class GatekeeperResult(BaseModel):
    status: GatekeeperStatus
    detail: str = ""

    _stats: Optional["GatekeeperStats"] = PrivateAttr(default=None)
    ...

😁

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely possible alternate structures ... might even be better! Will stick to the current setup for now.

@owtaylor

Copy link
Copy Markdown
Contributor Author

Pushed a new version with [0] * len(columns) ...

@owtaylor owtaylor requested a review from Jazzcort May 29, 2026 17:05
…tats tracking

- Make check_run_script async (acompletion) and add check_run_script_with_stats
  returning GatekeeperStats (tokens, cost, latency) alongside results
- Introduce EvalSuite and FileEval classes in run-eval.py, replacing global
  variables and eliminating duplicate single-file/multi-file code paths
- Load all test cases up front so progress output shows global numbering
  (e.g. [3/121]) as individual evals complete
- Add StatsAggregator dataclass for cleaner inference statistics reporting
- Add GatekeeperException with optional stats for timeout, output limit,
  and parse failures
- Add gatekeeper.cost config option for custom per-token cost accounting
- Switch summary/stats tables from plain text to rich.Table.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@owtaylor owtaylor force-pushed the async-gatekeeper-eval branch from 02c1c99 to 507296f Compare May 29, 2026 17:09

@Jazzcort Jazzcort left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@owtaylor owtaylor merged commit 6c1763f into rhel-lightspeed:main May 29, 2026
35 of 36 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants