refactor(gatekeeper): async eval runner with structured classes and stats tracking by owtaylor · Pull Request #479 · rhel-lightspeed/linux-mcp-server

owtaylor · 2026-05-24T21:00:22Z

Make check_run_script async (acompletion) and add check_run_script_with_stats returning GatekeeperStats (tokens, cost, latency) alongside results
Introduce EvalSuite and FileEval classes in run-eval.py, replacing global variables and eliminating duplicate single-file/multi-file code paths
Load all test cases up front so progress output shows global numbering (e.g. [3/121]) as individual evals complete
Add StatsAggregator dataclass for cleaner inference statistics reporting
Add GatekeeperException with optional stats for timeout, output limit, and parse failures
Add gatekeeper.cost config option for custom per-token cost accounting
Switch summary/stats tables from plain text to rich.Table

github-actions · 2026-05-24T21:00:30Z

For team members: test commit 51e6a98 in internal GitLab

github-actions · 2026-05-24T21:01:33Z

For team members: test commit 35adcb9 in internal GitLab

github-actions · 2026-05-25T14:47:10Z

For team members: test commit 914b391 in internal GitLab

codecov · 2026-05-25T14:50:08Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

Flag	Coverage Δ
unittests	`97.29% <100.00%> (+0.05%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
src/linux_mcp_server/config.py	`99.19% <100.00%> (+0.09%)`	⬆️
...rc/linux_mcp_server/gatekeeper/check_run_script.py	`100.00% <100.00%> (ø)`
src/linux_mcp_server/tools/run_script.py	`96.31% <100.00%> (ø)`
tests/gatekeeper/test_check_run_script.py	`100.00% <100.00%> (ø)`
tests/test_config.py	`100.00% <100.00%> (ø)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

github-actions · 2026-05-25T15:23:32Z

For team members: test commit f5f6da6 in internal GitLab

Jazzcort

LGTM! Comments are just some of my thoughts! 😁

Jazzcort · 2026-05-28T21:52:08Z



+# Maximum number of completion tokens (including reasoning)
+GATEKEEPER_MAX_TOKENS = 8000


Not sure if it makes sense that our users would like modify these values. These numbers are pretty reasonable and shouldn't be hit quite often. 😁

These are basically just there as safety fallbacks to avoid the model running on forever - I don't immediately a use case for modifying them:

There's no reason to make them bigger.

If a user is actually hitting them enough to care, they will be having a bad day.

But we can always add configuration later if it seems useful.

Totally agree! 😁

Jazzcort · 2026-05-28T21:54:14Z

+    stats: list[GatekeeperStats]
+    _values: dict[str, list[float]] = field(init=False, default_factory=dict)
+
+    def values(self, name: FieldName):


I like this lazy processing concept!

Jazzcort · 2026-05-28T22:04:58Z


-        expected_status = expected.get("status")
-        actual_status = actual.get("status")
+        totals = [0 for i in range(len(columns))]


Find a perfect spot to use totals = [0] * len(colums)! 😁

Where intuition fails - my intuition was that is clearer, but would be a bit slower, though not enough to matter. ("It's going to to allocate a bunch of lists, then append them").

>>> import timeit >>> timeit.timeit(lambda: [0] * 1000, number=10000) 0.02545745200040983 >>> timeit.timeit(lambda: [0 for i in range(1000)], number=10000) 0.6684908530005487

But it actually seems well optimized, and avoiding Python in the inner loop is a huge win. (Not that it matters.)

Thanks for even testing the performance of it! I'm surprised a performance difference exists—I thought it was just a different way to do the same thing 😆 The usage is actually pretty limited since this only works when setting up a default list with immutable objects, so this is the perfect place to keep it. 😁

Jazzcort · 2026-05-28T22:20:06Z

+        raise GatekeeperException("Failed to parse gatekeeper model output", stats=stats) from e
+
+
+async def check_run_script(description: str, script_type: str, script: str, *, readonly: bool) -> GatekeeperResult:


It just bugs me a little bit that when users are using our mcp server and these stats are generated and never got used with this wrapper solution. But actually, it does not harm the performance since getting these stats does not introduce any heavy computation burden.

If we later on want to avoid getting these stats in every check_run_script call, we can either have a function overload

@overload async def check_run_script( description: str, script_type: str, script: str, *, readonly: bool, with_stats: Literal[False] ) -> GatekeeperResult: ... @overload async def check_run_script( description: str, script_type: str, script: str, *, readonly: bool, with_stats: Literal[True] ) -> tuple[GatekeeperResult, GatekeeperStats]: ...

or have a hidden private attribute in GatekeeperResult that does not get serialized.

class GatekeeperResult(BaseModel): status: GatekeeperStatus detail: str = "" _stats: Optional["GatekeeperStats"] = PrivateAttr(default=None) ...

😁

Definitely possible alternate structures ... might even be better! Will stick to the current setup for now.

owtaylor · 2026-05-29T17:05:06Z

Pushed a new version with [0] * len(columns) ...

…tats tracking - Make check_run_script async (acompletion) and add check_run_script_with_stats returning GatekeeperStats (tokens, cost, latency) alongside results - Introduce EvalSuite and FileEval classes in run-eval.py, replacing global variables and eliminating duplicate single-file/multi-file code paths - Load all test cases up front so progress output shows global numbering (e.g. [3/121]) as individual evals complete - Add StatsAggregator dataclass for cleaner inference statistics reporting - Add GatekeeperException with optional stats for timeout, output limit, and parse failures - Add gatekeeper.cost config option for custom per-token cost accounting - Switch summary/stats tables from plain text to rich.Table. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Jazzcort

LGTM!

owtaylor requested a review from a team as a code owner May 24, 2026 21:00

owtaylor force-pushed the async-gatekeeper-eval branch from 51e6a98 to 35adcb9 Compare May 24, 2026 21:01

owtaylor force-pushed the async-gatekeeper-eval branch from 35adcb9 to 914b391 Compare May 25, 2026 14:47

owtaylor force-pushed the async-gatekeeper-eval branch from 914b391 to f5f6da6 Compare May 25, 2026 15:23

Jazzcort mentioned this pull request May 27, 2026

refactor(gatekeeper): restructure prompt for better caching and clarity #484

Draft

owtaylor requested a review from Jazzcort May 28, 2026 18:11

Jazzcort previously approved these changes May 28, 2026

View reviewed changes

owtaylor dismissed Jazzcort’s stale review via 02c1c99 May 29, 2026 17:04

owtaylor force-pushed the async-gatekeeper-eval branch from f5f6da6 to 02c1c99 Compare May 29, 2026 17:04

owtaylor requested a review from Jazzcort May 29, 2026 17:05

owtaylor force-pushed the async-gatekeeper-eval branch from 02c1c99 to 507296f Compare May 29, 2026 17:09

Jazzcort approved these changes May 29, 2026

View reviewed changes

owtaylor merged commit 6c1763f into rhel-lightspeed:main May 29, 2026
35 of 36 checks passed



		# Maximum number of completion tokens (including reasoning)
		GATEKEEPER_MAX_TOKENS = 8000

		raise GatekeeperException("Failed to parse gatekeeper model output", stats=stats) from e


		async def check_run_script(description: str, script_type: str, script: str, *, readonly: bool) -> GatekeeperResult:

Uh oh!

Conversation

owtaylor commented May 24, 2026

Uh oh!

github-actions Bot commented May 24, 2026

Uh oh!

github-actions Bot commented May 24, 2026

Uh oh!

github-actions Bot commented May 25, 2026

Uh oh!

codecov Bot commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions Bot commented May 25, 2026

Uh oh!

Jazzcort left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Jazzcort May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

owtaylor commented May 29, 2026

Uh oh!

Jazzcort left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov Bot commented May 25, 2026 •

edited

Loading

Jazzcort May 28, 2026 •

edited

Loading