Refactor: Rename CodeEvaluator to SystemEvaluator#121
Conversation
caohy1988
left a comment
There was a problem hiding this comment.
Review of PR #121 head 914119a
This is the first PR in a three-PR stack (#121 → #122 → #123). Review focuses on this PR's stated scope: "Rename CodeEvaluator to SystemEvaluator, kept as backward-compat alias."
Blockers
B1 — Stale base, 24 commits behind origin/main
Branch base is 292320b. origin/main is at 8824582. The PR's mergeStateStatus: UNKNOWN is consistent with that. Several of the 32 commits since base touch evaluators.py (e.g. PR #126 added retry-on-gate-failure orchestrator). Rebase needed before merge regardless of the review outcome.
git fetch origin main
git rebase origin/main
git push --force-with-leaseFindings on the rename itself
M1 — Incomplete rename across first-party code
After the PR, 18 files still contain CodeEvaluator. Classifying:
| Category | Files | Status |
|---|---|---|
| Backward-compat alias path (intentional) | evaluators.py (line 505: CodeEvaluator = SystemEvaluator), __init__.py (re-exports both) |
✅ Correct |
| Stale docstring references inside renamed files | client.py:880,885 ("CodeEvaluator metrics are computed..."), cli.py:409, udf_kernels.py:28 |
SystemEvaluator |
| Dead imports (imported but unused) | client.py:76, grader_pipeline.py:51, cli.py:42 all do from .evaluators import CodeEvaluator, SystemEvaluator but only use SystemEvaluator |
CodeEvaluator import is dead weight; pyflakes would flag |
| First-party callers NOT renamed | deploy/remote_function/dispatch.py (14 references — every factory entry still calls CodeEvaluator.latency(...) etc.); examples/e2e_demo.py (5 refs); examples/agent_improvement_cycle/eval/operational_metrics.py (5 refs); 3 notebooks (e2e_notebook_demo.ipynb, nba_agent_trace_analysis_notebook.ipynb, context_graph_adcp_demo.ipynb); docs/implementation_plan_remote_function.md (7 refs) |
❌ These are first-party; they should use the new name |
| Tests | tests/test_udf_kernels.py (10 refs), tests/test_sdk_evaluators.py, tests/test_pr16_fixes.py, tests/test_pr17_fixes.py, tests/test_sdk_client.py |
"A CodeEvaluator alias is kept in evaluators.py for backward-compatibility" justifies the alias for external callers, but the standard for first-party code, examples, tests, and docs is to use the new name. The PR leaves these inconsistent.
Specifically deploy/remote_function/dispatch.py is the deployed-runtime façade — keeping it on CodeEvaluator is awkward because the deploy code is owned by this repo, not by external callers.
L1 — Dead imports
In client.py:76, grader_pipeline.py:51, cli.py:42, both CodeEvaluator and SystemEvaluator are imported but only SystemEvaluator is used in the bodies. Either drop the CodeEvaluator import from these files or document why it's needed (it isn't — these are SDK internals, not the alias's customers).
What looks correct
- ✅
SystemEvaluatoris the consistent name throughoutevaluators.py(164+ references). - ✅
__init__.pyre-exports bothCodeEvaluatorandSystemEvaluator, preserving the public surface for external callers (from bigquery_agent_analytics import CodeEvaluator). - ✅ The alias is correctly defined:
CodeEvaluator = SystemEvaluator(class identity preserved — verified bypython -c "from bigquery_agent_analytics import CodeEvaluator, SystemEvaluator; assert CodeEvaluator is SystemEvaluator"after a rebase test). - ✅ Test count on PR 121 head: same set of evaluator tests pass.
Verdict
Request changes, blockers being B1 (rebase) and M1 (incomplete first-party rename). Suggested approach:
- Rebase on
origin/main. - Finish the rename in
deploy/,examples/(including notebooks),docs/implementation_plan_remote_function.md, and the relevant test files. Leave the alias only for external callers (who can keep usingCodeEvaluator). - Drop the dead
CodeEvaluatorimports fromclient.py,grader_pipeline.py,cli.py.
After those changes the alias stays a true backward-compat path rather than an internal inconsistency.
|
Fresh follow-up after the existing REQUEST_CHANGES review: One additional polish item I don't see captured yet: Low / release-note gap — the public rename needs an This PR introduces Suggested addition under
That matters because users reading release notes need to know whether to migrate imports immediately or whether existing |
70b4fac to
8e3569d
Compare
|
Thanks Haiyuan! Addressed all comments on this PR. |
gigistark-google
left a comment
There was a problem hiding this comment.
Thanks Haiyuan! Addressed all comments on this PR.
caohy1988
left a comment
There was a problem hiding this comment.
Fresh review of PR #121 head 8e3569d against current origin/main.
I still see merge-blocking issues.
Blockers
B1 — PR is still conflicting against current main
GitHub reports mergeable=CONFLICTING, mergeStateStatus=DIRTY. I also reproduced the conflict locally with git merge-tree; the remaining hard conflict is in CHANGELOG.md because current main added the --events-bq-query-file entry while this PR adds the SystemEvaluator rename entry in the same section.
Please rebase on current origin/main and keep both changelog entries. Until this is resolved, GitHub cannot merge the PR.
B2 — CodeEvaluator().name changed, so the alias is not fully backward compatible
The PR keeps CodeEvaluator = SystemEvaluator, which preserves imports and isinstance behavior, but it also changes the constructor default from name="code_evaluator" to name="system_evaluator".
Current main:
from bigquery_agent_analytics import CodeEvaluator
assert CodeEvaluator().name == "code_evaluator"PR head:
from bigquery_agent_analytics import CodeEvaluator, SystemEvaluator
print(CodeEvaluator is SystemEvaluator) # True
print(CodeEvaluator().name) # "system_evaluator"That is observable API behavior because Client._evaluate_code() serializes evaluator.name into EvaluationReport.evaluator_name, and remote-function / dashboard consumers read that field. A caller using the documented backward-compatible CodeEvaluator() alias now emits a different evaluator_name than before.
Please either:
- preserve the legacy default when users instantiate
CodeEvaluator()directly, or - explicitly document this as a breaking serialized-output change and add a regression test covering the intended behavior.
Given the PR says "CodeEvaluator as a backward-compatible alias (deprecated but supported)", I think preserving the old default is the right fix.
B3 — format checks are not clean
Local checks:
uv run pyink --check \
src/bigquery_agent_analytics/evaluators.py \
src/bigquery_agent_analytics/__init__.py \
src/bigquery_agent_analytics/client.py \
src/bigquery_agent_analytics/grader_pipeline.py \
src/bigquery_agent_analytics/cli.py \
deploy/remote_function/dispatch.py \
tests/test_sdk_evaluators.py \
tests/test_sdk_client.py \
tests/test_grader_pipeline.py \
tests/test_pr16_fixes.py \
tests/test_pr17_fixes.py \
tests/test_udf_kernels.pypasses.
But:
uv run isort --check-only ...fails on deploy/remote_function/dispatch.py. The required import order is:
from bigquery_agent_analytics import Client
-from bigquery_agent_analytics import SystemEvaluator
from bigquery_agent_analytics import LLMAsJudge
from bigquery_agent_analytics import serialize
+from bigquery_agent_analytics import SystemEvaluator
from bigquery_agent_analytics import TraceFilterAnd:
git diff --check origin/main...HEADfails with trailing whitespace in examples/agent_improvement_cycle/DEMO_NARRATION.md.
Both need to be clean before merge.
Verified
uv run pytest tests/test_sdk_evaluators.py tests/test_sdk_client.py tests/test_grader_pipeline.py tests/test_pr16_fixes.py tests/test_pr17_fixes.py tests/test_udf_kernels.py -qpasses:337 passed.uv run pytest tests/test_cli.py tests/test_remote_function.py tests/test_streaming_evaluation.py -qpasses:187 passed.uv run python -m py_compile ...over the changed Python entrypoints passes.- Package-root alias smoke check passes structurally:
CodeEvaluator is SystemEvaluatorandfrom bigquery_agent_analytics import SystemEvaluatorworks.
The rename is much cleaner than the prior review pass, but the PR should not merge until the conflict, format failures, and CodeEvaluator().name compatibility regression are resolved.
caohy1988
left a comment
There was a problem hiding this comment.
Follow-up to my earlier REQUEST_CHANGES — additional blockers found on a second pass. None of these are caught by the test suite (the tests import SystemEvaluator directly, so they don't exercise the docs/notebook/CHANGELOG drift).
Additional blockers
1. SDK.md misrepresents evaluator behavior
SDK.md:120-127 changes the pre-built descriptions from "fails when X exceeds the budget" to "score degrades linearly as X approaches threshold" / "penalizes". The code at src/bigquery_agent_analytics/evaluators.py:325-369 still returns binary 1.0 if observed <= threshold else 0.0. Readers will expect partial credit and get hard cutoffs.
Fix: Either revert SDK.md to the binary pass/fail wording, or implement linear degradation in code. This PR should not silently flip the documented behavior contract during a rename.
2. docs/implementation_plan_remote_function.md mixes a design change with the rename
Lines 220-249 change EVALUATOR_FACTORIES from a single-lambda dict to a (with_threshold, without_threshold) tuple-of-lambdas dict, and silently drops the \"llm-judge\": None entry. That's a design-document mutation unrelated to renaming.
Fix: Revert the structural change to the doc, or split it into its own PR. A rename PR should not be editing an implementation plan's design.
3. Notebook reformat noise is uninspectable
examples/context_graph_adcp_demo.ipynbwas re-saved with 2-space JSON indent.examples/e2e_notebook_demo.ipynbandexamples/nba_agent_trace_analysis_notebook.ipynbstayed at 1-space.- All three notebooks also had every cell's
\"source\"field rewritten from a string to an array-of-strings (\"source\": \"code\\nmore\"→\"source\": [\"code\\n\", \"more\"]).
Result: 6014 / 818 / 1041-line diffs nobody can audit for accidental content changes. Inconsistent treatment within the same PR is itself a smell.
Fix: Either re-save all three with consistent settings (one Jupyter version, one indent), or revert the notebooks (git checkout origin/main -- examples/*.ipynb) and apply a targeted sed rename for the four CodeEvaluator → SystemEvaluator occurrences. Either way, the notebook diff after this fix should be small enough to read end-to-end.
4. CHANGELOG.md promises a deprecation the code doesn't implement
CHANGELOG.md:12 says CodeEvaluator is "deprecated but supported". The code at src/bigquery_agent_analytics/evaluators.py:633 is a bare CodeEvaluator = SystemEvaluator assignment — no warnings.warn(DeprecationWarning, …).
Fix: Pick one. Either drop "deprecated" from CHANGELOG (alias-only, no warning), or wire an actual DeprecationWarning — e.g. wrap with a thin subclass whose __init_subclass__ / __new__ raises the warning. Silent passthrough is fine; lying about it in CHANGELOG isn't.
Soft blockers
5. docs/design.md:211 ASCII art damaged
One row of box-drawing characters got shortened from
└──────────────────────┘ └──────────────────────┘ └──────────────────┘
to
└──────────────────┘ └──────────────────┘ └──────────────────┘
…breaking the visual alignment with the boxes above. Stray edit from the search-and-replace pass; revert.
6. The CodeEvaluator().name == \"system_evaluator\" blocker has a wider blast radius
The rename also changes EvaluationReport.evaluator_name, which is populated from evaluator.name at src/bigquery_agent_analytics/client.py:952, 993, 1136, 1208, 1242. Any downstream consumer that does report.evaluator_name == \"code_evaluator\" silently misses now.
Fix: Address the constructor default AND every report-emission site in one pass — don't iterate.
Nits (non-blocking but worth catching here)
7. GraderPipeline.add_code_grader name vs docstring mismatch
src/bigquery_agent_analytics/grader_pipeline.py:280 keeps the method name add_code_grader but the docstring now says "Adds a SystemEvaluator grader to the pipeline." Either alias an add_system_grader method (and keep add_code_grader for back-compat) or use neutral language in the docstring ("Adds a code-based grader").
8. Internal CLI constant _CODE_EVALUATORS not renamed
src/bigquery_agent_analytics/cli.py:217 still uses _CODE_EVALUATORS as the dict name even though every value now constructs a SystemEvaluator. Internal scaffolding only, but inconsistent with the stated goal of a thorough rename.
Summary
My existing four blockers cover the merge-conflict, name-emission backward-compat, and lint failures. The four new blockers above are: SDK.md behavior misrepresentation, smuggled-in design plan change, notebook reformat that makes the PR un-auditable, and CHANGELOG making a deprecation claim the code doesn't deliver. The author should split notebook reformatting into its own PR (or revert it entirely) and pick a deprecation story for CodeEvaluator and stick to it.
ee0ffbc to
d442663
Compare
caohy1988
left a comment
There was a problem hiding this comment.
Fresh follow-up on latest PR head d442663.
One remaining non-doc blocker:
P1 / backward-compat regression — CodeEvaluator is no longer a true alias.
The PR description says CodeEvaluator is kept as a backward-compatible alias, but the current implementation is a subclass:
class CodeEvaluator(SystemEvaluator):
...That changes observable behavior for legacy callers. In current PR head:
from bigquery_agent_analytics import CodeEvaluator, SystemEvaluator
print(CodeEvaluator is SystemEvaluator) # False
print(CodeEvaluator().name) # "code_evaluator"
print(type(CodeEvaluator.latency()).__name__) # "SystemEvaluator"
print(isinstance(CodeEvaluator.latency(), CodeEvaluator)) # FalseBefore the rename, CodeEvaluator.latency() returned a CodeEvaluator. After this PR, a legacy caller using CodeEvaluator.latency() gets a SystemEvaluator, so any downstream code that checks identity/type against CodeEvaluator can break even though it is using the documented compatibility path.
Suggested fix: make the compatibility path a real alias again:
CodeEvaluator = SystemEvaluatorSystemEvaluator.__init__ already defaults to name="code_evaluator", so this still preserves the previous serialized evaluator name for direct construction. Please also add regression coverage for:
CodeEvaluator is SystemEvaluatorCodeEvaluator().name == "code_evaluator"isinstance(CodeEvaluator.latency(), CodeEvaluator)
The first-party rename is otherwise much cleaner, but this alias behavior should be locked before merge.
caohy1988
left a comment
There was a problem hiding this comment.
Re-review of PR #121 head d442663 against current origin/main
I checked the latest head out locally and ran it. Most prior findings are fixed — clean work. Two NEW issues no one has flagged yet, plus two already-known blockers still open.
Already-known blockers still open
CHANGELOG.md merge conflict
gh pr view reports mergeStateStatus: DIRTY. Confirmed locally with git merge origin/main — hard conflict in CHANGELOG.md (current main added a different ### Added entry in the same [Unreleased] section). Same finding as the 2026-05-12 review on head 8e3569d. Please rebase on current origin/main and keep both entries.
CodeEvaluator is a subclass, not a true alias
From my 2026-05-18 review. src/bigquery_agent_analytics/evaluators.py:631 declares class CodeEvaluator(SystemEvaluator), so:
CodeEvaluator is SystemEvaluator→FalseCodeEvaluator.latency()returns aSystemEvaluator(the factory methods at lines 322, 353, 381, 421, 451, 481, 521 are all@staticmethodhardcoded toSystemEvaluator(...))isinstance(CodeEvaluator.latency(), CodeEvaluator)→False
Still unaddressed. Suggested fix is still CodeEvaluator = SystemEvaluator plus the N1 change below.
NEW findings (not in any prior review)
N1 — Medium — SystemEvaluator() default name is "code_evaluator", leaking the legacy name into the new "preferred" symbol
src/bigquery_agent_analytics/evaluators.py:190 sets SystemEvaluator.__init__'s default to name="code_evaluator". To preserve the legacy default for CodeEvaluator(), the author pushed the legacy default down into the base class. Reproducible:
>>> from bigquery_agent_analytics import SystemEvaluator
>>> SystemEvaluator().name
'code_evaluator'Any user who follows the post-rename docs and constructs SystemEvaluator() (no name=) gets evaluator_name == "code_evaluator" serialized into every EvaluationReport (emitted at client.py:952, 993, 1136, 1208, 1242). The "preferred" name pollutes its own reports with the legacy label. Factory methods (latency, error_rate, etc.) are unaffected because they override the default, and explicit SystemEvaluator(name="my_eval") is fine — but the bare-constructor case is a footgun that contradicts the PR's narrative.
Fix. Set SystemEvaluator.__init__ default back to "system_evaluator", and only override in CodeEvaluator.__init__:
class SystemEvaluator:
def __init__(self, name: str = "system_evaluator", metrics=None) -> None:
...
class CodeEvaluator(SystemEvaluator):
def __init__(self, name: str = "code_evaluator", metrics=None) -> None:
super().__init__(name=name, metrics=metrics)This is the natural shape if CodeEvaluator stays a subclass — the legacy default belongs only on the legacy class. It also composes cleanly with the blocker above: whether the author goes back to a pure CodeEvaluator = SystemEvaluator alias or keeps a subclass, the legacy default should not live on SystemEvaluator.
N2 — Low (scope leak) — deploy/streaming_evaluation/main.py removes a stray blank line unrelated to the rename
deploy/streaming_evaluation/main.py:21 is deleted — a blank line between the flask imports and from worker import ...:
from flask import Flask
from flask import jsonify
from flask import request
-
from worker import handle_scheduled_runThere is no CodeEvaluator reference in this file at all (it was never an evaluator-touching file), so this hunk has nothing to do with the rename. The earlier review specifically called out scope leak (the docs/implementation_plan_remote_function.md design mutation); please revert this hunk for the same reason.
Verified (NO blocker, recording for transparency)
- Tests:
pytest tests/test_sdk_evaluators.py tests/test_sdk_client.py tests/test_grader_pipeline.py tests/test_pr16_fixes.py tests/test_pr17_fixes.py tests/test_udf_kernels.py -q→337 passed in 9.05s. - isort +
git diff --check: clean on all 20 PR-touched files. - pyink: clean on all 18 of 20 PR-touched files.
dashboard/app.pyandscripts/quality_report.pyfail pyink, but these failures are pre-existing onorigin/main(verified by runningpyink --checkon the unmodified files fromorigin/main— both fail identically). The PR's own edits to these files are tiny isort fixups; the rest of each file was already non-conforming before this PR. Not a blocker introduced by PR #121, but a future global lint pass will eventually trip on these. - First-party rename completeness: all
deploy/,examples/(notebooks reverted tomain-identical line counts),docs/, anddashboard/files now useSystemEvaluator. Only remainingCodeEvaluatorreferences are the alias class itself, the__init__.pyre-export, and CHANGELOG historical entries — all intentional. - No isinstance blast radius beyond N1 / the May-18 finding:
client.py:903andgrader_pipeline.py:405both checkisinstance(evaluator, SystemEvaluator)— these continue to work becauseCodeEvaluatoris a subclass. - Stale-base ontology_runtime exports: I initially worried the PR would drop the 12-symbol
ontology_runtimeblock from__init__.py(same pattern as PR #91). False alarm —git mergecorrectly preserves that block because the PR didn't touch those lines. OnlyCHANGELOG.mdconflicts. - Previously-flagged issues confirmed resolved on this head: SDK.md binary/linear language (fixed),
docs/implementation_plan_remote_function.mddesign mutation (reverted), notebook reformat noise (reverted — line counts identical to main), CHANGELOG "deprecated" claim (replaced with "backward-compatible alias"),docs/design.mdASCII art (fixed),_CODE_EVALUATORSconstant (renamed to_SYSTEM_EVALUATORS), deadCodeEvaluatorimports inclient.py/grader_pipeline.py/cli.py(cleaned). - Soft-known unfixed item from May 12:
GraderPipeline.add_code_grader(grader_pipeline.py:279) still has that name with the rewritten docstring "Adds a SystemEvaluator grader". Flagged as nit #7 on May 12, still unaddressed. Not blocking, but a free cleanup if the author is doing another pass.
Verdict — REQUEST_CHANGES
Two new items in addition to the two already-known open blockers (CHANGELOG conflict + subclass-not-alias):
- N1 —
SystemEvaluator()users getevaluator_name == "code_evaluator". One-line default change onevaluators.py:190. - N2 — Revert the stray blank-line hunk in
deploy/streaming_evaluation/main.py:21.
The rename itself is in much better shape than earlier rounds — once the CHANGELOG rebase + N1 + the May-18 alias-vs-subclass cleanup land, this is mergeable.
a56a8e6 to
e861dc3
Compare
|
Addressed all prior feedback. Thanks for following up on this! Appreciate it! |
caohy1988
left a comment
There was a problem hiding this comment.
Fresh re-review — PR #121 head e861dc3
Locally checked out, 3160 tests pass (pytest tests/, excluding live-BQ tests). Most prior round's findings landed correctly. The latest commit (Address PR feedback (N1, N2, and GraderPipeline nit)) does exactly what its message says. Two new issues surfaced; one is a tradeoff between two correctly-applied prior fixes, and the notebook/scope-leak concerns from earlier rounds have re-grown.
CHANGELOG conflict from prior rounds: clean now. git merge --no-commit origin/main from PR head → "Automatic merge went well", no <<<<<<< markers. GH's mergeable: UNKNOWN is computation-pending, not actually conflicting.
Open from prior rounds — verified RESOLVED
CodeEvaluatortrue alias (May 18 P1 / B2):evaluators.py:633isCodeEvaluator = SystemEvaluator.CodeEvaluator is SystemEvaluator→True.isinstance(CodeEvaluator.latency(), CodeEvaluator)→True. Locked in.- N1 (May 18):
SystemEvaluator.__init__default name flipped to"system_evaluator"atevaluators.py:192. The "preferred" symbol no longer emits a legacy report name. - N2 (May 18):
deploy/streaming_evaluation/main.pyis no longer in the diff. Stray-hunk scope leak reverted. - add_code_grader nit (May 12 nit #7): renamed to
add_system_graderatgrader_pipeline.py:279, aliasadd_code_grader = add_system_graderat:304. Tests updated.
New blocker
B1 — N1 + true-alias combination silently changes EvaluationReport.evaluator_name for legacy CodeEvaluator() callers
The two cleanly-applied fixes interact badly. Confirmed locally:
>>> from bigquery_agent_analytics import CodeEvaluator, SystemEvaluator
>>> CodeEvaluator is SystemEvaluator # True (alias)
>>> SystemEvaluator().name # 'system_evaluator' (N1)
>>> CodeEvaluator().name # 'system_evaluator' ← regressionBecause CodeEvaluator is now a pure alias for SystemEvaluator, CodeEvaluator() runs SystemEvaluator.__init__, which the N1 fix changed to default name="system_evaluator". So a caller using the documented backward-compat path now emits EvaluationReport.evaluator_name == "system_evaluator" instead of the legacy "code_evaluator".
Why it matters: Client._evaluate_code writes evaluator.name into EvaluationReport.evaluator_name at client.py:950. Any downstream consumer (dashboard, remote-function persisted reports, integration tests grepping for evaluator_name == "code_evaluator") silently misses now — no error, no warning, just a different string field.
The May 18 N1 ask assumed CodeEvaluator would stay a subclass (so legacy default lives on the legacy class). Since the author correctly switched back to a true alias to fix the bigger compatibility regression, the N1 fix has to be reconciled. Options:
- (a) Keep the alias, document the change. The CHANGELOG should explicitly say
CodeEvaluator()now emitsevaluator_name="system_evaluator". Migrate downstream filters. Simplest and probably right since0.4.0is a major-enough rename anyway. - (b) Re-add a thin
CodeEvaluator(SystemEvaluator)subclass solely to override the default name. LosesCodeEvaluator is SystemEvaluatoridentity and theisinstance(CodeEvaluator.latency(), CodeEvaluator)guarantee — i.e. re-introduces the May 18 P1. - (c) Provide a
code_evaluator()factory function that defaultsname="code_evaluator"and keep the class as the alias. Most awkward, but fully preserves both invariants.
(a) is the right call IMO — but it requires updating the CHANGELOG migration line. Without that, this is a silent behavioral regression masked as a rename.
Regression test that would lock the chosen path:
def test_code_evaluator_name_serialization():
# Whichever path is chosen, pin the serialized name explicitly.
assert CodeEvaluator().name == ...
assert CodeEvaluator.latency().name == "latency_evaluator" # factories are unaffectedHigh — repeats from May 12 review #4, regressed since May 18
H1 — Notebook reformat noise is back; PR is un-auditable on the notebook surface
Diff line counts vs origin/main:
examples/context_graph_adcp_demo.ipynb: main=3093 → PR=3115 (+22 net, 2094 lines of churn)
examples/e2e_notebook_demo.ipynb: main=1242 → PR=1299 (+57 net, 495 lines of churn)
examples/nba_agent_trace_analysis_notebook.ipynb: main=514 → 514 ( 6 lines — clean)
Counting only +/- lines that mention CodeEvaluator/SystemEvaluator/BigQueryTraceEvaluator/PerformanceEvaluator/GraderPipeline: 43 of 2595 (~1.7%). The other ~98% is JSON re-serialization noise:
- Cell field ordering (
id/cell_type/metadata/source/execution_count/outputsreshuffled). - HTML entity escaping (
-->→>, etc.). outputs: []andexecution_count: nulleither added or stripped.- (in
context_graph_adcp_demo.ipynb) the<table align="left">...Run in Colab...Vertex AI Workbench...</table>block churn — actually re-checked, the count of "Run in Colab" / "Vertex AI Workbench" strings is preserved (2 → 2), but the JSON structure around them was reordered.
The May 12 review #4 finding #3 called this out. The May 18 verification said "notebooks reverted — line counts identical to main"; that was true at head d442663 but a later commit re-introduced the reformat.
Either re-strip the notebook diffs (git checkout origin/main -- examples/*.ipynb then a targeted sed for the rename) or split notebook reformatting into its own PR. As-is, an auditor cannot tell whether any cell content actually changed.
H2 — CHANGELOG says "deprecated" but no DeprecationWarning is raised
CHANGELOG.md:12 says:
Kept
CodeEvaluatoras a backward-compatible alias (deprecated but supported).
evaluators.py:633 is a bare CodeEvaluator = SystemEvaluator. No warnings.warn(DeprecationWarning, …). Same finding as the May 12 review #4 finding #4. Was acknowledged as resolved on May 18 (CHANGELOG had been changed to "backward-compatible alias"). A later commit reverted the wording.
Pick one and lock it:
- Drop the "(deprecated but supported)" parenthetical — the language
CodeEvaluatoris just an alias to keep imports working. - Or add the actual deprecation warning — wrap the alias in a small
__class_getitem__/ module-level__getattr__that warns on access. PR #123 already does this forrender_ai_generate_judge_queryandsplit_judge_prompt_template; the same pattern applies here.
Silent passthrough is fine. Lying about it in CHANGELOG isn't.
H3 — Scope leak: dashboard/app.py and scripts/quality_report.py changes are not rename-related
dashboard/app.py: isort reordering of 4 imports (no logic / no CodeEvaluator refs)
scripts/quality_report.py: a single blank line removed at line 237 (no rename, no CodeEvaluator refs)
Same scope-leak concern as the May 12 review #4 finding #2 (docs/implementation_plan_remote_function.md design mutation) and the May 18 N2 (deploy/streaming_evaluation/main.py blank line). The author has now fixed N2 cleanly. These two are the same shape — unrelated changes attached to a rename PR. Revert both hunks.
Note: scripts/quality_report.py was massively rewritten in PR #174 (just reviewed). Once #174 merges, this PR will conflict in that file. Reverting the one-liner here avoids the future conflict.
Low / nit
L1 — add_code_grader = add_system_grader is a method-binding alias, not a proper subclass attribute
grader_pipeline.py:304 does add_code_grader = add_system_grader at class body. That's fine for Python's method-resolution — both names resolve to the same function — but help(pipeline.add_code_grader) will say add_system_grader because the function's __name__ doesn't get rewritten. Minor; not blocking.
If you want help() to show the legacy name on the legacy path, define a one-line wrapper:
def add_code_grader(self, *args, **kwargs):
"""Backward-compat alias for ``add_system_grader``."""
return self.add_system_grader(*args, **kwargs)Otherwise this is just a documentation polish issue.
Verified clean (recording for transparency)
- Tests:
pytest tests/ -q(excluding live BQ): 3160 passed, 27 skipped. - First-party rename completeness: only
__init__.py(re-export, intentional) andevaluators.py(alias definition, intentional) mentionCodeEvaluatoroutside tests/CHANGELOG. - Alias identity:
CodeEvaluator is SystemEvaluator→True;isinstance(CodeEvaluator.latency(), CodeEvaluator)→True. Factory methods (SystemEvaluator.latency,.error_rate, etc.) emit per-factory names (latency_evaluator,error_rate_evaluator) — unaffected by the default-name change in B1. - Mergeable:
git merge --no-commit origin/mainis clean (only auto-resolvedMmarkers inmaterialize_window.pyandtest_property_graph_spec.py, no conflicts).
Verdict — REQUEST_CHANGES
B1 (silent EvaluationReport.evaluator_name change for CodeEvaluator() callers) is the single substantive blocker — pick the migration story (option a, b, or c above) and reflect it in CHANGELOG + a regression test.
H1 (notebook reformat noise) and H3 (scope leak in dashboard/app.py + scripts/quality_report.py) are the same auditor-pain shape that came up multiple rounds. Strip them out so the diff is just the rename + the four directly-related files.
H2 (CHANGELOG "deprecated" claim) is a one-line fix: either delete "(deprecated but supported)" or add a DeprecationWarning.
Once the four are addressed, this PR is straightforward to land. The rename itself is in great shape.
|
Fresh review against current head The proposed follow-up comments check out against the source. Findings:
Additional process note: this PR is currently behind |
e861dc3 to
6e00e24
Compare
Rename CodeEvaluator to SystemEvaluator to align with its focus on system-level metrics. A CodeEvaluator alias is kept in evaluators.py for backward-compatibility.
- Set SystemEvaluator default name to 'code_evaluator' for backward compatibility. - Update docstring in grader_pipeline.py to use neutral language. - Rename internal CLI constant _CODE_EVALUATORS to _SYSTEM_EVALUATORS. - Fix ASCII art in docs/design.md. - Clean up CHANGELOG.md and SDK.md.
…s of SystemEvaluator
…aluator default name to system_evaluator, restore deleted blank line in Cloud Run entrypoint, and rename add_code_grader to add_system_grader
60db3e1 to
9644f2e
Compare
…ode_grader, and revert unrelated changes
9644f2e to
5a8ac26
Compare
|
Thanks Haiyuan for your feedback on this! The latest round of feedback has been addressed. |
caohy1988
left a comment
There was a problem hiding this comment.
Re-review — PR #121 head 5a8ac26
Checked out locally and ran the full battery. All five findings from the June 5 review are fixed, and fixed well. What remains is mechanical: two pyink failures and one isort failure on PR-touched files, plus a CHANGELOG rebase conflict against current main. Once those land, this is an approve.
June 5 findings — verified RESOLVED
- B1 (
CodeEvaluator().namemigration story) — the author picked option (a): keep the true alias, document the behavior change. CHANGELOG now reads "KeptCodeEvaluatoras a backward-compatible alias. Note that callingCodeEvaluator()now emitsevaluator_name="system_evaluator"." And the regression tests I asked for exist:tests/test_sdk_evaluators.py:875test_default_namepins bothCodeEvaluator().nameandSystemEvaluator().nameto"system_evaluator";test_prebuilt_isinstancepinsisinstance(CodeEvaluator.latency(), CodeEvaluator). Verified at runtime:CodeEvaluator is SystemEvaluator → True SystemEvaluator().name → 'system_evaluator' CodeEvaluator().name → 'system_evaluator' isinstance(latency(), CodeEvaluator) → True - H1 (notebook reformat noise) — collapsed completely.
context_graph_adcp_demo.ipynb2094 → 18 diff lines,e2e_notebook_demo.ipynb495 → 26,nba_…stays at 6. Audited every remaining +/- line: 100% rename-related (CodeEvaluator→SystemEvaluatorandadd_code_grader→add_system_grader). The diff is now fully auditable. - H2 (CHANGELOG "deprecated" claim) — the "(deprecated but supported)" parenthetical is gone; replaced with the honest alias + name-change note above.
- H3 (scope leak) —
dashboard/app.pyandscripts/quality_report.pyare no longer in the diff.deploy/streaming_evaluation/stays clean. - L1 (
add_code_graderbinding alias) — now a real wrapper method with its own docstring ("preserved for backwards compatibility, but isn't recommended for use") delegating toadd_system_grader.help()will show the right name.
Tests: pytest tests/ (excluding live-BQ) → 3178 passed, 27 skipped.
git diff --check: clean.
Non-blocking comments — please fix before merging (all mechanical)
1. pyink fails on two PR-touched files
would reformat src/bigquery_agent_analytics/grader_pipeline.py
would reformat tests/test_grader_pipeline.py
Both regressions come from the latest commit:
grader_pipeline.py:~314— double blank line between the newadd_code_graderwrapper andadd_llm_grader(pyink wants one blank line inside a class body).tests/test_grader_pipeline.py:164— pyink wants the chained call wrapped as:pipeline = GraderPipeline( WeightedStrategy(threshold=0.5) ).add_system_grader(SystemEvaluator.latency(threshold_ms=5000))
Fix: pyink src/bigquery_agent_analytics/grader_pipeline.py tests/test_grader_pipeline.py and commit.
2. isort fails on tests/test_sdk_evaluators.py
The CodeEvaluator import is mis-positioned:
from bigquery_agent_analytics.evaluators import AI_GENERATE_JUDGE_BATCH_QUERY
+from bigquery_agent_analytics.evaluators import CodeEvaluator
from bigquery_agent_analytics.evaluators import DEFAULT_ENDPOINT
from bigquery_agent_analytics.evaluators import EvaluationReport
from bigquery_agent_analytics.evaluators import LLM_JUDGE_BATCH_QUERY
-from bigquery_agent_analytics.evaluators import CodeEvaluatorFix: isort tests/test_sdk_evaluators.py and commit.
3. CHANGELOG.md conflicts with current main (13 commits behind)
Reproduced locally: git merge origin/main → CONFLICT (content): Merge conflict in CHANGELOG.md. main has moved (incl. the migration_v5 → context_graph rename, #300) and edited the same [Unreleased] section. Standard fix:
git fetch origin main
git rebase origin/main # keep both Unreleased entries in CHANGELOG.md
git push --force-with-leaseVerdict — APPROVE
The substance of this PR is done. The rename is complete and consistent, the alias semantics are pinned by tests, the diff is auditable, and the full suite is green (3178 passed). Approving now; please run pyink on the two files, isort on the one test, and rebase for the CHANGELOG conflict before merging — all three are mechanical and none affect the reviewed semantics.
Rename CodeEvaluator to SystemEvaluator to align with its focus on system-level metrics. A CodeEvaluator alias is kept in evaluators.py for backward-compatibility.