Skip to content

Refactor: Rename CodeEvaluator to SystemEvaluator#121

Open
gigistark-google wants to merge 5 commits into
GoogleCloudPlatform:mainfrom
gigistark-google:pr1_rename
Open

Refactor: Rename CodeEvaluator to SystemEvaluator#121
gigistark-google wants to merge 5 commits into
GoogleCloudPlatform:mainfrom
gigistark-google:pr1_rename

Conversation

@gigistark-google

Copy link
Copy Markdown

Rename CodeEvaluator to SystemEvaluator to align with its focus on system-level metrics. A CodeEvaluator alias is kept in evaluators.py for backward-compatibility.

@caohy1988 caohy1988 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review of PR #121 head 914119a

This is the first PR in a three-PR stack (#121#122#123). Review focuses on this PR's stated scope: "Rename CodeEvaluator to SystemEvaluator, kept as backward-compat alias."

Blockers

B1 — Stale base, 24 commits behind origin/main

Branch base is 292320b. origin/main is at 8824582. The PR's mergeStateStatus: UNKNOWN is consistent with that. Several of the 32 commits since base touch evaluators.py (e.g. PR #126 added retry-on-gate-failure orchestrator). Rebase needed before merge regardless of the review outcome.

git fetch origin main
git rebase origin/main
git push --force-with-lease

Findings on the rename itself

M1 — Incomplete rename across first-party code

After the PR, 18 files still contain CodeEvaluator. Classifying:

Category Files Status
Backward-compat alias path (intentional) evaluators.py (line 505: CodeEvaluator = SystemEvaluator), __init__.py (re-exports both) ✅ Correct
Stale docstring references inside renamed files client.py:880,885 ("CodeEvaluator metrics are computed..."), cli.py:409, udf_kernels.py:28 ⚠️ Cosmetic, but inconsistent with the file's own code now using SystemEvaluator
Dead imports (imported but unused) client.py:76, grader_pipeline.py:51, cli.py:42 all do from .evaluators import CodeEvaluator, SystemEvaluator but only use SystemEvaluator ⚠️ CodeEvaluator import is dead weight; pyflakes would flag
First-party callers NOT renamed deploy/remote_function/dispatch.py (14 references — every factory entry still calls CodeEvaluator.latency(...) etc.); examples/e2e_demo.py (5 refs); examples/agent_improvement_cycle/eval/operational_metrics.py (5 refs); 3 notebooks (e2e_notebook_demo.ipynb, nba_agent_trace_analysis_notebook.ipynb, context_graph_adcp_demo.ipynb); docs/implementation_plan_remote_function.md (7 refs) ❌ These are first-party; they should use the new name
Tests tests/test_udf_kernels.py (10 refs), tests/test_sdk_evaluators.py, tests/test_pr16_fixes.py, tests/test_pr17_fixes.py, tests/test_sdk_client.py ⚠️ Tests should either be renamed or explicitly test the alias path

"A CodeEvaluator alias is kept in evaluators.py for backward-compatibility" justifies the alias for external callers, but the standard for first-party code, examples, tests, and docs is to use the new name. The PR leaves these inconsistent.

Specifically deploy/remote_function/dispatch.py is the deployed-runtime façade — keeping it on CodeEvaluator is awkward because the deploy code is owned by this repo, not by external callers.

L1 — Dead imports

In client.py:76, grader_pipeline.py:51, cli.py:42, both CodeEvaluator and SystemEvaluator are imported but only SystemEvaluator is used in the bodies. Either drop the CodeEvaluator import from these files or document why it's needed (it isn't — these are SDK internals, not the alias's customers).

What looks correct

  • SystemEvaluator is the consistent name throughout evaluators.py (164+ references).
  • __init__.py re-exports both CodeEvaluator and SystemEvaluator, preserving the public surface for external callers (from bigquery_agent_analytics import CodeEvaluator).
  • ✅ The alias is correctly defined: CodeEvaluator = SystemEvaluator (class identity preserved — verified by python -c "from bigquery_agent_analytics import CodeEvaluator, SystemEvaluator; assert CodeEvaluator is SystemEvaluator" after a rebase test).
  • ✅ Test count on PR 121 head: same set of evaluator tests pass.

Verdict

Request changes, blockers being B1 (rebase) and M1 (incomplete first-party rename). Suggested approach:

  1. Rebase on origin/main.
  2. Finish the rename in deploy/, examples/ (including notebooks), docs/implementation_plan_remote_function.md, and the relevant test files. Leave the alias only for external callers (who can keep using CodeEvaluator).
  3. Drop the dead CodeEvaluator imports from client.py, grader_pipeline.py, cli.py.

After those changes the alias stays a true backward-compat path rather than an internal inconsistency.

@caohy1988

Copy link
Copy Markdown
Collaborator

Fresh follow-up after the existing REQUEST_CHANGES review:

One additional polish item I don't see captured yet:

Low / release-note gap — the public rename needs an Unreleased changelog entry and an alias policy.

This PR introduces SystemEvaluator as the preferred public name while keeping CodeEvaluator = SystemEvaluator as a compatibility alias. That's the right compatibility shape, but CHANGELOG.md still only documents the older CodeEvaluator behavior from v0.2.2 and has no note for the rename or the support policy for the alias.

Suggested addition under Unreleased after the rebase:

  • Added SystemEvaluator as the preferred name for deterministic/code-defined metrics.
  • Kept CodeEvaluator as a backward-compatible alias; state whether it is permanent or deprecated-but-supported.

That matters because users reading release notes need to know whether to migrate imports immediately or whether existing CodeEvaluator code remains supported.

@gigistark-google gigistark-google force-pushed the pr1_rename branch 3 times, most recently from 70b4fac to 8e3569d Compare May 12, 2026 19:58
@gigistark-google

Copy link
Copy Markdown
Author

Thanks Haiyuan! Addressed all comments on this PR.

@gigistark-google gigistark-google left a comment

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Haiyuan! Addressed all comments on this PR.

@caohy1988 caohy1988 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fresh review of PR #121 head 8e3569d against current origin/main.

I still see merge-blocking issues.

Blockers

B1 — PR is still conflicting against current main

GitHub reports mergeable=CONFLICTING, mergeStateStatus=DIRTY. I also reproduced the conflict locally with git merge-tree; the remaining hard conflict is in CHANGELOG.md because current main added the --events-bq-query-file entry while this PR adds the SystemEvaluator rename entry in the same section.

Please rebase on current origin/main and keep both changelog entries. Until this is resolved, GitHub cannot merge the PR.

B2 — CodeEvaluator().name changed, so the alias is not fully backward compatible

The PR keeps CodeEvaluator = SystemEvaluator, which preserves imports and isinstance behavior, but it also changes the constructor default from name="code_evaluator" to name="system_evaluator".

Current main:

from bigquery_agent_analytics import CodeEvaluator
assert CodeEvaluator().name == "code_evaluator"

PR head:

from bigquery_agent_analytics import CodeEvaluator, SystemEvaluator
print(CodeEvaluator is SystemEvaluator)  # True
print(CodeEvaluator().name)              # "system_evaluator"

That is observable API behavior because Client._evaluate_code() serializes evaluator.name into EvaluationReport.evaluator_name, and remote-function / dashboard consumers read that field. A caller using the documented backward-compatible CodeEvaluator() alias now emits a different evaluator_name than before.

Please either:

  • preserve the legacy default when users instantiate CodeEvaluator() directly, or
  • explicitly document this as a breaking serialized-output change and add a regression test covering the intended behavior.

Given the PR says "CodeEvaluator as a backward-compatible alias (deprecated but supported)", I think preserving the old default is the right fix.

B3 — format checks are not clean

Local checks:

uv run pyink --check \
  src/bigquery_agent_analytics/evaluators.py \
  src/bigquery_agent_analytics/__init__.py \
  src/bigquery_agent_analytics/client.py \
  src/bigquery_agent_analytics/grader_pipeline.py \
  src/bigquery_agent_analytics/cli.py \
  deploy/remote_function/dispatch.py \
  tests/test_sdk_evaluators.py \
  tests/test_sdk_client.py \
  tests/test_grader_pipeline.py \
  tests/test_pr16_fixes.py \
  tests/test_pr17_fixes.py \
  tests/test_udf_kernels.py

passes.

But:

uv run isort --check-only ...

fails on deploy/remote_function/dispatch.py. The required import order is:

 from bigquery_agent_analytics import Client
-from bigquery_agent_analytics import SystemEvaluator
 from bigquery_agent_analytics import LLMAsJudge
 from bigquery_agent_analytics import serialize
+from bigquery_agent_analytics import SystemEvaluator
 from bigquery_agent_analytics import TraceFilter

And:

git diff --check origin/main...HEAD

fails with trailing whitespace in examples/agent_improvement_cycle/DEMO_NARRATION.md.

Both need to be clean before merge.

Verified

  • uv run pytest tests/test_sdk_evaluators.py tests/test_sdk_client.py tests/test_grader_pipeline.py tests/test_pr16_fixes.py tests/test_pr17_fixes.py tests/test_udf_kernels.py -q passes: 337 passed.
  • uv run pytest tests/test_cli.py tests/test_remote_function.py tests/test_streaming_evaluation.py -q passes: 187 passed.
  • uv run python -m py_compile ... over the changed Python entrypoints passes.
  • Package-root alias smoke check passes structurally: CodeEvaluator is SystemEvaluator and from bigquery_agent_analytics import SystemEvaluator works.

The rename is much cleaner than the prior review pass, but the PR should not merge until the conflict, format failures, and CodeEvaluator().name compatibility regression are resolved.

@caohy1988 caohy1988 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up to my earlier REQUEST_CHANGES — additional blockers found on a second pass. None of these are caught by the test suite (the tests import SystemEvaluator directly, so they don't exercise the docs/notebook/CHANGELOG drift).


Additional blockers

1. SDK.md misrepresents evaluator behavior

SDK.md:120-127 changes the pre-built descriptions from "fails when X exceeds the budget" to "score degrades linearly as X approaches threshold" / "penalizes". The code at src/bigquery_agent_analytics/evaluators.py:325-369 still returns binary 1.0 if observed <= threshold else 0.0. Readers will expect partial credit and get hard cutoffs.

Fix: Either revert SDK.md to the binary pass/fail wording, or implement linear degradation in code. This PR should not silently flip the documented behavior contract during a rename.

2. docs/implementation_plan_remote_function.md mixes a design change with the rename

Lines 220-249 change EVALUATOR_FACTORIES from a single-lambda dict to a (with_threshold, without_threshold) tuple-of-lambdas dict, and silently drops the \"llm-judge\": None entry. That's a design-document mutation unrelated to renaming.

Fix: Revert the structural change to the doc, or split it into its own PR. A rename PR should not be editing an implementation plan's design.

3. Notebook reformat noise is uninspectable

  • examples/context_graph_adcp_demo.ipynb was re-saved with 2-space JSON indent.
  • examples/e2e_notebook_demo.ipynb and examples/nba_agent_trace_analysis_notebook.ipynb stayed at 1-space.
  • All three notebooks also had every cell's \"source\" field rewritten from a string to an array-of-strings (\"source\": \"code\\nmore\"\"source\": [\"code\\n\", \"more\"]).

Result: 6014 / 818 / 1041-line diffs nobody can audit for accidental content changes. Inconsistent treatment within the same PR is itself a smell.

Fix: Either re-save all three with consistent settings (one Jupyter version, one indent), or revert the notebooks (git checkout origin/main -- examples/*.ipynb) and apply a targeted sed rename for the four CodeEvaluatorSystemEvaluator occurrences. Either way, the notebook diff after this fix should be small enough to read end-to-end.

4. CHANGELOG.md promises a deprecation the code doesn't implement

CHANGELOG.md:12 says CodeEvaluator is "deprecated but supported". The code at src/bigquery_agent_analytics/evaluators.py:633 is a bare CodeEvaluator = SystemEvaluator assignment — no warnings.warn(DeprecationWarning, …).

Fix: Pick one. Either drop "deprecated" from CHANGELOG (alias-only, no warning), or wire an actual DeprecationWarning — e.g. wrap with a thin subclass whose __init_subclass__ / __new__ raises the warning. Silent passthrough is fine; lying about it in CHANGELOG isn't.


Soft blockers

5. docs/design.md:211 ASCII art damaged

One row of box-drawing characters got shortened from

└──────────────────────┘  └──────────────────────┘  └──────────────────┘

to

└──────────────────┘  └──────────────────┘  └──────────────────┘

…breaking the visual alignment with the boxes above. Stray edit from the search-and-replace pass; revert.

6. The CodeEvaluator().name == \"system_evaluator\" blocker has a wider blast radius

The rename also changes EvaluationReport.evaluator_name, which is populated from evaluator.name at src/bigquery_agent_analytics/client.py:952, 993, 1136, 1208, 1242. Any downstream consumer that does report.evaluator_name == \"code_evaluator\" silently misses now.

Fix: Address the constructor default AND every report-emission site in one pass — don't iterate.


Nits (non-blocking but worth catching here)

7. GraderPipeline.add_code_grader name vs docstring mismatch

src/bigquery_agent_analytics/grader_pipeline.py:280 keeps the method name add_code_grader but the docstring now says "Adds a SystemEvaluator grader to the pipeline." Either alias an add_system_grader method (and keep add_code_grader for back-compat) or use neutral language in the docstring ("Adds a code-based grader").

8. Internal CLI constant _CODE_EVALUATORS not renamed

src/bigquery_agent_analytics/cli.py:217 still uses _CODE_EVALUATORS as the dict name even though every value now constructs a SystemEvaluator. Internal scaffolding only, but inconsistent with the stated goal of a thorough rename.


Summary

My existing four blockers cover the merge-conflict, name-emission backward-compat, and lint failures. The four new blockers above are: SDK.md behavior misrepresentation, smuggled-in design plan change, notebook reformat that makes the PR un-auditable, and CHANGELOG making a deprecation claim the code doesn't deliver. The author should split notebook reformatting into its own PR (or revert it entirely) and pick a deprecation story for CodeEvaluator and stick to it.

@gigistark-google gigistark-google force-pushed the pr1_rename branch 2 times, most recently from ee0ffbc to d442663 Compare May 12, 2026 23:29

@gigistark-google gigistark-google left a comment

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for your detailed feedback! It is much appreciated! I am definitely rusty on GitHub so apologies in particular for missing some of these formatting/rebase things.

All feedback has now been addressed.

@caohy1988 caohy1988 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fresh follow-up on latest PR head d442663.

One remaining non-doc blocker:

P1 / backward-compat regression — CodeEvaluator is no longer a true alias.

The PR description says CodeEvaluator is kept as a backward-compatible alias, but the current implementation is a subclass:

class CodeEvaluator(SystemEvaluator):
    ...

That changes observable behavior for legacy callers. In current PR head:

from bigquery_agent_analytics import CodeEvaluator, SystemEvaluator

print(CodeEvaluator is SystemEvaluator)                  # False
print(CodeEvaluator().name)                              # "code_evaluator"
print(type(CodeEvaluator.latency()).__name__)            # "SystemEvaluator"
print(isinstance(CodeEvaluator.latency(), CodeEvaluator)) # False

Before the rename, CodeEvaluator.latency() returned a CodeEvaluator. After this PR, a legacy caller using CodeEvaluator.latency() gets a SystemEvaluator, so any downstream code that checks identity/type against CodeEvaluator can break even though it is using the documented compatibility path.

Suggested fix: make the compatibility path a real alias again:

CodeEvaluator = SystemEvaluator

SystemEvaluator.__init__ already defaults to name="code_evaluator", so this still preserves the previous serialized evaluator name for direct construction. Please also add regression coverage for:

  • CodeEvaluator is SystemEvaluator
  • CodeEvaluator().name == "code_evaluator"
  • isinstance(CodeEvaluator.latency(), CodeEvaluator)

The first-party rename is otherwise much cleaner, but this alias behavior should be locked before merge.

@caohy1988 caohy1988 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review of PR #121 head d442663 against current origin/main

I checked the latest head out locally and ran it. Most prior findings are fixed — clean work. Two NEW issues no one has flagged yet, plus two already-known blockers still open.


Already-known blockers still open

CHANGELOG.md merge conflict

gh pr view reports mergeStateStatus: DIRTY. Confirmed locally with git merge origin/main — hard conflict in CHANGELOG.md (current main added a different ### Added entry in the same [Unreleased] section). Same finding as the 2026-05-12 review on head 8e3569d. Please rebase on current origin/main and keep both entries.

CodeEvaluator is a subclass, not a true alias

From my 2026-05-18 review. src/bigquery_agent_analytics/evaluators.py:631 declares class CodeEvaluator(SystemEvaluator), so:

  • CodeEvaluator is SystemEvaluatorFalse
  • CodeEvaluator.latency() returns a SystemEvaluator (the factory methods at lines 322, 353, 381, 421, 451, 481, 521 are all @staticmethod hardcoded to SystemEvaluator(...))
  • isinstance(CodeEvaluator.latency(), CodeEvaluator)False

Still unaddressed. Suggested fix is still CodeEvaluator = SystemEvaluator plus the N1 change below.


NEW findings (not in any prior review)

N1 — Medium — SystemEvaluator() default name is "code_evaluator", leaking the legacy name into the new "preferred" symbol

src/bigquery_agent_analytics/evaluators.py:190 sets SystemEvaluator.__init__'s default to name="code_evaluator". To preserve the legacy default for CodeEvaluator(), the author pushed the legacy default down into the base class. Reproducible:

>>> from bigquery_agent_analytics import SystemEvaluator
>>> SystemEvaluator().name
'code_evaluator'

Any user who follows the post-rename docs and constructs SystemEvaluator() (no name=) gets evaluator_name == "code_evaluator" serialized into every EvaluationReport (emitted at client.py:952, 993, 1136, 1208, 1242). The "preferred" name pollutes its own reports with the legacy label. Factory methods (latency, error_rate, etc.) are unaffected because they override the default, and explicit SystemEvaluator(name="my_eval") is fine — but the bare-constructor case is a footgun that contradicts the PR's narrative.

Fix. Set SystemEvaluator.__init__ default back to "system_evaluator", and only override in CodeEvaluator.__init__:

class SystemEvaluator:
  def __init__(self, name: str = "system_evaluator", metrics=None) -> None:
    ...

class CodeEvaluator(SystemEvaluator):
  def __init__(self, name: str = "code_evaluator", metrics=None) -> None:
    super().__init__(name=name, metrics=metrics)

This is the natural shape if CodeEvaluator stays a subclass — the legacy default belongs only on the legacy class. It also composes cleanly with the blocker above: whether the author goes back to a pure CodeEvaluator = SystemEvaluator alias or keeps a subclass, the legacy default should not live on SystemEvaluator.

N2 — Low (scope leak) — deploy/streaming_evaluation/main.py removes a stray blank line unrelated to the rename

deploy/streaming_evaluation/main.py:21 is deleted — a blank line between the flask imports and from worker import ...:

 from flask import Flask
 from flask import jsonify
 from flask import request
-
 from worker import handle_scheduled_run

There is no CodeEvaluator reference in this file at all (it was never an evaluator-touching file), so this hunk has nothing to do with the rename. The earlier review specifically called out scope leak (the docs/implementation_plan_remote_function.md design mutation); please revert this hunk for the same reason.


Verified (NO blocker, recording for transparency)

  • Tests: pytest tests/test_sdk_evaluators.py tests/test_sdk_client.py tests/test_grader_pipeline.py tests/test_pr16_fixes.py tests/test_pr17_fixes.py tests/test_udf_kernels.py -q337 passed in 9.05s.
  • isort + git diff --check: clean on all 20 PR-touched files.
  • pyink: clean on all 18 of 20 PR-touched files. dashboard/app.py and scripts/quality_report.py fail pyink, but these failures are pre-existing on origin/main (verified by running pyink --check on the unmodified files from origin/main — both fail identically). The PR's own edits to these files are tiny isort fixups; the rest of each file was already non-conforming before this PR. Not a blocker introduced by PR #121, but a future global lint pass will eventually trip on these.
  • First-party rename completeness: all deploy/, examples/ (notebooks reverted to main-identical line counts), docs/, and dashboard/ files now use SystemEvaluator. Only remaining CodeEvaluator references are the alias class itself, the __init__.py re-export, and CHANGELOG historical entries — all intentional.
  • No isinstance blast radius beyond N1 / the May-18 finding: client.py:903 and grader_pipeline.py:405 both check isinstance(evaluator, SystemEvaluator) — these continue to work because CodeEvaluator is a subclass.
  • Stale-base ontology_runtime exports: I initially worried the PR would drop the 12-symbol ontology_runtime block from __init__.py (same pattern as PR #91). False alarm — git merge correctly preserves that block because the PR didn't touch those lines. Only CHANGELOG.md conflicts.
  • Previously-flagged issues confirmed resolved on this head: SDK.md binary/linear language (fixed), docs/implementation_plan_remote_function.md design mutation (reverted), notebook reformat noise (reverted — line counts identical to main), CHANGELOG "deprecated" claim (replaced with "backward-compatible alias"), docs/design.md ASCII art (fixed), _CODE_EVALUATORS constant (renamed to _SYSTEM_EVALUATORS), dead CodeEvaluator imports in client.py / grader_pipeline.py / cli.py (cleaned).
  • Soft-known unfixed item from May 12: GraderPipeline.add_code_grader (grader_pipeline.py:279) still has that name with the rewritten docstring "Adds a SystemEvaluator grader". Flagged as nit #7 on May 12, still unaddressed. Not blocking, but a free cleanup if the author is doing another pass.

Verdict — REQUEST_CHANGES

Two new items in addition to the two already-known open blockers (CHANGELOG conflict + subclass-not-alias):

  1. N1SystemEvaluator() users get evaluator_name == "code_evaluator". One-line default change on evaluators.py:190.
  2. N2 — Revert the stray blank-line hunk in deploy/streaming_evaluation/main.py:21.

The rename itself is in much better shape than earlier rounds — once the CHANGELOG rebase + N1 + the May-18 alias-vs-subclass cleanup land, this is mergeable.

@gigistark-google gigistark-google force-pushed the pr1_rename branch 2 times, most recently from a56a8e6 to e861dc3 Compare June 5, 2026 16:15
@gigistark-google

Copy link
Copy Markdown
Author

Addressed all prior feedback. Thanks for following up on this! Appreciate it!

@caohy1988 caohy1988 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fresh re-review — PR #121 head e861dc3

Locally checked out, 3160 tests pass (pytest tests/, excluding live-BQ tests). Most prior round's findings landed correctly. The latest commit (Address PR feedback (N1, N2, and GraderPipeline nit)) does exactly what its message says. Two new issues surfaced; one is a tradeoff between two correctly-applied prior fixes, and the notebook/scope-leak concerns from earlier rounds have re-grown.

CHANGELOG conflict from prior rounds: clean now. git merge --no-commit origin/main from PR head → "Automatic merge went well", no <<<<<<< markers. GH's mergeable: UNKNOWN is computation-pending, not actually conflicting.


Open from prior rounds — verified RESOLVED

  • CodeEvaluator true alias (May 18 P1 / B2): evaluators.py:633 is CodeEvaluator = SystemEvaluator. CodeEvaluator is SystemEvaluatorTrue. isinstance(CodeEvaluator.latency(), CodeEvaluator)True. Locked in.
  • N1 (May 18): SystemEvaluator.__init__ default name flipped to "system_evaluator" at evaluators.py:192. The "preferred" symbol no longer emits a legacy report name.
  • N2 (May 18): deploy/streaming_evaluation/main.py is no longer in the diff. Stray-hunk scope leak reverted.
  • add_code_grader nit (May 12 nit #7): renamed to add_system_grader at grader_pipeline.py:279, alias add_code_grader = add_system_grader at :304. Tests updated.

New blocker

B1 — N1 + true-alias combination silently changes EvaluationReport.evaluator_name for legacy CodeEvaluator() callers

The two cleanly-applied fixes interact badly. Confirmed locally:

>>> from bigquery_agent_analytics import CodeEvaluator, SystemEvaluator
>>> CodeEvaluator is SystemEvaluator           # True   (alias)
>>> SystemEvaluator().name                     # 'system_evaluator'   (N1)
>>> CodeEvaluator().name                       # 'system_evaluator'   ← regression

Because CodeEvaluator is now a pure alias for SystemEvaluator, CodeEvaluator() runs SystemEvaluator.__init__, which the N1 fix changed to default name="system_evaluator". So a caller using the documented backward-compat path now emits EvaluationReport.evaluator_name == "system_evaluator" instead of the legacy "code_evaluator".

Why it matters: Client._evaluate_code writes evaluator.name into EvaluationReport.evaluator_name at client.py:950. Any downstream consumer (dashboard, remote-function persisted reports, integration tests grepping for evaluator_name == "code_evaluator") silently misses now — no error, no warning, just a different string field.

The May 18 N1 ask assumed CodeEvaluator would stay a subclass (so legacy default lives on the legacy class). Since the author correctly switched back to a true alias to fix the bigger compatibility regression, the N1 fix has to be reconciled. Options:

  • (a) Keep the alias, document the change. The CHANGELOG should explicitly say CodeEvaluator() now emits evaluator_name="system_evaluator". Migrate downstream filters. Simplest and probably right since 0.4.0 is a major-enough rename anyway.
  • (b) Re-add a thin CodeEvaluator(SystemEvaluator) subclass solely to override the default name. Loses CodeEvaluator is SystemEvaluator identity and the isinstance(CodeEvaluator.latency(), CodeEvaluator) guarantee — i.e. re-introduces the May 18 P1.
  • (c) Provide a code_evaluator() factory function that defaults name="code_evaluator" and keep the class as the alias. Most awkward, but fully preserves both invariants.

(a) is the right call IMO — but it requires updating the CHANGELOG migration line. Without that, this is a silent behavioral regression masked as a rename.

Regression test that would lock the chosen path:

def test_code_evaluator_name_serialization():
    # Whichever path is chosen, pin the serialized name explicitly.
    assert CodeEvaluator().name == ...
    assert CodeEvaluator.latency().name == "latency_evaluator"  # factories are unaffected

High — repeats from May 12 review #4, regressed since May 18

H1 — Notebook reformat noise is back; PR is un-auditable on the notebook surface

Diff line counts vs origin/main:

examples/context_graph_adcp_demo.ipynb:   main=3093 → PR=3115   (+22 net,  2094 lines of churn)
examples/e2e_notebook_demo.ipynb:         main=1242 → PR=1299   (+57 net,   495 lines of churn)
examples/nba_agent_trace_analysis_notebook.ipynb: main=514 → 514 (   6 lines — clean)

Counting only +/- lines that mention CodeEvaluator/SystemEvaluator/BigQueryTraceEvaluator/PerformanceEvaluator/GraderPipeline: 43 of 2595 (~1.7%). The other ~98% is JSON re-serialization noise:

  • Cell field ordering (id/cell_type/metadata/source/execution_count/outputs reshuffled).
  • HTML entity escaping (-->>, etc.).
  • outputs: [] and execution_count: null either added or stripped.
  • (in context_graph_adcp_demo.ipynb) the <table align="left">...Run in Colab...Vertex AI Workbench...</table> block churn — actually re-checked, the count of "Run in Colab" / "Vertex AI Workbench" strings is preserved (2 → 2), but the JSON structure around them was reordered.

The May 12 review #4 finding #3 called this out. The May 18 verification said "notebooks reverted — line counts identical to main"; that was true at head d442663 but a later commit re-introduced the reformat.

Either re-strip the notebook diffs (git checkout origin/main -- examples/*.ipynb then a targeted sed for the rename) or split notebook reformatting into its own PR. As-is, an auditor cannot tell whether any cell content actually changed.

H2 — CHANGELOG says "deprecated" but no DeprecationWarning is raised

CHANGELOG.md:12 says:

Kept CodeEvaluator as a backward-compatible alias (deprecated but supported).

evaluators.py:633 is a bare CodeEvaluator = SystemEvaluator. No warnings.warn(DeprecationWarning, …). Same finding as the May 12 review #4 finding #4. Was acknowledged as resolved on May 18 (CHANGELOG had been changed to "backward-compatible alias"). A later commit reverted the wording.

Pick one and lock it:

  • Drop the "(deprecated but supported)" parenthetical — the language CodeEvaluator is just an alias to keep imports working.
  • Or add the actual deprecation warning — wrap the alias in a small __class_getitem__ / module-level __getattr__ that warns on access. PR #123 already does this for render_ai_generate_judge_query and split_judge_prompt_template; the same pattern applies here.

Silent passthrough is fine. Lying about it in CHANGELOG isn't.

H3 — Scope leak: dashboard/app.py and scripts/quality_report.py changes are not rename-related

dashboard/app.py:       isort reordering of 4 imports (no logic / no CodeEvaluator refs)
scripts/quality_report.py: a single blank line removed at line 237 (no rename, no CodeEvaluator refs)

Same scope-leak concern as the May 12 review #4 finding #2 (docs/implementation_plan_remote_function.md design mutation) and the May 18 N2 (deploy/streaming_evaluation/main.py blank line). The author has now fixed N2 cleanly. These two are the same shape — unrelated changes attached to a rename PR. Revert both hunks.

Note: scripts/quality_report.py was massively rewritten in PR #174 (just reviewed). Once #174 merges, this PR will conflict in that file. Reverting the one-liner here avoids the future conflict.


Low / nit

L1 — add_code_grader = add_system_grader is a method-binding alias, not a proper subclass attribute

grader_pipeline.py:304 does add_code_grader = add_system_grader at class body. That's fine for Python's method-resolution — both names resolve to the same function — but help(pipeline.add_code_grader) will say add_system_grader because the function's __name__ doesn't get rewritten. Minor; not blocking.

If you want help() to show the legacy name on the legacy path, define a one-line wrapper:

def add_code_grader(self, *args, **kwargs):
    """Backward-compat alias for ``add_system_grader``."""
    return self.add_system_grader(*args, **kwargs)

Otherwise this is just a documentation polish issue.


Verified clean (recording for transparency)

  • Tests: pytest tests/ -q (excluding live BQ): 3160 passed, 27 skipped.
  • First-party rename completeness: only __init__.py (re-export, intentional) and evaluators.py (alias definition, intentional) mention CodeEvaluator outside tests/CHANGELOG.
  • Alias identity: CodeEvaluator is SystemEvaluatorTrue; isinstance(CodeEvaluator.latency(), CodeEvaluator)True. Factory methods (SystemEvaluator.latency, .error_rate, etc.) emit per-factory names (latency_evaluator, error_rate_evaluator) — unaffected by the default-name change in B1.
  • Mergeable: git merge --no-commit origin/main is clean (only auto-resolved M markers in materialize_window.py and test_property_graph_spec.py, no conflicts).

Verdict — REQUEST_CHANGES

B1 (silent EvaluationReport.evaluator_name change for CodeEvaluator() callers) is the single substantive blocker — pick the migration story (option a, b, or c above) and reflect it in CHANGELOG + a regression test.

H1 (notebook reformat noise) and H3 (scope leak in dashboard/app.py + scripts/quality_report.py) are the same auditor-pain shape that came up multiple rounds. Strip them out so the diff is just the rename + the four directly-related files.

H2 (CHANGELOG "deprecated" claim) is a one-line fix: either delete "(deprecated but supported)" or add a DeprecationWarning.

Once the four are addressed, this PR is straightforward to land. The rename itself is in great shape.

@caohy1988

Copy link
Copy Markdown
Collaborator

Fresh review against current head e861dc3.

The proposed follow-up comments check out against the source.

Findings:

  1. B1: CodeEvaluator().name contract changed. CodeEvaluator is now a true alias of SystemEvaluator, and CodeEvaluator().name resolves to system_evaluator instead of the legacy code_evaluator. That means EvaluationReport.evaluator_name changes for direct legacy constructor callers. This may be intentional, but it should be treated as a user-visible compatibility decision. Recommended resolution: keep the alias if that is the intended API shape, and document the evaluator-name change explicitly in the CHANGELOG/migration note.

  2. H1: notebook formatting noise is back. The notebook diff is still mostly reformat churn. My count on this head is about 2,601 changed notebook lines, with only 51 lines matching the evaluator rename surface. The exact count differs slightly from the earlier 2,595/43 number, but the conclusion is the same: the signal is buried in unrelated notebook churn.

  3. H2: deprecation wording and behavior disagree. The CHANGELOG says CodeEvaluator is “deprecated but supported,” but the current alias cannot emit a DeprecationWarning. Either add an actual warning path, or soften the wording to “backward-compatible alias” without calling it deprecated.

  4. H3: unrelated scope leaks remain. dashboard/app.py is import-formatting only, and scripts/quality_report.py removes a blank line. Neither has evaluator rename relevance; the quality_report.py hunk is also likely to conflict with Quality report: dimensions, correction analysis, execution traces, golden-Q&A grounding, and version filtering #174 once that lands.

Additional process note: this PR is currently behind main and overlaps heavily with #123 in the evaluator migration surface. Before final review, please rebase after removing the notebook/scope noise and make the intended stack/merge order explicit so reviewers are not comparing stale migration text across overlapping PRs.

Rename CodeEvaluator to SystemEvaluator to align with its focus on system-level metrics. A CodeEvaluator alias is kept in evaluators.py for backward-compatibility.
- Set SystemEvaluator default name to 'code_evaluator' for backward compatibility.
- Update docstring in grader_pipeline.py to use neutral language.
- Rename internal CLI constant _CODE_EVALUATORS to _SYSTEM_EVALUATORS.
- Fix ASCII art in docs/design.md.
- Clean up CHANGELOG.md and SDK.md.
…aluator default name to system_evaluator, restore deleted blank line in Cloud Run entrypoint, and rename add_code_grader to add_system_grader
@gigistark-google gigistark-google force-pushed the pr1_rename branch 3 times, most recently from 60db3e1 to 9644f2e Compare June 7, 2026 20:00
@gigistark-google

Copy link
Copy Markdown
Author

Thanks Haiyuan for your feedback on this! The latest round of feedback has been addressed.

@caohy1988 caohy1988 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review — PR #121 head 5a8ac26

Checked out locally and ran the full battery. All five findings from the June 5 review are fixed, and fixed well. What remains is mechanical: two pyink failures and one isort failure on PR-touched files, plus a CHANGELOG rebase conflict against current main. Once those land, this is an approve.


June 5 findings — verified RESOLVED

  • B1 (CodeEvaluator().name migration story) — the author picked option (a): keep the true alias, document the behavior change. CHANGELOG now reads "Kept CodeEvaluator as a backward-compatible alias. Note that calling CodeEvaluator() now emits evaluator_name="system_evaluator"." And the regression tests I asked for exist: tests/test_sdk_evaluators.py:875 test_default_name pins both CodeEvaluator().name and SystemEvaluator().name to "system_evaluator"; test_prebuilt_isinstance pins isinstance(CodeEvaluator.latency(), CodeEvaluator). Verified at runtime:
    CodeEvaluator is SystemEvaluator        → True
    SystemEvaluator().name                  → 'system_evaluator'
    CodeEvaluator().name                    → 'system_evaluator'
    isinstance(latency(), CodeEvaluator)    → True
    
  • H1 (notebook reformat noise) — collapsed completely. context_graph_adcp_demo.ipynb 2094 → 18 diff lines, e2e_notebook_demo.ipynb 495 → 26, nba_… stays at 6. Audited every remaining +/- line: 100% rename-related (CodeEvaluatorSystemEvaluator and add_code_graderadd_system_grader). The diff is now fully auditable.
  • H2 (CHANGELOG "deprecated" claim) — the "(deprecated but supported)" parenthetical is gone; replaced with the honest alias + name-change note above.
  • H3 (scope leak)dashboard/app.py and scripts/quality_report.py are no longer in the diff. deploy/streaming_evaluation/ stays clean.
  • L1 (add_code_grader binding alias) — now a real wrapper method with its own docstring ("preserved for backwards compatibility, but isn't recommended for use") delegating to add_system_grader. help() will show the right name.

Tests: pytest tests/ (excluding live-BQ) → 3178 passed, 27 skipped.
git diff --check: clean.


Non-blocking comments — please fix before merging (all mechanical)

1. pyink fails on two PR-touched files

would reformat src/bigquery_agent_analytics/grader_pipeline.py
would reformat tests/test_grader_pipeline.py

Both regressions come from the latest commit:

  • grader_pipeline.py:~314 — double blank line between the new add_code_grader wrapper and add_llm_grader (pyink wants one blank line inside a class body).
  • tests/test_grader_pipeline.py:164 — pyink wants the chained call wrapped as:
    pipeline = GraderPipeline(
        WeightedStrategy(threshold=0.5)
    ).add_system_grader(SystemEvaluator.latency(threshold_ms=5000))

Fix: pyink src/bigquery_agent_analytics/grader_pipeline.py tests/test_grader_pipeline.py and commit.

2. isort fails on tests/test_sdk_evaluators.py

The CodeEvaluator import is mis-positioned:

 from bigquery_agent_analytics.evaluators import AI_GENERATE_JUDGE_BATCH_QUERY
+from bigquery_agent_analytics.evaluators import CodeEvaluator
 from bigquery_agent_analytics.evaluators import DEFAULT_ENDPOINT
 from bigquery_agent_analytics.evaluators import EvaluationReport
 from bigquery_agent_analytics.evaluators import LLM_JUDGE_BATCH_QUERY
-from bigquery_agent_analytics.evaluators import CodeEvaluator

Fix: isort tests/test_sdk_evaluators.py and commit.

3. CHANGELOG.md conflicts with current main (13 commits behind)

Reproduced locally: git merge origin/mainCONFLICT (content): Merge conflict in CHANGELOG.md. main has moved (incl. the migration_v5context_graph rename, #300) and edited the same [Unreleased] section. Standard fix:

git fetch origin main
git rebase origin/main   # keep both Unreleased entries in CHANGELOG.md
git push --force-with-lease

Verdict — APPROVE

The substance of this PR is done. The rename is complete and consistent, the alias semantics are pinned by tests, the diff is auditable, and the full suite is green (3178 passed). Approving now; please run pyink on the two files, isort on the one test, and rebase for the CHANGELOG conflict before merging — all three are mechanical and none affect the reviewed semantics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants