Skip to content

Unify One-Sided & Side-by-Side Performance Metrics in the PerformanceEvaluator.#123

Open
gigistark-google wants to merge 8 commits into
GoogleCloudPlatform:mainfrom
gigistark-google:pr3_refactor
Open

Unify One-Sided & Side-by-Side Performance Metrics in the PerformanceEvaluator.#123
gigistark-google wants to merge 8 commits into
GoogleCloudPlatform:mainfrom
gigistark-google:pr3_refactor

Conversation

@gigistark-google

Copy link
Copy Markdown
Contributor

This change isolates system metrics in the system-evaluator and performance-metrics in the performance evaluator.

Note that in order to pass pyink --config pyproject.toml --check, some additional files were reformatted as part of this change.

@caohy1988 caohy1988 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review of PR #123 head 081a92d

Third PR in the stack (#121#122#123). Verified locally on a fresh checkout. The split-and-shim pattern is well executed — class identity is preserved across the backward-compat shims, the full test suite passes (1772 / 5 skipped), and lint is clean. But the PR also performs silent public-API removals that the CHANGELOG doesn't fully document, and that's the blocker.

What was verified clean

  • Test suite green: pytest tests/ -q1772 passed, 5 skipped, 17 warnings.

  • Lint clean: pyink --check src/ tests/ and isort --check-only src/ tests/ both pass.

  • Class identity preserved across shims:

    from bigquery_agent_analytics import CodeEvaluator, SystemEvaluator
    assert CodeEvaluator is SystemEvaluator                              # True
    
    from bigquery_agent_analytics.evaluators import CodeEvaluator as A
    from bigquery_agent_analytics.system_evaluator import SystemEvaluator as B
    assert A is B                                                         # True
    
    from bigquery_agent_analytics.trace_evaluator import BigQueryTraceEvaluator
    from bigquery_agent_analytics.performance_evaluator import PerformanceEvaluator
    assert BigQueryTraceEvaluator is PerformanceEvaluator                # True
    
    from bigquery_agent_analytics.grader_pipeline import GraderPipeline
    from bigquery_agent_analytics.aggregate_grader import AggregateGrader
    assert GraderPipeline is AggregateGrader                              # True

    The four shim files (evaluators.py, trace_evaluator.py, multi_trial.py, grader_pipeline.py) are now 17–239 lines each, re-exporting from the canonical new modules. Standard Python import semantics give a single class object, accessible from both old and new paths.

  • The split is clean: system_evaluator.py is pure to SystemEvaluator; performance_evaluator.py houses PerformanceEvaluator (with both one-sided and side-by-side judge prompts unified at lines 824-874); aggregate_grader.py houses AggregateGrader; multi_trial_performance_evaluator.py houses the multi-trial harness.

Blockers

B1 — Public API removed silently; CHANGELOG migration story is incomplete

The PR deletes four documented public symbols from evaluators.py:

  • render_ai_generate_judge_query
  • AI_GENERATE_JUDGE_BATCH_QUERY
  • LLM_JUDGE_BATCH_QUERY
  • split_judge_prompt_template

These were the SQL-batch path for LLM-as-judge — submitting AI.GENERATE via BigQuery rather than per-session API calls. The previous release, [0.2.3] (2026-04-27), was specifically a fix to that path:

Fixed

  • LLM-as-Judge AI.GENERATE path now executes against current BigQuery. Earlier versions emitted a table-valued [...]

Eight days later this PR removes the feature wholesale. The Unreleased CHANGELOG entry covers:

Purged obsolete criteria-list LLMAsJudge implementations, replacing them natively with PerformanceEvaluator [...]
Migration Story: Users should no longer construct _JudgeCriterion objects directly. Instead, use LLMAsJudge.add_criterion(...).

But this only addresses the _JudgeCriterion constructor change. It does not mention:

  1. render_ai_generate_judge_query is gone (grep -rn "render_ai_generate_judge_query" src/ → 0 hits).
  2. AI_GENERATE_JUDGE_BATCH_QUERY is gone.
  3. LLM_JUDGE_BATCH_QUERY is gone.
  4. split_judge_prompt_template is gone.
  5. Client.evaluate(LLMAsJudge_instance) now raises TypeErrorclient.py:864-912 shows the new dispatch only accepts SystemEvaluator or PerformanceEvaluator. Anyone who did client.evaluate(evaluator=LLMAsJudge.correctness()) (a documented pattern in the previous SDK.md and shipped tests) will now get a runtime error.
  6. The live integration test that guarded the BQ-side judge SQL is deleted (tests/test_ai_generate_judge_live.py, 203 lines).

Three asks:

  • Restore the public symbols as backward-compat shims in evaluators.py, even if they raise DeprecationWarning. They were public exports in the previous SDK.md.

  • Extend the CHANGELOG migration story to spell out:

    • That Client.evaluate(LLMAsJudge) no longer works and how to migrate (presumably PerformanceEvaluator + judge factory).
    • That the AI.GENERATE batch path is gone; per-session API calls are the new path.
    • That callers depending on render_ai_generate_judge_query / AI_GENERATE_JUDGE_BATCH_QUERY need to migrate.
  • Restore live integration coverage for the new per-session API path. The deleted file's docstring is explicit about why it existed:

    "These tests submit the exact SQL produced by render_ai_generate_judge_query to a real BigQuery project [...] Mocks alone won't catch that class of bug."

    The new _evaluate_performance path in client.py uses client.aio.models.generate_content (in performance_evaluator.py:850) — that's an entirely different integration contract, with no live test guarding it. The 0.2.3 fix's lesson applies in reverse: now that the SDK calls Vertex AI directly per-session, a mock-only test surface won't catch when (e.g.) the genai client signature changes or the response schema drifts. A skip-by-default live test for the new path would restore the safety net.

B2 — Stack inherits 24-commits-behind stale base

Same as #121 / #122. After the rebase, several files in PR 123's diff (notably client.py, dashboard/app.py, examples/*.ipynb) will need conflict resolution. The client.py change is particularly large (97+ / 406-) — worth re-checking after rebase that no surface intended to stay was lost during conflict resolution.

Concerns

C1 — client.py evaluate() docstring claim is inconsistent with implementation

client.py:875-876 reads:

"PerformanceEvaluator metrics use BQML's ML.GENERATE_TEXT for zero-ETL evaluation."

But the new _evaluate_performance dispatch (and PerformanceEvaluator.judge() at performance_evaluator.py:824-874) calls client.aio.models.generate_content directly through google-genai. There's no ML.GENERATE_TEXT call in the path. The docstring describes the deleted path, not the new one. Fix the docstring to match the actual implementation ("uses the Vertex AI Generative API per session") or, if BQML support is still planned, mark it as a TODO.

C2 — dashboard/app.py 142+/90- — what changed?

Not strictly in the evaluator/grader/multi-trial scope. Worth a sentence in the PR body explaining whether this is a related cleanup or just pyink reformatting (the PR body says "some additional files were reformatted as part of this change"). If it's purely formatting, the diff stat should be much smaller than 142+/90-. Suggest a glance at this file's actual changes to confirm.

C3 — Notebook diffs are massive

Three notebooks have major reformats: e2e_notebook_demo.ipynb (238/210), nba_agent_trace_analysis_notebook.ipynb (115/99), context_graph_adcp_demo.ipynb (1041/1019). These are notebook JSON, so the line counts are misleading — but if the PR body intent is "isolate system metrics in system_evaluator and performance metrics in performance_evaluator," the notebooks should change only insofar as their imports point at the new modules. A 1041-line change in context_graph_adcp_demo.ipynb is too big for that. Worth a confirmation: did the cells re-execute and serialize new outputs into the notebook diff? If yes, those changes should be a separate PR or stripped (the standard repo practice is nbstripout or equivalent for executed-cell noise).

C4 — examples/agent_improvement_cycle/agent/tools.py 35+/24- and improver_agent.py 64+/76-

Substantive logic changes in an example agent. Not in the stated PR scope ("isolate system / performance metrics"). What changed and why? If it's part of the rename refactor, OK. If it's an unrelated improvement, split it out.

What's already correct (acknowledge)

  • ✅ The shim pattern in evaluators.py, trace_evaluator.py, multi_trial.py, grader_pipeline.py is textbook. Re-exports preserve class identity. Hint to future readers via the module docstring ("Backward-compatibility module mapping for ...") is helpful.
  • system_evaluator.py and performance_evaluator.py are clean splits — no cross-module circular dependencies, no shared mutable state.
  • ✅ Lint clean across the entire src/ and tests/ (the PR description's caveat about "some additional files were reformatted as part of this change" matches the observable lint state).
  • ✅ Test suite green: 1772 passed.
  • utils.py (new file, 110 lines) hosts _parse_json_from_text — the right place for it after the evaluators split.
  • ✅ Unified one-sided / side-by-side judge prompts in PerformanceEvaluator (lines 824-874): the PR title's main claim is delivered.

Verdict

Request changes primarily on B1 (silent public API removal + missing CHANGELOG coverage + deleted live integration test with no replacement). The architecture of the refactor is sound; the unstated migration cost is what blocks merge.

B2 (rebase) and C1-C4 are normal cleanup. Once B1's symbols are either restored as deprecated shims or the CHANGELOG honestly states they're removed (with migration paths spelled out), this PR is mergeable.

@caohy1988

Copy link
Copy Markdown
Collaborator

Fresh follow-up after the existing REQUEST_CHANGES review:

High — bq-agent-sdk evaluate --evaluator=llm-judge is now broken by the same LLMAsJudge / Client.evaluate mismatch.

The review already calls out that Client.evaluate(LLMAsJudge_instance) now raises TypeError, but there is a concrete CLI path still constructing exactly that unsupported object:

  • src/bigquery_agent_analytics/cli.py maps correctness, hallucination, and sentiment to LLMAsJudge.*() in _LLM_JUDGES.
  • The command then calls client.evaluate(evaluator=ev, ...) unconditionally.
  • Client.evaluate() now accepts only SystemEvaluator | PerformanceEvaluator, so a real CLI invocation with --evaluator=llm-judge --criterion=correctness fails with Unsupported evaluator type: <class '...LLMAsJudge'>.

The current CLI test misses this because it mocks client.evaluate and only asserts that the constructed object is an LLMAsJudge; it never exercises the real Client.evaluate type gate.

Fix options:

  1. Keep LLMAsJudge supported in Client.evaluate() as a backward-compatible shim, or
  2. Change the CLI to build the new supported PerformanceEvaluator path for LLM judging, and update tests to exercise the real type compatibility instead of only the mock call argument.

Until this is fixed, this PR silently regresses a documented/user-facing CLI evaluation mode, not just the Python API.

@gigistark-google gigistark-google marked this pull request as draft June 5, 2026 20:37
@gigistark-google gigistark-google force-pushed the pr3_refactor branch 7 times, most recently from 676c5a7 to da0332a Compare June 5, 2026 21:07
@gigistark-google

Copy link
Copy Markdown
Contributor Author

Created a new commit that addresses the latest feedback and squashes in pr 122 (#122).

Regarding formatting, below are my thoughts:

  • Formatting Cleanup: dashboard/app.py, examples/agent_improvement_cycle/agent/tools.py, and examples/agent_improvement_cycle/agent_improvement/improver_agent.py were reformatted using pyink to align with the project style guidelines. There are no logical or functional changes in these files.
  • Notebook Diffs: The diffs in e2e_notebook_demo.ipynb, context_graph_adcp_demo.ipynb, and nba_agent_trace_analysis_notebook.ipynb are minimal format synchronizations (cell IDs, JSON metadata, and character escaping). The execution outputs are stripped (empty) in these notebooks to avoid noise.

@gigistark-google gigistark-google marked this pull request as ready for review June 5, 2026 21:19

@caohy1988 caohy1988 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fresh re-review — PR #123 head da0332a (PR #122 squashed in)

Locally checked out, 3115 tests pass (pytest tests/, excluding live-BQ tests). The June 5 squash-and-feedback commit landed all four May 11 blockers and the CLI-breakage comment. The architecture of the split (system / performance / aggregate / multi-trial modules with thin re-export shims) is clean.

The remaining issues are: a CHANGELOG that documents a behavior that no longer happens (the marquee migration line is stale after the LLMAsJudge re-add), massive notebook reformat noise, and scope-leak files the author has flagged as "pyink reformat only" but that shouldn't ride this PR.


Open from prior round — verified RESOLVED

  • B1a (May 11): the four deleted public symbols are restored in evaluators.py:
    • render_ai_generate_judge_query:304, raises DeprecationWarning
    • AI_GENERATE_JUDGE_BATCH_QUERY:344, template alias
    • LLM_JUDGE_BATCH_QUERY:394, template alias
    • split_judge_prompt_template:401, raises DeprecationWarning
    • __all__ lists them all at lines 38–41. Clean.
  • B1b (May 11): CHANGELOG migration story extended. New "Changed" block covers _JudgeCriterion, the AI.GENERATE removal, and points users at PerformanceEvaluator.
  • B1c (May 11): live integration coverage restored. tests/test_performance_evaluator_live.py (115 lines) replaces the deleted test_ai_generate_judge_live.py. Same skip-by-default gating pattern (BQAA_LIVE_BQ=1).
  • CLI breakage comment (May 11): Client.evaluate(LLMAsJudge_instance) now works. client.py:866 accepts SystemEvaluator | PerformanceEvaluator | LLMAsJudge; the new _evaluate_legacy_judge at :1017 runs the legacy path against Vertex per-session. Verified locally — no more TypeError.

Blocker

B1 — CHANGELOG migration story is stale: still says Client.evaluate(LLMAsJudge) raises TypeError, but the code accepts it

CHANGELOG.md (after the squash) contains:

  • Client.evaluate(LLMAsJudge) is no longer supported and will raise TypeError. Callers must migrate to using PerformanceEvaluator with the appropriate judge configurations.

But the code at client.py:912-918 is the explicit fix for the May 11 CLI comment:

elif isinstance(evaluator, LLMAsJudge):
    return self._evaluate_legacy_judge(
        evaluator,
        ...
    )

So Client.evaluate(LLMAsJudge_instance) returns an EvaluationReport, no TypeError. CHANGELOG ↔ code disagree on what users should expect. A reader of the CHANGELOG would (a) needlessly migrate, or (b) try the documented behavior and discover the code doesn't match the docs — eroding trust in everything else in this paragraph.

Two fixes:

  • Update the migration line to reflect the actual state:

    Client.evaluate(LLMAsJudge) continues to work as a backward-compatible path via the Gemini API per session (see _evaluate_legacy_judge). New code should prefer PerformanceEvaluator for batched multi-criterion judging.

  • Or commit to the documented removal and re-delete the LLMAsJudge dispatch. That re-introduces the CLI breakage, so it has to come with cli.py:_LLM_JUDGES migrating to construct PerformanceEvaluator instead. PR description has not signaled this intent.

Strongly recommend the first.

The "BigQuery-side AI.GENERATE batch SQL path has been removed in favor of Python-side per-session API calls" line is also slightly misleading now — AI_GENERATE_JUDGE_BATCH_QUERY / LLM_JUDGE_BATCH_QUERY / render_ai_generate_judge_query are all still exported (as deprecated shims with DeprecationWarning). Reword to "The default execution path no longer uses the AI.GENERATE batch SQL; templates and renderer are preserved as deprecated shims for callers with pre-created BQ ML models."


High

H1 — Notebook reformat noise is back; PR is un-auditable on the notebook surface

examples/context_graph_adcp_demo.ipynb:   main=3093 → PR=3115   (~+22 net, 2094 lines of churn)
examples/e2e_notebook_demo.ipynb:         main=1242 → PR=1299   (~+57 net,  495 lines of churn)
examples/nba_agent_trace_analysis_notebook.ipynb: main=514 → 530 (~+16 net, ~230 lines of churn)

Of 2819 total +/- diff lines in the three notebooks, 65 mention any renamed symbol (CodeEvaluator/SystemEvaluator/BigQueryTraceEvaluator/PerformanceEvaluator/GraderPipeline/AggregateGrader). ~2.3% rename-related, ~97.7% noise.

The June 5 comment says "minimal format synchronizations (cell IDs, JSON metadata, and character escaping). The execution outputs are stripped (empty)." Verified true on inspection — -->--> HTML escaping flips, cell-ID assignment, outputs: [] flattening. But "minimal" is misleading: the auditor cannot tell that the cells' visible content is unchanged without diffing line by line.

This is the same finding as the May 11 review section C3. Sample fix:

git checkout origin/main -- examples/*.ipynb
# Then apply only the rename hunks:
sed -i '' -e 's/BigQueryTraceEvaluator/PerformanceEvaluator/g' \
          -e 's/CodeEvaluator/SystemEvaluator/g' \
          -e 's/GraderPipeline/AggregateGrader/g' \
          examples/*.ipynb

Or split notebook reformat into a separate "Run all notebooks through nbstripout + reformat" PR.

H2 — Scope leak: dashboard/app.py (+142/-90), examples/agent_improvement_cycle/agent/tools.py (+59/-29), improver_agent.py (+64/-76) are pyink-only changes with zero rename relevance

Verified: zero CodeEvaluator|SystemEvaluator|BigQueryTraceEvaluator|PerformanceEvaluator|GraderPipeline references in all three files, both before and after. The June 5 comment correctly says "no logical or functional changes" — confirmed by spot-checking improver_agent.py: hunks are pyink dict-as-arg style (json.dumps(\n {\n ...\n }\n)json.dumps({\n ...\n})).

But: that is scope leak. The PR description is "Unify One-Sided & Side-by-Side Performance Metrics in the PerformanceEvaluator." A reader reviewing the diff has to mentally classify each of the 114 files as "in scope" or "drive-by formatter pass." Three of those files (~300 lines of diff) are pure drive-by.

Lift them into a separate "repo-wide pyink pass" PR. That PR can land in parallel; the review cost is much lower because the diff is uniformly mechanical. The May 11 review C2/C4 raised exactly this point and the author's response was "reformat is intentional." The blast radius of intentional reformat in a refactor PR is fatigue + miscategorization risk.

H3 — CHANGELOG migration paragraph indents inconsistently (formatting nit but readable badly)

### Changed
- ...
- ...
- **Unified One-Sided & Side-by-Side Performance Metrics** in the `PerformanceEvaluator`:
  - Purged ...
  - Decoupled ...
  - Overrode the backwards-compatible `LLMAsJudge` subclass ...
    - Removed `_JudgeCriterion` from public access ...
      - **Migration Story**:
        - Users should no longer construct `_JudgeCriterion` ...
        - `Client.evaluate(LLMAsJudge)` is no longer supported ...
        - The BigQuery-side `AI.GENERATE` batch SQL path ...

**Migration Story** is nested six levels deep under ### Changed. On the rendered CHANGELOG page it ends up as a tiny indent block that's easy to miss — and it's the most important part of the entry. Promote to a top-level subsection (e.g. ### Migration or a dedicated #### Migration story for the LLM-as-Judge refactor) directly under the "Unified" bullet's parent. Optional but materially improves discoverability.


Low / nit

L1 — LLM_JUDGE_BATCH_QUERY and AI_GENERATE_JUDGE_BATCH_QUERY are module-level constants but have no DeprecationWarning path

render_ai_generate_judge_query and split_judge_prompt_template raise DeprecationWarning on call. The two template-string constants (AI_GENERATE_JUDGE_BATCH_QUERY, LLM_JUDGE_BATCH_QUERY) at evaluators.py:344, 394 are bare assignments — accessing them silently returns the template. Symmetric with the function shims would mean a module-level __getattr__ that warns on first attribute access. Not blocking; minor consistency improvement.

L2 — evaluators.py is now 446 lines despite being described as a "Backward-compatibility module mapping for evaluators"

The module docstring at :18 reads """Backward-compatibility module mapping for evaluators.""". But the file then defines:

  • two long prompt templates (_CORRECTNESS_PROMPT, _HALLUCINATION_PROMPT)
  • the SQL template _AI_GENERATE_JUDGE_BATCH_QUERY_TEMPLATE (52 lines)
  • the legacy SQL template _LEGACY_LLM_JUDGE_BATCH_QUERY (43 lines)
  • the render_ai_generate_judge_query function
  • the split_judge_prompt_template function

These are concrete legacy implementations, not just compatibility re-exports. Either move them into a _legacy_judge.py submodule that evaluators.py re-exports, or update the module docstring to "Backward-compatibility module re-exports + legacy LLM-as-judge SQL templates." As-is the docstring mis-advertises the file.

L3 — Test rename pattern is internally consistent (good) but adds one redundant file

tests/test_grader_pipeline.pytests/test_aggregate_grader.py (164 lines). tests/test_sdk_evaluators.pytests/test_system_evaluator.py (398 lines diff). tests/test_multi_trial.pytests/test_multi_trial_performance_evaluator.py (46 lines). All clean.

But a new tests/test_evaluators.py (150 lines) also exists, testing the compat shims. That overlaps responsibility with the renamed files. Worth a one-line file docstring explicitly stating what test_evaluators.py covers that the renamed tests don't (the shim re-export identity assertions, presumably).


Verified clean (recording for transparency)

  • Tests: pytest tests/ -q (excluding live BQ): 3115 passed, 25 skipped.
  • Shim identity (verified each shim preserves class identity):
    from bigquery_agent_analytics.evaluators import CodeEvaluator as A
    from bigquery_agent_analytics.system_evaluator import SystemEvaluator as B
    assert A is B   # True
    from bigquery_agent_analytics.trace_evaluator import BigQueryTraceEvaluator
    from bigquery_agent_analytics.performance_evaluator import PerformanceEvaluator
    assert BigQueryTraceEvaluator is PerformanceEvaluator   # True
    from bigquery_agent_analytics.grader_pipeline import GraderPipeline
    from bigquery_agent_analytics.aggregate_grader import AggregateGrader
    assert GraderPipeline is AggregateGrader   # True
  • Client.evaluate dispatch: LLMAsJudge path is wired and dispatches to _evaluate_legacy_judge. Type gate at :919 correctly raises TypeError for unsupported types.
  • _evaluate_legacy_judge execution mode: marks report.details["execution_mode"] = "legacy_llm_judge" — distinguishable from the SystemEvaluator ("code_evaluator"-style) and PerformanceEvaluator ("performance_evaluator") paths. Good for downstream filtering.
  • Mergeable: GH reports MERGEABLE.

Verdict — REQUEST_CHANGES

B1 (CHANGELOG ↔ code disagreement on Client.evaluate(LLMAsJudge)) is the single substantive blocker — three CHANGELOG lines need to reflect the actual June 5 commit. Without this, users will read the CHANGELOG and migrate away from a path that still works.

H1 (notebook reformat noise) and H2 (dashboard/app.py + two examples files as drive-by pyink scope leak) are the same auditor-pain shape that came up in earlier rounds. Strip them out or split into their own PRs.

H3 + L1–L3 are tightening; bundle them in or skip.

The architecture of the split (system / performance / aggregate / multi-trial + shims preserving class identity) is the same as the May 11 review found — well executed. Once B1 lands and the noise is stripped, this is mergeable.

@caohy1988

Copy link
Copy Markdown
Collaborator

Fresh review against current head da0332a.

The proposed follow-up comments check out against the source.

Findings:

  1. B1: CHANGELOG contradicts the actual LLMAsJudge compatibility path. The CHANGELOG says Client.evaluate(LLMAsJudge) is no longer supported and raises TypeError, but Client.evaluate() explicitly accepts LLMAsJudge and dispatches to _evaluate_legacy_judge. There is also a unit test covering this working path. This should block until the marquee migration line is corrected: either the code should reject LLMAsJudge, or the CHANGELOG should say it remains supported through the legacy fallback while users are encouraged to migrate to PerformanceEvaluator.

  2. H1: notebook formatting noise is still large. I counted about 2,825 changed notebook lines, with only 65 matching the evaluator/judge rename surface. The exact total may differ by counting method, but the review issue is real: the rename-relevant diff is a small fraction of the notebook churn.

  3. H2: unrelated formatting-only scope leaks remain. dashboard/app.py, examples/agent_improvement_cycle/agent/tools.py, and examples/agent_improvement_cycle/agent_improvement/improver_agent.py are formatting-only/noise for this migration. I found no evaluator/judge rename relevance in those diffs.

  4. Deprecation nuance: the restored legacy functions now warn correctly, but the template constants are still plain constants and cannot warn on access. If the docs call all of those symbols “deprecated shims,” either clarify that constants are preserved compatibility exports without runtime warnings, or move to an accessor pattern in a later cleanup.

Additional process note: this PR is currently behind main and overlaps heavily with #121/#122 migration surfaces. Before final review, please rebase after removing the notebook/scope noise and make the intended stack/merge order explicit. That will make the code/doc compatibility story much easier to review.

Rename CodeEvaluator to SystemEvaluator to align with its focus on system-level metrics. A CodeEvaluator alias is kept in evaluators.py for backward-compatibility.
- Set SystemEvaluator default name to 'code_evaluator' for backward compatibility.
- Update docstring in grader_pipeline.py to use neutral language.
- Rename internal CLI constant _CODE_EVALUATORS to _SYSTEM_EVALUATORS.
- Fix ASCII art in docs/design.md.
- Clean up CHANGELOG.md and SDK.md.
…aluator default name to system_evaluator, restore deleted blank line in Cloud Run entrypoint, and rename add_code_grader to add_system_grader
This commit squashes PR 122 (Context Cache Hit Rate) and PR 123 (PerformanceEvaluator unification) rebased onto PR 121.

- Added context cache hit rate metric to SystemEvaluator and CLI.
- Unified one-sided and side-by-side performance metrics in PerformanceEvaluator.
- Decoupled system and performance modules cleanly.
- Preserved backward-compatible LLMAsJudge wrapper, restored deprecated SQL-batch path symbols in evaluators.py, and restored LLMAsJudge support in Client.evaluate().
- Added live integration test for PerformanceEvaluator LLM judge path.
- Fixed notebook demos to use the new SystemEvaluator and PerformanceEvaluator.
- Updated Client.evaluate docstring to reflect the new performance evaluation path.
- Reformatted dashboard/app.py, tools.py, and improver_agent.py using pyink (no logic changes).
- Verified notebook diffs are minimal format sync (cell IDs and JSON escaping, no execution outputs).

TAG=agy
CONV=7051de01-a124-4d4a-b178-81af860027bc
…ed renames, and document CodeEvaluator name change

@caohy1988 caohy1988 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review — PR #123 head 6e00e24

Checked out locally. The cleanup direction is right — dashboard/app.py is reverted, and the three notebooks from the June 5 H1 finding are collapsed to pure renames (18 / 26 / 6 diff lines). But the latest two commits introduced two new hard blockers, one of which breaks the test suite, and the June 5 B1 (stale CHANGELOG migration line) is still unfixed.


New blockers (introduced by commits 8d0c0d6 + 6e00e24)

NB1 — scripts/quality_report.py was "reverted" to a pre-#156 snapshot, deleting a shipped feature and breaking the test suite

The commit message says "revert unrelated changes in dashboard/app.py and scripts/quality_report.py" — but the revert went to the wrong baseline. Verified against the PR's own merge-base (9d7bbd0):

merge-base quality_report.py: 1504 lines  (has PR #156 scope-aware eval)
PR head    quality_report.py: 1228 lines  (pre-#156 snapshot)
PR's own diff vs merge-base:  +113 / -389
git log <merge-base>..origin/main -- scripts/quality_report.py: (empty — main never touched it since)

The PR deletes, among other things:

  • _load_agent_config() — scope config loading
  • _build_scope_context() — scope context for the LLM judge
  • get_eval_metrics(config_path)get_eval_metrics() (no scope support)
  • the --config CLI flag and the declined category support

That is the scope-aware quality eval shipped in PR #156 (f4aa037), wholesale removed by a PR whose stated scope is evaluator-module refactoring.

And it is CI-visible: the test suite no longer collects.

tests/test_quality_report_helpers.py:28: in <module>
    from quality_report import _AGENT_CONFIG_CACHE  # noqa: E402
E   ImportError: cannot import name '_AGENT_CONFIG_CACHE' from 'quality_report'
!!!!!!!!!!!!!!!!!!!! Interrupted: 1 error during collection !!!!!!!!!!!!!!!!!!!!

Fix — restore the file to exactly what the branch's merge-base (== current main) has:

git checkout $(git merge-base origin/main HEAD) -- scripts/quality_report.py
# or equivalently, since main never changed it:
git checkout origin/main -- scripts/quality_report.py
git commit -m "Restore scripts/quality_report.py to main (was rolled back past PR #156)"

After the restore, pytest tests/test_quality_report_helpers.py should collect and pass again. (With the broken file excluded, the rest of the suite is green: 3082 passed, 25 skipped — so this is the only suite breakage.)

NB2 — examples/migration_v5_demo_notebook.ipynb rewritten (3921 → 1263 lines) for zero rename benefit, and the file no longer exists on main

The same cleanup commit (8d0c0d6) stripped outputs/metadata from this notebook — a 4287-line diff (+839 / −3498). Two problems:

  1. The notebook contains zero references to any renamed symbol (verified: 0 matches for CodeEvaluator|GraderPipeline|BigQueryTraceEvaluator|MultiTrial at the merge-base). It needed no change at all; this is pure churn.
  2. Current main renamed migration_v5context_graph (PR #300, commit 19726aa), so examples/migration_v5_demo_notebook.ipynb doesn't exist on main anymore. The PR's edit guarantees a modify/delete conflict (or worse, silently resurrects a deleted file) at merge time.

Fix: drop the file from the branch entirely —

git checkout $(git merge-base origin/main HEAD) -- examples/migration_v5_demo_notebook.ipynb

and let the rebase (below) carry main's rename forward.


Still open from June 5

B1 — CHANGELOG still documents a TypeError the code no longer raises

CHANGELOG.md migration story still says:

Client.evaluate(LLMAsJudge) is no longer supported and will raise TypeError. Callers must migrate to using PerformanceEvaluator

while client.py:912 explicitly dispatches LLMAsJudge to _evaluate_legacy_judge. This was the single blocker in the June 5 review and it survived the two follow-up commits untouched. Suggested replacement wording (from the prior review):

Client.evaluate(LLMAsJudge) continues to work as a backward-compatible path via the Gemini API per session. New code should prefer PerformanceEvaluator for batched multi-criterion judging.

Also still worth softening the "AI.GENERATE batch SQL path has been removed" line, since render_ai_generate_judge_query / AI_GENERATE_JUDGE_BATCH_QUERY / LLM_JUDGE_BATCH_QUERY remain exported as deprecated shims.

H2 (partial) — examples/agent_improvement_cycle/agent/tools.py (+59/−29) and improver_agent.py (+140 reflow) are still pyink-only scope leak

dashboard/app.py was reverted (thanks), but these two are still in the diff with zero renamed-symbol relevance — pure string-reflow/format churn. Same ask as before: revert them or move them to a dedicated formatting PR.

H3 — Migration story still nested six levels deep in CHANGELOG

Unchanged from June 5. Non-blocking, but while you're editing the same paragraph for B1, promoting Migration Story to its own subsection costs nothing.

Rebase — 13 commits behind; SDK.md + CHANGELOG.md conflict

Reproduced locally: git merge origin/main → content conflicts in SDK.md and CHANGELOG.md. Main's migration_v5context_graph rename (#300) and other changes have landed since the branch's base. Rebase before the next push so NB2's modify/delete conflict and these content conflicts get resolved in one pass.


Verified RESOLVED from June 5

  • dashboard/app.py reverted — no longer in the diff.
  • Three notebooks (context_graph_adcp_demo, e2e_notebook_demo, nba_…) collapsed from ~2800 lines of churn to 18 / 26 / 6 rename-only lines.
  • CHANGELOG now documents the CodeEvaluator()evaluator_name="system_evaluator" behavior change (commit 8d0c0d6), and the "(deprecated…)" wording was dropped (commit 28664c9) — consistent with PR #121.
  • All four legacy public symbols remain restored with DeprecationWarning shims; Client.evaluate accepts SystemEvaluator | PerformanceEvaluator | LLMAsJudge; shim class-identity assertions all hold.
  • Suite (excluding the NB1-broken file): 3082 passed, 25 skipped.

Verdict — REQUEST_CHANGES

  1. NB1 — restore scripts/quality_report.py from main; this currently deletes PR #156's shipped feature and breaks test collection. One git checkout + commit.
  2. NB2 — drop the migration_v5_demo_notebook.ipynb rewrite; the file is churn-only and deleted on main.
  3. B1 — fix the stale TypeError migration line (third time flagging; it's a two-line CHANGELOG edit).
  4. Rebase onto current main (resolves SDK.md/CHANGELOG.md conflicts and the #300 rename).
  5. H2-partial / H3 — optional polish in the same push.

The evaluator-module refactor itself remains in good shape — every issue here is in the cleanup commits' blast radius, not the core split.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants