Unify One-Sided & Side-by-Side Performance Metrics in the PerformanceEvaluator.#123
Unify One-Sided & Side-by-Side Performance Metrics in the PerformanceEvaluator.#123gigistark-google wants to merge 8 commits into
Conversation
caohy1988
left a comment
There was a problem hiding this comment.
Review of PR #123 head 081a92d
Third PR in the stack (#121 → #122 → #123). Verified locally on a fresh checkout. The split-and-shim pattern is well executed — class identity is preserved across the backward-compat shims, the full test suite passes (1772 / 5 skipped), and lint is clean. But the PR also performs silent public-API removals that the CHANGELOG doesn't fully document, and that's the blocker.
What was verified clean
-
✅ Test suite green:
pytest tests/ -q→ 1772 passed, 5 skipped, 17 warnings. -
✅ Lint clean:
pyink --check src/ tests/andisort --check-only src/ tests/both pass. -
✅ Class identity preserved across shims:
from bigquery_agent_analytics import CodeEvaluator, SystemEvaluator assert CodeEvaluator is SystemEvaluator # True from bigquery_agent_analytics.evaluators import CodeEvaluator as A from bigquery_agent_analytics.system_evaluator import SystemEvaluator as B assert A is B # True from bigquery_agent_analytics.trace_evaluator import BigQueryTraceEvaluator from bigquery_agent_analytics.performance_evaluator import PerformanceEvaluator assert BigQueryTraceEvaluator is PerformanceEvaluator # True from bigquery_agent_analytics.grader_pipeline import GraderPipeline from bigquery_agent_analytics.aggregate_grader import AggregateGrader assert GraderPipeline is AggregateGrader # True
The four shim files (
evaluators.py,trace_evaluator.py,multi_trial.py,grader_pipeline.py) are now 17–239 lines each, re-exporting from the canonical new modules. Standard Python import semantics give a single class object, accessible from both old and new paths. -
✅ The split is clean:
system_evaluator.pyis pure toSystemEvaluator;performance_evaluator.pyhousesPerformanceEvaluator(with both one-sided and side-by-side judge prompts unified at lines 824-874);aggregate_grader.pyhousesAggregateGrader;multi_trial_performance_evaluator.pyhouses the multi-trial harness.
Blockers
B1 — Public API removed silently; CHANGELOG migration story is incomplete
The PR deletes four documented public symbols from evaluators.py:
render_ai_generate_judge_queryAI_GENERATE_JUDGE_BATCH_QUERYLLM_JUDGE_BATCH_QUERYsplit_judge_prompt_template
These were the SQL-batch path for LLM-as-judge — submitting AI.GENERATE via BigQuery rather than per-session API calls. The previous release, [0.2.3] (2026-04-27), was specifically a fix to that path:
Fixed
- LLM-as-Judge AI.GENERATE path now executes against current BigQuery. Earlier versions emitted a table-valued [...]
Eight days later this PR removes the feature wholesale. The Unreleased CHANGELOG entry covers:
Purged obsolete criteria-list
LLMAsJudgeimplementations, replacing them natively withPerformanceEvaluator[...]
Migration Story: Users should no longer construct_JudgeCriterionobjects directly. Instead, useLLMAsJudge.add_criterion(...).
But this only addresses the _JudgeCriterion constructor change. It does not mention:
render_ai_generate_judge_queryis gone (grep -rn "render_ai_generate_judge_query" src/→ 0 hits).AI_GENERATE_JUDGE_BATCH_QUERYis gone.LLM_JUDGE_BATCH_QUERYis gone.split_judge_prompt_templateis gone.Client.evaluate(LLMAsJudge_instance)now raisesTypeError—client.py:864-912shows the new dispatch only acceptsSystemEvaluatororPerformanceEvaluator. Anyone who didclient.evaluate(evaluator=LLMAsJudge.correctness())(a documented pattern in the previous SDK.md and shipped tests) will now get a runtime error.- The live integration test that guarded the BQ-side judge SQL is deleted (
tests/test_ai_generate_judge_live.py, 203 lines).
Three asks:
-
Restore the public symbols as backward-compat shims in
evaluators.py, even if they raiseDeprecationWarning. They were public exports in the previous SDK.md. -
Extend the CHANGELOG migration story to spell out:
- That
Client.evaluate(LLMAsJudge)no longer works and how to migrate (presumablyPerformanceEvaluator+ judge factory). - That the AI.GENERATE batch path is gone; per-session API calls are the new path.
- That callers depending on
render_ai_generate_judge_query/AI_GENERATE_JUDGE_BATCH_QUERYneed to migrate.
- That
-
Restore live integration coverage for the new per-session API path. The deleted file's docstring is explicit about why it existed:
"These tests submit the exact SQL produced by
render_ai_generate_judge_queryto a real BigQuery project [...] Mocks alone won't catch that class of bug."The new
_evaluate_performancepath inclient.pyusesclient.aio.models.generate_content(inperformance_evaluator.py:850) — that's an entirely different integration contract, with no live test guarding it. The 0.2.3 fix's lesson applies in reverse: now that the SDK calls Vertex AI directly per-session, a mock-only test surface won't catch when (e.g.) the genai client signature changes or the response schema drifts. A skip-by-default live test for the new path would restore the safety net.
B2 — Stack inherits 24-commits-behind stale base
Same as #121 / #122. After the rebase, several files in PR 123's diff (notably client.py, dashboard/app.py, examples/*.ipynb) will need conflict resolution. The client.py change is particularly large (97+ / 406-) — worth re-checking after rebase that no surface intended to stay was lost during conflict resolution.
Concerns
C1 — client.py evaluate() docstring claim is inconsistent with implementation
client.py:875-876 reads:
"
PerformanceEvaluatormetrics use BQML'sML.GENERATE_TEXTfor zero-ETL evaluation."
But the new _evaluate_performance dispatch (and PerformanceEvaluator.judge() at performance_evaluator.py:824-874) calls client.aio.models.generate_content directly through google-genai. There's no ML.GENERATE_TEXT call in the path. The docstring describes the deleted path, not the new one. Fix the docstring to match the actual implementation ("uses the Vertex AI Generative API per session") or, if BQML support is still planned, mark it as a TODO.
C2 — dashboard/app.py 142+/90- — what changed?
Not strictly in the evaluator/grader/multi-trial scope. Worth a sentence in the PR body explaining whether this is a related cleanup or just pyink reformatting (the PR body says "some additional files were reformatted as part of this change"). If it's purely formatting, the diff stat should be much smaller than 142+/90-. Suggest a glance at this file's actual changes to confirm.
C3 — Notebook diffs are massive
Three notebooks have major reformats: e2e_notebook_demo.ipynb (238/210), nba_agent_trace_analysis_notebook.ipynb (115/99), context_graph_adcp_demo.ipynb (1041/1019). These are notebook JSON, so the line counts are misleading — but if the PR body intent is "isolate system metrics in system_evaluator and performance metrics in performance_evaluator," the notebooks should change only insofar as their imports point at the new modules. A 1041-line change in context_graph_adcp_demo.ipynb is too big for that. Worth a confirmation: did the cells re-execute and serialize new outputs into the notebook diff? If yes, those changes should be a separate PR or stripped (the standard repo practice is nbstripout or equivalent for executed-cell noise).
C4 — examples/agent_improvement_cycle/agent/tools.py 35+/24- and improver_agent.py 64+/76-
Substantive logic changes in an example agent. Not in the stated PR scope ("isolate system / performance metrics"). What changed and why? If it's part of the rename refactor, OK. If it's an unrelated improvement, split it out.
What's already correct (acknowledge)
- ✅ The shim pattern in
evaluators.py,trace_evaluator.py,multi_trial.py,grader_pipeline.pyis textbook. Re-exports preserve class identity. Hint to future readers via the module docstring ("Backward-compatibility module mapping for ...") is helpful. - ✅
system_evaluator.pyandperformance_evaluator.pyare clean splits — no cross-module circular dependencies, no shared mutable state. - ✅ Lint clean across the entire src/ and tests/ (the PR description's caveat about "some additional files were reformatted as part of this change" matches the observable lint state).
- ✅ Test suite green: 1772 passed.
- ✅
utils.py(new file, 110 lines) hosts_parse_json_from_text— the right place for it after the evaluators split. - ✅ Unified one-sided / side-by-side judge prompts in
PerformanceEvaluator(lines 824-874): the PR title's main claim is delivered.
Verdict
Request changes primarily on B1 (silent public API removal + missing CHANGELOG coverage + deleted live integration test with no replacement). The architecture of the refactor is sound; the unstated migration cost is what blocks merge.
B2 (rebase) and C1-C4 are normal cleanup. Once B1's symbols are either restored as deprecated shims or the CHANGELOG honestly states they're removed (with migration paths spelled out), this PR is mergeable.
|
Fresh follow-up after the existing REQUEST_CHANGES review: High — The review already calls out that
The current CLI test misses this because it mocks Fix options:
Until this is fixed, this PR silently regresses a documented/user-facing CLI evaluation mode, not just the Python API. |
081a92d to
5131a5c
Compare
676c5a7 to
da0332a
Compare
|
Created a new commit that addresses the latest feedback and squashes in pr 122 (#122). Regarding formatting, below are my thoughts:
|
caohy1988
left a comment
There was a problem hiding this comment.
Fresh re-review — PR #123 head da0332a (PR #122 squashed in)
Locally checked out, 3115 tests pass (pytest tests/, excluding live-BQ tests). The June 5 squash-and-feedback commit landed all four May 11 blockers and the CLI-breakage comment. The architecture of the split (system / performance / aggregate / multi-trial modules with thin re-export shims) is clean.
The remaining issues are: a CHANGELOG that documents a behavior that no longer happens (the marquee migration line is stale after the LLMAsJudge re-add), massive notebook reformat noise, and scope-leak files the author has flagged as "pyink reformat only" but that shouldn't ride this PR.
Open from prior round — verified RESOLVED
- B1a (May 11): the four deleted public symbols are restored in
evaluators.py:render_ai_generate_judge_query—:304, raisesDeprecationWarningAI_GENERATE_JUDGE_BATCH_QUERY—:344, template aliasLLM_JUDGE_BATCH_QUERY—:394, template aliassplit_judge_prompt_template—:401, raisesDeprecationWarning__all__lists them all at lines 38–41. Clean.
- B1b (May 11): CHANGELOG migration story extended. New "Changed" block covers
_JudgeCriterion, theAI.GENERATEremoval, and points users atPerformanceEvaluator. - B1c (May 11): live integration coverage restored.
tests/test_performance_evaluator_live.py(115 lines) replaces the deletedtest_ai_generate_judge_live.py. Same skip-by-default gating pattern (BQAA_LIVE_BQ=1). - CLI breakage comment (May 11):
Client.evaluate(LLMAsJudge_instance)now works.client.py:866acceptsSystemEvaluator | PerformanceEvaluator | LLMAsJudge; the new_evaluate_legacy_judgeat:1017runs the legacy path against Vertex per-session. Verified locally — no moreTypeError.
Blocker
B1 — CHANGELOG migration story is stale: still says Client.evaluate(LLMAsJudge) raises TypeError, but the code accepts it
CHANGELOG.md (after the squash) contains:
Client.evaluate(LLMAsJudge)is no longer supported and will raiseTypeError. Callers must migrate to usingPerformanceEvaluatorwith the appropriate judge configurations.
But the code at client.py:912-918 is the explicit fix for the May 11 CLI comment:
elif isinstance(evaluator, LLMAsJudge):
return self._evaluate_legacy_judge(
evaluator,
...
)So Client.evaluate(LLMAsJudge_instance) returns an EvaluationReport, no TypeError. CHANGELOG ↔ code disagree on what users should expect. A reader of the CHANGELOG would (a) needlessly migrate, or (b) try the documented behavior and discover the code doesn't match the docs — eroding trust in everything else in this paragraph.
Two fixes:
-
Update the migration line to reflect the actual state:
Client.evaluate(LLMAsJudge)continues to work as a backward-compatible path via the Gemini API per session (see_evaluate_legacy_judge). New code should preferPerformanceEvaluatorfor batched multi-criterion judging. -
Or commit to the documented removal and re-delete the LLMAsJudge dispatch. That re-introduces the CLI breakage, so it has to come with
cli.py:_LLM_JUDGESmigrating to constructPerformanceEvaluatorinstead. PR description has not signaled this intent.
Strongly recommend the first.
The "BigQuery-side AI.GENERATE batch SQL path has been removed in favor of Python-side per-session API calls" line is also slightly misleading now — AI_GENERATE_JUDGE_BATCH_QUERY / LLM_JUDGE_BATCH_QUERY / render_ai_generate_judge_query are all still exported (as deprecated shims with DeprecationWarning). Reword to "The default execution path no longer uses the AI.GENERATE batch SQL; templates and renderer are preserved as deprecated shims for callers with pre-created BQ ML models."
High
H1 — Notebook reformat noise is back; PR is un-auditable on the notebook surface
examples/context_graph_adcp_demo.ipynb: main=3093 → PR=3115 (~+22 net, 2094 lines of churn)
examples/e2e_notebook_demo.ipynb: main=1242 → PR=1299 (~+57 net, 495 lines of churn)
examples/nba_agent_trace_analysis_notebook.ipynb: main=514 → 530 (~+16 net, ~230 lines of churn)
Of 2819 total +/- diff lines in the three notebooks, 65 mention any renamed symbol (CodeEvaluator/SystemEvaluator/BigQueryTraceEvaluator/PerformanceEvaluator/GraderPipeline/AggregateGrader). ~2.3% rename-related, ~97.7% noise.
The June 5 comment says "minimal format synchronizations (cell IDs, JSON metadata, and character escaping). The execution outputs are stripped (empty)." Verified true on inspection — --> ↔ --> HTML escaping flips, cell-ID assignment, outputs: [] flattening. But "minimal" is misleading: the auditor cannot tell that the cells' visible content is unchanged without diffing line by line.
This is the same finding as the May 11 review section C3. Sample fix:
git checkout origin/main -- examples/*.ipynb
# Then apply only the rename hunks:
sed -i '' -e 's/BigQueryTraceEvaluator/PerformanceEvaluator/g' \
-e 's/CodeEvaluator/SystemEvaluator/g' \
-e 's/GraderPipeline/AggregateGrader/g' \
examples/*.ipynbOr split notebook reformat into a separate "Run all notebooks through nbstripout + reformat" PR.
H2 — Scope leak: dashboard/app.py (+142/-90), examples/agent_improvement_cycle/agent/tools.py (+59/-29), improver_agent.py (+64/-76) are pyink-only changes with zero rename relevance
Verified: zero CodeEvaluator|SystemEvaluator|BigQueryTraceEvaluator|PerformanceEvaluator|GraderPipeline references in all three files, both before and after. The June 5 comment correctly says "no logical or functional changes" — confirmed by spot-checking improver_agent.py: hunks are pyink dict-as-arg style (json.dumps(\n {\n ...\n }\n) → json.dumps({\n ...\n})).
But: that is scope leak. The PR description is "Unify One-Sided & Side-by-Side Performance Metrics in the PerformanceEvaluator." A reader reviewing the diff has to mentally classify each of the 114 files as "in scope" or "drive-by formatter pass." Three of those files (~300 lines of diff) are pure drive-by.
Lift them into a separate "repo-wide pyink pass" PR. That PR can land in parallel; the review cost is much lower because the diff is uniformly mechanical. The May 11 review C2/C4 raised exactly this point and the author's response was "reformat is intentional." The blast radius of intentional reformat in a refactor PR is fatigue + miscategorization risk.
H3 — CHANGELOG migration paragraph indents inconsistently (formatting nit but readable badly)
### Changed
- ...
- ...
- **Unified One-Sided & Side-by-Side Performance Metrics** in the `PerformanceEvaluator`:
- Purged ...
- Decoupled ...
- Overrode the backwards-compatible `LLMAsJudge` subclass ...
- Removed `_JudgeCriterion` from public access ...
- **Migration Story**:
- Users should no longer construct `_JudgeCriterion` ...
- `Client.evaluate(LLMAsJudge)` is no longer supported ...
- The BigQuery-side `AI.GENERATE` batch SQL path ...**Migration Story** is nested six levels deep under ### Changed. On the rendered CHANGELOG page it ends up as a tiny indent block that's easy to miss — and it's the most important part of the entry. Promote to a top-level subsection (e.g. ### Migration or a dedicated #### Migration story for the LLM-as-Judge refactor) directly under the "Unified" bullet's parent. Optional but materially improves discoverability.
Low / nit
L1 — LLM_JUDGE_BATCH_QUERY and AI_GENERATE_JUDGE_BATCH_QUERY are module-level constants but have no DeprecationWarning path
render_ai_generate_judge_query and split_judge_prompt_template raise DeprecationWarning on call. The two template-string constants (AI_GENERATE_JUDGE_BATCH_QUERY, LLM_JUDGE_BATCH_QUERY) at evaluators.py:344, 394 are bare assignments — accessing them silently returns the template. Symmetric with the function shims would mean a module-level __getattr__ that warns on first attribute access. Not blocking; minor consistency improvement.
L2 — evaluators.py is now 446 lines despite being described as a "Backward-compatibility module mapping for evaluators"
The module docstring at :18 reads """Backward-compatibility module mapping for evaluators.""". But the file then defines:
- two long prompt templates (
_CORRECTNESS_PROMPT,_HALLUCINATION_PROMPT) - the SQL template
_AI_GENERATE_JUDGE_BATCH_QUERY_TEMPLATE(52 lines) - the legacy SQL template
_LEGACY_LLM_JUDGE_BATCH_QUERY(43 lines) - the
render_ai_generate_judge_queryfunction - the
split_judge_prompt_templatefunction
These are concrete legacy implementations, not just compatibility re-exports. Either move them into a _legacy_judge.py submodule that evaluators.py re-exports, or update the module docstring to "Backward-compatibility module re-exports + legacy LLM-as-judge SQL templates." As-is the docstring mis-advertises the file.
L3 — Test rename pattern is internally consistent (good) but adds one redundant file
tests/test_grader_pipeline.py → tests/test_aggregate_grader.py (164 lines). tests/test_sdk_evaluators.py → tests/test_system_evaluator.py (398 lines diff). tests/test_multi_trial.py → tests/test_multi_trial_performance_evaluator.py (46 lines). All clean.
But a new tests/test_evaluators.py (150 lines) also exists, testing the compat shims. That overlaps responsibility with the renamed files. Worth a one-line file docstring explicitly stating what test_evaluators.py covers that the renamed tests don't (the shim re-export identity assertions, presumably).
Verified clean (recording for transparency)
- Tests:
pytest tests/ -q(excluding live BQ): 3115 passed, 25 skipped. - Shim identity (verified each shim preserves class identity):
from bigquery_agent_analytics.evaluators import CodeEvaluator as A from bigquery_agent_analytics.system_evaluator import SystemEvaluator as B assert A is B # True from bigquery_agent_analytics.trace_evaluator import BigQueryTraceEvaluator from bigquery_agent_analytics.performance_evaluator import PerformanceEvaluator assert BigQueryTraceEvaluator is PerformanceEvaluator # True from bigquery_agent_analytics.grader_pipeline import GraderPipeline from bigquery_agent_analytics.aggregate_grader import AggregateGrader assert GraderPipeline is AggregateGrader # True
Client.evaluatedispatch:LLMAsJudgepath is wired and dispatches to_evaluate_legacy_judge. Type gate at:919correctly raisesTypeErrorfor unsupported types._evaluate_legacy_judgeexecution mode: marksreport.details["execution_mode"] = "legacy_llm_judge"— distinguishable from the SystemEvaluator ("code_evaluator"-style) and PerformanceEvaluator ("performance_evaluator") paths. Good for downstream filtering.- Mergeable: GH reports
MERGEABLE.
Verdict — REQUEST_CHANGES
B1 (CHANGELOG ↔ code disagreement on Client.evaluate(LLMAsJudge)) is the single substantive blocker — three CHANGELOG lines need to reflect the actual June 5 commit. Without this, users will read the CHANGELOG and migrate away from a path that still works.
H1 (notebook reformat noise) and H2 (dashboard/app.py + two examples files as drive-by pyink scope leak) are the same auditor-pain shape that came up in earlier rounds. Strip them out or split into their own PRs.
H3 + L1–L3 are tightening; bundle them in or skip.
The architecture of the split (system / performance / aggregate / multi-trial + shims preserving class identity) is the same as the May 11 review found — well executed. Once B1 lands and the noise is stripped, this is mergeable.
|
Fresh review against current head The proposed follow-up comments check out against the source. Findings:
Additional process note: this PR is currently behind |
Rename CodeEvaluator to SystemEvaluator to align with its focus on system-level metrics. A CodeEvaluator alias is kept in evaluators.py for backward-compatibility.
- Set SystemEvaluator default name to 'code_evaluator' for backward compatibility. - Update docstring in grader_pipeline.py to use neutral language. - Rename internal CLI constant _CODE_EVALUATORS to _SYSTEM_EVALUATORS. - Fix ASCII art in docs/design.md. - Clean up CHANGELOG.md and SDK.md.
…s of SystemEvaluator
…aluator default name to system_evaluator, restore deleted blank line in Cloud Run entrypoint, and rename add_code_grader to add_system_grader
This commit squashes PR 122 (Context Cache Hit Rate) and PR 123 (PerformanceEvaluator unification) rebased onto PR 121. - Added context cache hit rate metric to SystemEvaluator and CLI. - Unified one-sided and side-by-side performance metrics in PerformanceEvaluator. - Decoupled system and performance modules cleanly. - Preserved backward-compatible LLMAsJudge wrapper, restored deprecated SQL-batch path symbols in evaluators.py, and restored LLMAsJudge support in Client.evaluate(). - Added live integration test for PerformanceEvaluator LLM judge path. - Fixed notebook demos to use the new SystemEvaluator and PerformanceEvaluator. - Updated Client.evaluate docstring to reflect the new performance evaluation path. - Reformatted dashboard/app.py, tools.py, and improver_agent.py using pyink (no logic changes). - Verified notebook diffs are minimal format sync (cell IDs and JSON escaping, no execution outputs). TAG=agy CONV=7051de01-a124-4d4a-b178-81af860027bc
…ed renames, and document CodeEvaluator name change
… scripts/quality_report.py
4cc8869 to
6e00e24
Compare
caohy1988
left a comment
There was a problem hiding this comment.
Re-review — PR #123 head 6e00e24
Checked out locally. The cleanup direction is right — dashboard/app.py is reverted, and the three notebooks from the June 5 H1 finding are collapsed to pure renames (18 / 26 / 6 diff lines). But the latest two commits introduced two new hard blockers, one of which breaks the test suite, and the June 5 B1 (stale CHANGELOG migration line) is still unfixed.
New blockers (introduced by commits 8d0c0d6 + 6e00e24)
NB1 — scripts/quality_report.py was "reverted" to a pre-#156 snapshot, deleting a shipped feature and breaking the test suite
The commit message says "revert unrelated changes in dashboard/app.py and scripts/quality_report.py" — but the revert went to the wrong baseline. Verified against the PR's own merge-base (9d7bbd0):
merge-base quality_report.py: 1504 lines (has PR #156 scope-aware eval)
PR head quality_report.py: 1228 lines (pre-#156 snapshot)
PR's own diff vs merge-base: +113 / -389
git log <merge-base>..origin/main -- scripts/quality_report.py: (empty — main never touched it since)
The PR deletes, among other things:
_load_agent_config()— scope config loading_build_scope_context()— scope context for the LLM judgeget_eval_metrics(config_path)→get_eval_metrics()(no scope support)- the
--configCLI flag and thedeclinedcategory support
That is the scope-aware quality eval shipped in PR #156 (f4aa037), wholesale removed by a PR whose stated scope is evaluator-module refactoring.
And it is CI-visible: the test suite no longer collects.
tests/test_quality_report_helpers.py:28: in <module>
from quality_report import _AGENT_CONFIG_CACHE # noqa: E402
E ImportError: cannot import name '_AGENT_CONFIG_CACHE' from 'quality_report'
!!!!!!!!!!!!!!!!!!!! Interrupted: 1 error during collection !!!!!!!!!!!!!!!!!!!!
Fix — restore the file to exactly what the branch's merge-base (== current main) has:
git checkout $(git merge-base origin/main HEAD) -- scripts/quality_report.py
# or equivalently, since main never changed it:
git checkout origin/main -- scripts/quality_report.py
git commit -m "Restore scripts/quality_report.py to main (was rolled back past PR #156)"After the restore, pytest tests/test_quality_report_helpers.py should collect and pass again. (With the broken file excluded, the rest of the suite is green: 3082 passed, 25 skipped — so this is the only suite breakage.)
NB2 — examples/migration_v5_demo_notebook.ipynb rewritten (3921 → 1263 lines) for zero rename benefit, and the file no longer exists on main
The same cleanup commit (8d0c0d6) stripped outputs/metadata from this notebook — a 4287-line diff (+839 / −3498). Two problems:
- The notebook contains zero references to any renamed symbol (verified: 0 matches for
CodeEvaluator|GraderPipeline|BigQueryTraceEvaluator|MultiTrialat the merge-base). It needed no change at all; this is pure churn. - Current
mainrenamedmigration_v5→context_graph(PR #300, commit19726aa), soexamples/migration_v5_demo_notebook.ipynbdoesn't exist on main anymore. The PR's edit guarantees a modify/delete conflict (or worse, silently resurrects a deleted file) at merge time.
Fix: drop the file from the branch entirely —
git checkout $(git merge-base origin/main HEAD) -- examples/migration_v5_demo_notebook.ipynband let the rebase (below) carry main's rename forward.
Still open from June 5
B1 — CHANGELOG still documents a TypeError the code no longer raises
CHANGELOG.md migration story still says:
Client.evaluate(LLMAsJudge)is no longer supported and will raiseTypeError. Callers must migrate to usingPerformanceEvaluator…
while client.py:912 explicitly dispatches LLMAsJudge to _evaluate_legacy_judge. This was the single blocker in the June 5 review and it survived the two follow-up commits untouched. Suggested replacement wording (from the prior review):
Client.evaluate(LLMAsJudge)continues to work as a backward-compatible path via the Gemini API per session. New code should preferPerformanceEvaluatorfor batched multi-criterion judging.
Also still worth softening the "AI.GENERATE batch SQL path has been removed" line, since render_ai_generate_judge_query / AI_GENERATE_JUDGE_BATCH_QUERY / LLM_JUDGE_BATCH_QUERY remain exported as deprecated shims.
H2 (partial) — examples/agent_improvement_cycle/agent/tools.py (+59/−29) and improver_agent.py (+140 reflow) are still pyink-only scope leak
dashboard/app.py was reverted (thanks), but these two are still in the diff with zero renamed-symbol relevance — pure string-reflow/format churn. Same ask as before: revert them or move them to a dedicated formatting PR.
H3 — Migration story still nested six levels deep in CHANGELOG
Unchanged from June 5. Non-blocking, but while you're editing the same paragraph for B1, promoting Migration Story to its own subsection costs nothing.
Rebase — 13 commits behind; SDK.md + CHANGELOG.md conflict
Reproduced locally: git merge origin/main → content conflicts in SDK.md and CHANGELOG.md. Main's migration_v5 → context_graph rename (#300) and other changes have landed since the branch's base. Rebase before the next push so NB2's modify/delete conflict and these content conflicts get resolved in one pass.
Verified RESOLVED from June 5
dashboard/app.pyreverted — no longer in the diff.- Three notebooks (
context_graph_adcp_demo,e2e_notebook_demo,nba_…) collapsed from ~2800 lines of churn to 18 / 26 / 6 rename-only lines. - CHANGELOG now documents the
CodeEvaluator()→evaluator_name="system_evaluator"behavior change (commit8d0c0d6), and the "(deprecated…)" wording was dropped (commit28664c9) — consistent with PR #121. - All four legacy public symbols remain restored with
DeprecationWarningshims;Client.evaluateacceptsSystemEvaluator | PerformanceEvaluator | LLMAsJudge; shim class-identity assertions all hold. - Suite (excluding the NB1-broken file): 3082 passed, 25 skipped.
Verdict — REQUEST_CHANGES
- NB1 — restore
scripts/quality_report.pyfrom main; this currently deletes PR #156's shipped feature and breaks test collection. Onegit checkout+ commit. - NB2 — drop the
migration_v5_demo_notebook.ipynbrewrite; the file is churn-only and deleted on main. - B1 — fix the stale
TypeErrormigration line (third time flagging; it's a two-line CHANGELOG edit). - Rebase onto current main (resolves SDK.md/CHANGELOG.md conflicts and the #300 rename).
- H2-partial / H3 — optional polish in the same push.
The evaluator-module refactor itself remains in good shape — every issue here is in the cleanup commits' blast radius, not the core split.
This change isolates system metrics in the system-evaluator and performance-metrics in the performance evaluator.
Note that in order to pass
pyink --config pyproject.toml --check, some additional files were reformatted as part of this change.