GoogleCloudPlatform · gigistark-google · May 3, 2026 · May 12, 2026 · Jun 5, 2026 · Jun 5, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -7,6 +7,11 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ## [Unreleased]
 
+### Added
+
+- Added `SystemEvaluator` as the preferred name for deterministic/code-defined metrics.
+- Kept `CodeEvaluator` as a backward-compatible alias. Note that calling `CodeEvaluator()` now emits `evaluator_name="system_evaluator"`.
+
 ## [0.3.2] - 2026-05-22
 
 ### Release highlights
@@ -950,6 +955,18 @@ BQAA loop
   re-opens the choice for the session-aggregated `AI.GENERATE`
   tier with Option C (SQL / Python UDF) as the primary candidate.
   Unblocks the compile-harness PR.
+### Changed
+- **Renamed `CodeEvaluator` to `SystemEvaluator`** to align with its focus on system-level metrics. Kept `CodeEvaluator` as an alias for backward compatibility.
+- **Staged identical baseline copies and test renames** for lineage tracking.
+- **Unified One-Sided & Side-by-Side Performance Metrics** in the `PerformanceEvaluator`:
+  - Purged obsolete criteria-list `LLMAsJudge` implementations, replacing them natively with `PerformanceEvaluator` for folded Tone, Faithfulness, Correctness, and Efficiency evaluations.
+  - Decoupled system and performance modules cleanly, making `system_evaluator.py` pure to `SystemEvaluator`.
+  - Overrode the backwards-compatible `LLMAsJudge` subclass in `evaluators.py` with required static factories for correctness, hallucination, and sentiment.
+    - Removed `_JudgeCriterion` from public access in `system_evaluator.py` (now internal to `evaluators.py`).
+      - **Migration Story**:
+        - Users should no longer construct `_JudgeCriterion` objects directly. Instead, use `LLMAsJudge.add_criterion(name, prompt_template, score_key, threshold)` to add criteria to a judge instance.
+        - `Client.evaluate(LLMAsJudge)` is no longer supported and will raise `TypeError`. Callers must migrate to using `PerformanceEvaluator` with the appropriate judge configurations.
+        - The BigQuery-side `AI.GENERATE` batch SQL path has been removed in favor of Python-side per-session API calls. Callers depending on `render_ai_generate_judge_query` or `AI_GENERATE_JUDGE_BATCH_QUERY` template constants must migrate to `PerformanceEvaluator` which handles LLM evaluation programmatically. The legacy symbols are preserved as deprecated shims in `evaluators.py`.
 
 ## [0.2.3] - 2026-04-27
 

diff --git a/README.md b/README.md
@@ -25,10 +25,9 @@ regressions — all through BigQuery SQL or Python.
 - Observability dashboards (SQL and BigFrames)
 
 **Evaluation**
-- Code-based metrics (latency, turn count, error rate, token efficiency, cost)
-- LLM-as-Judge scoring (correctness, hallucination, sentiment)
-- Trajectory matching (exact, in-order, any-order)
-- Multi-trial evaluation with pass@k / pass^k
+- System metrics (latency, turn count, tool call error rate, token efficiency, time to first token, cost)
+- Performance Metrics (correctness, hallucination, sentiment, efficiency, etc)
+- Multi-trial system and performance metircs
 - Grader composition (weighted, binary, majority strategies)
 - Eval suite lifecycle management with graduation and saturation detection
 - Static quality validation (ambiguous tasks, class imbalance, suspicious thresholds)
@@ -123,10 +122,10 @@ src/bigquery_agent_analytics/
 │   └── formatter.py               # Output formatting (json/text/table)
 │
 ├── Evaluation
-│   ├── evaluators.py              # CodeEvaluator + LLMAsJudge
-│   ├── trace_evaluator.py         # Trajectory matching & replay
-│   ├── multi_trial.py             # Multi-trial runner + pass@k
-│   ├── grader_pipeline.py         # Grader composition pipeline
+│   ├── system_evaluator.py        # SystemEvaluator
+│   ├── performance_evaluator.py   # PerformanceEvaluator
+│   ├── multi_trial_performance_evaluator.py # MultiTrialPerformanceEvaluator
+│   └── aggregate_grader.py        # AggregateGrader
 │   ├── eval_suite.py              # Eval suite lifecycle management
 │   └── eval_validator.py          # Static validation checks
 │