GoogleCloudPlatform
diff --git a/‎CHANGELOG.md‎
Lines changed: 10 additions & 0 deletions b/‎CHANGELOG.md‎
Lines changed: 10 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 7 additions & 8 deletions b/‎README.md‎
Lines changed: 7 additions & 8 deletions
@@ -7,6 +7,16 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ## [Unreleased]
 
+### Changed
+- **Renamed `CodeEvaluator` to `SystemEvaluator`** to align with its focus on system-level metrics. Kept `CodeEvaluator` as an alias for backward compatibility.
+- **Staged identical baseline copies and test renames** for lineage tracking.
+- **Unified One-Sided & Side-by-Side Performance Metrics** in the `PerformanceEvaluator`:
+  - Purged obsolete criteria-list `LLMAsJudge` implementations, replacing them natively with `PerformanceEvaluator` for folded Tone, Faithfulness, Correctness, and Efficiency evaluations.
+  - Decoupled system and performance modules cleanly, making `system_evaluator.py` pure to `SystemEvaluator`.
+  - Overrode the backwards-compatible `LLMAsJudge` subclass in `evaluators.py` with required static factories for correctness, hallucination, and sentiment.
+    - Removed `_JudgeCriterion` from public access in `system_evaluator.py` (now internal to `evaluators.py`).
+      - **Migration Story**: Users should no longer construct `_JudgeCriterion` objects directly. Instead, use `LLMAsJudge.add_criterion(name, prompt_template, score_key, threshold)` to add criteria to a judge instance.
+
 ## [0.2.3] - 2026-04-27
 
 ### Fixed
 
@@ -25,10 +25,9 @@ regressions — all through BigQuery SQL or Python.
 - Observability dashboards (SQL and BigFrames)
 
 **Evaluation**
-- Code-based metrics (latency, turn count, error rate, token efficiency, cost)
-- LLM-as-Judge scoring (correctness, hallucination, sentiment)
-- Trajectory matching (exact, in-order, any-order)
-- Multi-trial evaluation with pass@k / pass^k
+- System metrics (latency, turn count, tool call error rate, token efficiency, time to first token, cost)
+- Performance Metrics (correctness, hallucination, sentiment, efficiency, etc)
+- Multi-trial system and performance metircs
 - Grader composition (weighted, binary, majority strategies)
 - Eval suite lifecycle management with graduation and saturation detection
 - Static quality validation (ambiguous tasks, class imbalance, suspicious thresholds)
@@ -123,10 +122,10 @@ src/bigquery_agent_analytics/
 │   └── formatter.py               # Output formatting (json/text/table)
 │
 ├── Evaluation
-│   ├── evaluators.py              # SystemEvaluator + LLMAsJudge
-│   ├── trace_evaluator.py         # Trajectory matching & replay
-│   ├── multi_trial.py             # Multi-trial runner + pass@k
-│   ├── grader_pipeline.py         # Grader composition pipeline
+│   ├── system_evaluator.py        # SystemEvaluator
+│   ├── performance_evaluator.py   # PerformanceEvaluator
+│   ├── multi_trial_performance_evaluator.py # MultiTrialPerformanceEvaluator
+│   └── aggregate_grader.py        # AggregateGrader
 │   ├── eval_suite.py              # Eval suite lifecycle management
 │   └── eval_validator.py          # Static validation checks
 │