Skip to content
Open
17 changes: 17 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,11 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## [Unreleased]

### Added

- Added `SystemEvaluator` as the preferred name for deterministic/code-defined metrics.
- Kept `CodeEvaluator` as a backward-compatible alias. Note that calling `CodeEvaluator()` now emits `evaluator_name="system_evaluator"`.

## [0.3.2] - 2026-05-22

### Release highlights
Expand Down Expand Up @@ -950,6 +955,18 @@ BQAA loop
re-opens the choice for the session-aggregated `AI.GENERATE`
tier with Option C (SQL / Python UDF) as the primary candidate.
Unblocks the compile-harness PR.
### Changed
- **Renamed `CodeEvaluator` to `SystemEvaluator`** to align with its focus on system-level metrics. Kept `CodeEvaluator` as an alias for backward compatibility.
- **Staged identical baseline copies and test renames** for lineage tracking.
- **Unified One-Sided & Side-by-Side Performance Metrics** in the `PerformanceEvaluator`:
- Purged obsolete criteria-list `LLMAsJudge` implementations, replacing them natively with `PerformanceEvaluator` for folded Tone, Faithfulness, Correctness, and Efficiency evaluations.
- Decoupled system and performance modules cleanly, making `system_evaluator.py` pure to `SystemEvaluator`.
- Overrode the backwards-compatible `LLMAsJudge` subclass in `evaluators.py` with required static factories for correctness, hallucination, and sentiment.
- Removed `_JudgeCriterion` from public access in `system_evaluator.py` (now internal to `evaluators.py`).
- **Migration Story**:
- Users should no longer construct `_JudgeCriterion` objects directly. Instead, use `LLMAsJudge.add_criterion(name, prompt_template, score_key, threshold)` to add criteria to a judge instance.
- `Client.evaluate(LLMAsJudge)` is no longer supported and will raise `TypeError`. Callers must migrate to using `PerformanceEvaluator` with the appropriate judge configurations.
- The BigQuery-side `AI.GENERATE` batch SQL path has been removed in favor of Python-side per-session API calls. Callers depending on `render_ai_generate_judge_query` or `AI_GENERATE_JUDGE_BATCH_QUERY` template constants must migrate to `PerformanceEvaluator` which handles LLM evaluation programmatically. The legacy symbols are preserved as deprecated shims in `evaluators.py`.

## [0.2.3] - 2026-04-27

Expand Down
15 changes: 7 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,10 +25,9 @@ regressions — all through BigQuery SQL or Python.
- Observability dashboards (SQL and BigFrames)

**Evaluation**
- Code-based metrics (latency, turn count, error rate, token efficiency, cost)
- LLM-as-Judge scoring (correctness, hallucination, sentiment)
- Trajectory matching (exact, in-order, any-order)
- Multi-trial evaluation with pass@k / pass^k
- System metrics (latency, turn count, tool call error rate, token efficiency, time to first token, cost)
- Performance Metrics (correctness, hallucination, sentiment, efficiency, etc)
- Multi-trial system and performance metircs
- Grader composition (weighted, binary, majority strategies)
- Eval suite lifecycle management with graduation and saturation detection
- Static quality validation (ambiguous tasks, class imbalance, suspicious thresholds)
Expand Down Expand Up @@ -123,10 +122,10 @@ src/bigquery_agent_analytics/
│ └── formatter.py # Output formatting (json/text/table)
├── Evaluation
│ ├── evaluators.py # CodeEvaluator + LLMAsJudge
│ ├── trace_evaluator.py # Trajectory matching & replay
│ ├── multi_trial.py # Multi-trial runner + pass@k
── grader_pipeline.py # Grader composition pipeline
│ ├── system_evaluator.py # SystemEvaluator
│ ├── performance_evaluator.py # PerformanceEvaluator
│ ├── multi_trial_performance_evaluator.py # MultiTrialPerformanceEvaluator
── aggregate_grader.py # AggregateGrader
│ ├── eval_suite.py # Eval suite lifecycle management
│ └── eval_validator.py # Static validation checks
Expand Down
Loading