Skip to content

Commit 081a92d

Browse files
Unify One-Sided & Side-by-Side Performance Metrics in the PerformanceEvaluator, but don't add new metrics to the MultiTrialPerformance Evaluator yet
Purged obsolete criteria-list LLMAsJudge implementations, replacing them natively with PerformanceEvaluator for folded Tone, Faithfulness, Correctness, and Efficiency evaluations. - Decoupled system and performance modules cleanly, making system_evaluator.py pure to SystemEvaluator. - Overrode the backwards-compatible LLMAsJudge subclass in evaluators.py with required static factories for correctness, hallucination, and sentiment. - PURGED criteria-list BQML execution code from client.py, and deleted legacy _criteria and _JudgeCriterion list validations throughout test suites. - Fixed Jupyter event-loop context constraints via robust asyncio running event-loop setters inside Client._evaluate_performance. - Refactored strip_markdown_fences in utils.py to drop trailing prose after fenced markdown closing backticks cleanly. - Verified 1,997 collected unit tests PASSING 100% green successfully. TAG=agy CONV=bf5607ce-a7fc-4a29-a7fb-c6074580e613
1 parent 3468e9c commit 081a92d

107 files changed

Lines changed: 3895 additions & 7867 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

CHANGELOG.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,16 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
77

88
## [Unreleased]
99

10+
### Changed
11+
- **Renamed `CodeEvaluator` to `SystemEvaluator`** to align with its focus on system-level metrics. Kept `CodeEvaluator` as an alias for backward compatibility.
12+
- **Staged identical baseline copies and test renames** for lineage tracking.
13+
- **Unified One-Sided & Side-by-Side Performance Metrics** in the `PerformanceEvaluator`:
14+
- Purged obsolete criteria-list `LLMAsJudge` implementations, replacing them natively with `PerformanceEvaluator` for folded Tone, Faithfulness, Correctness, and Efficiency evaluations.
15+
- Decoupled system and performance modules cleanly, making `system_evaluator.py` pure to `SystemEvaluator`.
16+
- Overrode the backwards-compatible `LLMAsJudge` subclass in `evaluators.py` with required static factories for correctness, hallucination, and sentiment.
17+
- Removed `_JudgeCriterion` from public access in `system_evaluator.py` (now internal to `evaluators.py`).
18+
- **Migration Story**: Users should no longer construct `_JudgeCriterion` objects directly. Instead, use `LLMAsJudge.add_criterion(name, prompt_template, score_key, threshold)` to add criteria to a judge instance.
19+
1020
## [0.2.3] - 2026-04-27
1121

1222
### Fixed

README.md

Lines changed: 7 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -25,10 +25,9 @@ regressions — all through BigQuery SQL or Python.
2525
- Observability dashboards (SQL and BigFrames)
2626

2727
**Evaluation**
28-
- Code-based metrics (latency, turn count, error rate, token efficiency, cost)
29-
- LLM-as-Judge scoring (correctness, hallucination, sentiment)
30-
- Trajectory matching (exact, in-order, any-order)
31-
- Multi-trial evaluation with pass@k / pass^k
28+
- System metrics (latency, turn count, tool call error rate, token efficiency, time to first token, cost)
29+
- Performance Metrics (correctness, hallucination, sentiment, efficiency, etc)
30+
- Multi-trial system and performance metircs
3231
- Grader composition (weighted, binary, majority strategies)
3332
- Eval suite lifecycle management with graduation and saturation detection
3433
- Static quality validation (ambiguous tasks, class imbalance, suspicious thresholds)
@@ -123,10 +122,10 @@ src/bigquery_agent_analytics/
123122
│ └── formatter.py # Output formatting (json/text/table)
124123
125124
├── Evaluation
126-
│ ├── evaluators.py # SystemEvaluator + LLMAsJudge
127-
│ ├── trace_evaluator.py # Trajectory matching & replay
128-
│ ├── multi_trial.py # Multi-trial runner + pass@k
129-
── grader_pipeline.py # Grader composition pipeline
125+
│ ├── system_evaluator.py # SystemEvaluator
126+
│ ├── performance_evaluator.py # PerformanceEvaluator
127+
│ ├── multi_trial_performance_evaluator.py # MultiTrialPerformanceEvaluator
128+
── aggregate_grader.py # AggregateGrader
130129
│ ├── eval_suite.py # Eval suite lifecycle management
131130
│ └── eval_validator.py # Static validation checks
132131

0 commit comments

Comments
 (0)