Skip to content

Commit 8e3569d

Browse files
Refactor: Rename CodeEvaluator to SystemEvaluator
Rename CodeEvaluator to SystemEvaluator to align with its focus on system-level metrics. A CodeEvaluator alias is kept in evaluators.py for backward-compatibility.
1 parent c172bcb commit 8e3569d

31 files changed

Lines changed: 4410 additions & 3997 deletions

CHANGELOG.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,9 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
99

1010
### Added
1111

12+
- Added `SystemEvaluator` as the preferred name for deterministic/code-defined metrics.
13+
- Kept `CodeEvaluator` as a backward-compatible alias (deprecated but supported).
14+
1215
- **``bqaa-revalidate-extractors`` CLI** in
1316
`bigquery_agent_analytics.extractor_compilation.cli_revalidate`
1417
and

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -123,7 +123,7 @@ src/bigquery_agent_analytics/
123123
│ └── formatter.py # Output formatting (json/text/table)
124124
125125
├── Evaluation
126-
│ ├── evaluators.py # CodeEvaluator + LLMAsJudge
126+
│ ├── evaluators.py # SystemEvaluator + LLMAsJudge
127127
│ ├── trace_evaluator.py # Trajectory matching & replay
128128
│ ├── multi_trial.py # Multi-trial runner + pass@k
129129
│ ├── grader_pipeline.py # Grader composition pipeline

SDK.md

Lines changed: 21 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -112,32 +112,32 @@ traces = client.list_traces(
112112

113113
## 3. Code-Based Evaluation (Deterministic Metrics)
114114

115-
`CodeEvaluator` runs deterministic, code-defined metric functions against session summaries. Each metric returns a score between 0.0 and 1.0.
115+
`SystemEvaluator` runs deterministic, code-defined metric functions against session summaries. Each metric returns a score between 0.0 and 1.0.
116116

117117
### Pre-Built Evaluators
118118

119119
The SDK ships with seven ready-to-use evaluators:
120120

121121
```python
122-
from bigquery_agent_analytics import CodeEvaluator
122+
from bigquery_agent_analytics import SystemEvaluator
123123

124-
# Latency: fails when average latency exceeds the budget
125-
evaluator = CodeEvaluator.latency(threshold_ms=5000)
124+
# Latency: score degrades linearly as avg latency approaches threshold
125+
evaluator = SystemEvaluator.latency(threshold_ms=5000)
126126

127-
# Turn count: fails when sessions use too many back-and-forth turns
128-
evaluator = CodeEvaluator.turn_count(max_turns=10)
127+
# Turn count: penalizes sessions with too many back-and-forth turns
128+
evaluator = SystemEvaluator.turn_count(max_turns=10)
129129

130-
# Error rate: fails on high tool error rates
131-
evaluator = CodeEvaluator.error_rate(max_error_rate=0.1)
130+
# Error rate: penalizes high tool error rates
131+
evaluator = SystemEvaluator.error_rate(max_error_rate=0.1)
132132

133133
# Token efficiency: checks total token usage stays within budget
134-
evaluator = CodeEvaluator.token_efficiency(max_tokens=50000)
134+
evaluator = SystemEvaluator.token_efficiency(max_tokens=50000)
135135

136136
# Context cache hit rate: checks repeated prompt-prefix reuse
137-
evaluator = CodeEvaluator.context_cache_hit_rate(min_hit_rate=0.5)
137+
evaluator = SystemEvaluator.context_cache_hit_rate(min_hit_rate=0.5)
138138

139139
# Cost per session: checks estimated USD cost stays under budget
140-
evaluator = CodeEvaluator.cost_per_session(
140+
evaluator = SystemEvaluator.cost_per_session(
141141
max_cost_usd=1.0,
142142
input_cost_per_1k=0.00025,
143143
output_cost_per_1k=0.00125,
@@ -173,7 +173,7 @@ Define your own metric functions and chain multiple metrics together:
173173

174174
```python
175175
evaluator = (
176-
CodeEvaluator(name="my_quality_check")
176+
SystemEvaluator(name="my_quality_check")
177177
.add_metric(
178178
name="latency",
179179
fn=lambda s: 1.0 - min(s.get("avg_latency_ms", 0) / 5000, 1.0),
@@ -216,7 +216,7 @@ Run evaluation across all sessions matching a filter:
216216
from bigquery_agent_analytics import TraceFilter
217217

218218
report = client.evaluate(
219-
evaluator=CodeEvaluator.latency(threshold_ms=3000),
219+
evaluator=SystemEvaluator.latency(threshold_ms=3000),
220220
filters=TraceFilter(agent_id="my_agent"),
221221
)
222222

@@ -561,7 +561,7 @@ pass_pow_k = compute_pass_pow_k(num_trials=10, num_passed=8) # ~0.107
561561

562562
## 7. Grader Composition Pipeline
563563

564-
Combine multiple evaluators (`CodeEvaluator` + `LLMAsJudge` + custom functions) into a single aggregated verdict using configurable scoring strategies.
564+
Combine multiple evaluators (`SystemEvaluator` + `LLMAsJudge` + custom functions) into a single aggregated verdict using configurable scoring strategies.
565565

566566
### Scoring Strategies
567567

@@ -575,7 +575,7 @@ Combine multiple evaluators (`CodeEvaluator` + `LLMAsJudge` + custom functions)
575575

576576
```python
577577
from bigquery_agent_analytics import (
578-
CodeEvaluator, GraderPipeline, LLMAsJudge,
578+
SystemEvaluator, GraderPipeline, LLMAsJudge,
579579
WeightedStrategy, GraderResult,
580580
)
581581

@@ -588,8 +588,8 @@ pipeline = (
588588
},
589589
threshold=0.6,
590590
))
591-
.add_code_grader(CodeEvaluator.latency(threshold_ms=5000), weight=0.2)
592-
.add_code_grader(CodeEvaluator.cost_per_session(max_cost_usd=0.50), weight=0.1)
591+
.add_code_grader(SystemEvaluator.latency(threshold_ms=5000), weight=0.2)
592+
.add_code_grader(SystemEvaluator.cost_per_session(max_cost_usd=0.50), weight=0.1)
593593
.add_llm_grader(LLMAsJudge.correctness(threshold=0.7), weight=0.7)
594594
)
595595

@@ -618,8 +618,8 @@ from bigquery_agent_analytics import BinaryStrategy
618618

619619
pipeline = (
620620
GraderPipeline(BinaryStrategy())
621-
.add_code_grader(CodeEvaluator.latency(threshold_ms=3000))
622-
.add_code_grader(CodeEvaluator.error_rate(max_error_rate=0.05))
621+
.add_code_grader(SystemEvaluator.latency(threshold_ms=3000))
622+
.add_code_grader(SystemEvaluator.error_rate(max_error_rate=0.05))
623623
.add_llm_grader(LLMAsJudge.hallucination(threshold=0.8))
624624
)
625625

@@ -649,7 +649,7 @@ def business_rules_grader(context):
649649

650650
pipeline = (
651651
GraderPipeline(BinaryStrategy())
652-
.add_code_grader(CodeEvaluator.latency())
652+
.add_code_grader(SystemEvaluator.latency())
653653
.add_custom_grader("business_rules", business_rules_grader)
654654
)
655655
```
@@ -2057,7 +2057,7 @@ bigquery_agent_analytics/
20572057
│ Core
20582058
│ ├── client.py ← High-level SDK entry point
20592059
│ ├── trace.py ← Trace/Span reconstruction & DAG rendering
2060-
│ └── evaluators.py ← CodeEvaluator + LLMAsJudge + SQL templates
2060+
│ └── evaluators.py ← SystemEvaluator + LLMAsJudge + SQL templates
20612061
20622062
│ Evaluation Harness
20632063
│ ├── trace_evaluator.py ← BigQueryTraceEvaluator, trajectory matching, replay

deploy/remote_function/dispatch.py

Lines changed: 15 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@
2525
from typing import Any
2626

2727
from bigquery_agent_analytics import Client
28-
from bigquery_agent_analytics import CodeEvaluator
28+
from bigquery_agent_analytics import SystemEvaluator
2929
from bigquery_agent_analytics import LLMAsJudge
3030
from bigquery_agent_analytics import serialize
3131
from bigquery_agent_analytics import TraceFilter
@@ -137,43 +137,43 @@ def build_filters(params):
137137

138138

139139
def build_evaluator(params):
140-
"""Build CodeEvaluator from params dict."""
140+
"""Build SystemEvaluator from params dict."""
141141
metric = params.get("metric", "latency")
142142
threshold = params.get("threshold")
143143
fail_on_missing_telemetry = _bool_param(
144144
params.get("fail_on_missing_telemetry", False)
145145
)
146146

147147
factories_with_t = {
148-
"latency": lambda t: CodeEvaluator.latency(threshold_ms=t),
149-
"error_rate": lambda t: CodeEvaluator.error_rate(
148+
"latency": lambda t: SystemEvaluator.latency(threshold_ms=t),
149+
"error_rate": lambda t: SystemEvaluator.error_rate(
150150
max_error_rate=t,
151151
),
152-
"turn_count": lambda t: CodeEvaluator.turn_count(
152+
"turn_count": lambda t: SystemEvaluator.turn_count(
153153
max_turns=int(t),
154154
),
155-
"token_efficiency": lambda t: CodeEvaluator.token_efficiency(
155+
"token_efficiency": lambda t: SystemEvaluator.token_efficiency(
156156
max_tokens=int(t),
157157
),
158-
"ttft": lambda t: CodeEvaluator.ttft(threshold_ms=t),
159-
"cost": lambda t: CodeEvaluator.cost_per_session(
158+
"ttft": lambda t: SystemEvaluator.ttft(threshold_ms=t),
159+
"cost": lambda t: SystemEvaluator.cost_per_session(
160160
max_cost_usd=t,
161161
),
162162
}
163163
factories_default = {
164-
"latency": CodeEvaluator.latency,
165-
"error_rate": CodeEvaluator.error_rate,
166-
"turn_count": CodeEvaluator.turn_count,
167-
"token_efficiency": CodeEvaluator.token_efficiency,
168-
"ttft": CodeEvaluator.ttft,
169-
"cost": CodeEvaluator.cost_per_session,
164+
"latency": SystemEvaluator.latency,
165+
"error_rate": SystemEvaluator.error_rate,
166+
"turn_count": SystemEvaluator.turn_count,
167+
"token_efficiency": SystemEvaluator.token_efficiency,
168+
"ttft": SystemEvaluator.ttft,
169+
"cost": SystemEvaluator.cost_per_session,
170170
}
171171

172172
if metric == "context_cache_hit_rate":
173173
kwargs = {"fail_on_missing_telemetry": fail_on_missing_telemetry}
174174
if threshold is not None:
175175
kwargs["min_hit_rate"] = threshold
176-
return CodeEvaluator.context_cache_hit_rate(**kwargs)
176+
return SystemEvaluator.context_cache_hit_rate(**kwargs)
177177

178178
if metric not in factories_with_t:
179179
raise ValueError(f"Unknown metric: {metric!r}")

docs/design.md

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -150,7 +150,7 @@ As demonstrated in the [e2e demo](../examples/e2e_demo.py):
150150

151151
**Phase 2 — Evaluation:**
152152
1. `Client.get_trace()` retrieves all events for a session
153-
2. `CodeEvaluator` preset factories assess latency, turn count, error rate, token efficiency
153+
2. `SystemEvaluator` preset factories assess latency, turn count, error rate, token efficiency
154154
3. `LLMAsJudge.correctness()` performs semantic evaluation via BigQuery `AI.GENERATE`
155155
4. `BigQueryTraceEvaluator.evaluate_session()` performs trajectory matching against golden tool sequences
156156

@@ -208,7 +208,7 @@ As demonstrated in the [e2e demo](../examples/e2e_demo.py):
208208
│ categorical_evaluator│ │ ontology_* (6 modules)│ │ cli │
209209
│ categorical_views │ │ (YAML → AI.GENERATE → │ │ (Typer commands) │
210210
│ (label evaluation) │ │ tables → PG → GQL) │ │ │
211-
└──────────────────────┘ └──────────────────────┘ └──────────────────┘
211+
└──────────────────┘ └──────────────────┘ └──────────────────┘
212212
213213
┌──────────────────┐ ┌───────────────────┐
214214
│ udf_kernels │ │ serialization │
@@ -248,7 +248,7 @@ Aggregations, filtering, joins, and even LLM evaluation (via `AI.GENERATE`) are
248248
LLM-based evaluation can run via (1) BigQuery `AI.GENERATE`, (2) legacy BigQuery ML `ML.GENERATE_TEXT`, or (3) the Gemini API directly. This maximizes compatibility across different GCP configurations.
249249

250250
**Decision 4: Composition over inheritance.**
251-
The `GraderPipeline` composes `CodeEvaluator`, `LLMAsJudge`, and custom functions via a builder pattern rather than requiring them to share a common base class. The `BigQueryMemoryService` composes four internal services rather than extending a single monolithic class.
251+
The `GraderPipeline` composes `SystemEvaluator`, `LLMAsJudge`, and custom functions via a builder pattern rather than requiring them to share a common base class. The `BigQueryMemoryService` composes four internal services rather than extending a single monolithic class.
252252

253253
---
254254

@@ -396,7 +396,7 @@ Each field generates a separate `AND` condition with a corresponding `bigquery.S
396396

397397
This module contains two evaluator classes and the SQL templates that power batch evaluation.
398398

399-
#### 4.3.1 `CodeEvaluator`
399+
#### 4.3.1 `SystemEvaluator`
400400

401401
Deterministic evaluation using code-defined metric functions.
402402

@@ -641,7 +641,7 @@ Combines heterogeneous evaluators into a unified verdict using a strategy patter
641641
642642
┌──────────────┼──────────────┐
643643
▼ ▼ ▼
644-
CodeEvaluator LLMAsJudge Custom Fn
644+
SystemEvaluator LLMAsJudge Custom Fn
645645
(sync) (async) (sync)
646646
│ │ │
647647
▼ ▼ ▼
@@ -1273,7 +1273,7 @@ results = client.query(formatted, job_config=job_config)
12731273

12741274
```
12751275
Evaluation
1276-
├── Deterministic (CodeEvaluator)
1276+
├── Deterministic (SystemEvaluator)
12771277
│ ├── Latency
12781278
│ ├── Turn count
12791279
│ ├── Error rate
@@ -1321,7 +1321,7 @@ All evaluation scores in the SDK are normalized to `[0.0, 1.0]`:
13211321

13221322
| Mode | Evaluator | Where Computation Runs |
13231323
|------|-----------|----------------------|
1324-
| Single session (sync) | `CodeEvaluator.evaluate_session()` | Python |
1324+
| Single session (sync) | `SystemEvaluator.evaluate_session()` | Python |
13251325
| Single session (async) | `LLMAsJudge.evaluate_session()` | Gemini API |
13261326
| Batch via Client | `Client.evaluate()` | BigQuery (SQL + AI.GENERATE) |
13271327
| Trajectory matching | `BigQueryTraceEvaluator.evaluate_session()` | BigQuery (fetch) + Python (matching) |
@@ -1420,7 +1420,7 @@ Synchronous (user-facing):
14201420
├── Client.drift_detection()
14211421
├── Client.insights()
14221422
├── Client.deep_analysis()
1423-
├── CodeEvaluator.evaluate_session()
1423+
├── SystemEvaluator.evaluate_session()
14241424
├── EvalSuite.*
14251425
├── EvalValidator.*
14261426
└── BigFramesEvaluator.*
@@ -1480,10 +1480,10 @@ results = await asyncio.gather(*[_run_one(t) for t in tasks])
14801480

14811481
## 10. Extensibility & Plugin Points
14821482

1483-
### 10.1 Custom Metrics (CodeEvaluator)
1483+
### 10.1 Custom Metrics (SystemEvaluator)
14841484

14851485
```python
1486-
evaluator = CodeEvaluator(name="custom").add_metric(
1486+
evaluator = SystemEvaluator(name="custom").add_metric(
14871487
name="business_metric",
14881488
fn=lambda session: your_scoring_logic(session),
14891489
threshold=0.7,
@@ -1586,7 +1586,7 @@ All tests mock BigQuery — no GCP credentials or live BigQuery access is needed
15861586
```
15871587
tests/
15881588
├── test_sdk_client.py # Client integration tests
1589-
├── test_sdk_evaluators.py # CodeEvaluator + LLMAsJudge
1589+
├── test_sdk_evaluators.py # SystemEvaluator + LLMAsJudge
15901590
├── test_sdk_trace.py # Trace/Span reconstruction
15911591
├── test_sdk_feedback.py # Drift detection
15921592
├── test_sdk_insights.py # Insights pipeline

docs/hatteras_evaluation.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ agent sessions into user-defined categories directly against traces stored in
77
BigQuery, without relying on an external service.
88

99
This should be implemented as a new categorical evaluation subsystem, not as
10-
an overload of the existing numeric `CodeEvaluator` / `LLMAsJudge` report
10+
an overload of the existing numeric `SystemEvaluator` / `LLMAsJudge` report
1111
path.
1212

1313
The goal is to support Hatteras-like functionality inside the SDK:
@@ -22,7 +22,7 @@ The goal is to support Hatteras-like functionality inside the SDK:
2222

2323
Today the SDK supports two major evaluation modes:
2424

25-
- deterministic numeric scoring via `CodeEvaluator`
25+
- deterministic numeric scoring via `SystemEvaluator`
2626
- semantic numeric scoring via `LLMAsJudge`
2727

2828
What is missing is a first-class way to answer questions like:
@@ -60,7 +60,7 @@ That capability is useful for:
6060
This design is not proposing:
6161

6262
- a full clone of an external Hatteras service
63-
- a replacement for `CodeEvaluator`
63+
- a replacement for `SystemEvaluator`
6464
- a replacement for `LLMAsJudge`
6565
- a new remote function or Python UDF surface in the first phase
6666
- real-time ingestion-time classification in phase 1

docs/implementation_plan_concept_index_runtime.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -165,7 +165,7 @@ Work: `bigquery_ontology/contrib/advertising/` stub with Yahoo's resolver (if co
165165
- `src/bigquery_ontology/graph_ddl_compiler.py` — add `compile_concept_index(ontology, binding, *, output_table) -> str`. Preserve `compile_graph()` contract byte-identically. No changes to existing function bodies.
166166
- `src/bigquery_ontology/cli.py:299``compile` command gains `--emit-concept-index` and `--concept-index-table` flags. When absent, behavior is byte-identical to today.
167167
- `src/bigquery_ontology/__init__.py` — add `from .graph_ddl_compiler import compile_concept_index` so the new public function is importable as `from bigquery_ontology import compile_concept_index`, matching the existing pattern for `compile_graph` (`__init__.py:50` today).
168-
- `src/bigquery_agent_analytics/__init__.py` — add the new public surface to the try/except re-export block (same pattern as `Client`, `CodeEvaluator`, etc.):
168+
- `src/bigquery_agent_analytics/__init__.py` — add the new public surface to the try/except re-export block (same pattern as `Client`, `SystemEvaluator`, etc.):
169169
- `OntologyRuntime` from `.ontology_runtime`
170170
- `EntityResolver`, `ExactMatchResolver`, `SynonymResolver`, `Candidate`, `ResolveResult` from `.entity_resolver`
171171
- `ConceptIndexMismatchError`, `ConceptIndexProvenanceMissing`, `ConceptIndexInconsistentPair`, `ConceptIndexRefreshed` from `.ontology_runtime`

0 commit comments

Comments
 (0)