Skip to content
5 changes: 5 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,11 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## [Unreleased]

### Added

- Added `SystemEvaluator` as the preferred name for deterministic/code-defined metrics.
- Kept `CodeEvaluator` as a backward-compatible alias. Note that calling `CodeEvaluator()` now emits `evaluator_name="system_evaluator"`.

## [0.3.4] - 2026-06-10

### Release highlights
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -164,7 +164,7 @@ src/bigquery_agent_analytics/
│ └── formatter.py # Output formatting (json/text/table)
├── Evaluation
│ ├── evaluators.py # CodeEvaluator + LLMAsJudge
│ ├── evaluators.py # SystemEvaluator + LLMAsJudge
│ ├── trace_evaluator.py # Trajectory matching & replay
│ ├── multi_trial.py # Multi-trial runner + pass@k
│ ├── grader_pipeline.py # Grader composition pipeline
Expand Down
42 changes: 21 additions & 21 deletions SDK.md
Original file line number Diff line number Diff line change
Expand Up @@ -112,32 +112,32 @@ traces = client.list_traces(

## 3. Code-Based Evaluation (Deterministic Metrics)

`CodeEvaluator` runs deterministic, code-defined metric functions against session summaries. Each metric returns a score between 0.0 and 1.0.
`SystemEvaluator` runs deterministic, code-defined metric functions against session summaries. Each metric returns a score between 0.0 and 1.0.

### Pre-Built Evaluators

The SDK ships with seven ready-to-use evaluators:

```python
from bigquery_agent_analytics import CodeEvaluator
from bigquery_agent_analytics import SystemEvaluator

# Latency: fails when average latency exceeds the budget
evaluator = CodeEvaluator.latency(threshold_ms=5000)
# Latency: fails when average latency exceeds the threshold
evaluator = SystemEvaluator.latency(threshold_ms=5000)

# Turn count: fails when sessions use too many back-and-forth turns
evaluator = CodeEvaluator.turn_count(max_turns=10)
# Turn count: fails when session turns exceed the max turns
evaluator = SystemEvaluator.turn_count(max_turns=10)

# Error rate: fails on high tool error rates
evaluator = CodeEvaluator.error_rate(max_error_rate=0.1)
# Error rate: fails when tool error rate exceeds the max error rate
evaluator = SystemEvaluator.error_rate(max_error_rate=0.1)

# Token efficiency: checks total token usage stays within budget
evaluator = CodeEvaluator.token_efficiency(max_tokens=50000)
evaluator = SystemEvaluator.token_efficiency(max_tokens=50000)

# Context cache hit rate: checks repeated prompt-prefix reuse
evaluator = CodeEvaluator.context_cache_hit_rate(min_hit_rate=0.5)
evaluator = SystemEvaluator.context_cache_hit_rate(min_hit_rate=0.5)

# Cost per session: checks estimated USD cost stays under budget
evaluator = CodeEvaluator.cost_per_session(
evaluator = SystemEvaluator.cost_per_session(
max_cost_usd=1.0,
input_cost_per_1k=0.00025,
output_cost_per_1k=0.00125,
Expand Down Expand Up @@ -173,7 +173,7 @@ Define your own metric functions and chain multiple metrics together:

```python
evaluator = (
CodeEvaluator(name="my_quality_check")
SystemEvaluator(name="my_quality_check")
.add_metric(
name="latency",
fn=lambda s: 1.0 - min(s.get("avg_latency_ms", 0) / 5000, 1.0),
Expand Down Expand Up @@ -216,7 +216,7 @@ Run evaluation across all sessions matching a filter:
from bigquery_agent_analytics import TraceFilter

report = client.evaluate(
evaluator=CodeEvaluator.latency(threshold_ms=3000),
evaluator=SystemEvaluator.latency(threshold_ms=3000),
filters=TraceFilter(agent_id="my_agent"),
)

Expand Down Expand Up @@ -561,7 +561,7 @@ pass_pow_k = compute_pass_pow_k(num_trials=10, num_passed=8) # ~0.107

## 7. Grader Composition Pipeline

Combine multiple evaluators (`CodeEvaluator` + `LLMAsJudge` + custom functions) into a single aggregated verdict using configurable scoring strategies.
Combine multiple evaluators (`SystemEvaluator` + `LLMAsJudge` + custom functions) into a single aggregated verdict using configurable scoring strategies.

### Scoring Strategies

Expand All @@ -575,7 +575,7 @@ Combine multiple evaluators (`CodeEvaluator` + `LLMAsJudge` + custom functions)

```python
from bigquery_agent_analytics import (
CodeEvaluator, GraderPipeline, LLMAsJudge,
SystemEvaluator, GraderPipeline, LLMAsJudge,
WeightedStrategy, GraderResult,
)

Expand All @@ -588,8 +588,8 @@ pipeline = (
},
threshold=0.6,
))
.add_code_grader(CodeEvaluator.latency(threshold_ms=5000), weight=0.2)
.add_code_grader(CodeEvaluator.cost_per_session(max_cost_usd=0.50), weight=0.1)
.add_system_grader(SystemEvaluator.latency(threshold_ms=5000), weight=0.2)
.add_system_grader(SystemEvaluator.cost_per_session(max_cost_usd=0.50), weight=0.1)
.add_llm_grader(LLMAsJudge.correctness(threshold=0.7), weight=0.7)
)

Expand Down Expand Up @@ -618,8 +618,8 @@ from bigquery_agent_analytics import BinaryStrategy

pipeline = (
GraderPipeline(BinaryStrategy())
.add_code_grader(CodeEvaluator.latency(threshold_ms=3000))
.add_code_grader(CodeEvaluator.error_rate(max_error_rate=0.05))
.add_system_grader(SystemEvaluator.latency(threshold_ms=3000))
.add_system_grader(SystemEvaluator.error_rate(max_error_rate=0.05))
.add_llm_grader(LLMAsJudge.hallucination(threshold=0.8))
)

Expand Down Expand Up @@ -649,7 +649,7 @@ def business_rules_grader(context):

pipeline = (
GraderPipeline(BinaryStrategy())
.add_code_grader(CodeEvaluator.latency())
.add_system_grader(SystemEvaluator.latency())
.add_custom_grader("business_rules", business_rules_grader)
)
```
Expand Down Expand Up @@ -2076,7 +2076,7 @@ bigquery_agent_analytics/
│ Core
│ ├── client.py ← High-level SDK entry point
│ ├── trace.py ← Trace/Span reconstruction & DAG rendering
│ └── evaluators.py ← CodeEvaluator + LLMAsJudge + SQL templates
│ └── evaluators.py ← SystemEvaluator + LLMAsJudge + SQL templates
│ Evaluation Harness
│ ├── trace_evaluator.py ← BigQueryTraceEvaluator, trajectory matching, replay
Expand Down
30 changes: 15 additions & 15 deletions deploy/remote_function/dispatch.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,9 +25,9 @@
from typing import Any

from bigquery_agent_analytics import Client
from bigquery_agent_analytics import CodeEvaluator
from bigquery_agent_analytics import LLMAsJudge
from bigquery_agent_analytics import serialize
from bigquery_agent_analytics import SystemEvaluator
from bigquery_agent_analytics import TraceFilter
from bigquery_agent_analytics._deploy_runtime import resolve_client_options

Expand Down Expand Up @@ -137,43 +137,43 @@ def build_filters(params):


def build_evaluator(params):
"""Build CodeEvaluator from params dict."""
"""Build SystemEvaluator from params dict."""
metric = params.get("metric", "latency")
threshold = params.get("threshold")
fail_on_missing_telemetry = _bool_param(
params.get("fail_on_missing_telemetry", False)
)

factories_with_t = {
"latency": lambda t: CodeEvaluator.latency(threshold_ms=t),
"error_rate": lambda t: CodeEvaluator.error_rate(
"latency": lambda t: SystemEvaluator.latency(threshold_ms=t),
"error_rate": lambda t: SystemEvaluator.error_rate(
max_error_rate=t,
),
"turn_count": lambda t: CodeEvaluator.turn_count(
"turn_count": lambda t: SystemEvaluator.turn_count(
max_turns=int(t),
),
"token_efficiency": lambda t: CodeEvaluator.token_efficiency(
"token_efficiency": lambda t: SystemEvaluator.token_efficiency(
max_tokens=int(t),
),
"ttft": lambda t: CodeEvaluator.ttft(threshold_ms=t),
"cost": lambda t: CodeEvaluator.cost_per_session(
"ttft": lambda t: SystemEvaluator.ttft(threshold_ms=t),
"cost": lambda t: SystemEvaluator.cost_per_session(
max_cost_usd=t,
),
}
factories_default = {
"latency": CodeEvaluator.latency,
"error_rate": CodeEvaluator.error_rate,
"turn_count": CodeEvaluator.turn_count,
"token_efficiency": CodeEvaluator.token_efficiency,
"ttft": CodeEvaluator.ttft,
"cost": CodeEvaluator.cost_per_session,
"latency": SystemEvaluator.latency,
"error_rate": SystemEvaluator.error_rate,
"turn_count": SystemEvaluator.turn_count,
"token_efficiency": SystemEvaluator.token_efficiency,
"ttft": SystemEvaluator.ttft,
"cost": SystemEvaluator.cost_per_session,
}

if metric == "context_cache_hit_rate":
kwargs = {"fail_on_missing_telemetry": fail_on_missing_telemetry}
if threshold is not None:
kwargs["min_hit_rate"] = threshold
return CodeEvaluator.context_cache_hit_rate(**kwargs)
return SystemEvaluator.context_cache_hit_rate(**kwargs)

if metric not in factories_with_t:
raise ValueError(f"Unknown metric: {metric!r}")
Expand Down
24 changes: 12 additions & 12 deletions docs/design.md
Original file line number Diff line number Diff line change
Expand Up @@ -150,7 +150,7 @@ As demonstrated in the [e2e demo](../examples/e2e_demo.py):

**Phase 2 — Evaluation:**
1. `Client.get_trace()` retrieves all events for a session
2. `CodeEvaluator` preset factories assess latency, turn count, error rate, token efficiency
2. `SystemEvaluator` preset factories assess latency, turn count, error rate, token efficiency
3. `LLMAsJudge.correctness()` performs semantic evaluation via BigQuery `AI.GENERATE`
4. `BigQueryTraceEvaluator.evaluate_session()` performs trajectory matching against golden tool sequences

Expand Down Expand Up @@ -204,11 +204,11 @@ As demonstrated in the [e2e demo](../examples/e2e_demo.py):
└──────────────────┘ └───────────────────┘ │ world-change detect) │
└──────────────────────┘

┌──────────────────────┐ ┌──────────────────────┐ ┌──────────────────┐
┌──────────────────────┐ ┌──────────────────────┐ ┌──────────────────┐
│ categorical_evaluator│ │ ontology_* (6 modules)│ │ cli │
│ categorical_views │ │ (YAML → AI.GENERATE → │ │ (Typer commands) │
│ (label evaluation) │ │ tables → PG → GQL) │ │ │
└──────────────────────┘ └──────────────────────┘ └──────────────────┘
└──────────────────────┘ └──────────────────────┘ └──────────────────┘

┌──────────────────┐ ┌───────────────────┐
│ udf_kernels │ │ serialization │
Expand Down Expand Up @@ -248,7 +248,7 @@ Aggregations, filtering, joins, and even LLM evaluation (via `AI.GENERATE`) are
LLM-based evaluation can run via (1) BigQuery `AI.GENERATE`, (2) legacy BigQuery ML `ML.GENERATE_TEXT`, or (3) the Gemini API directly. This maximizes compatibility across different GCP configurations.

**Decision 4: Composition over inheritance.**
The `GraderPipeline` composes `CodeEvaluator`, `LLMAsJudge`, and custom functions via a builder pattern rather than requiring them to share a common base class. The `BigQueryMemoryService` composes four internal services rather than extending a single monolithic class.
The `GraderPipeline` composes `SystemEvaluator`, `LLMAsJudge`, and custom functions via a builder pattern rather than requiring them to share a common base class. The `BigQueryMemoryService` composes four internal services rather than extending a single monolithic class.

---

Expand Down Expand Up @@ -396,7 +396,7 @@ Each field generates a separate `AND` condition with a corresponding `bigquery.S

This module contains two evaluator classes and the SQL templates that power batch evaluation.

#### 4.3.1 `CodeEvaluator`
#### 4.3.1 `SystemEvaluator`

Deterministic evaluation using code-defined metric functions.

Expand Down Expand Up @@ -641,7 +641,7 @@ Combines heterogeneous evaluators into a unified verdict using a strategy patter
┌──────────────┼──────────────┐
▼ ▼ ▼
CodeEvaluator LLMAsJudge Custom Fn
SystemEvaluator LLMAsJudge Custom Fn
(sync) (async) (sync)
│ │ │
▼ ▼ ▼
Expand Down Expand Up @@ -1273,7 +1273,7 @@ results = client.query(formatted, job_config=job_config)

```
Evaluation
├── Deterministic (CodeEvaluator)
├── Deterministic (SystemEvaluator)
│ ├── Latency
│ ├── Turn count
│ ├── Error rate
Expand Down Expand Up @@ -1321,7 +1321,7 @@ All evaluation scores in the SDK are normalized to `[0.0, 1.0]`:

| Mode | Evaluator | Where Computation Runs |
|------|-----------|----------------------|
| Single session (sync) | `CodeEvaluator.evaluate_session()` | Python |
| Single session (sync) | `SystemEvaluator.evaluate_session()` | Python |
| Single session (async) | `LLMAsJudge.evaluate_session()` | Gemini API |
| Batch via Client | `Client.evaluate()` | BigQuery (SQL + AI.GENERATE) |
| Trajectory matching | `BigQueryTraceEvaluator.evaluate_session()` | BigQuery (fetch) + Python (matching) |
Expand Down Expand Up @@ -1420,7 +1420,7 @@ Synchronous (user-facing):
├── Client.drift_detection()
├── Client.insights()
├── Client.deep_analysis()
├── CodeEvaluator.evaluate_session()
├── SystemEvaluator.evaluate_session()
├── EvalSuite.*
├── EvalValidator.*
└── BigFramesEvaluator.*
Expand Down Expand Up @@ -1480,10 +1480,10 @@ results = await asyncio.gather(*[_run_one(t) for t in tasks])

## 10. Extensibility & Plugin Points

### 10.1 Custom Metrics (CodeEvaluator)
### 10.1 Custom Metrics (SystemEvaluator)

```python
evaluator = CodeEvaluator(name="custom").add_metric(
evaluator = SystemEvaluator(name="custom").add_metric(
name="business_metric",
fn=lambda session: your_scoring_logic(session),
threshold=0.7,
Expand Down Expand Up @@ -1586,7 +1586,7 @@ All tests mock BigQuery — no GCP credentials or live BigQuery access is needed
```
tests/
├── test_sdk_client.py # Client integration tests
├── test_sdk_evaluators.py # CodeEvaluator + LLMAsJudge
├── test_sdk_evaluators.py # SystemEvaluator + LLMAsJudge
├── test_sdk_trace.py # Trace/Span reconstruction
├── test_sdk_feedback.py # Drift detection
├── test_sdk_insights.py # Insights pipeline
Expand Down
6 changes: 3 additions & 3 deletions docs/hatteras_evaluation.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ agent sessions into user-defined categories directly against traces stored in
BigQuery, without relying on an external service.

This should be implemented as a new categorical evaluation subsystem, not as
an overload of the existing numeric `CodeEvaluator` / `LLMAsJudge` report
an overload of the existing numeric `SystemEvaluator` / `LLMAsJudge` report
path.

The goal is to support Hatteras-like functionality inside the SDK:
Expand All @@ -22,7 +22,7 @@ The goal is to support Hatteras-like functionality inside the SDK:

Today the SDK supports two major evaluation modes:

- deterministic numeric scoring via `CodeEvaluator`
- deterministic numeric scoring via `SystemEvaluator`
- semantic numeric scoring via `LLMAsJudge`

What is missing is a first-class way to answer questions like:
Expand Down Expand Up @@ -60,7 +60,7 @@ That capability is useful for:
This design is not proposing:

- a full clone of an external Hatteras service
- a replacement for `CodeEvaluator`
- a replacement for `SystemEvaluator`
- a replacement for `LLMAsJudge`
- a new remote function or Python UDF surface in the first phase
- real-time ingestion-time classification in phase 1
Expand Down
2 changes: 1 addition & 1 deletion docs/implementation_plan_concept_index_runtime.md
Original file line number Diff line number Diff line change
Expand Up @@ -165,7 +165,7 @@ Work: `bigquery_ontology/contrib/advertising/` stub with Yahoo's resolver (if co
- `src/bigquery_ontology/graph_ddl_compiler.py` — add `compile_concept_index(ontology, binding, *, output_table) -> str`. Preserve `compile_graph()` contract byte-identically. No changes to existing function bodies.
- `src/bigquery_ontology/cli.py:299` — `compile` command gains `--emit-concept-index` and `--concept-index-table` flags. When absent, behavior is byte-identical to today.
- `src/bigquery_ontology/__init__.py` — add `from .graph_ddl_compiler import compile_concept_index` so the new public function is importable as `from bigquery_ontology import compile_concept_index`, matching the existing pattern for `compile_graph` (`__init__.py:50` today).
- `src/bigquery_agent_analytics/__init__.py` — add the new public surface to the try/except re-export block (same pattern as `Client`, `CodeEvaluator`, etc.):
- `src/bigquery_agent_analytics/__init__.py` — add the new public surface to the try/except re-export block (same pattern as `Client`, `SystemEvaluator`, etc.):
- `OntologyRuntime` from `.ontology_runtime`
- `EntityResolver`, `ExactMatchResolver`, `SynonymResolver`, `Candidate`, `ResolveResult` from `.entity_resolver`
- `ConceptIndexMismatchError`, `ConceptIndexProvenanceMissing`, `ConceptIndexInconsistentPair`, `ConceptIndexRefreshed` from `.ontology_runtime`
Expand Down
Loading
Loading