Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,9 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

### Added

- Added `SystemEvaluator` as the preferred name for deterministic/code-defined metrics.
- Kept `CodeEvaluator` as a backward-compatible alias (deprecated but supported).

- **``bqaa-revalidate-extractors`` CLI** in
`bigquery_agent_analytics.extractor_compilation.cli_revalidate`
and
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -123,7 +123,7 @@ src/bigquery_agent_analytics/
│ └── formatter.py # Output formatting (json/text/table)
├── Evaluation
│ ├── evaluators.py # CodeEvaluator + LLMAsJudge
│ ├── evaluators.py # SystemEvaluator + LLMAsJudge
│ ├── trace_evaluator.py # Trajectory matching & replay
│ ├── multi_trial.py # Multi-trial runner + pass@k
│ ├── grader_pipeline.py # Grader composition pipeline
Expand Down
42 changes: 21 additions & 21 deletions SDK.md
Original file line number Diff line number Diff line change
Expand Up @@ -112,32 +112,32 @@ traces = client.list_traces(

## 3. Code-Based Evaluation (Deterministic Metrics)

`CodeEvaluator` runs deterministic, code-defined metric functions against session summaries. Each metric returns a score between 0.0 and 1.0.
`SystemEvaluator` runs deterministic, code-defined metric functions against session summaries. Each metric returns a score between 0.0 and 1.0.

### Pre-Built Evaluators

The SDK ships with seven ready-to-use evaluators:

```python
from bigquery_agent_analytics import CodeEvaluator
from bigquery_agent_analytics import SystemEvaluator

# Latency: fails when average latency exceeds the budget
evaluator = CodeEvaluator.latency(threshold_ms=5000)
# Latency: score degrades linearly as avg latency approaches threshold
evaluator = SystemEvaluator.latency(threshold_ms=5000)

# Turn count: fails when sessions use too many back-and-forth turns
evaluator = CodeEvaluator.turn_count(max_turns=10)
# Turn count: penalizes sessions with too many back-and-forth turns
evaluator = SystemEvaluator.turn_count(max_turns=10)

# Error rate: fails on high tool error rates
evaluator = CodeEvaluator.error_rate(max_error_rate=0.1)
# Error rate: penalizes high tool error rates
evaluator = SystemEvaluator.error_rate(max_error_rate=0.1)

# Token efficiency: checks total token usage stays within budget
evaluator = CodeEvaluator.token_efficiency(max_tokens=50000)
evaluator = SystemEvaluator.token_efficiency(max_tokens=50000)

# Context cache hit rate: checks repeated prompt-prefix reuse
evaluator = CodeEvaluator.context_cache_hit_rate(min_hit_rate=0.5)
evaluator = SystemEvaluator.context_cache_hit_rate(min_hit_rate=0.5)

# Cost per session: checks estimated USD cost stays under budget
evaluator = CodeEvaluator.cost_per_session(
evaluator = SystemEvaluator.cost_per_session(
max_cost_usd=1.0,
input_cost_per_1k=0.00025,
output_cost_per_1k=0.00125,
Expand Down Expand Up @@ -173,7 +173,7 @@ Define your own metric functions and chain multiple metrics together:

```python
evaluator = (
CodeEvaluator(name="my_quality_check")
SystemEvaluator(name="my_quality_check")
.add_metric(
name="latency",
fn=lambda s: 1.0 - min(s.get("avg_latency_ms", 0) / 5000, 1.0),
Expand Down Expand Up @@ -216,7 +216,7 @@ Run evaluation across all sessions matching a filter:
from bigquery_agent_analytics import TraceFilter

report = client.evaluate(
evaluator=CodeEvaluator.latency(threshold_ms=3000),
evaluator=SystemEvaluator.latency(threshold_ms=3000),
filters=TraceFilter(agent_id="my_agent"),
)

Expand Down Expand Up @@ -561,7 +561,7 @@ pass_pow_k = compute_pass_pow_k(num_trials=10, num_passed=8) # ~0.107

## 7. Grader Composition Pipeline

Combine multiple evaluators (`CodeEvaluator` + `LLMAsJudge` + custom functions) into a single aggregated verdict using configurable scoring strategies.
Combine multiple evaluators (`SystemEvaluator` + `LLMAsJudge` + custom functions) into a single aggregated verdict using configurable scoring strategies.

### Scoring Strategies

Expand All @@ -575,7 +575,7 @@ Combine multiple evaluators (`CodeEvaluator` + `LLMAsJudge` + custom functions)

```python
from bigquery_agent_analytics import (
CodeEvaluator, GraderPipeline, LLMAsJudge,
SystemEvaluator, GraderPipeline, LLMAsJudge,
WeightedStrategy, GraderResult,
)

Expand All @@ -588,8 +588,8 @@ pipeline = (
},
threshold=0.6,
))
.add_code_grader(CodeEvaluator.latency(threshold_ms=5000), weight=0.2)
.add_code_grader(CodeEvaluator.cost_per_session(max_cost_usd=0.50), weight=0.1)
.add_code_grader(SystemEvaluator.latency(threshold_ms=5000), weight=0.2)
.add_code_grader(SystemEvaluator.cost_per_session(max_cost_usd=0.50), weight=0.1)
.add_llm_grader(LLMAsJudge.correctness(threshold=0.7), weight=0.7)
)

Expand Down Expand Up @@ -618,8 +618,8 @@ from bigquery_agent_analytics import BinaryStrategy

pipeline = (
GraderPipeline(BinaryStrategy())
.add_code_grader(CodeEvaluator.latency(threshold_ms=3000))
.add_code_grader(CodeEvaluator.error_rate(max_error_rate=0.05))
.add_code_grader(SystemEvaluator.latency(threshold_ms=3000))
.add_code_grader(SystemEvaluator.error_rate(max_error_rate=0.05))
.add_llm_grader(LLMAsJudge.hallucination(threshold=0.8))
)

Expand Down Expand Up @@ -649,7 +649,7 @@ def business_rules_grader(context):

pipeline = (
GraderPipeline(BinaryStrategy())
.add_code_grader(CodeEvaluator.latency())
.add_code_grader(SystemEvaluator.latency())
.add_custom_grader("business_rules", business_rules_grader)
)
```
Expand Down Expand Up @@ -2057,7 +2057,7 @@ bigquery_agent_analytics/
│ Core
│ ├── client.py ← High-level SDK entry point
│ ├── trace.py ← Trace/Span reconstruction & DAG rendering
│ └── evaluators.py ← CodeEvaluator + LLMAsJudge + SQL templates
│ └── evaluators.py ← SystemEvaluator + LLMAsJudge + SQL templates
│ Evaluation Harness
│ ├── trace_evaluator.py ← BigQueryTraceEvaluator, trajectory matching, replay
Expand Down
30 changes: 15 additions & 15 deletions deploy/remote_function/dispatch.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@
from typing import Any

from bigquery_agent_analytics import Client
from bigquery_agent_analytics import CodeEvaluator
from bigquery_agent_analytics import SystemEvaluator
from bigquery_agent_analytics import LLMAsJudge
from bigquery_agent_analytics import serialize
from bigquery_agent_analytics import TraceFilter
Expand Down Expand Up @@ -137,43 +137,43 @@ def build_filters(params):


def build_evaluator(params):
"""Build CodeEvaluator from params dict."""
"""Build SystemEvaluator from params dict."""
metric = params.get("metric", "latency")
threshold = params.get("threshold")
fail_on_missing_telemetry = _bool_param(
params.get("fail_on_missing_telemetry", False)
)

factories_with_t = {
"latency": lambda t: CodeEvaluator.latency(threshold_ms=t),
"error_rate": lambda t: CodeEvaluator.error_rate(
"latency": lambda t: SystemEvaluator.latency(threshold_ms=t),
"error_rate": lambda t: SystemEvaluator.error_rate(
max_error_rate=t,
),
"turn_count": lambda t: CodeEvaluator.turn_count(
"turn_count": lambda t: SystemEvaluator.turn_count(
max_turns=int(t),
),
"token_efficiency": lambda t: CodeEvaluator.token_efficiency(
"token_efficiency": lambda t: SystemEvaluator.token_efficiency(
max_tokens=int(t),
),
"ttft": lambda t: CodeEvaluator.ttft(threshold_ms=t),
"cost": lambda t: CodeEvaluator.cost_per_session(
"ttft": lambda t: SystemEvaluator.ttft(threshold_ms=t),
"cost": lambda t: SystemEvaluator.cost_per_session(
max_cost_usd=t,
),
}
factories_default = {
"latency": CodeEvaluator.latency,
"error_rate": CodeEvaluator.error_rate,
"turn_count": CodeEvaluator.turn_count,
"token_efficiency": CodeEvaluator.token_efficiency,
"ttft": CodeEvaluator.ttft,
"cost": CodeEvaluator.cost_per_session,
"latency": SystemEvaluator.latency,
"error_rate": SystemEvaluator.error_rate,
"turn_count": SystemEvaluator.turn_count,
"token_efficiency": SystemEvaluator.token_efficiency,
"ttft": SystemEvaluator.ttft,
"cost": SystemEvaluator.cost_per_session,
}

if metric == "context_cache_hit_rate":
kwargs = {"fail_on_missing_telemetry": fail_on_missing_telemetry}
if threshold is not None:
kwargs["min_hit_rate"] = threshold
return CodeEvaluator.context_cache_hit_rate(**kwargs)
return SystemEvaluator.context_cache_hit_rate(**kwargs)

if metric not in factories_with_t:
raise ValueError(f"Unknown metric: {metric!r}")
Expand Down
22 changes: 11 additions & 11 deletions docs/design.md
Original file line number Diff line number Diff line change
Expand Up @@ -150,7 +150,7 @@ As demonstrated in the [e2e demo](../examples/e2e_demo.py):

**Phase 2 — Evaluation:**
1. `Client.get_trace()` retrieves all events for a session
2. `CodeEvaluator` preset factories assess latency, turn count, error rate, token efficiency
2. `SystemEvaluator` preset factories assess latency, turn count, error rate, token efficiency
3. `LLMAsJudge.correctness()` performs semantic evaluation via BigQuery `AI.GENERATE`
4. `BigQueryTraceEvaluator.evaluate_session()` performs trajectory matching against golden tool sequences

Expand Down Expand Up @@ -208,7 +208,7 @@ As demonstrated in the [e2e demo](../examples/e2e_demo.py):
│ categorical_evaluator│ │ ontology_* (6 modules)│ │ cli │
│ categorical_views │ │ (YAML → AI.GENERATE → │ │ (Typer commands) │
│ (label evaluation) │ │ tables → PG → GQL) │ │ │
└──────────────────────┘ └──────────────────────┘ └──────────────────┘
└──────────────────┘ └──────────────────┘ └──────────────────┘

┌──────────────────┐ ┌───────────────────┐
│ udf_kernels │ │ serialization │
Expand Down Expand Up @@ -248,7 +248,7 @@ Aggregations, filtering, joins, and even LLM evaluation (via `AI.GENERATE`) are
LLM-based evaluation can run via (1) BigQuery `AI.GENERATE`, (2) legacy BigQuery ML `ML.GENERATE_TEXT`, or (3) the Gemini API directly. This maximizes compatibility across different GCP configurations.

**Decision 4: Composition over inheritance.**
The `GraderPipeline` composes `CodeEvaluator`, `LLMAsJudge`, and custom functions via a builder pattern rather than requiring them to share a common base class. The `BigQueryMemoryService` composes four internal services rather than extending a single monolithic class.
The `GraderPipeline` composes `SystemEvaluator`, `LLMAsJudge`, and custom functions via a builder pattern rather than requiring them to share a common base class. The `BigQueryMemoryService` composes four internal services rather than extending a single monolithic class.

---

Expand Down Expand Up @@ -396,7 +396,7 @@ Each field generates a separate `AND` condition with a corresponding `bigquery.S

This module contains two evaluator classes and the SQL templates that power batch evaluation.

#### 4.3.1 `CodeEvaluator`
#### 4.3.1 `SystemEvaluator`

Deterministic evaluation using code-defined metric functions.

Expand Down Expand Up @@ -641,7 +641,7 @@ Combines heterogeneous evaluators into a unified verdict using a strategy patter
┌──────────────┼──────────────┐
▼ ▼ ▼
CodeEvaluator LLMAsJudge Custom Fn
SystemEvaluator LLMAsJudge Custom Fn
(sync) (async) (sync)
│ │ │
▼ ▼ ▼
Expand Down Expand Up @@ -1273,7 +1273,7 @@ results = client.query(formatted, job_config=job_config)

```
Evaluation
├── Deterministic (CodeEvaluator)
├── Deterministic (SystemEvaluator)
│ ├── Latency
│ ├── Turn count
│ ├── Error rate
Expand Down Expand Up @@ -1321,7 +1321,7 @@ All evaluation scores in the SDK are normalized to `[0.0, 1.0]`:

| Mode | Evaluator | Where Computation Runs |
|------|-----------|----------------------|
| Single session (sync) | `CodeEvaluator.evaluate_session()` | Python |
| Single session (sync) | `SystemEvaluator.evaluate_session()` | Python |
| Single session (async) | `LLMAsJudge.evaluate_session()` | Gemini API |
| Batch via Client | `Client.evaluate()` | BigQuery (SQL + AI.GENERATE) |
| Trajectory matching | `BigQueryTraceEvaluator.evaluate_session()` | BigQuery (fetch) + Python (matching) |
Expand Down Expand Up @@ -1420,7 +1420,7 @@ Synchronous (user-facing):
├── Client.drift_detection()
├── Client.insights()
├── Client.deep_analysis()
├── CodeEvaluator.evaluate_session()
├── SystemEvaluator.evaluate_session()
├── EvalSuite.*
├── EvalValidator.*
└── BigFramesEvaluator.*
Expand Down Expand Up @@ -1480,10 +1480,10 @@ results = await asyncio.gather(*[_run_one(t) for t in tasks])

## 10. Extensibility & Plugin Points

### 10.1 Custom Metrics (CodeEvaluator)
### 10.1 Custom Metrics (SystemEvaluator)

```python
evaluator = CodeEvaluator(name="custom").add_metric(
evaluator = SystemEvaluator(name="custom").add_metric(
name="business_metric",
fn=lambda session: your_scoring_logic(session),
threshold=0.7,
Expand Down Expand Up @@ -1586,7 +1586,7 @@ All tests mock BigQuery — no GCP credentials or live BigQuery access is needed
```
tests/
├── test_sdk_client.py # Client integration tests
├── test_sdk_evaluators.py # CodeEvaluator + LLMAsJudge
├── test_sdk_evaluators.py # SystemEvaluator + LLMAsJudge
├── test_sdk_trace.py # Trace/Span reconstruction
├── test_sdk_feedback.py # Drift detection
├── test_sdk_insights.py # Insights pipeline
Expand Down
6 changes: 3 additions & 3 deletions docs/hatteras_evaluation.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ agent sessions into user-defined categories directly against traces stored in
BigQuery, without relying on an external service.

This should be implemented as a new categorical evaluation subsystem, not as
an overload of the existing numeric `CodeEvaluator` / `LLMAsJudge` report
an overload of the existing numeric `SystemEvaluator` / `LLMAsJudge` report
path.

The goal is to support Hatteras-like functionality inside the SDK:
Expand All @@ -22,7 +22,7 @@ The goal is to support Hatteras-like functionality inside the SDK:

Today the SDK supports two major evaluation modes:

- deterministic numeric scoring via `CodeEvaluator`
- deterministic numeric scoring via `SystemEvaluator`
- semantic numeric scoring via `LLMAsJudge`

What is missing is a first-class way to answer questions like:
Expand Down Expand Up @@ -60,7 +60,7 @@ That capability is useful for:
This design is not proposing:

- a full clone of an external Hatteras service
- a replacement for `CodeEvaluator`
- a replacement for `SystemEvaluator`
- a replacement for `LLMAsJudge`
- a new remote function or Python UDF surface in the first phase
- real-time ingestion-time classification in phase 1
Expand Down
2 changes: 1 addition & 1 deletion docs/implementation_plan_concept_index_runtime.md
Original file line number Diff line number Diff line change
Expand Up @@ -165,7 +165,7 @@ Work: `bigquery_ontology/contrib/advertising/` stub with Yahoo's resolver (if co
- `src/bigquery_ontology/graph_ddl_compiler.py` — add `compile_concept_index(ontology, binding, *, output_table) -> str`. Preserve `compile_graph()` contract byte-identically. No changes to existing function bodies.
- `src/bigquery_ontology/cli.py:299` — `compile` command gains `--emit-concept-index` and `--concept-index-table` flags. When absent, behavior is byte-identical to today.
- `src/bigquery_ontology/__init__.py` — add `from .graph_ddl_compiler import compile_concept_index` so the new public function is importable as `from bigquery_ontology import compile_concept_index`, matching the existing pattern for `compile_graph` (`__init__.py:50` today).
- `src/bigquery_agent_analytics/__init__.py` — add the new public surface to the try/except re-export block (same pattern as `Client`, `CodeEvaluator`, etc.):
- `src/bigquery_agent_analytics/__init__.py` — add the new public surface to the try/except re-export block (same pattern as `Client`, `SystemEvaluator`, etc.):
- `OntologyRuntime` from `.ontology_runtime`
- `EntityResolver`, `ExactMatchResolver`, `SynonymResolver`, `Candidate`, `ResolveResult` from `.entity_resolver`
- `ConceptIndexMismatchError`, `ConceptIndexProvenanceMissing`, `ConceptIndexInconsistentPair`, `ConceptIndexRefreshed` from `.ontology_runtime`
Expand Down
Loading