Skip to content

Commit ee0ffbc

Browse files
Finalize SystemEvaluator transition and address PR reviews
- Set SystemEvaluator default name to 'code_evaluator' for backward compatibility. - Update docstring in grader_pipeline.py to use neutral language. - Rename internal CLI constant _CODE_EVALUATORS to _SYSTEM_EVALUATORS. - Fix ASCII art in docs/design.md. - Clean up CHANGELOG.md and SDK.md.
1 parent 3454132 commit ee0ffbc

16 files changed

Lines changed: 3762 additions & 4169 deletions

File tree

CHANGELOG.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
1010
### Added
1111

1212
- Added `SystemEvaluator` as the preferred name for deterministic/code-defined metrics.
13-
- Kept `CodeEvaluator` as a backward-compatible alias (deprecated but supported).
13+
- Kept `CodeEvaluator` as a backward-compatible alias.
1414
- **Compiled-extractor rollout guide** at
1515
[`docs/extractor_compilation_rollout_guide.md`](docs/extractor_compilation_rollout_guide.md).
1616
Operational playbook for the Phase C pipeline (issue

SDK.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -121,13 +121,13 @@ The SDK ships with seven ready-to-use evaluators:
121121
```python
122122
from bigquery_agent_analytics import SystemEvaluator
123123

124-
# Latency: score degrades linearly as avg latency approaches threshold
124+
# Latency: fails when average latency exceeds the threshold
125125
evaluator = SystemEvaluator.latency(threshold_ms=5000)
126126

127-
# Turn count: penalizes sessions with too many back-and-forth turns
127+
# Turn count: fails when session turns exceed the max turns
128128
evaluator = SystemEvaluator.turn_count(max_turns=10)
129129

130-
# Error rate: penalizes high tool error rates
130+
# Error rate: fails when tool error rate exceeds the max error rate
131131
evaluator = SystemEvaluator.error_rate(max_error_rate=0.1)
132132

133133
# Token efficiency: checks total token usage stays within budget

dashboard/app.py

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,9 @@
1-
import streamlit as st
2-
import pandas as pd
1+
import re
2+
33
from google.cloud import bigquery
4+
import pandas as pd
45
import plotly.express as px
5-
import re
6+
import streamlit as st
67

78
# --- 1. Page Configuration ---
89
st.set_page_config(page_title="Agent Analytics V2", layout="wide")

deploy/remote_function/dispatch.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,9 +25,9 @@
2525
from typing import Any
2626

2727
from bigquery_agent_analytics import Client
28-
from bigquery_agent_analytics import SystemEvaluator
2928
from bigquery_agent_analytics import LLMAsJudge
3029
from bigquery_agent_analytics import serialize
30+
from bigquery_agent_analytics import SystemEvaluator
3131
from bigquery_agent_analytics import TraceFilter
3232
from bigquery_agent_analytics._deploy_runtime import resolve_client_options
3333

deploy/streaming_evaluation/main.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,6 @@
1919
from flask import Flask
2020
from flask import jsonify
2121
from flask import request
22-
2322
from worker import handle_scheduled_run
2423

2524
app = Flask(__name__)

docs/design.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -208,7 +208,7 @@ As demonstrated in the [e2e demo](../examples/e2e_demo.py):
208208
│ categorical_evaluator│ │ ontology_* (6 modules)│ │ cli │
209209
│ categorical_views │ │ (YAML → AI.GENERATE → │ │ (Typer commands) │
210210
│ (label evaluation) │ │ tables → PG → GQL) │ │ │
211-
└──────────────────┘ └──────────────────┘ └──────────────────┘
211+
└──────────────────────┘ └──────────────────────┘ └──────────────────┘
212212
213213
┌──────────────────┐ ┌───────────────────┐
214214
│ udf_kernels │ │ serialization │

docs/implementation_plan_remote_function.md

Lines changed: 7 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -220,30 +220,13 @@ Dispatch logic:
220220
```python
221221
# Map CLI --evaluator to SDK factory
222222
EVALUATOR_FACTORIES = {
223-
"latency": (
224-
lambda t: SystemEvaluator.latency(threshold_ms=t),
225-
lambda: SystemEvaluator.latency(),
226-
),
227-
"error_rate": (
228-
lambda t: SystemEvaluator.error_rate(max_error_rate=t),
229-
lambda: SystemEvaluator.error_rate(),
230-
),
231-
"turn_count": (
232-
lambda t: SystemEvaluator.turn_count(max_turns=int(t)),
233-
lambda: SystemEvaluator.turn_count(),
234-
),
235-
"token_efficiency": (
236-
lambda t: SystemEvaluator.token_efficiency(max_tokens=int(t)),
237-
lambda: SystemEvaluator.token_efficiency(),
238-
),
239-
"ttft": (
240-
lambda t: SystemEvaluator.ttft(threshold_ms=t),
241-
lambda: SystemEvaluator.ttft(),
242-
),
243-
"cost": (
244-
lambda t: SystemEvaluator.cost_per_session(max_cost_usd=t),
245-
lambda: SystemEvaluator.cost_per_session(),
246-
),
223+
"latency": lambda t: SystemEvaluator.latency(threshold_ms=t),
224+
"error_rate": lambda t: SystemEvaluator.error_rate(max_error_rate=t),
225+
"turn_count": lambda t: SystemEvaluator.turn_count(max_turns=int(t)),
226+
"token_efficiency": lambda t: SystemEvaluator.token_efficiency(max_tokens=int(t)),
227+
"ttft": lambda t: SystemEvaluator.ttft(threshold_ms=t),
228+
"cost": lambda t: SystemEvaluator.cost_per_session(max_cost_usd=t),
229+
"llm-judge": None, # special handling
247230
}
248231
# context_cache_hit_rate is special-cased so callers can pass
249232
# fail_on_missing_telemetry in addition to threshold/min_hit_rate.

examples/agent_improvement_cycle/DEMO_NARRATION.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -217,5 +217,5 @@ By default, the script runs a single cycle and stops. The `--auto` flag enables
217217
## [CLOSING]
218218

219219
That's the agent improvement cycle. Capture sessions with the BigQuery Agent Analytics Plugin, evaluate quality with the SDK's LLM judge,
220-
check operational metrics with the SDK's SystemEvaluator, optimize prompts with Vertex AI, and measure the results — all automated, all repeatable.
220+
check operational metrics with the SDK's SystemEvaluator, optimize prompts with Vertex AI, and measure the results — all automated, all repeatable.
221221
The golden eval set grows with every cycle, so failures you discover today become regression tests for tomorrow.

0 commit comments

Comments
 (0)