GoogleCloudPlatform · haiyuan-eng-google · Jun 17, 2026 · May 3, 2026 · May 12, 2026 · Jun 5, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -7,6 +7,11 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ## [Unreleased]
 
+### Added
+
+- Added `SystemEvaluator` as the preferred name for deterministic/code-defined metrics.
+- Kept `CodeEvaluator` as a backward-compatible alias. Note that calling `CodeEvaluator()` now emits `evaluator_name="system_evaluator"`.
+
 ## [0.3.4] - 2026-06-10
 
 ### Release highlights

diff --git a/README.md b/README.md
@@ -164,7 +164,7 @@ src/bigquery_agent_analytics/
 │   └── formatter.py               # Output formatting (json/text/table)
 │
 ├── Evaluation
-│   ├── evaluators.py              # CodeEvaluator + LLMAsJudge
+│   ├── evaluators.py              # SystemEvaluator + LLMAsJudge
 │   ├── trace_evaluator.py         # Trajectory matching & replay
 │   ├── multi_trial.py             # Multi-trial runner + pass@k
 │   ├── grader_pipeline.py         # Grader composition pipeline

diff --git a/SDK.md b/SDK.md
@@ -112,32 +112,32 @@ traces = client.list_traces(
 
 ## 3. Code-Based Evaluation (Deterministic Metrics)
 
-`CodeEvaluator` runs deterministic, code-defined metric functions against session summaries. Each metric returns a score between 0.0 and 1.0.
+`SystemEvaluator` runs deterministic, code-defined metric functions against session summaries. Each metric returns a score between 0.0 and 1.0.
 
 ### Pre-Built Evaluators
 
 The SDK ships with seven ready-to-use evaluators:
 
 ```python
-from bigquery_agent_analytics import CodeEvaluator
+from bigquery_agent_analytics import SystemEvaluator
 
-# Latency: fails when average latency exceeds the budget
-evaluator = CodeEvaluator.latency(threshold_ms=5000)
+# Latency: fails when average latency exceeds the threshold
+evaluator = SystemEvaluator.latency(threshold_ms=5000)
 
-# Turn count: fails when sessions use too many back-and-forth turns
-evaluator = CodeEvaluator.turn_count(max_turns=10)
+# Turn count: fails when session turns exceed the max turns
+evaluator = SystemEvaluator.turn_count(max_turns=10)
 
-# Error rate: fails on high tool error rates
-evaluator = CodeEvaluator.error_rate(max_error_rate=0.1)
+# Error rate: fails when tool error rate exceeds the max error rate
+evaluator = SystemEvaluator.error_rate(max_error_rate=0.1)
 
 # Token efficiency: checks total token usage stays within budget
-evaluator = CodeEvaluator.token_efficiency(max_tokens=50000)
+evaluator = SystemEvaluator.token_efficiency(max_tokens=50000)
 
 # Context cache hit rate: checks repeated prompt-prefix reuse
-evaluator = CodeEvaluator.context_cache_hit_rate(min_hit_rate=0.5)
+evaluator = SystemEvaluator.context_cache_hit_rate(min_hit_rate=0.5)
 
 # Cost per session: checks estimated USD cost stays under budget
-evaluator = CodeEvaluator.cost_per_session(
+evaluator = SystemEvaluator.cost_per_session(
     max_cost_usd=1.0,
     input_cost_per_1k=0.00025,
     output_cost_per_1k=0.00125,
@@ -173,7 +173,7 @@ Define your own metric functions and chain multiple metrics together:
 
 ```python
 evaluator = (
-    CodeEvaluator(name="my_quality_check")
+    SystemEvaluator(name="my_quality_check")
     .add_metric(
         name="latency",
         fn=lambda s: 1.0 - min(s.get("avg_latency_ms", 0) / 5000, 1.0),
@@ -216,7 +216,7 @@ Run evaluation across all sessions matching a filter:
 from bigquery_agent_analytics import TraceFilter
 
 report = client.evaluate(
-    evaluator=CodeEvaluator.latency(threshold_ms=3000),
+    evaluator=SystemEvaluator.latency(threshold_ms=3000),
     filters=TraceFilter(agent_id="my_agent"),
 )
 
@@ -561,7 +561,7 @@ pass_pow_k = compute_pass_pow_k(num_trials=10, num_passed=8)  # ~0.107
 
 ## 7. Grader Composition Pipeline
 
-Combine multiple evaluators (`CodeEvaluator` + `LLMAsJudge` + custom functions) into a single aggregated verdict using configurable scoring strategies.
+Combine multiple evaluators (`SystemEvaluator` + `LLMAsJudge` + custom functions) into a single aggregated verdict using configurable scoring strategies.
 
 ### Scoring Strategies
 
@@ -575,7 +575,7 @@ Combine multiple evaluators (`CodeEvaluator` + `LLMAsJudge` + custom functions)
 
 ```python
 from bigquery_agent_analytics import (
-    CodeEvaluator, GraderPipeline, LLMAsJudge,
+    SystemEvaluator, GraderPipeline, LLMAsJudge,
     WeightedStrategy, GraderResult,
 )
 
@@ -588,8 +588,8 @@ pipeline = (
         },
         threshold=0.6,
     ))
-    .add_code_grader(CodeEvaluator.latency(threshold_ms=5000), weight=0.2)
-    .add_code_grader(CodeEvaluator.cost_per_session(max_cost_usd=0.50), weight=0.1)
+    .add_system_grader(SystemEvaluator.latency(threshold_ms=5000), weight=0.2)
+    .add_system_grader(SystemEvaluator.cost_per_session(max_cost_usd=0.50), weight=0.1)
     .add_llm_grader(LLMAsJudge.correctness(threshold=0.7), weight=0.7)
 )
 
@@ -618,8 +618,8 @@ from bigquery_agent_analytics import BinaryStrategy
 
 pipeline = (
     GraderPipeline(BinaryStrategy())
-    .add_code_grader(CodeEvaluator.latency(threshold_ms=3000))
-    .add_code_grader(CodeEvaluator.error_rate(max_error_rate=0.05))
+    .add_system_grader(SystemEvaluator.latency(threshold_ms=3000))
+    .add_system_grader(SystemEvaluator.error_rate(max_error_rate=0.05))
     .add_llm_grader(LLMAsJudge.hallucination(threshold=0.8))
 )
 
@@ -649,7 +649,7 @@ def business_rules_grader(context):
 
 pipeline = (
     GraderPipeline(BinaryStrategy())
-    .add_code_grader(CodeEvaluator.latency())
+    .add_system_grader(SystemEvaluator.latency())
     .add_custom_grader("business_rules", business_rules_grader)
 )
 ```
@@ -2076,7 +2076,7 @@ bigquery_agent_analytics/
 │   Core
 │   ├── client.py              ← High-level SDK entry point
 │   ├── trace.py               ← Trace/Span reconstruction & DAG rendering
-│   └── evaluators.py          ← CodeEvaluator + LLMAsJudge + SQL templates
+│   └── evaluators.py          ← SystemEvaluator + LLMAsJudge + SQL templates
 │
 │   Evaluation Harness
 │   ├── trace_evaluator.py     ← BigQueryTraceEvaluator, trajectory matching, replay

diff --git a/deploy/remote_function/dispatch.py b/deploy/remote_function/dispatch.py
@@ -25,9 +25,9 @@
 from typing import Any
 
 from bigquery_agent_analytics import Client
-from bigquery_agent_analytics import CodeEvaluator
 from bigquery_agent_analytics import LLMAsJudge
 from bigquery_agent_analytics import serialize
+from bigquery_agent_analytics import SystemEvaluator
 from bigquery_agent_analytics import TraceFilter
 from bigquery_agent_analytics._deploy_runtime import resolve_client_options
 
@@ -137,43 +137,43 @@ def build_filters(params):
 
 
 def build_evaluator(params):
-  """Build CodeEvaluator from params dict."""
+  """Build SystemEvaluator from params dict."""
   metric = params.get("metric", "latency")
   threshold = params.get("threshold")
   fail_on_missing_telemetry = _bool_param(
       params.get("fail_on_missing_telemetry", False)
   )
 
   factories_with_t = {
-      "latency": lambda t: CodeEvaluator.latency(threshold_ms=t),
-      "error_rate": lambda t: CodeEvaluator.error_rate(
+      "latency": lambda t: SystemEvaluator.latency(threshold_ms=t),
+      "error_rate": lambda t: SystemEvaluator.error_rate(
           max_error_rate=t,
       ),
-      "turn_count": lambda t: CodeEvaluator.turn_count(
+      "turn_count": lambda t: SystemEvaluator.turn_count(
           max_turns=int(t),
       ),
-      "token_efficiency": lambda t: CodeEvaluator.token_efficiency(
+      "token_efficiency": lambda t: SystemEvaluator.token_efficiency(
           max_tokens=int(t),
       ),
-      "ttft": lambda t: CodeEvaluator.ttft(threshold_ms=t),
-      "cost": lambda t: CodeEvaluator.cost_per_session(
+      "ttft": lambda t: SystemEvaluator.ttft(threshold_ms=t),
+      "cost": lambda t: SystemEvaluator.cost_per_session(
           max_cost_usd=t,
       ),
   }
   factories_default = {
-      "latency": CodeEvaluator.latency,
-      "error_rate": CodeEvaluator.error_rate,
-      "turn_count": CodeEvaluator.turn_count,
-      "token_efficiency": CodeEvaluator.token_efficiency,
-      "ttft": CodeEvaluator.ttft,
-      "cost": CodeEvaluator.cost_per_session,
+      "latency": SystemEvaluator.latency,
+      "error_rate": SystemEvaluator.error_rate,
+      "turn_count": SystemEvaluator.turn_count,
+      "token_efficiency": SystemEvaluator.token_efficiency,
+      "ttft": SystemEvaluator.ttft,
+      "cost": SystemEvaluator.cost_per_session,
   }
 
   if metric == "context_cache_hit_rate":
     kwargs = {"fail_on_missing_telemetry": fail_on_missing_telemetry}
     if threshold is not None:
       kwargs["min_hit_rate"] = threshold
-    return CodeEvaluator.context_cache_hit_rate(**kwargs)
+    return SystemEvaluator.context_cache_hit_rate(**kwargs)
 
   if metric not in factories_with_t:
     raise ValueError(f"Unknown metric: {metric!r}")

diff --git a/docs/design.md b/docs/design.md
@@ -150,7 +150,7 @@ As demonstrated in the [e2e demo](../examples/e2e_demo.py):
 
 **Phase 2 — Evaluation:**
 1. `Client.get_trace()` retrieves all events for a session
-2. `CodeEvaluator` preset factories assess latency, turn count, error rate, token efficiency
+2. `SystemEvaluator` preset factories assess latency, turn count, error rate, token efficiency
 3. `LLMAsJudge.correctness()` performs semantic evaluation via BigQuery `AI.GENERATE`
 4. `BigQueryTraceEvaluator.evaluate_session()` performs trajectory matching against golden tool sequences
 
@@ -204,11 +204,11 @@ As demonstrated in the [e2e demo](../examples/e2e_demo.py):
    └──────────────────┘  └───────────────────┘  │ world-change detect) │
                                                 └──────────────────────┘
 
-   ┌──────────────────────┐  ┌──────────────────────┐  ┌──────────────────┐
+   ┌──────────────────────┐  ┌───────────────────────┐  ┌──────────────────┐
    │ categorical_evaluator│  │ ontology_* (6 modules)│  │      cli         │
    │ categorical_views    │  │ (YAML → AI.GENERATE → │  │ (Typer commands) │
    │ (label evaluation)   │  │  tables → PG → GQL)   │  │                  │
-   └──────────────────────┘  └──────────────────────┘  └──────────────────┘
+   └──────────────────────┘  └───────────────────────┘  └──────────────────┘
 
    ┌──────────────────┐  ┌───────────────────┐
    │ udf_kernels      │  │ serialization     │
@@ -248,7 +248,7 @@ Aggregations, filtering, joins, and even LLM evaluation (via `AI.GENERATE`) are
 LLM-based evaluation can run via (1) BigQuery `AI.GENERATE`, (2) legacy BigQuery ML `ML.GENERATE_TEXT`, or (3) the Gemini API directly. This maximizes compatibility across different GCP configurations.
 
 **Decision 4: Composition over inheritance.**
-The `GraderPipeline` composes `CodeEvaluator`, `LLMAsJudge`, and custom functions via a builder pattern rather than requiring them to share a common base class. The `BigQueryMemoryService` composes four internal services rather than extending a single monolithic class.
+The `GraderPipeline` composes `SystemEvaluator`, `LLMAsJudge`, and custom functions via a builder pattern rather than requiring them to share a common base class. The `BigQueryMemoryService` composes four internal services rather than extending a single monolithic class.
 
 ---
 
@@ -396,7 +396,7 @@ Each field generates a separate `AND` condition with a corresponding `bigquery.S
 
 This module contains two evaluator classes and the SQL templates that power batch evaluation.
 
-#### 4.3.1 `CodeEvaluator`
+#### 4.3.1 `SystemEvaluator`
 
 Deterministic evaluation using code-defined metric functions.
 
@@ -641,7 +641,7 @@ Combines heterogeneous evaluators into a unified verdict using a strategy patter
                              │
               ┌──────────────┼──────────────┐
               ▼              ▼              ▼
-        CodeEvaluator   LLMAsJudge    Custom Fn
+       SystemEvaluator   LLMAsJudge    Custom Fn
         (sync)          (async)        (sync)
               │              │              │
               ▼              ▼              ▼
@@ -1273,7 +1273,7 @@ results = client.query(formatted, job_config=job_config)
 
 ```
 Evaluation
-├── Deterministic (CodeEvaluator)
+├── Deterministic (SystemEvaluator)
 │   ├── Latency
 │   ├── Turn count
 │   ├── Error rate
@@ -1321,7 +1321,7 @@ All evaluation scores in the SDK are normalized to `[0.0, 1.0]`:
 
 | Mode | Evaluator | Where Computation Runs |
 |------|-----------|----------------------|
-| Single session (sync) | `CodeEvaluator.evaluate_session()` | Python |
+| Single session (sync) | `SystemEvaluator.evaluate_session()` | Python |
 | Single session (async) | `LLMAsJudge.evaluate_session()` | Gemini API |
 | Batch via Client | `Client.evaluate()` | BigQuery (SQL + AI.GENERATE) |
 | Trajectory matching | `BigQueryTraceEvaluator.evaluate_session()` | BigQuery (fetch) + Python (matching) |
@@ -1420,7 +1420,7 @@ Synchronous (user-facing):
 ├── Client.drift_detection()
 ├── Client.insights()
 ├── Client.deep_analysis()
-├── CodeEvaluator.evaluate_session()
+├── SystemEvaluator.evaluate_session()
 ├── EvalSuite.*
 ├── EvalValidator.*
 └── BigFramesEvaluator.*
@@ -1480,10 +1480,10 @@ results = await asyncio.gather(*[_run_one(t) for t in tasks])
 
 ## 10. Extensibility & Plugin Points
 
-### 10.1 Custom Metrics (CodeEvaluator)
+### 10.1 Custom Metrics (SystemEvaluator)
 
 ```python
-evaluator = CodeEvaluator(name="custom").add_metric(
+evaluator = SystemEvaluator(name="custom").add_metric(
     name="business_metric",
     fn=lambda session: your_scoring_logic(session),
     threshold=0.7,
@@ -1586,7 +1586,7 @@ All tests mock BigQuery — no GCP credentials or live BigQuery access is needed
 ```
 tests/
 ├── test_sdk_client.py              # Client integration tests
-├── test_sdk_evaluators.py          # CodeEvaluator + LLMAsJudge
+├── test_sdk_evaluators.py          # SystemEvaluator + LLMAsJudge
 ├── test_sdk_trace.py               # Trace/Span reconstruction
 ├── test_sdk_feedback.py            # Drift detection
 ├── test_sdk_insights.py            # Insights pipeline

diff --git a/docs/hatteras_evaluation.md b/docs/hatteras_evaluation.md
@@ -7,7 +7,7 @@ agent sessions into user-defined categories directly against traces stored in
 BigQuery, without relying on an external service.
 
 This should be implemented as a new categorical evaluation subsystem, not as
-an overload of the existing numeric `CodeEvaluator` / `LLMAsJudge` report
+an overload of the existing numeric `SystemEvaluator` / `LLMAsJudge` report
 path.
 
 The goal is to support Hatteras-like functionality inside the SDK:
@@ -22,7 +22,7 @@ The goal is to support Hatteras-like functionality inside the SDK:
 
 Today the SDK supports two major evaluation modes:
 
-- deterministic numeric scoring via `CodeEvaluator`
+- deterministic numeric scoring via `SystemEvaluator`
 - semantic numeric scoring via `LLMAsJudge`
 
 What is missing is a first-class way to answer questions like:
@@ -60,7 +60,7 @@ That capability is useful for:
 This design is not proposing:
 
 - a full clone of an external Hatteras service
-- a replacement for `CodeEvaluator`
+- a replacement for `SystemEvaluator`
 - a replacement for `LLMAsJudge`
 - a new remote function or Python UDF surface in the first phase
 - real-time ingestion-time classification in phase 1

diff --git a/docs/implementation_plan_concept_index_runtime.md b/docs/implementation_plan_concept_index_runtime.md
@@ -165,7 +165,7 @@ Work: `bigquery_ontology/contrib/advertising/` stub with Yahoo's resolver (if co
 - `src/bigquery_ontology/graph_ddl_compiler.py` — add `compile_concept_index(ontology, binding, *, output_table) -> str`. Preserve `compile_graph()` contract byte-identically. No changes to existing function bodies.
 - `src/bigquery_ontology/cli.py:299` — `compile` command gains `--emit-concept-index` and `--concept-index-table` flags. When absent, behavior is byte-identical to today.
 - `src/bigquery_ontology/__init__.py` — add `from .graph_ddl_compiler import compile_concept_index` so the new public function is importable as `from bigquery_ontology import compile_concept_index`, matching the existing pattern for `compile_graph` (`__init__.py:50` today).
-- `src/bigquery_agent_analytics/__init__.py` — add the new public surface to the try/except re-export block (same pattern as `Client`, `CodeEvaluator`, etc.):
+- `src/bigquery_agent_analytics/__init__.py` — add the new public surface to the try/except re-export block (same pattern as `Client`, `SystemEvaluator`, etc.):
   - `OntologyRuntime` from `.ontology_runtime`
   - `EntityResolver`, `ExactMatchResolver`, `SynonymResolver`, `Candidate`, `ResolveResult` from `.entity_resolver`
   - `ConceptIndexMismatchError`, `ConceptIndexProvenanceMissing`, `ConceptIndexInconsistentPair`, `ConceptIndexRefreshed` from `.ontology_runtime`