Status: Proposal Parent PRD: Unified Analytics Interface Related: Remote Function Rationale Date: 2026-03-14
This document proposes adding a second SQL integration path for the BigQuery Agent Analytics SDK based on BigQuery Python UDFs, while retaining the existing Remote Function path as the primary full-fidelity interface.
This is the right product shape.
It is not the right engineering move to replace Remote Function with Python UDF. The two mechanisms solve different problems:
- Remote Function is the right fit for the current multiplexed
agent_analytics(operation STRING, params JSON) RETURNS JSONcontract. - Python UDF is the right fit for typed, scalar, row-level or vectorized analytical kernels that run directly inside BigQuery.
The recommended approach is:
- Keep Remote Function for the broad SDK surface area.
- Add Python UDF as an independent, complementary interface.
- Reuse SDK logic only where it can be expressed as pure Python kernels over typed inputs.
- Do not attempt to force the current JSON-RPC-like remote function contract into Python UDF as the primary design.
Yes, keeping both is a good approach, but only if the two paths are deliberately separated.
That separation matters because the platform contracts are different:
- Python UDFs are Preview / Pre-GA.
- Python UDFs implement a scalar function in Python.
- Python UDFs do not support
JSONtypes. - Python UDFs can use third-party packages and can access external services only when created with a Cloud resource connection.
- Python UDFs run on BigQuery-managed resources, which removes Cloud Functions / Cloud Run deployment from the architecture.
Official references:
- BigQuery Python UDFs:
https://docs.cloud.google.com/bigquery/docs/user-defined-functions-python - BigQuery UDFs overview:
https://docs.cloud.google.com/bigquery/docs/user-defined-functions - BigQuery Remote Functions:
https://docs.cloud.google.com/bigquery/docs/remote-functions
The product decision is therefore:
- Remote Function remains the best interface for the current operation-dispatch design.
- Python UDF should be added only for the subset of SDK capabilities that naturally map to typed UDF signatures and in-engine execution.
The design has to respect the current BigQuery Python UDF contract.
From the current BigQuery Python UDF documentation:
- Runtime is limited to
python-3.11 - Python UDFs are persistent only; temporary Python UDFs are not supported
- Python UDFs cannot be used in materialized views
- Python UDF query results are not cached
JSON,RANGE,INTERVAL, andGEOGRAPHYtypes are not supported- Network access is blocked unless the UDF is created with
WITH CONNECTION - PyPI packages and Cloud Storage Python libraries are supported
- Vectorized UDFs are documented for Python but
OPTIONS(vectorized = true)is not yet supported in the Python UDF preview (only JavaScript UDFs)
These constraints drive several design decisions:
-
The current Remote Function API cannot be ported as-is. Reason: it depends on
JSONinput/output and a multiplexed operation-dispatch contract. -
Full
Clientreuse is not the right goal. Reason:Clientis built around issuing BigQuery jobs, loading trace rows, and assembling rich Python objects such asTrace,EvaluationReport, andInsightsReport. -
Python UDF support should be built around small analytical kernels, not around a UDF that internally reimplements the whole SDK client.
-
Complex outputs should use either:
- typed scalar return values, or
STRINGreturns carrying JSON text when structured output is necessary.
-
Session-level analytics will often require a SQL + UDF split: SQL does the scan and aggregation; Python UDF does the reusable scoring or parsing logic.
The SDK should expose two SQL-native paths:
| Path | Best for | Contract shape | Infra model |
|---|---|---|---|
| Remote Function | Full SDK access from SQL | Multiplexed, JSON-based | Cloud Function + BQ connection |
| Python UDF | Direct typed functions in BigQuery | Scalar / vectorized typed UDFs | BigQuery-managed runtime |
Use this positioning:
- Remote Function = broadest capability coverage
- Python UDF = lowest operational friction for deterministic kernels
This avoids a false choice. It also gives users a clean progression:
- Start with Python UDFs for lightweight scoring and canonical event logic.
- Move to Remote Function when they need richer objects, broader method coverage, or JSON-based multi-operation dispatch.
The goal is not "all SDK features in Python UDF." That is not realistic under current BigQuery constraints.
The right goal is:
Support as much of the SDK as possible by extracting the subset of logic that is deterministic, pure-Python, typed, and meaningful at row or batch level.
These parts of the SDK map well to Python UDFs:
| SDK area | Current source | Python UDF fit | Design |
|---|---|---|---|
| Error detection | event_semantics.py | Strong | BOOL helpers such as is_error_event |
| Tool outcome classification | event_semantics.py | Strong | STRING helpers such as tool_outcome |
| Response text extraction | event_semantics.py | Good | parse-wrapper plus STRING extraction from a JSON-formatted STRING payload |
| Latency scoring | evaluators.py | Strong | FLOAT64 score kernel |
| Turn-count scoring | evaluators.py | Strong | FLOAT64 score kernel |
| Error-rate scoring | evaluators.py | Strong | FLOAT64 score kernel |
| TTFT scoring | evaluators.py | Strong | FLOAT64 score kernel |
| Cost scoring | evaluators.py | Strong | wider FLOAT64 score kernel over token and pricing inputs |
These kernels are exactly the kind of logic that benefits from direct SQL invocation with no external deployment surface.
These capabilities can be supported, but not as a single "call the SDK" primitive:
| SDK area | Python UDF fit | Required redesign |
|---|---|---|
Client.evaluate(SystemEvaluator, filters) |
Partial | SQL builds per-session summaries first; UDF computes scores from summary fields |
Client.deep_analysis() / question distribution |
Partial | SQL does grouping / embeddings / top-k; UDF can help with categorization or normalization |
Client.drift_detection() |
Partial | SQL computes set logic; UDF may help with text normalization or thresholding |
Client.insights() |
Partial | Best split into SQL extraction + optional UDF post-processing; not a direct port |
These SDK capabilities should remain Remote Function or Python-library-only:
| SDK area | Why Python UDF is a poor fit |
|---|---|
Client.get_trace() / get_session_trace() |
Rich trace reconstruction, nested objects, and event-row loading do not map cleanly to scalar UDFs |
Client.list_traces() |
Returns collections, not scalar computation kernels |
Client.doctor() |
Environment and schema diagnostics are not a natural UDF workload |
ViewManager |
DDL management is not a UDF responsibility |
ContextGraphManager / GQL / audit export |
Multi-table graph workflows and property graph DDL are outside UDF scope |
| Memory service | Stateful retrieval and semantic search are not good scalar UDF targets |
| Trial runner / eval suite / grader pipeline | Orchestration constructs, not in-query scalar kernels |
The Python UDF path should not depend on instantiating the full
Client
inside the UDF body.
Instead, the SDK should expose a new internal layer:
bigquery_agent_analytics/
client.py # BigQuery job orchestration
evaluators.py # existing evaluator logic
event_semantics.py # existing canonical predicates
udf_kernels.py # new: pure functions reused by Python UDFs
udf_serialization.py # new: STRING envelope helpers if needed
It keeps the contracts clear:
Clientremains the orchestration layer- Remote Function remains the broad SQL bridge
- Python UDFs reuse only pure business logic that is stable and typed
That is maintainable. Reusing the entire client inside a Python UDF is not.
The current evaluator score math is not implemented as standalone top-level
functions today. It lives inside factory-method closures such as
SystemEvaluator.latency() and SystemEvaluator.error_rate() in
evaluators.py.
That means the first implementation step is a deliberate refactor:
- extract each
_scoreclosure into a top-level pure function inudf_kernels.py - update the existing evaluator factories to call those shared functions
- add parity tests proving exact behavior matches the current implementation
Those parity tests must cover the existing edge cases, including the current
"missing or non-positive input returns 1.0" behavior.
The recommended API surface is a family of typed functions, not a single multiplexed UDF.
These should be the first Python UDFs because they are simple, stable, and useful across many queries.
CREATE FUNCTION `PROJECT.UDF_DATASET.bqaa_is_error_event`(
event_type STRING,
error_message STRING,
status STRING
) RETURNS BOOL
LANGUAGE python
...
CREATE FUNCTION `PROJECT.UDF_DATASET.bqaa_tool_outcome`(
event_type STRING,
status STRING
) RETURNS STRING
LANGUAGE python
...
CREATE FUNCTION `PROJECT.UDF_DATASET.bqaa_extract_response_text`(
content_json STRING
) RETURNS STRING
LANGUAGE python
...bqaa_extract_response_text is not a direct lift of the current helper.
The existing
extract_response_text()
accepts a parsed Python dict. The Python UDF version therefore needs a thin
parse wrapper that accepts STRING, calls json.loads(), then delegates to
the shared extraction helper.
These should map directly to the existing SystemEvaluator math:
CREATE FUNCTION `PROJECT.UDF_DATASET.bqaa_score_latency`(
avg_latency_ms FLOAT64,
threshold_ms FLOAT64
) RETURNS FLOAT64
LANGUAGE python
...
CREATE FUNCTION `PROJECT.UDF_DATASET.bqaa_score_error_rate`(
tool_calls INT64,
tool_errors INT64,
max_error_rate FLOAT64
) RETURNS FLOAT64
LANGUAGE python
...
CREATE FUNCTION `PROJECT.UDF_DATASET.bqaa_score_turn_count`(
turn_count INT64,
max_turns INT64
) RETURNS FLOAT64
LANGUAGE python
...
CREATE FUNCTION `PROJECT.UDF_DATASET.bqaa_score_cost`(
input_tokens INT64,
output_tokens INT64,
max_cost_usd FLOAT64,
input_cost_per_1k FLOAT64,
output_cost_per_1k FLOAT64
) RETURNS FLOAT64
LANGUAGE python
...bqaa_score_cost is the widest deterministic kernel in the first batch. It
cannot reuse the simpler two- or three-argument signatures used by the other
scoring functions because the current cost_per_session() evaluator depends on
both token counts and pricing parameters.
These kernels let users express session evaluation in plain SQL:
WITH session_summary AS (
SELECT
session_id,
AVG(latency_ms) AS avg_latency_ms,
COUNTIF(event_type = 'TOOL_STARTING') AS tool_calls,
COUNTIF(event_type = 'TOOL_ERROR') AS tool_errors,
COUNTIF(event_type = 'USER_MESSAGE_RECEIVED') AS turn_count
FROM `PROJECT.DATASET.agent_events`
GROUP BY session_id
)
SELECT
session_id,
`PROJECT.UDF_DATASET.bqaa_score_latency`(avg_latency_ms, 5000.0) AS latency,
`PROJECT.UDF_DATASET.bqaa_score_error_rate`(
tool_calls, tool_errors, 0.1
) AS error_rate,
`PROJECT.UDF_DATASET.bqaa_score_turn_count`(turn_count, 10) AS turn_score
FROM session_summary;Where row-wise scalar UDFs become too slow or repetitive, add vectorized Python UDFs that accept a batch of typed columns and return one score column.
This is the right place to support:
- batch latency scoring
- batch cost scoring
- batch normalized labels from free-text fields
For outputs that need richer structure, allow a separate family of UDFs that
return STRING containing JSON text:
CREATE FUNCTION `PROJECT.UDF_DATASET.bqaa_eval_summary_json`(
avg_latency_ms FLOAT64,
tool_calls INT64,
tool_errors INT64,
turn_count INT64,
threshold_ms FLOAT64,
max_error_rate FLOAT64,
max_turns INT64
) RETURNS STRING
LANGUAGE python
...This is acceptable as a secondary pattern, but it should not become the primary API style.
These anti-patterns should be explicitly avoided.
Avoid:
-- Not recommended
CREATE FUNCTION fn(operation STRING, params STRING) RETURNS STRINGWhy this is the wrong default:
- it recreates a JSON-RPC envelope without JSON type support
- it loses type safety
- it is harder to document and optimize
- it hides which inputs are actually required
- it becomes a second dispatch surface to maintain
The Python UDF should not become a wrapper that:
- accepts a session id
- creates a BigQuery client
- submits another BigQuery job
- waits for the result
- returns a string
That would be operationally fragile, hard to reason about, and contrary to the point of in-engine computation.
Python UDF should be documented as a subset interface. If the docs imply "everything the SDK can do is now callable directly as a Python UDF," the docs will overpromise and drift.
If this proposal is accepted, the implementation should add a separate deploy surface:
deploy/
remote_function/
python_udf/
register.sql
generate_sql.py
README.md
examples/
python_udf_evaluation.sql
python_udf_event_semantics.sql
src/bigquery_agent_analytics/
udf_kernels.py
udf_sql_templates.py
tests/
test_udf_kernels.py
test_udf_sql_generation.py
docs/
python_udf_support_design.md
This module should contain only:
- pure functions
- typed inputs and outputs
- no BigQuery client creation
- no file IO
- no environment-dependent behavior
That gives one source of truth for both Python and SQL-facing implementations.
This is the realistic support target.
| SDK capability | Remote Function | Python UDF | Recommended path |
|---|---|---|---|
| Trace reconstruction | Full | No | Remote Function / Python SDK |
| Deterministic score math | Full | Full | Both |
| Canonical event predicates | N/A | Full | Python UDF |
| Session-level eval from pre-aggregated SQL | Good | Good | Both |
| LLM judge | Good | Maybe, but not preferred | Remote Function or BigQuery SQL AI |
| Insights extraction | Good | Partial | Remote Function or SQL AI pipeline |
| Drift detection | Good | Partial | Remote Function plus SQL primitives |
| Context graph / GQL / audit | Full | No | Remote Function / Python SDK |
| View creation | No | No | CLI / Python SDK |
| Health diagnostics | No | No | CLI / Python SDK |
Python UDF should be described as:
"Fastest path for deterministic, typed analytics inside BigQuery."
Remote Function should still be described as:
"Broadest SQL interface for SDK capabilities."
- Add
udf_kernels.py - Move reusable evaluator math into standalone pure functions
- Move reusable event semantic helpers into a UDF-safe layer
- Add unit tests proving parity with existing
SystemEvaluatorbehavior
- Register event semantics UDFs
- Register deterministic score UDFs
- Publish example SQL for session scoring
- Document region-replication guidance for utility datasets
Note: BigQuery UDFs are region-scoped. The BigQuery UDF docs recommend maintaining UDFs in each region, or using dataset replication for utility datasets.
- Deferred: BigQuery Python UDF preview does not support
OPTIONS(vectorized = true). The option is currently only available for JavaScript UDFs. Vectorized Python UDFs will be added when BigQuery extendsvectorizedsupport to the Python runtime. - When available: add batch-oriented UDFs where scalar calls become inefficient; benchmark against pure SQL expressions and Remote Function equivalents.
- Add JSON-string return helpers only where the structure is worth the extra parsing overhead
The docs should say this explicitly:
- Python UDF support is complementary
- Remote Function support remains first-class
- Python UDF support is aimed at deterministic and typed kernels
- Complex, orchestration-heavy, or JSON-heavy SDK features remain better served by Remote Function or direct Python usage
Recommended wording:
The SDK supports two SQL-native execution paths. Use Remote Function for the broadest method coverage and JSON-shaped outputs. Use Python UDFs for direct in-BigQuery execution of deterministic analytical kernels with typed inputs.
Yes, adding independent Python UDF support is a good approach.
But only under these conditions:
- Remote Function stays in place as the full-surface SQL bridge.
- Python UDF support is positioned as a parallel subset interface.
- The Python UDF API is typed and function-family-based, not a second JSON-RPC dispatcher.
- The implementation reuses pure kernels, not the full
Clientorchestration layer.
If those constraints are followed, Python UDF support improves the SDK:
- less deployment overhead for deterministic SQL-native users
- better fit for direct BigQuery evaluation patterns
- no damage to the existing Remote Function design
If those constraints are not followed, Python UDF support will become a second, worse Remote Function implementation.
The next step should be a narrow implementation plan, not immediate broad feature parity.
Recommended first cut:
- Event semantics UDFs
- Deterministic score kernels for
latency,error_rate,turn_count,ttft, andcost - Example SQL showing session pre-aggregation plus UDF scoring
- Benchmark and doc comparison versus Remote Function
That is the highest-confidence path to "support as much as possible" without building the wrong abstraction.