Skip to content

Proposal: Automated Agent Quality Scorecard #63

Description

@haiyuan-eng-google

Author: Gayathri Radhakrishnan

Date: April 20, 2026

1. Executive Summary

To evolve the BigQuery Agent Analytics (BQ AA) platform from a reactive diagnostic tool into a proactive fleet-management system. This project implements a systematic "AI Judge" that automatically evaluates and grades every agent interaction, turning raw telemetry into actionable performance KPIs.

2. The "Evaluation Gap"

The current Closed-Loop RCA is an industry-leading tool for deep-diving into why a specific session failed. However, as agent deployments scale, manual root-cause analysis becomes a bottleneck. Organizations need a way to:

  • Identify high-performing vs. low-performing agent versions at a glance.
  • Monitor global quality trends without manual intervention.
  • Flag policy or safety violations in real-time across thousands of logs.

3. Proposed Solution: The Quality Scorecard

I propose building a modular evaluation pipeline that sits on top of the BigQuery event logs. This system will utilize BigQuery’s native AI capabilities (AI.GENERATE) to "grade" sessions across three key pillars:

  • Helpfulness Score (1–5): Did the agent resolve the user’s intent effectively?
  • Accuracy & Grounding (1–5): Did the agent use the available tools correctly and avoid hallucinations?
  • Policy Compliance (Pass/Fail): Did the response adhere to GRC standards (e.g., no PII leakage, authorized tool usage)?

4. Key Features & Flexibility

  • Data-Agnostic Design: The evaluation logic is decoupled from specific table names. It can be routed to point at any existing logs table or a fresh "v4" schema, requiring only standard session_id and content fields to function.
  • Fleet-Level Benchmarking: Aggregates scores into a "Leaderboard" view, allowing the team to compare performance across different regions, model versions, or system prompts.
  • Automated Triage: Automatically flags sessions with a score below a certain threshold for immediate human-in-the-loop (HITL) review.

5. Technical Impact for the Team

  • Showcases Platform Capability: Demonstrates the power of using BigQuery as a Governance and Evaluation engine, not just a storage layer.
  • Zero Infrastructure Friction: Operates entirely within the BigQuery ecosystem—no external APIs, new IAM permissions, or complex deployments required.
  • Modular Architecture: The "Judge" logic can be reused as a template for enterprise customers looking to build their own internal audit trails.

6. Implementation Roadmap

  • Phase 1: Develop the SQL-based "AI Judge" prompt and test on a sample dataset.
  • Phase 2: Create the aggregated agent_quality_metrics table for reporting.
  • Phase 3: Integrate a "Global Agent Health" visualization into the existing dashboard.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions