Auto-Optimize (GSO)

Auto-Optimize is a benchmark-driven optimization pipeline that measures Genie Space accuracy, diagnoses failures, and iteratively applies metadata patches until quality thresholds are met. It is powered by the Genie Space Optimizer (GSO) engine — a separate Python package at packages/genie-space-optimizer/.

Overview

Unlike the Fix Agent (which applies targeted patches from scan findings), Auto-Optimize runs a closed-loop pipeline: it generates benchmarks, evaluates Genie's generated SQL against expected answers using specialized judges, identifies failure patterns, proposes and tests metadata changes, and only commits changes that pass multi-stage evaluation gates.

The 6-Task Pipeline

The optimization runs as a Databricks Lakeflow Job with six sequential tasks:

┌───────────┐   ┌──────────┐   ┌────────────┐   ┌────────────┐   ┌──────────┐   ┌────────┐
│ Preflight │──▶│ Baseline │──▶│ Enrichment │──▶│ Lever Loop │──▶│ Finalize │──▶│ Deploy │
│           │   │   Eval   │   │            │   │            │   │          │   │        │
└───────────┘   └──────────┘   └────────────┘   └────────────┘   └──────────┘   └────────┘
     1               2               3                4               5             6

Task Details

#	Task	Purpose	Key Actions
1	Preflight	Validate prerequisites	Check SP permissions, verify benchmark questions, validate table access, ensure Prompt Registry is enabled
2	Baseline Evaluation	Measure current accuracy	Run all benchmark questions through Genie, evaluate with 9 judges, establish baseline score
3	Enrichment	Gather optimization context	Proactive metadata enrichment — profile tables, analyze query patterns, identify improvement opportunities
4	Lever Loop	Iterative optimization	The core loop: cluster failures → pick levers → generate patches → 3-gate evaluation → accept or rollback
5	Finalize	Consolidate results	Merge accepted patches, validate final configuration, compute final accuracy
6	Deploy	Apply to space	Optionally apply the optimized configuration to the live Genie Space

The 5 Lever Categories

Levers are categories of metadata changes the optimizer can apply. Each lever targets a different aspect of the space configuration:

Lever	Target	Examples
Tables/Columns	Table and column metadata	Add descriptions, synonyms, entity matching, format assistance
Metric Views	Pre-computed metric definitions	Add metric views for common aggregations
TVFs (Table-Valued Functions)	Custom SQL functions	Add TVFs for complex business logic
Join Specs	Table relationship definitions	Add or refine join specifications between tables
Instructions/Example SQL	Behavioral guidance	Add text instructions, example SQL pairs, SQL snippets (filters, measures, expressions)

The lever loop's strategist analyzes current failure patterns and selects the lever category most likely to address them.

3-Gate Evaluation

Before accepting any set of patches, the optimizer runs them through three progressively broader evaluation gates:

┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│   Gate 1:    │────▶│   Gate 2:    │────▶│   Gate 3:    │
│   Slice      │     │   P0         │     │   Full       │
│              │     │              │     │              │
│ Test on the  │     │ Test on high │     │ Test on all  │
│ failing      │     │ priority     │     │ benchmark    │
│ questions    │     │ questions    │     │ questions    │
│ only         │     │              │     │              │
└──────────────┘     └──────────────┘     └──────────────┘

Gate	Scope	Purpose
Slice	Only the questions that currently fail	Quick check — did the patches fix the targeted failures?
P0	High-priority / critical questions	Regression check — did we break anything important?
Full	All benchmark questions	Complete evaluation — net accuracy improvement

If a patch set fails at any gate, it is rolled back and the optimizer tries a different approach. The strategist records the failure in a reflection buffer to avoid retrying the same strategy.

9 Specialized Judges

Accuracy evaluation uses 9 specialized judges that compare Genie's generated SQL against expected benchmark answers:

Each judge evaluates a different dimension of SQL correctness (e.g., table selection, join logic, filter conditions, aggregation, column selection, output format). A question is considered "correct" when it passes the required subset of judges.

Judge prompts are managed via MLflow Prompt Registry, providing version control and traceability for evaluation criteria.

Convergence

The lever loop terminates when one of three conditions is met:

Status	Condition	Meaning
`CONVERGED`	Accuracy target reached (typically ≥ 85%)	Optimization succeeded
`STALLED`	No improvement across consecutive iterations	Further optimization is unlikely
`MAX_ITERATIONS`	Iteration limit reached	Time-boxed stop

The IQ Scanner checks for terminal GSO runs when evaluating checks 11 and 12. A CONVERGED run with best_accuracy ≥ 85% satisfies both checks.

Data Persistence

Auto-Optimize stores all state in 12 Delta tables under GSO_CATALOG.GSO_SCHEMA:

Table	Contents
`genie_opt_runs`	Run metadata: status, accuracy, timestamps, config
`genie_opt_iterations`	Per-iteration evaluation results
`genie_opt_patches`	All patches generated (accepted and rejected)
`genie_opt_suggestions`	Strategist suggestions per iteration
`genie_opt_eval_results`	Detailed per-question evaluation results
`genie_opt_asi_results`	ASI (judge) results per question per iteration
`genie_opt_benchmarks`	Benchmark question definitions
`genie_opt_enrichments`	Proactive enrichment data
`genie_opt_lever_configs`	Lever configuration per run
`genie_opt_space_snapshots`	Space config snapshots (before/after)
`genie_opt_failure_clusters`	Failure pattern clusters
`genie_opt_reflection`	Reflection buffer (what worked, what didn't)

The Workbench frontend reads this data through backend/routers/auto_optimize.py, which queries Lakebase synced tables (preferred) or falls back to direct Delta queries via the SP.

Permission Model

The optimization job runs entirely as the app's Service Principal (SP). See Authentication & Permissions for the full security model, including:

Why jobs can't use OBO
How user authorization is verified before job submission
What SP permissions are required

MLflow Integration

Experiment tracking: each optimization run is tracked as an MLflow experiment
Prompt Registry: judge prompts are versioned in MLflow Prompt Registry, enabling reproducible evaluations
MLFLOW_EXPERIMENT_ID: configured in app.yaml, validated at startup

MLflow Prompt Registry must be enabled on the workspace. If disabled, the preflight task will fail with FEATURE_DISABLED.

Triggering from the UI

Users trigger optimization from the Optimize tab in the Space Detail view:

The UI calls GET /api/auto-optimize/permissions/{space_id} to pre-check SP access
User configures options (apply mode, levers, and optional LLM model) and clicks "Optimize"
POST /api/auto-optimize/trigger starts the job (see trigger flow)
The UI polls GET /api/auto-optimize/runs/{run_id}/status for progress
On completion, the user can review patches and choose to apply or discard

Model Selection

Auto-Optimize can run with a per-run llm_model selected from the curated compatibility list returned by GET /api/models. If the trigger request omits llm_model, the backend uses the workspace-wide LLM_MODEL default.

Selected models are validated against the curated compatibility list before job submission. The shared list also serves Create Agent, so GPT 5.5 endpoints are intentionally excluded until Create Agent supports the Responses API tool-calling shape. The chosen model is stored on genie_opt_runs.llm_model, passed through the GSO job widgets, and shown in optimization history.

Source Files

packages/genie-space-optimizer/ — the GSO engine package
backend/routers/auto_optimize.py — 16 API endpoints for GSO management
backend/services/gso_lakebase.py — synced table reads
backend/main.py — _ensure_gso_job_run_as() startup hook
databricks.yml — job definition for the optimization DAG

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Auto-Optimize (GSO)

Overview

The 6-Task Pipeline

Task Details

The 5 Lever Categories

3-Gate Evaluation

9 Specialized Judges

Convergence

Data Persistence

Permission Model

MLflow Integration

Triggering from the UI

Model Selection

Source Files

Related Documentation

Uh oh!

FilesExpand file tree

07-auto-optimize.md

Latest commit

History

07-auto-optimize.md

File metadata and controls

Auto-Optimize (GSO)

Overview

The 6-Task Pipeline

Task Details

The 5 Lever Categories

3-Gate Evaluation

9 Specialized Judges

Convergence

Data Persistence

Permission Model

MLflow Integration

Triggering from the UI

Model Selection

Source Files

Related Documentation