Skip to content

Latest commit

 

History

History
156 lines (104 loc) · 7.12 KB

File metadata and controls

156 lines (104 loc) · 7.12 KB

Operations Guide

This guide covers day-to-day operations for a deployed Genie Workbench instance: Lakebase management, MLflow configuration, monitoring, and GSO job management.

Lakebase

Schema and Tables

The app creates the genie schema and all tables on first startup (the SP owns everything it creates). Data is stored in the databricks_postgres database:

Table Purpose
scan_results IQ scan history: score, maturity, checks, findings, timestamps
starred_spaces User-starred spaces for quick access
seen_spaces Tracks which spaces the user has visited
optimization_runs Legacy optimization accuracy records (used by scanner checks 11–12)
agent_sessions Create agent session persistence (message history, step state)

Lakebase state is tied to the Databricks App service principal that created these objects. For normal updates, keep the same app instance and update through the same install path: ./scripts/deploy.sh --update for local terminal installs, or rerun notebooks/install.py for notebook installs. If you create a new app instance, use a fresh Lakebase project instead of pointing the new app at the old app's genie schema.

Credential Refresh

Lakebase credentials are auto-generated via the Databricks SDK (postgres.generate_database_credential for autoscaling, database.generate_database_credential for provisioned). These OAuth tokens expire after ~1 hour, so the app recreates the asyncpg connection pool every 50 minutes to stay ahead of expiration.

Graceful Degradation

If LAKEBASE_HOST is not configured (no Lakebase attached), the app falls back to in-memory dictionaries. The app remains fully functional but:

  • Scan results are lost on restart
  • Starred spaces are lost on restart
  • Agent sessions are lost on restart
  • The Admin Dashboard shows no historical data

Troubleshooting Lakebase

Symptom Cause Fix
"Failed to list spaces" Lakebase not attached Re-run deploy.sh --update or rerun notebooks/install.py with Lakebase enabled
Connection errors after ~1 hour Token refresh failed Check app logs for credential generation errors
Tables not created SP lacks CONNECT or CREATE ON DATABASE Re-run deploy.sh --update or rerun notebooks/install.py to re-create the SP role and grants
permission denied for sequence scan_results_id_seq New app is reusing Lakebase objects owned by an older app SP Reuse the original app instance or move the new app to a fresh Lakebase project

MLflow

Experiment Tracking

LLM calls in the fix agent, create agent, and optimization pipeline are traced via MLflow. Tracing is optional — controlled by the MLFLOW_EXPERIMENT_ID environment variable in app.yaml.

At startup, the app validates that the experiment ID exists in the workspace. If it doesn't, tracing is silently disabled (the variable is cleared).

Prompt Registry

Auto-Optimize requires MLflow Prompt Registry for versioned judge prompts. If Prompt Registry is not enabled on the workspace, the optimization preflight task will fail with FEATURE_DISABLED.

Configuration

# In app.yaml
- name: MLFLOW_TRACKING_URI
  value: "databricks"
- name: MLFLOW_REGISTRY_URI
  value: "databricks-uc"
- name: MLFLOW_EXPERIMENT_ID
  value: "<your-experiment-id>"

The experiment ID is workspace-specific. The local terminal installer can create one during setup. The notebook installer leaves Create Agent tracing dormant and deploys with no app-level experiment ID. You can still enable tracing manually by setting MLFLOW_EXPERIMENT_ID in app.yaml before deploying.

Monitoring

App Logs

databricks apps logs <app-name> --profile <profile>

App Status

databricks apps get <app-name> --profile <profile>

Verify Workspace Files

databricks workspace list /Workspace/Users/<email>/<app-name>/backend --profile <profile>

Key Log Patterns

Log Pattern Meaning
OBO: using user token for /api/... Request authenticated via user's OBO token
OBO: no x-forwarded-access-token, using SP No user token — using SP (expected for health checks)
OBO token lacks genie scope, retrying with service principal Genie API scope fallback triggered
Lakebase pool created Database connection established
Lakebase pool re-created (credential refresh) Scheduled 50-minute token refresh
Failed to persist scan result Lakebase write failed (check connectivity)

GSO Job Management

Job Creation

The optimization job is created automatically by the active install path. Local terminal installs use deploy.sh and databricks bundle deploy -t app, with Terraform state scoped to the deployer. Notebook installs create or reset the same gso-optimization-job through the SDK/Jobs API from generated workspace assets.

Job Reuse

If the job already exists (from a previous deploy), it is reused. To force recreation:

  1. Delete the job in the Databricks UI
  2. Re-run ./scripts/deploy.sh --update for local terminal installs, or rerun notebooks/install.py for notebook installs

ensure_job_run_as Self-Healing

At app startup, _ensure_gso_job_run_as() checks that the optimization job's run_as matches the current app SP. If they don't match (e.g., the app was redeployed with a different SP), the job is automatically updated. This avoids manual reconfiguration when the app identity changes.

Bundle Management

For local terminal installs, the GSO job is managed by Databricks Asset Bundles (DABs):

# Deploy/update the job (done automatically by deploy.sh)
databricks bundle deploy -t app --profile <profile>

Important: Do NOT run databricks bundle deploy -t dev for production deployments — it creates [dev username] prefixed orphan jobs with separate Terraform state.

The app target uses mode: development for per-deployer Terraform state with presets.name_prefix: "" for clean job names.

For notebook installs, the GSO job is managed by scripts.deploy_lib.gso_job with Jobs API reset/update semantics. It uploads notebooks under /Workspace/Users/<user>/.genie-workbench-deploy/<app-name>/gso/jobs and stores the GSO wheel in the UC volume under /Volumes/<catalog>/genie_space_optimizer/app_artifacts/.

Post-Deploy: Genie Space Access

After deploying, the app's SP needs access to Genie Spaces for API fallback and optimization:

  1. The installer grants SP access to your existing Genie Spaces
  2. For spaces created after install, share them with the SP (CAN_MANAGE)
  3. Grant SP SELECT on referenced schemas:
GRANT SELECT ON SCHEMA <catalog>.<schema> TO `<service-principal-name>`;

See Authentication & Permissions for the full permission model.

Related Documentation