Release v1.10.0: SDK auth migration, summarization agent, and Lakebase fixes by forrestmurray-db · Pull Request #125 · databricks-solutions/project-0xfffff

forrestmurray-db · 2026-04-13T18:40:48Z

Summary

SDK Auth Migration: Replace manual token storage with Databricks SDK-based authentication (resolve_databricks_token()). Removes DatabricksTokenDB model, token input fields, and DATABRICKS_TOKEN env var mutations. All services now use SDK auth.
Summarization Agent Overhaul: Refactor summarization to a tool-based agent with span data resolution. Add facilitator visibility into summarization status/results, job tracking via SummarizationJob table, and resummarize capability.
Lakebase Fixes: Switch to do_connect token injection, fix connection pool settings, and update specs with pool requirements and service principal permissions.
Docs: Update facilitator guide for Lakebase and Git-based deployment, fix setup prerequisites.
Bug Fixes: Deduplicate convertTraceToTraceData for summary propagation, handle databricks_host with existing https:// prefix, resolve available-models without mlflow intake config.

Changes (63 files, +4839 / -1206)

Auth (12 commits)

Add resolve_databricks_token() utility using Databricks SDK
Remove DatabricksTokenDB model and databricks_tokens table
Remove token input fields from IntakePage and DBSQLExportPage
Replace token_storage patterns across all services and routers
Update TypeScript models and service docstrings

Summarization (7 commits)

Refactor to tool-based agent with span data resolution
Add facilitator visibility into summarization status and results
New SummarizationJob model and migration (0018)
Fix summary propagation through convertTraceToTraceData
Use SDK auth and separate DB session for background tasks

Lakebase & Database (3 commits)

Switch to do_connect token injection for Lakebase
Fix pool settings for Databricks SQL connections
Update specs with connection pool requirements

Docs (4 commits)

Update facilitator guide for Lakebase and Git-based deployment
Add service principal permissions to AUTHENTICATION_SPEC
Fix Lakebase setup prerequisites

Test plan

Verify SDK auth works end-to-end (token resolution, service initialization)
Test summarization agent with tool-based flow
Confirm facilitator dashboard shows summarization status
Verify Lakebase connection pool behavior
Run just test-server — all backend tests pass
Run just ui-test-unit — all frontend tests pass
Run just e2e — end-to-end tests pass

🤖 Generated with Claude Code

Replace the hardcoded MODEL_MAPPING with a live API call to Databricks serving-endpoints. The backend uses async httpx to avoid blocking the event loop, and the frontend fetches models via useAvailableModels and builds options dynamically with buildModelOptions. All components now store and pass endpoint names directly instead of translating between display names and backend names. Also switches model prefetching from an eager useEffect in WorkflowContext to intent-based prefetchQuery on hover/focus of navigation buttons, and clears Databricks auth env vars that can override token auth in the MLflow intake service. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace stale hasMlflowConfig references in DiscoveryAnalysisTab with modelOptions.length checks to match the switch to dynamic model listing. Fix discovery-complete endpoint returning 404 for facilitators whose workshop_id is NULL by also checking against workshop.facilitator_id. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Prevent worktree contents from being tracked. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…vice init Add a public resolve_databricks_token() function that uses the Databricks SDK for auth (service principal on Apps, CLI profile locally) with a fallback to DATABRICKS_TOKEN env var. Remove the token_storage/db_service fallback chain from DatabricksService.__init__. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

MLflow uses whatever Databricks auth the SDK provides. Stop setting DATABRICKS_TOKEN in the environment — only set DATABRICKS_HOST so the SDK knows which workspace to target. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Mark databricks_token as deprecated with empty default in Python models (MLflowIntakeConfig, MLflowIntakeConfigCreate, DBSQLExportRequest, DatabricksConfig) and optional in TypeScript models. SDK auth is used instead. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…outer Replace 10+ token_storage.get_token / db_service.get_databricks_token fallback chains with resolve_databricks_token(). Remove all os.environ["DATABRICKS_TOKEN"] mutations. Update test mocks to patch resolve_databricks_token instead of token_storage. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…outers Update discovery_service (7 refs), judge_service, draft_rubric_grouping, database_service, databricks router, dbsql_export router. Remove set/get_databricks_token methods from database_service. Update test mocks to patch resolve_databricks_token. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Remove the token persistence infrastructure: - DatabricksTokenDB SQLAlchemy model from database.py - databricks_tokens from postgres_manager ALLOWED_TABLES and CREATE TABLE - DatabricksTokenDB import from database_service.py - test_token_storage_service.py (5 tests for deleted functionality) - Update postgres_manager test expectations token_storage_service.py is kept for Custom LLM API key storage. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…Page Users no longer need to provide Databricks tokens — the backend uses SDK auth (service principal on Apps, CLI profile locally). Remove all token state, localStorage persistence, form fields, and validation from both pages. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Remove os.environ["DATABRICKS_TOKEN"] and DATABRICKS_CLIENT_ID/SECRET pop() calls from alignment_service, judge_service, dbsql_export_service, and database_service. The SDK handles auth automatically — only DATABRICKS_HOST needs to be set for MLflow to know which workspace. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

AUTHENTICATION_SPEC: - Rewrite Architecture Context to describe the two-layer model accurately - Add new "Databricks API Authentication" section with token resolution contract, environment-specific behavior, MLflow auth, and what was removed - Add "Future: Per-User Auth" subsection for OBO pattern - Add 8 success criteria for Databricks API auth - Mark SDK Auth Migration as complete in implementation log BUILD_AND_DEPLOY_SPEC: - Mark DATABRICKS_TOKEN as optional (SDK auth preferred) in env vars table - Update Databricks Apps Authentication section to reference resolve_databricks_token() and link to AUTHENTICATION_SPEC JUDGE_EVALUATION_SPEC: - Fix troubleshooting note: "host, token" → "host, experiment ID + SDK auth" - Add SDK Auth Migration to implementation log README.md: - Add keyword index entries: PAT, SDK auth, resolve_databricks_token, service principal, DATABRICKS_TOKEN, DATABRICKS_CLIENT_ID, OAuth, CLI profile Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Document the Databricks resources the app's service principal needs access to: MLflow Experiment (Can edit), Model Serving Endpoints (Can query), SQL Warehouse (Can use), Unity Catalog Volume (Can read and write). Note which are required vs optional. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Lakebase (PostgreSQL) is the primary production database. Its OAuth tokens are refreshed via WorkspaceClient().config.oauth_token() every 15 minutes. Split permissions into core (Lakebase, MLflow, Serving Endpoints) vs optional (SQL Warehouse, UC Volume). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

AUTHENTICATION_SPEC: - Add "Lakebase Connection Pool" section with token lifecycle, do_connect injection pattern, required pool settings, credential API, and setup prerequisites — all with links to Databricks docs - Update Lakebase row in permissions table to reference generate_database_credential - Add 7 Lakebase connection pool success criteria - Add implementation log entry BUILD_AND_DEPLOY_SPEC: - Add Lakebase env vars (PGHOST, PGDATABASE, PGUSER, PGPORT, PGSSLMODE, PGAPPNAME, ENDPOINT_NAME, DATABASE_ENV) to environment variables table - Add implementation log section with SDK auth and Lakebase pool entries Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ings Replace the creator-based connection factory with the recommended do_connect event pattern from Databricks docs. Key changes: - OAuthTokenManager → LakebaseCredentialManager using generate_database_credential(endpoint=ENDPOINT_NAME) API - Token injection via do_connect event (not creator callable) - pool_recycle: 300s → 3600s (was causing excessive connection churn) - pool_pre_ping: True → False (conflicts with do_connect injection) - max_overflow: 10 → 5 (caps at 20 total across 2 workers) - postgres_manager: pool created once with custom OAuthConnection class, never recreated on token refresh - database.py: _reset_connection_pool no longer calls force_refresh Reference: https://docs.databricks.com/aws/en/lakebase/connect/custom-app.html Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Remove databricks_token from CSV upload body type, make DatabricksConfig.token optional, update ApiService/WorkshopsService docstrings to reflect SDK auth. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

When Lakebase is added as a Databricks App resource, the platform automatically creates a Postgres role for the service principal. Manual databricks_create_role() is only needed for external/additional identities outside the App resource integration. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ndency - Add summarization_enabled, summarization_model, summarization_guidance columns to WorkshopDB - Add summary (JSON) column to TraceDB for structured milestone views - Add corresponding Pydantic model fields and DB service methods - Add pydantic-ai-slim[openai] dependency - Create TRACE_SUMMARIZATION_SPEC with success criteria - Create implementation plan Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… with batch support Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…raceViewer Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ingestion - PUT /workshops/{id}/summarization-settings for facilitator config - POST /workshops/{id}/resummarize for on-demand re-summarization - Background summarization triggered after MLflow trace ingestion Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…odelOptions The settings agent used a function name that doesn't exist in modelMapping.ts. Fixed to follow the same pattern as other components: useAvailableModels() + buildModelOptions(). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…s fork The FastAPI lifespan bootstrap ran migrations in each worker process, requiring interprocess locks and never applying new migrations after initial deploy. Move migration execution to gunicorn's on_starting hook which runs exactly once in the master process before any workers fork. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

# Conflicts: # specs/BUILD_AND_DEPLOY_SPEC.md

…nd tasks - Use resolve_databricks_token() instead of stored PAT (SDK auth compat) - Create new SessionLocal() inside background tasks to avoid using the request-scoped DB session after it's closed - Add logging for summarization completion Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fix two indentation errors in workshops.py caused by removing 'if databricks_token:' gatekeeping without dedenting the body. Remove orphaned 'else: no token' branch. Update test fixtures for databricks and dbsql_export routers to match new no-arg create_databricks_service() and DBSQLExportService() APIs. 810 passed, 0 failed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Migration 0017_remove_databricks_host branched from 0016 alongside the existing 0017_add_summarization, creating multiple heads. Renumbered to 0019 with down_revision pointing to 0018. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The frontend no longer sends experiment_id (it comes from MLFLOW_EXPERIMENT_ID env var). The endpoint now resolves it from the environment when not provided in the request body. 810 passed, 0 failed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Variable was renamed to workspace_host via get_databricks_host() but the call to _run_summarization_background still referenced the old name. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…nvalidation - Remove duplicate QueryClient from App.tsx — main.tsx's configured client is now the single provider - Set global staleTime: 30s to prevent refetch storms on navigation; remove 12 redundant per-hook staleTime overrides - Add 7 selector hooks (useWorkshopPhase, useWorkshopDisplayConfig, useWorkshopMeta, useWorkshopDiscoveryConfig, useWorkshopAnnotationConfig, useWorkshopEvalConfig, useWorkshopSummarizationConfig) using TanStack Query select option — components only re-render when their slice changes - Migrate 15 components from useWorkshop() to selector hooks - Fix mutation anti-patterns: remove unnecessary workshop invalidation from annotation submit; eliminate setQueryData + invalidateQueries in 4 mutations (toggle notes, JSONPath, span filter, summarization) - Update 12 test files to mock new selector hooks Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Summarization jobs could get stuck with no way to stop them. This adds a cancel endpoint that cancels the asyncio background task and a cancel button in the SummarizationSettings progress UI. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Deleted the .env.local file containing Databricks configuration. - Updated justfile to include log level configuration for uvicorn commands. - Removed unused databricks_host property from Body_upload_csv_and_log_to_mlflow_workshops model. - Made experiment_id optional in MLflowIntakeConfigCreate type. - Added cancelSummarizationJob API method to ApiService and WorkshopsService. - Enhanced logging in summarization background tasks and services for better traceability. - Refactored DiscoveryService to use unified SDK authentication for Databricks LLM calls.

…roup display and focus handling - Added a section in DraftRubricPanel to display proposed groups for faster review, including an apply and dismiss button. - Updated DraftRubricSidebar to handle focus changes, allowing the sidebar to expand when inputs are focused. - Improved the layout and styling of proposed group displays in both components for better user experience.

Strip wrapping quotes and whitespace from experiment IDs loaded from env and request inputs so Databricks MLflow lookups resolve correctly.

Improve discovery for one-at-a-time eval mode by removing trace-count selection at start, injecting workshop use-case and milestone summary context into follow-up and analysis prompts, and tracking milestone references through follow-up answers and findings evidence. Add clickable origin badges that scroll facilitators to the referenced trace or milestone in the feed.

Add question-level lineage refs and markdown link rendering so findings can cite trace milestones and follow-up questions inline, with navigation that resolves question links back to participant-selected context.

Add eval-mode workshop support with per-trace criteria CRUD, rubric rendering, scoring aggregation, and mode-gated routing/UI so eval and legacy workshop flows can evolve independently. (cherry picked from commit e85d269f7d53b67fa8fb8aeec1003da42ab0a72d)

…128) Define and normalize MLflow experiment IDs during ingestion, switch trace search to the non-deprecated locations API, and ensure Databricks hosts always include a protocol to prevent endpoint listing failures.

Introduce a social discovery mode with threaded trace/milestone comments, voting, facilitator @assistant/@agent mentions, and SSE streaming updates while keeping analysis mode behind a toggle. Also include eval-mode regression hotfixes for Databricks/MLflow intake behavior with expanded unit coverage.

Add facilitator-only comment deletion and make social thread vote/delete interactions respond instantly with optimistic updates, so moderation and rating actions feel reliable under live SSE updates.

…ation timeout - Extend get_display_text with optional milestone context enrichment so LLM judges can reason about agent trajectory, not just final response - JudgeService now includes trace milestone summaries when evaluating (normal workshop mode) - Fix summarization background job idle-in-transaction timeout by using short-lived DB sessions per write instead of one long-lived session across the entire batch of LLM calls

The lockfile was generated against pypi-proxy.dev.databricks.com which is unavailable in the deployment environment.

Drop client package-lock.json and add client .npmrc to avoid proxy-pinned tarball URLs during Databricks Apps npm installs.

forrestmurray-db and others added 30 commits April 10, 2026 10:51

chore: add .claude/worktrees/ to gitignore

bbd882c

Prevent worktree contents from being tracked. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat(summarization): add PydanticAI-based trace summarization service…

ecf37de

… with batch support Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat(summarization): add MilestoneView component with tab toggle in T…

89f26dc

…raceViewer Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat(summarization): add facilitator settings UI for trace summarization

98f37fd

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat(summarization): regenerate API client with summarization endpoints

c3d4e6f

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

chore(deploy): exclude .claude and htmlcov from databricks sync

50883f2

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Merge fix/async-models-endpoint-and-prefetch into release/v1.10.0

c148648

Merge feature/sdk-auth-migration into release/v1.10.0

8c46900

# Conflicts: # specs/BUILD_AND_DEPLOY_SPEC.md

forrestmurray-db and others added 30 commits April 15, 2026 11:40

fix(auth): fix workspace_url -> workspace_host in resummarize endpoint

dae3088

Variable was renamed to workspace_host via get_databricks_host() but the call to _run_summarization_background still referenced the old name. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Merge branch 'chore/tanstack-refactor' into release/v1.10.0

75895f6

add eval mode docs and plans

02088a7

Merge branch 'feat/summarization-cancel' into release/v1.10.0

9ff4dfa

chore: update discovery sidebar expansion

a0d515a

fix: normalize MLflow experiment ID inputs

d5eca67

Strip wrapping quotes and whitespace from experiment IDs loaded from env and request inputs so Databricks MLflow lookups resolve correctly.

reformat value_from --> valueFrom

d748db8

experiment -> env from resources

90a0ce9

rename default experiment value key

34a32ae

feat: add inline evidence lineage links for discovery findings

a2476d5

Add question-level lineage refs and markdown link rendering so findings can cite trace milestones and follow-up questions inline, with navigation that resolves question links back to participant-selected context.

fix: add facilitator comment moderation controls

1c39694

Add facilitator-only comment deletion and make social thread vote/delete interactions respond instantly with optimistic updates, so moderation and rating actions feel reliable under live SSE updates.

fix(build): remove unused @databricks/design-system dependency

3820a83

fix(build): repoint uv.lock from Databricks proxy to pypi.org

3aba148

The lockfile was generated against pypi-proxy.dev.databricks.com which is unavailable in the deployment environment.

fix(build): add pinned requirements.txt for app installs

2424a49

fix(build): force npmjs registry for app builds

6bd0f27

Drop client package-lock.json and add client .npmrc to avoid proxy-pinned tarball URLs during Databricks Apps npm installs.

fix(app): stabilize Databricks app startup diagnostics

c62a38a

fix(release): integrate DNB alignment hotfixes (#147)

7f1691f

fix(migration): use postgres-safe boolean defaults

97b397f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release v1.10.0: SDK auth migration, summarization agent, and Lakebase fixes#125

Release v1.10.0: SDK auth migration, summarization agent, and Lakebase fixes#125
forrestmurray-db wants to merge 88 commits into
mainfrom
release/v1.10.0

forrestmurray-db commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

forrestmurray-db commented Apr 13, 2026

Summary

Changes (63 files, +4839 / -1206)

Auth (12 commits)

Summarization (7 commits)

Lakebase & Database (3 commits)

Docs (4 commits)

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant