This document explains how to work with the Casper's Kitchens repository through Claude.
Casper's Kitchens uses a three-layer architecture:
Purpose: Infrastructure deployment - "put the pieces in place"
- Defined in
databricks.yml - Syncs files to workspace
- Creates the Job definition itself
- Manages bundle-level resources
Deploy: databricks bundle deploy -t <target>
Purpose: Runtime orchestration - "give users a dashboard to operate it"
- Main job: "Casper's Initializer"
- Stage-based tasks with dependency management
- Runtime parameterization (CATALOG, LLM_MODEL, etc.)
- UI-driven control (users can select which tasks to run)
- Visual DAG in Databricks Jobs UI
- Observable execution (logs, retries, task-level reruns)
Run: databricks bundle run caspers [--params "CATALOG=mycatalog"]
Why not just use DABs? DABs deploys infrastructure, but the Job provides:
- Runtime flexibility (change parameters without redeploying)
- Non-technical access (no CLI required)
- Selective execution (pick which demo components to run)
- Built-in workflow management (retries, parallelization, observability)
Purpose: Track dynamically created resources - "enable cleanup"
The Problem: When stages run, they create resources that DABs doesn't know about:
- Catalogs (with runtime-specified names via CATALOG parameter)
- Delta pipelines
- Model endpoints
- Databricks Apps
- Database instances (Lakebase)
- Volumes, schemas, tables
Running databricks bundle destroy only removes the Job definition - all runtime resources are orphaned!
The Solution: /utils/uc_state/ tracks all created resources in a UC table (<CATALOG>._caspers_state.resources)
Cleanup workflow:
databricks bundle run cleanup # Delete runtime resources via uc_state
databricks bundle destroy # Delete bundle resources via DABsLocation: /utils/uc_state/ - See README.md for API usage
Each target is a preset configuration for different demo scenarios, defined in databricks.yml.
# Read the databricks.yml file - look for the "targets:" section
# Each target has:
# - A name (default, complaints, free, all, etc.)
# - A "tasks:" list defining which stages to run
# - A "parameters:" list with default valuesKey insights about targets:
- Shared stages - Some stages appear in multiple targets (look for stages that appear under multiple targets - these must work across all contexts)
- Target-specific stages - Some stages only appear in certain targets
- Parameter variance - Different targets may have different parameters or different default values
When making changes, always check databricks.yml to understand:
- Which target(s) does this affect?
- Are there stages that are shared vs unique to this target?
- Are there different parameters for different targets?
Stages are thin orchestration layers that:
- Accept parameters from the Job
- Call other notebooks/scripts (the actual implementation)
- Create resources and register them with uc_state
Stages are NOT where the work happens - they orchestrate other code.
When asked to work on a feature, follow this process:
-
Find the stage in
/stages/<stage_name>.ipynb -
Read the stage notebook to understand:
- What parameters does it accept? (look for
dbutils.widgets.get()) - What does it call? (look for
%runcommands,dbutils.notebook.run(), or API calls likew.pipelines.create()) - What resources does it create? (look for API calls to create resources)
- What does it register with uc_state? (look for
state.add()calls)
- What parameters does it accept? (look for
-
Follow the call chain:
- If it calls another notebook with
%run /path/to/notebook, read that notebook - If it creates a resource with a path parameter (e.g.,
notebook_path), read that notebook - If it references code in
/data/,/jobs/,/apps/, or/pipelines/, read that code
- If it calls another notebook with
-
Understand dependencies:
- Check
databricks.ymlfor thedepends_onfield to see what must run before this stage - Understand what data/resources this stage expects to exist
- Understand what data/resources this stage produces for downstream stages
- Check
Question: "How does the data generator work?"
Process:
- Check
databricks.yml- which target am I working with? Which stage generates data? - For
defaulttarget, I see a task calledRaw_Datawithnotebook_path: ${workspace.root_path}/stages/raw_data - Read
/stages/raw_data.ipynb- what does it call? - Follow any
%runcommands or notebook paths to find the actual generator implementation - For
freetarget, I see a task calledCanonical_Data- this is a different data generation approach - Read
/stages/canonical_data.ipynband follow its call chain
The key: Don't assume you know the structure - trace it by reading the files.
# List top-level directories
ls -la /
# Each directory typically has a purpose:
# /stages/ - Look here for stage orchestrators
# /data/ - Look here for data generation code
# /jobs/ - Look here for streaming job implementations
# /apps/ - Look here for Databricks App code
# /pipelines/ - Look here for pipeline transformation logic
# /utils/ - Look here for shared utilities
# /demos/ - Look here for standalone demo materials-
Stage orchestrator →
/stages/<name>.ipynb- Accepts Job parameters
- Calls implementation code
- Creates resources
- Registers with uc_state
- Keep thin - no business logic here
-
Implementation → Depends on type:
- Data generation →
/data/ - Streaming job →
/jobs/ - App →
/apps/ - Pipeline →
/pipelines/ - Utility →
/utils/
- Data generation →
-
Resource metadata → Tracked in uc_state
This is the most common modification pattern.
-
Understand the requirement:
- What does this stage need to do?
- Which target(s) should it be in?
- What stages must run before it?
- What resources will it create?
-
Check for similar stages:
- Look at existing stages in
/stages/to understand patterns - Find a similar stage to use as a template
- Look at existing stages in
-
Create the stage orchestrator:
- Location:
/stages/<name>.ipynb - Accept Job parameters via
dbutils.widgets.get() - Call implementation code (create this next)
- Create resources and register with uc_state
- Location:
-
Create the implementation:
- Determine appropriate location based on type (data/jobs/apps/pipelines)
- Put business logic here, not in the stage orchestrator
-
Update databricks.yml:
- Add task to relevant target(s) under
tasks:section - Define dependencies with
depends_on(which stages must run first?) - Add any new parameters to
parameters:section - Look at existing tasks for patterns to follow
- Add task to relevant target(s) under
-
Test (see Testing section below)
-
Expose as Job parameter when:
- Different values needed for different targets
- Users might want to customize (catalog name, model choice)
- Behavior should differ by environment (e.g., continuous vs triggered pipeline mode)
-
Hardcode when:
- Same value always used
- Internal implementation detail
- No variation needed
How to check: Look at existing parameters in databricks.yml to see patterns.
Testing is intricate - requires cleanup, redeploy, run cycle.
Before deploying or making changes, check if there's already an active deployment.
Quick check:
# Check if .databricks/bundle exists
ls -la .databricks/bundle/
# If it exists, check which target(s) are deployed
ls -la .databricks/bundle/Detailed deployment info:
-
View deployment metadata:
# Shows deployment ID, timestamp, and full file list cat .databricks/bundle/<target>/deployment.json | jq '.' # Quick summary cat .databricks/bundle/<target>/deployment.json | jq '{ deployment_id: .id, timestamp: .timestamp, cli_version: .cli_version, file_count: (.files | length) }'
-
Check deployed resources:
# See what Databricks resources were created cat .databricks/bundle/<target>/terraform/terraform.tfstate | jq -r ' .resources[] | select(.type == "databricks_job") | {name: .name, id: .instances[0].attributes.id, url: .instances[0].attributes.url} '
-
Get job URL directly:
# Extract the Casper's Initializer job URL cat .databricks/bundle/<target>/terraform/terraform.tfstate | \ jq -r '.resources[] | select(.type == "databricks_job" and .name == "caspers") | .instances[0].attributes.url'
-
Check when last deployed:
# See deployment timestamp cat .databricks/bundle/<target>/deployment.json | jq -r '.timestamp'
What you can learn:
- Is there an active deployment? (directory exists)
- Which target was deployed? (subdirectory name: default, free, complaints, all)
- When was it deployed? (timestamp in deployment.json)
- What files were synced? (files array in deployment.json)
- What resources exist? (terraform.tfstate)
- What's the job URL? (terraform state for databricks_job)
Example output:
{
"deployment_id": "83e2c41e-41c0-4642-b77a-25104a76ea47",
"timestamp": "2026-01-08T21:16:49.957861Z",
"cli_version": "0.275.0",
"file_count": 78
}
{
"name": "caspers",
"id": "653921609021816",
"url": "https://dbc-cad2eaf0-2163.cloud.databricks.com/#job/653921609021816"
}Use cases:
- Before hot fix: Check if there's a running deployment to preserve
- Before full redeploy: See what's currently deployed
- Debugging: Verify which files were actually synced
- Finding job URL: Get direct link to Databricks Jobs UI
Cache bug: After bundle destroy, redeployment sometimes doesn't fully redeploy due to stale local state in .databricks/ and .bundle/ directories. Files may appear to deploy successfully but are not actually synced to the workspace.
Workaround: Delete local cache directories before redeploying:
rm -rf .databricks .bundle-
Clean existing deployment
databricks bundle run cleanup --params "CATALOG=testcatalog" databricks bundle destroy -t <target>
-
Clear local cache (if experiencing cache issues)
rm -rf .databricks .bundle
-
Deploy changes
databricks bundle deploy -t <target>
-
CRITICAL: Verify files were synced to workspace
Before running anything, verify files were actually deployed. This is critical due to the known cache bug.
Step 1: Understand sync rules
Check
databricks.ymlfor thesync:section to see what's included/excluded:cat databricks.yml | grep -A 10 "^sync:"
Step 2: Get workspace path
# Extract workspace path from databricks.yml USER=$(databricks current-user me --output json | jq -r .userName) WORKSPACE_PATH="/Workspace/Users/$USER/$(cat databricks.yml | grep root_path | grep -o 'caspers-kitchens[^"]*' | head -1)" echo "Workspace path: $WORKSPACE_PATH"
Step 3: Verify critical files for your target
Check that stage files needed for your target exist:
# List all stages that were deployed databricks workspace list $WORKSPACE_PATH/stages 2>&1 # For specific target, verify its stages exist: # - free target needs: canonical_data, lakeflow # - default target needs: canonical_data, refunder_agent, refunder_stream, lakebase, apps, lakeflow # Check databricks.yml under targets.<target>.tasks to see which stages are needed
Step 4: Verify excluded directories are NOT synced
Based on
sync.excludein databricks.yml, check that those directories don't exist:# Example: Check .git is excluded databricks workspace list $WORKSPACE_PATH/.git 2>&1 | grep "doesn't exist" # If you see "doesn't exist", it's correctly excluded ✓ # If you see the directory contents, something went wrong ✗
Example verification script:
echo "=== Verifying File Sync ===" && \ echo "Key stages:" && \ databricks workspace list $WORKSPACE_PATH/stages 2>&1 && \ echo "" && \ echo "Checking excluded .git:" && \ databricks workspace list $WORKSPACE_PATH/.git 2>&1
If files are missing or stale:
- Delete local cache:
rm -rf .databricks .bundle - Redeploy:
databricks bundle deploy -t <target> - Verify again before proceeding
Important: Don't check file contents unless debugging a specific issue. File existence is what matters for sync verification. A fresh deploy is a fresh deploy.
- Delete local cache:
-
Run the job
databricks bundle run caspers --params "CATALOG=testcatalog" -
Validate
- Manual: Check Databricks UI for expected resources
- Automated: Use Databricks SDK/API to query resources (see below)
This requires deep Databricks API knowledge - specifics depend on what you're building.
You can use the Databricks SDK to programmatically validate:
from databricks.sdk import WorkspaceClient
w = WorkspaceClient()
# Check catalog exists
try:
catalog = w.catalogs.get("testcatalog")
print(f"✅ Catalog created: {catalog.name}")
except:
print("❌ Catalog not found")
# Check pipeline exists
pipelines = w.pipelines.list_pipelines()
caspers_pipelines = [p for p in pipelines if "caspers" in p.name.lower()]
print(f"✅ Found {len(caspers_pipelines)} pipelines")
# Check endpoint exists
try:
endpoint = w.serving_endpoints.get("caspers_refund_agent")
print(f"✅ Endpoint exists: {endpoint.state}")
except:
print("❌ Endpoint not found")
# Check app exists
apps = w.apps.list()
caspers_apps = [a for a in apps if "refund" in a.name.lower()]
print(f"✅ Found {len(caspers_apps)} apps")
# Check job exists and status
jobs = w.jobs.list()
caspers_jobs = [j for j in jobs if "Casper" in j.settings.name]
if caspers_jobs:
job_id = caspers_jobs[0].job_id
runs = w.jobs.list_runs(job_id=job_id, limit=1)
if runs:
print(f"✅ Latest run: {runs[0].state.life_cycle_state}")The validation logic is dynamic - you'll need to determine the right API calls based on what you're building.
When to use this: A pipeline is running, one stage failed, and you need to fix it without disrupting the entire deployment.
- Pipeline is running with live data flowing
- One task/stage fails (e.g.,
Refund_Recommender_Agenthas a schema error) - Other stages are working fine
- Full redeploy would disrupt everything and waste resources
Steps:
-
Diagnose the issue
- Check the task failure logs in Databricks UI
- Identify which notebook/code needs fixing
-
Export the current workspace notebook
# Get your workspace path from databricks.yml (workspace.file_path) # Default: /Workspace/Users/<email>/caspers-kitchens-demo databricks workspace export \ /Workspace/Users/<your-email>/caspers-kitchens-demo/stages/<stage_name> \ --format SOURCE
This shows you the current code in the workspace.
-
Create the fixed version
Create a fixed notebook file locally (e.g.,
/tmp/stage_fixed.py) with your changes. -
Push the fix to workspace
databricks workspace import \ --file /tmp/stage_fixed.py \ --language PYTHON \ --format SOURCE \ --overwrite \ /Workspace/Users/<your-email>/caspers-kitchens-demo/stages/<stage_name>
-
Verify the fix was applied
# Export again and check for your changes databricks workspace export \ /Workspace/Users/<your-email>/caspers-kitchens-demo/stages/<stage_name> \ --format SOURCE | grep "<search_term>"
-
Repair the failed task in Databricks UI
- Navigate to Jobs UI
- Find the "Casper's Initializer" job run
- Click on the failed task
- Click "Repair" button
- The task will rerun with the fixed code
-
CRITICAL: Backport changes to local repo
Once the repair succeeds, you MUST backport the changes:
# Apply the same fix to your local files # Edit ./stages/<stage_name>.ipynb with the same changes # Commit to maintain sync git add stages/<stage_name>.ipynb git commit -m "Fix: <description of the fix>" git push
Problem: Refunder agent failed with column 'location' cannot be resolved because canonical dataset uses location_id instead of location.
Fix applied:
# 1. Create fixed notebook with JOIN clause
cat > /tmp/refunder_agent_fixed.py << 'EOF'
# Databricks notebook source
# ... (notebook content with fixes)
# Changed: SELECT body, event_type, order_id, ts, location
# To: SELECT ae.body, ae.event_type, ae.order_id, ae.ts, loc.name as location
# FROM ${CATALOG}.lakeflow.all_events ae
# LEFT JOIN ${CATALOG}.simulator.locations loc ON ae.location_id = loc.location_id
EOF
# 2. Push to workspace
databricks workspace import \
--file /tmp/refunder_agent_fixed.py \
--language PYTHON \
--format SOURCE \
--overwrite \
/Workspace/Users/nick.karpov@databricks.com/caspers-kitchens-demo/stages/refunder_agent
# 3. Verify
databricks workspace export \
/Workspace/Users/nick.karpov@databricks.com/caspers-kitchens-demo/stages/refunder_agent \
--format SOURCE | grep "LEFT JOIN"
# 4. In UI: Click "Repair" on failed task
# 5. Backport to local
# Edit ./stages/refunder_agent.ipynb with same changes
git add stages/refunder_agent.ipynb
git commit -m "Fix: Add JOIN for location_id schema change"DO:
- Use this for emergency fixes during active pipelines
- Always verify changes before repairing
- Always backport to local repo after success
- Document what you fixed and why
DON'T:
- Use this as primary development workflow
- Skip backporting (leads to local/remote drift)
- Apply complex multi-file changes this way
- Forget to commit after backporting
Use full redeploy instead if:
- Multiple files need changes
- Changes affect bundle configuration (
databricks.yml) - No active pipeline running (nothing to preserve)
- Testing phase (not production repair)
- Why fragile: If stages don't register resources, cleanup fails and resources are orphaned
- When touching: Any time you create Databricks resources (catalogs, pipelines, endpoints, apps, database instances)
- Best practice: Always call
state.add()immediately after resource creation - Location: Check
/utils/uc_state/README.mdfor API details
- Why fragile: Downstream stages depend on specific table schemas
- When touching: Pipeline definitions, transformations, or any stage that creates tables
- Best practice:
- Trace the full dependency chain first (what consumes this data?)
- Check all stages that have
depends_onthis stage - Understand what schema they expect
- Why fragile: Complex time-based state management for streaming
- Location: Check
/data/canonical/for canonical streaming source - When touching: Streaming replay behavior, checkpoint logic
- Best practice: Understand the checkpoint mechanism before modifying
- See:
/data/canonical/README.mdfor details
- Why fragile: Changes to shared stages must work across all targets
- When touching: Any stage that appears in multiple targets in
databricks.yml - Best practice:
- Check which targets use this stage
- Test all affected targets, not just one
- Consider if behavior should be parameterized per-target
- Why fragile: Parameters flow through multiple layers: Job → Stage → Implementation
- When touching: Adding new parameters or modifying parameter handling
- Best practice:
- Trace the full parameter path
- Ensure it's plumbed through all layers
- Check
databricks.ymlparameters, stage parameter parsing, and implementation usage
- Why fragile: Stages create resources that others depend on (endpoints, tables, etc.)
- When touching: Creation/deletion order, stage dependencies
- Best practice:
- Check
databricks.ymlfordepends_onrelationships - Understand the full dependency graph
- Ensure uc_state deletion happens in correct order
- Check
- Why fragile: Local state in
.databricks/and.bundle/can become stale - Symptom:
bundle deploysucceeds but files are not actually synced to workspace - When it happens: After
bundle destroyand redeploy - Best practice:
- Always verify files in workspace UI after deploying
- Delete
.databricks/and.bundle/before redeploying if files are stale - Check timestamps in workspace to confirm files updated
Context: Migrating from live generator to canonical dataset approach
Preferred data source: /data/canonical/ (pre-generated 90-day dataset with streaming replay)
How to check current state:
- Look at
databricks.ymlunder each target - Check which stage is used for data generation (look for tasks with
raw_dataorcanonical_data) freetarget usescanonical_datastage- Other targets may still use
raw_datastage (old generator)
Goal: Make canonical the default for all targets
Why canonical is better:
- Reliable (no dying generators that can't restart)
- Flexible (start at any day, run at any speed)
- Portable (34.5 MB file, easy to ship)
- Reproducible (same dataset across all environments)
See: /data/canonical/README.md for comprehensive documentation
When working on specific areas, read these docs:
- Target structure & stage dependencies:
databricks.yml - Canonical data source:
/data/canonical/README.md - State management:
/utils/uc_state/README.md - Overall project context:
README.md - Claude guide (this file):
claude.md
Pattern: READMEs in subdirectories provide detailed documentation for that area.
-
Understand the requirement:
- What feature/fix is being requested?
- Which target(s) is this for?
-
Discover the relevant code:
- Check
databricks.ymlto understand target structure - Find relevant stage(s) in
/stages/ - Trace execution tree by reading stage → implementation
- Identify what resources are created/consumed
- Check
-
Assess impact:
- Which stages are affected?
- Are any shared stages involved? (check which targets use them)
- What downstream stages depend on this?
- Will schema/interface change affect others?
-
Plan the change:
- What code needs to be modified?
- Does
databricks.ymlneed updating? - Should this be parameterized?
- How will uc_state track any new resources?
-
Make the change:
- Follow code organization patterns
- Keep stages thin (orchestration only)
- Put business logic in implementation files
- Register resources with uc_state
-
Validate:
- Describe testing approach
- Provide validation code if applicable
- Consider which targets need testing
- About scope: "Which target(s) should this apply to?"
- About parameters: "Should this be configurable at runtime?"
- About testing: "How do you want to validate this?"
- About structure: "I need to trace the execution tree - let me read X first"
Always read before assuming:
- Don't hardcode assumptions about stage structure
- Trace execution by reading files
- Check
databricks.ymlfor actual target configuration - Look for READMEs in subdirectories for detailed docs
databricks bundle deploy -t <target>databricks bundle run caspers [--params "CATALOG=mycatalog"]databricks bundle run cleanup [--params "CATALOG=mycatalog"]
databricks bundle destroy -t <target>
rm -rf .databricks .bundle # If cache issues# See all targets and their structure
cat databricks.yml
# See all stages
ls -la stages/
# See directory organization
ls -la /- Which target(s) is this for? (check databricks.yml)
- Which stage(s) will this touch? (look in /stages/)
- What does that stage call? (read the stage notebook and trace execution)
- What resources will be created? (look for resource creation APIs)
- What depends on this stage? (check depends_on in databricks.yml)
- Is this stage shared across targets? (check if it appears in multiple targets)
- Should this be parameterized? (look at existing parameters for patterns)
- How will we validate it works? (manual UI check or API validation)
- After deployment, did files actually sync? (check workspace UI before running)
Core Principle: The repository structure is dynamic. Always discover by reading files rather than assume based on hardcoded documentation. Trace execution trees, check dependencies, and understand the full context before making changes.