DocumentDrivenDX
diff --git a/‎Makefile‎
Lines changed: 3 additions & 0 deletions b/‎Makefile‎
Lines changed: 3 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 18 additions & 0 deletions b/‎README.md‎
Lines changed: 18 additions & 0 deletions
diff --git a/‎examples/Dockerfile.demo‎
Lines changed: 26 additions & 0 deletions b/‎examples/Dockerfile.demo‎
Lines changed: 26 additions & 0 deletions
diff --git a/‎examples/SCREENCAST_SCRIPT.md‎
Lines changed: 226 additions & 0 deletions b/‎examples/SCREENCAST_SCRIPT.md‎
Lines changed: 226 additions & 0 deletions
@@ -47,6 +47,9 @@ test-unit: ## Run unit tests only
 test-integration: ## Run integration tests only
 	uv run pytest tests/integration/
 
+test-demo: ## Run demo script as acceptance test
+	uv run python examples/demo.py
+
 coverage: ## Run tests with coverage report
 	uv run pytest --cov=src --cov-report=term-missing --cov-report=html
 
 
@@ -18,6 +18,24 @@ Python library for working with table schemas in Universal Metadata Format (UMF)
 - **Domain Type Inference**: Automatic detection of domain types (SSN, NPI, phone, state codes, etc.)
 - **Change Management**: UMF diffing, atomic change application, and git-based changelogs
 
+## Demo
+
+![tablespec demo](examples/tablespec-demo.gif)
+
+The demo walks through loading a UMF schema, generating SQL/PySpark/JSON schemas, type mappings, domain type inference, Great Expectations baseline generation, LLM prompt generation, UMF diffing, and PySpark validation with sample data generation.
+
+Run it yourself:
+
+```bash
+# Run the demo (requires tablespec[spark])
+uv run python examples/demo.py
+
+# Run as acceptance test
+uv run pytest tests/integration/test_demo.py
+```
+
+A [narrated screencast](examples/tablespec-demo-narrated.mp4) and [asciinema recording](examples/tablespec-demo.cast) are also available.
+
 ## Installation
 
 ### Using uv (recommended)
 
@@ -0,0 +1,26 @@
+FROM eclipse-temurin:21-jre-noble
+
+# System deps
+RUN apt-get update && apt-get install -y --no-install-recommends \
+        python3 python3-pip python3-venv curl ca-certificates git \
+    && rm -rf /var/lib/apt/lists/*
+
+# Install uv
+COPY --from=ghcr.io/astral-sh/uv:latest /uv /usr/local/bin/uv
+
+WORKDIR /app
+
+# Copy project files
+COPY pyproject.toml uv.lock README.md ./
+COPY src/ src/
+COPY examples/ examples/
+
+# Install tablespec with spark extra
+RUN uv sync --extra spark --no-dev --frozen
+
+# Suppress Spark noise
+ENV SPARK_LOG_LEVEL=ERROR
+ENV PYTHONUNBUFFERED=1
+
+# Redirect Spark stderr noise to /dev/null in the entrypoint
+ENTRYPOINT ["sh", "-c", "uv run python examples/demo.py 2>/dev/null"]
@@ -0,0 +1,226 @@
+# tablespec Screencast Script
+
+**Runtime:** ~3 minutes
+**Tools:** Docker, VHS (charmbracelet/vhs)
+**Record:** `vhs examples/demo.tape`
+
+---
+
+## COLD OPEN
+
+> Terminal appears with a dark theme. Two comment lines fade in:
+
+```
+# tablespec — Universal Metadata Format for table schemas
+# A complete walkthrough: schema loading, generation, validation, and PySpark
+```
+
+**NARRATOR:** tablespec is a Python library for defining, validating, and
+generating table schemas using a single YAML-based format called UMF —
+Universal Metadata Format. Let's see what it can do.
+
+---
+
+## SCENE 1 — "The Schema"
+
+> The example UMF YAML file is displayed on screen.
+
+```yaml
+version: "1.0"
+table_name: Medical_Claims
+description: Healthcare claims and billing information
+columns:
+  - name: claim_id
+    data_type: VARCHAR
+    length: 50
+    nullable:
+      MD: false   # Medicaid
+      MP: false   # Medicare Part D
+      ME: false   # Medicare
+  - name: claim_amount
+    data_type: DECIMAL
+    precision: 10
+    scale: 2
+    nullable:
+      MD: true
+      MP: true
+      ME: true
+  - name: provider_id
+    data_type: VARCHAR
+    length: 20
+```
+
+**NARRATOR:** This is a UMF schema for a healthcare claims table. Three
+columns, each with a data type, length, and nullable configuration per
+Line of Business — Medicaid, Medicare Part D, and Medicare. One YAML file
+is the single source of truth for everything downstream.
+
+---
+
+## SCENE 2 — "The Build"
+
+> Docker image builds with PySpark 4.0, Java 21, and tablespec.
+
+**NARRATOR:** We're building a Docker image with PySpark 4.0 and
+tablespec installed. This mirrors what you'd have on Databricks — same
+Spark, same library, same behavior.
+
+---
+
+## SCENE 3 — "The Demo"
+
+The demo runs in Docker. Each section appears sequentially:
+
+### ACT 1: Load & Inspect (Section 1)
+
+> Output shows table name, description, columns, nullable config.
+
+**NARRATOR:** We load the UMF YAML into a Pydantic model. Every field is
+type-checked. The nullable config tells us which columns are required in
+which LOB — claim_id is required everywhere, but claim_amount can be null.
+
+---
+
+### ACT 2: Schema Generation (Section 2)
+
+> SQL DDL, PySpark StructType, and JSON Schema appear in sequence.
+
+**NARRATOR:** From one UMF file, we generate three schema formats. SQL DDL
+for data warehouses. PySpark StructType for Spark jobs. JSON Schema for API
+validation. One source, many targets.
+
+---
+
+### ACT 3: Type Mappings (Section 3)
+
+> A table shows VARCHAR -> StringType -> string -> StringType across systems.
+
+**NARRATOR:** The type mapping engine converts between UMF, PySpark, JSON
+Schema, and Great Expectations. VARCHAR becomes StringType in Spark, string
+in JSON, StringType in GX. DECIMAL stays DECIMAL with precision preserved.
+
+---
+
+### ACT 4: Domain Type Inference (Section 4)
+
+> Column names are matched to domain types with confidence scores.
+
+**NARRATOR:** tablespec ships with 42 domain types. Feed it a column name
+like "provider_npi" and it recognizes it as an NPI — National Provider
+Identifier — with 100% confidence. It even knows the validation rule:
+a 10-digit regex. state_code maps to US state codes. member_email maps to
+email. All automatic.
+
+---
+
+### ACT 5: Great Expectations Baseline (Section 5)
+
+> 13 expectations are generated with severity levels.
+
+**NARRATOR:** From the same UMF, we generate a baseline Great Expectations
+suite. 13 expectations: column existence, type validation, nullability
+constraints, length checks. Each tagged with a severity — critical for
+data integrity, warning for quality, info for structural checks. No manual
+GX authoring needed.
+
+---
+
+### ACT 6: LLM Prompt Generation (Section 6)
+
+> Prompt lengths and a preview are displayed.
+
+**NARRATOR:** tablespec generates structured prompts for LLMs. A
+documentation prompt asks an AI to analyze the table's business purpose,
+data flow, and compliance considerations. A validation prompt asks it to
+generate multi-column GX rules that go beyond what baseline can do
+automatically. The prompts include all column metadata, sample values, and
+domain context.
+
+---
+
+### ACT 7: UMF Diffing (Section 7)
+
+> Two changes detected: a new column and a modified description.
+
+**NARRATOR:** Schema evolution tracking. We modified the claims table —
+added a service_date column and updated a description. UMFDiff detects
+both changes instantly. This powers changelog generation and schema review
+workflows.
+
+---
+
+### ACT 8: Spark Session (Section 8)
+
+> Spark 4.0.1 session is created. A DataFrame with 5 claims is displayed.
+
+**NARRATOR:** Now we enter PySpark territory. We create a local Spark
+session — the same factory function works on Databricks, it auto-detects
+the environment. We create a sample DataFrame with five claims, including
+one with a NULL amount.
+
+---
+
+### ACT 9: Profiling (Section 9)
+
+> SparkToUmfMapper infers column types from the DataFrame.
+
+**NARRATOR:** SparkToUmfMapper goes the other direction — from a Spark
+DataFrame back to UMF. It infers column names, types, and nullability.
+Useful for onboarding existing tables that don't have a UMF spec yet.
+
+---
+
+### ACT 10: Validation (Section 10)
+
+> One validation error: claim_amount has the wrong data type.
+
+**NARRATOR:** TableValidator checks the DataFrame against the UMF spec.
+It found one issue — claim_amount is a double in Spark but DECIMAL in the
+spec. This is exactly the kind of type drift that causes silent data
+corruption in pipelines. The validator returns a structured error DataFrame
+you can write to a monitoring table.
+
+---
+
+### ACT 11: Sample Data Generation (Section 11)
+
+> Split-format UMF is prepared. 100 rows of claims and providers are generated.
+
+**NARRATOR:** Finally, sample data generation. We save the UMF specs in
+split format — the git-friendly directory structure — and generate 100 rows
+for each table. The generator respects column types, nullable rules, and
+produces realistic values. Provider NPIs are 10 digits. State codes are
+real states. Foreign keys are coherent across tables.
+
+---
+
+## CLOSING
+
+> "Demo complete!" banner appears.
+
+**NARRATOR:** That's tablespec. One YAML schema drives SQL generation,
+Spark schemas, Great Expectations, domain inference, validation, profiling,
+LLM prompts, and sample data. Define once, use everywhere.
+
+---
+
+## Production Notes
+
+**To record the screencast:**
+
+```bash
+# Build the Docker image first (one-time)
+docker build -t tablespec-demo -f examples/Dockerfile.demo .
+
+# Record with VHS
+vhs examples/demo.tape
+```
+
+**Outputs:**
+- `examples/demo.gif` — animated GIF for README / docs
+- `examples/demo.mp4` — video for presentations
+
+**To customize:**
+- Edit `examples/demo.tape` for timing, theme, font
+- Edit `examples/demo.py` to add/remove sections
+- The VHS tape runs the demo inside Docker for reproducibility