|
| 1 | +# tablespec Screencast Script |
| 2 | + |
| 3 | +**Runtime:** ~3 minutes |
| 4 | +**Tools:** Docker, VHS (charmbracelet/vhs) |
| 5 | +**Record:** `vhs examples/demo.tape` |
| 6 | + |
| 7 | +--- |
| 8 | + |
| 9 | +## COLD OPEN |
| 10 | + |
| 11 | +> Terminal appears with a dark theme. Two comment lines fade in: |
| 12 | +
|
| 13 | +``` |
| 14 | +# tablespec — Universal Metadata Format for table schemas |
| 15 | +# A complete walkthrough: schema loading, generation, validation, and PySpark |
| 16 | +``` |
| 17 | + |
| 18 | +**NARRATOR:** tablespec is a Python library for defining, validating, and |
| 19 | +generating table schemas using a single YAML-based format called UMF — |
| 20 | +Universal Metadata Format. Let's see what it can do. |
| 21 | + |
| 22 | +--- |
| 23 | + |
| 24 | +## SCENE 1 — "The Schema" |
| 25 | + |
| 26 | +> The example UMF YAML file is displayed on screen. |
| 27 | +
|
| 28 | +```yaml |
| 29 | +version: "1.0" |
| 30 | +table_name: Medical_Claims |
| 31 | +description: Healthcare claims and billing information |
| 32 | +columns: |
| 33 | + - name: claim_id |
| 34 | + data_type: VARCHAR |
| 35 | + length: 50 |
| 36 | + nullable: |
| 37 | + MD: false # Medicaid |
| 38 | + MP: false # Medicare Part D |
| 39 | + ME: false # Medicare |
| 40 | + - name: claim_amount |
| 41 | + data_type: DECIMAL |
| 42 | + precision: 10 |
| 43 | + scale: 2 |
| 44 | + nullable: |
| 45 | + MD: true |
| 46 | + MP: true |
| 47 | + ME: true |
| 48 | + - name: provider_id |
| 49 | + data_type: VARCHAR |
| 50 | + length: 20 |
| 51 | +``` |
| 52 | +
|
| 53 | +**NARRATOR:** This is a UMF schema for a healthcare claims table. Three |
| 54 | +columns, each with a data type, length, and nullable configuration per |
| 55 | +Line of Business — Medicaid, Medicare Part D, and Medicare. One YAML file |
| 56 | +is the single source of truth for everything downstream. |
| 57 | +
|
| 58 | +--- |
| 59 | +
|
| 60 | +## SCENE 2 — "The Build" |
| 61 | +
|
| 62 | +> Docker image builds with PySpark 4.0, Java 21, and tablespec. |
| 63 | +
|
| 64 | +**NARRATOR:** We're building a Docker image with PySpark 4.0 and |
| 65 | +tablespec installed. This mirrors what you'd have on Databricks — same |
| 66 | +Spark, same library, same behavior. |
| 67 | +
|
| 68 | +--- |
| 69 | +
|
| 70 | +## SCENE 3 — "The Demo" |
| 71 | +
|
| 72 | +The demo runs in Docker. Each section appears sequentially: |
| 73 | +
|
| 74 | +### ACT 1: Load & Inspect (Section 1) |
| 75 | +
|
| 76 | +> Output shows table name, description, columns, nullable config. |
| 77 | +
|
| 78 | +**NARRATOR:** We load the UMF YAML into a Pydantic model. Every field is |
| 79 | +type-checked. The nullable config tells us which columns are required in |
| 80 | +which LOB — claim_id is required everywhere, but claim_amount can be null. |
| 81 | +
|
| 82 | +--- |
| 83 | +
|
| 84 | +### ACT 2: Schema Generation (Section 2) |
| 85 | +
|
| 86 | +> SQL DDL, PySpark StructType, and JSON Schema appear in sequence. |
| 87 | +
|
| 88 | +**NARRATOR:** From one UMF file, we generate three schema formats. SQL DDL |
| 89 | +for data warehouses. PySpark StructType for Spark jobs. JSON Schema for API |
| 90 | +validation. One source, many targets. |
| 91 | +
|
| 92 | +--- |
| 93 | +
|
| 94 | +### ACT 3: Type Mappings (Section 3) |
| 95 | +
|
| 96 | +> A table shows VARCHAR -> StringType -> string -> StringType across systems. |
| 97 | +
|
| 98 | +**NARRATOR:** The type mapping engine converts between UMF, PySpark, JSON |
| 99 | +Schema, and Great Expectations. VARCHAR becomes StringType in Spark, string |
| 100 | +in JSON, StringType in GX. DECIMAL stays DECIMAL with precision preserved. |
| 101 | +
|
| 102 | +--- |
| 103 | +
|
| 104 | +### ACT 4: Domain Type Inference (Section 4) |
| 105 | +
|
| 106 | +> Column names are matched to domain types with confidence scores. |
| 107 | +
|
| 108 | +**NARRATOR:** tablespec ships with 42 domain types. Feed it a column name |
| 109 | +like "provider_npi" and it recognizes it as an NPI — National Provider |
| 110 | +Identifier — with 100% confidence. It even knows the validation rule: |
| 111 | +a 10-digit regex. state_code maps to US state codes. member_email maps to |
| 112 | +email. All automatic. |
| 113 | +
|
| 114 | +--- |
| 115 | +
|
| 116 | +### ACT 5: Great Expectations Baseline (Section 5) |
| 117 | +
|
| 118 | +> 13 expectations are generated with severity levels. |
| 119 | +
|
| 120 | +**NARRATOR:** From the same UMF, we generate a baseline Great Expectations |
| 121 | +suite. 13 expectations: column existence, type validation, nullability |
| 122 | +constraints, length checks. Each tagged with a severity — critical for |
| 123 | +data integrity, warning for quality, info for structural checks. No manual |
| 124 | +GX authoring needed. |
| 125 | +
|
| 126 | +--- |
| 127 | +
|
| 128 | +### ACT 6: LLM Prompt Generation (Section 6) |
| 129 | +
|
| 130 | +> Prompt lengths and a preview are displayed. |
| 131 | +
|
| 132 | +**NARRATOR:** tablespec generates structured prompts for LLMs. A |
| 133 | +documentation prompt asks an AI to analyze the table's business purpose, |
| 134 | +data flow, and compliance considerations. A validation prompt asks it to |
| 135 | +generate multi-column GX rules that go beyond what baseline can do |
| 136 | +automatically. The prompts include all column metadata, sample values, and |
| 137 | +domain context. |
| 138 | +
|
| 139 | +--- |
| 140 | +
|
| 141 | +### ACT 7: UMF Diffing (Section 7) |
| 142 | +
|
| 143 | +> Two changes detected: a new column and a modified description. |
| 144 | +
|
| 145 | +**NARRATOR:** Schema evolution tracking. We modified the claims table — |
| 146 | +added a service_date column and updated a description. UMFDiff detects |
| 147 | +both changes instantly. This powers changelog generation and schema review |
| 148 | +workflows. |
| 149 | +
|
| 150 | +--- |
| 151 | +
|
| 152 | +### ACT 8: Spark Session (Section 8) |
| 153 | +
|
| 154 | +> Spark 4.0.1 session is created. A DataFrame with 5 claims is displayed. |
| 155 | +
|
| 156 | +**NARRATOR:** Now we enter PySpark territory. We create a local Spark |
| 157 | +session — the same factory function works on Databricks, it auto-detects |
| 158 | +the environment. We create a sample DataFrame with five claims, including |
| 159 | +one with a NULL amount. |
| 160 | +
|
| 161 | +--- |
| 162 | +
|
| 163 | +### ACT 9: Profiling (Section 9) |
| 164 | +
|
| 165 | +> SparkToUmfMapper infers column types from the DataFrame. |
| 166 | +
|
| 167 | +**NARRATOR:** SparkToUmfMapper goes the other direction — from a Spark |
| 168 | +DataFrame back to UMF. It infers column names, types, and nullability. |
| 169 | +Useful for onboarding existing tables that don't have a UMF spec yet. |
| 170 | +
|
| 171 | +--- |
| 172 | +
|
| 173 | +### ACT 10: Validation (Section 10) |
| 174 | +
|
| 175 | +> One validation error: claim_amount has the wrong data type. |
| 176 | +
|
| 177 | +**NARRATOR:** TableValidator checks the DataFrame against the UMF spec. |
| 178 | +It found one issue — claim_amount is a double in Spark but DECIMAL in the |
| 179 | +spec. This is exactly the kind of type drift that causes silent data |
| 180 | +corruption in pipelines. The validator returns a structured error DataFrame |
| 181 | +you can write to a monitoring table. |
| 182 | +
|
| 183 | +--- |
| 184 | +
|
| 185 | +### ACT 11: Sample Data Generation (Section 11) |
| 186 | +
|
| 187 | +> Split-format UMF is prepared. 100 rows of claims and providers are generated. |
| 188 | +
|
| 189 | +**NARRATOR:** Finally, sample data generation. We save the UMF specs in |
| 190 | +split format — the git-friendly directory structure — and generate 100 rows |
| 191 | +for each table. The generator respects column types, nullable rules, and |
| 192 | +produces realistic values. Provider NPIs are 10 digits. State codes are |
| 193 | +real states. Foreign keys are coherent across tables. |
| 194 | +
|
| 195 | +--- |
| 196 | +
|
| 197 | +## CLOSING |
| 198 | +
|
| 199 | +> "Demo complete!" banner appears. |
| 200 | +
|
| 201 | +**NARRATOR:** That's tablespec. One YAML schema drives SQL generation, |
| 202 | +Spark schemas, Great Expectations, domain inference, validation, profiling, |
| 203 | +LLM prompts, and sample data. Define once, use everywhere. |
| 204 | +
|
| 205 | +--- |
| 206 | +
|
| 207 | +## Production Notes |
| 208 | +
|
| 209 | +**To record the screencast:** |
| 210 | +
|
| 211 | +```bash |
| 212 | +# Build the Docker image first (one-time) |
| 213 | +docker build -t tablespec-demo -f examples/Dockerfile.demo . |
| 214 | + |
| 215 | +# Record with VHS |
| 216 | +vhs examples/demo.tape |
| 217 | +``` |
| 218 | + |
| 219 | +**Outputs:** |
| 220 | +- `examples/demo.gif` — animated GIF for README / docs |
| 221 | +- `examples/demo.mp4` — video for presentations |
| 222 | + |
| 223 | +**To customize:** |
| 224 | +- Edit `examples/demo.tape` for timing, theme, font |
| 225 | +- Edit `examples/demo.py` to add/remove sections |
| 226 | +- The VHS tape runs the demo inside Docker for reproducibility |
0 commit comments