Compress screencast idle time and split Spark scene into sub-scenes

easel · claude · easel · commit e24b4bd56991 · 2026-03-19T00:48:14.000-04:00
Add sentinel markers in scene.py so screencast.sh can detect Spark
sub-scene boundaries (profile, validate, sample). Replace monolithic
Spark narration with 4 focused clips. Add cast file compression step
in narrate.sh to cap idle gaps at 2s and realign timing log, fixing
audio/video desync.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/examples/SCREENCAST_SCRIPT.md b/examples/SCREENCAST_SCRIPT.md
@@ -149,48 +149,50 @@ workflows.
 
 ---
 
-### ACT 8: Spark Session (Section 8)
+### ACT 8: CLI Authoring Commands (Scene 13)
+
+> CLI commands add a column, modify its type, rename with alias, set a domain type, and remove a validation expectation.
+
+**NARRATOR:** CLI commands for schema authoring. Add a column, modify its
+type, rename it with alias preservation. Assign domain types from the
+built-in registry. And manage validation expectations — all without
+touching YAML directly.
+
+---
+
+### ACT 9a: Spark Session (Section 14)
 
 > Spark 4.0.1 session is created. A DataFrame with 5 claims is displayed.
 
-**NARRATOR:** Now we enter PySpark territory. We create a local Spark
-session — the same factory function works on Databricks, it auto-detects
-the environment. We create a sample DataFrame with five claims, including
-one with a NULL amount.
+**NARRATOR:** Now we enter PySpark territory. Creating a Spark session
+and a sample DataFrame with five claims.
 
 ---
 
-### ACT 9: Profiling (Section 9)
+### ACT 9b: Profiling (Section 9)
 
 > SparkToUmfMapper infers column types from the DataFrame.
 
-**NARRATOR:** SparkToUmfMapper goes the other direction — from a Spark
-DataFrame back to UMF. It infers column names, types, and nullability.
-Useful for onboarding existing tables that don't have a UMF spec yet.
+**NARRATOR:** SparkToUmfMapper infers a UMF schema from the DataFrame.
+Column names, types, and nullability, all detected automatically.
 
 ---
 
-### ACT 10: Validation (Section 10)
+### ACT 9c: Validation (Section 10)
 
 > One validation error: claim_amount has the wrong data type.
 
 **NARRATOR:** TableValidator checks the DataFrame against the UMF spec.
-It found one issue — claim_amount is a double in Spark but DECIMAL in the
-spec. This is exactly the kind of type drift that causes silent data
-corruption in pipelines. The validator returns a structured error DataFrame
-you can write to a monitoring table.
+It catches type drift that causes silent data corruption.
 
 ---
 
-### ACT 11: Sample Data Generation (Section 11)
+### ACT 9d: Sample Data Generation (Section 11)
 
-> Split-format UMF is prepared. 100 rows of claims and providers are generated.
+> 100 rows of claims and providers are generated from UMF specs.
 
-**NARRATOR:** Finally, sample data generation. We save the UMF specs in
-split format — the git-friendly directory structure — and generate 100 rows
-for each table. The generator respects column types, nullable rules, and
-produces realistic values. Provider NPIs are 10 digits. State codes are
-real states. Foreign keys are coherent across tables.
+**NARRATOR:** Sample data generation from UMF specs. 100 rows per table,
+respecting types, nullable rules, and domain constraints.
 
 ---
 
diff --git a/examples/narrate.sh b/examples/narrate.sh
@@ -42,7 +42,14 @@ gen_clip "gx"       "Generate a full Great Expectations suite deterministically
 gen_clip "prompts"  "Generate structured prompts for LLMs. Documentation prompts. Validation rule prompts. All the column metadata and domain context is included automatically."
 gen_clip "diff"     "Schema evolution tracking. Modify a table and see exactly what changed. Added columns. Modified descriptions."
 gen_clip "sql_plan" "Generate full SQL execution plans from UMF metadata. Joins, column derivations, survivorship logic, aggregations. All computed automatically from the schema relationships."
-gen_clip "spark"    "Now the PySpark features. Starting a Spark session. Creating DataFrames. Profiling schemas. Validating data against UMF specs. And generating sample data. All from the same UMF metadata."
+gen_clip "context"  "Context-aware validation. Different nullable rules for each Line of Business, all from one YAML file."
+gen_clip "compat"   "Schema evolution safety. Check backward compatibility before deploying changes."
+gen_clip "excel"    "Export to Excel for domain experts. Import their edits back with no data loss."
+gen_clip "cli"      "CLI commands for schema authoring. Add a column, modify its type, rename it with alias preservation. Assign domain types from the built-in registry. And manage validation expectations — all without touching YAML directly."
+gen_clip "spark_session"   "Now we enter PySpark territory. Creating a Spark session and a sample DataFrame with five claims."
+gen_clip "spark_profile"   "SparkToUmfMapper infers a UMF schema from the DataFrame. Column names, types, and nullability, all detected automatically."
+gen_clip "spark_validate"  "TableValidator checks the DataFrame against the UMF spec. It catches type drift that causes silent data corruption."
+gen_clip "spark_sample"    "Sample data generation from UMF specs. 100 rows per table, respecting types, nullable rules, and domain constraints."
 gen_clip "close"    "That's tablespec. Define once. Use everywhere."
 
 echo
@@ -61,7 +68,14 @@ CLIP_gx=${CLIP_DUR[gx]}
 CLIP_prompts=${CLIP_DUR[prompts]}
 CLIP_diff=${CLIP_DUR[diff]}
 CLIP_sql_plan=${CLIP_DUR[sql_plan]}
-CLIP_spark=${CLIP_DUR[spark]}
+CLIP_context=${CLIP_DUR[context]}
+CLIP_compat=${CLIP_DUR[compat]}
+CLIP_excel=${CLIP_DUR[excel]}
+CLIP_cli=${CLIP_DUR[cli]}
+CLIP_spark_session=${CLIP_DUR[spark_session]}
+CLIP_spark_profile=${CLIP_DUR[spark_profile]}
+CLIP_spark_validate=${CLIP_DUR[spark_validate]}
+CLIP_spark_sample=${CLIP_DUR[spark_sample]}
 CLIP_close=${CLIP_DUR[close]}
 EOF
 
@@ -91,14 +105,89 @@ print(f'{last_ts:.0f}')
 echo "  Recorded: ${CAST}  (${CAST_DUR}s)"
 echo
 
+# ─── Step 2.5: Compress idle gaps in cast file ──────────────────
+
+echo "Step 2.5: Compressing idle gaps (max 2s)..."
+
+python3 -c "
+import json
+MAX_IDLE = 2.0
+CAST = '$CAST'
+with open(CAST) as f:
+    lines = f.readlines()
+header = lines[0]
+events = [json.loads(l) for l in lines[1:]]
+shift = 0
+prev_orig = 0
+for evt in events:
+    orig_ts = evt[0]
+    gap = orig_ts - prev_orig
+    if gap > MAX_IDLE:
+        shift += gap - MAX_IDLE
+    evt[0] = round(orig_ts - shift, 6)
+    prev_orig = orig_ts
+with open(CAST, 'w') as f:
+    f.write(header)
+    for evt in events:
+        f.write(json.dumps(evt) + '\n')
+print(f'  Compressed: removed {shift:.1f}s of idle time')
+last_ts = events[-1][0] if events else 0
+print(f'  New duration: {last_ts:.0f}s')
+"
+
+# Apply same compression to timing log
+python3 -c "
+import json
+MAX_IDLE = 2.0
+CAST = '$CAST'
+TIMING_LOG = '/tmp/screencast_timing.log'
+
+# Rebuild the same shift table from the cast
+with open(CAST) as f:
+    lines = f.readlines()
+# The cast is already compressed, but we need the original timestamps
+# to compute shifts. Re-read to get compressed timestamps.
+# Instead, compress the timing log offsets using the same algorithm.
+
+with open(TIMING_LOG) as f:
+    tlines = f.readlines()
+start_epoch = tlines[0].strip()
+
+# Collect all offsets
+entries = []
+for line in tlines[1:]:
+    parts = line.strip().split()
+    if len(parts) == 2:
+        entries.append((parts[0], float(parts[1])))
+
+# Apply same gap compression
+shift = 0
+prev_orig = 0
+compressed_entries = []
+for name, offset in entries:
+    gap = offset - prev_orig
+    if gap > MAX_IDLE:
+        shift += gap - MAX_IDLE
+    compressed_entries.append((name, round(offset - shift, 2)))
+    prev_orig = offset
+
+# Write back
+with open(TIMING_LOG, 'w') as f:
+    f.write(start_epoch + '\n')
+    for name, offset in compressed_entries:
+        f.write(f'{name} {offset}\n')
+print('  Timing log compressed to match')
+"
+echo
+
 # ─── Step 4: Convert to GIF and MP4 ─────────────────────────────
 
 echo "Step 3: Converting to GIF and MP4..."
 
 GIF="examples/tablespec-demo.gif"
 MP4="examples/tablespec-demo.mp4"
 
-agg --font-family "JetBrains Mono" \
+agg --font-family "JetBrainsMono Nerd Font" \
     --font-size 16 \
     --theme asciinema \
     "$CAST" "$GIF" 2>/dev/null
diff --git a/examples/scene.py b/examples/scene.py
@@ -12,6 +12,9 @@
   gx        Great Expectations baseline
   prompts   LLM prompt generation
   diff      UMF diffing & change detection
+  context   Context-aware nullable expectations
+  compat    Compatibility checking
+  excel     Excel round-trip
   spark     PySpark sections 8-11 (session, profile, validate, sample data)
 """
 
@@ -191,6 +194,82 @@ def scene_diff():
         print(f"  {c.description()}")
 
 
+def scene_context():
+    from tablespec import BaselineExpectationGenerator
+
+    umf_with_context = {
+        "table_name": "Enrollments",
+        "context_column": "LOB",
+        "columns": [
+            {"name": "member_id", "data_type": "VARCHAR", "nullable": {"MD": False, "MP": True, "ME": False}},
+            {"name": "LOB", "data_type": "VARCHAR"},
+        ],
+    }
+
+    gen = BaselineExpectationGenerator()
+    exps = gen.generate_baseline_expectations(umf_with_context, include_structural=False)
+    row_cond = [e for e in exps if "row_condition" in e.get("kwargs", {})]
+
+    print(f"{len(exps)} expectations generated ({len(row_cond)} context-aware):\n")
+    for exp in row_cond:
+        col = exp["kwargs"].get("column", "")
+        cond = exp["kwargs"]["row_condition"]
+        print(f"  {exp['type']}")
+        print(f"    column={col}  row_condition={cond}")
+
+    print(f"\nDifferent LOBs get different nullable rules — from one YAML.")
+
+
+def scene_compat():
+    from tablespec import UMFColumn, check_compatibility, load_umf_from_yaml
+
+    claims = load_umf_from_yaml(str(CLAIMS_YAML))
+    modified = deepcopy(claims)
+    modified.columns[1].data_type = "INTEGER"  # Narrowing DECIMAL -> INTEGER
+    modified.columns.append(
+        UMFColumn(name="diagnosis_code", data_type="VARCHAR", description="ICD-10 code")
+    )
+
+    report = check_compatibility(claims, modified)
+
+    print(f"Backward compatible: {report.is_backward_compatible}")
+    print(f"Forward compatible:  {report.is_forward_compatible}")
+    print(f"Issues found: {len(report.issues)}\n")
+    for issue in report.issues:
+        print(f"  [{issue.severity:8s}] {issue.component}: {issue.description}")
+
+
+def scene_excel():
+    from tablespec import UMFToExcelConverter, load_umf_from_yaml
+
+    claims = load_umf_from_yaml(str(CLAIMS_YAML))
+
+    with tempfile.NamedTemporaryFile(suffix=".xlsx", delete=False) as f:
+        excel_path = Path(f.name)
+
+    try:
+        exporter = UMFToExcelConverter()
+        workbook = exporter.convert(claims)
+        workbook.save(str(excel_path))
+        size = excel_path.stat().st_size
+
+        print(f"Exported to Excel: {excel_path.name} ({size:,} bytes)")
+        print(f"Sheets: {workbook.sheetnames}")
+
+        cols_sheet = workbook["Columns"]
+        headers = [cell.value for cell in cols_sheet[1] if cell.value]
+        nullable_headers = [h for h in headers if h.startswith("Nullable")]
+        print(f"Nullable columns in Excel: {nullable_headers}")
+
+        # Show a few rows from the Columns sheet
+        print(f"\nColumns sheet preview:")
+        for row in cols_sheet.iter_rows(min_row=1, max_row=4, values_only=True):
+            vals = [str(v) if v is not None else "" for v in row[:5]]
+            print(f"  {' | '.join(vals)}")
+    finally:
+        excel_path.unlink(missing_ok=True)
+
+
 def scene_spark():
     import time
 
@@ -228,6 +307,7 @@ def scene_spark():
         Row(claim_id="CLM-005", claim_amount=4100.00, provider_id="PRV002"),
     ])
     claims_df.show()
+    print("###MARK:spark_profile###", flush=True)
 
     # --- Section 9: Profile ---
     print(f"{'=' * 60}")
@@ -238,6 +318,7 @@ def scene_spark():
     inferred = mapper.map_dataframe_to_umf(claims_df, table_name="InferredClaims")
     for col in inferred["columns"]:
         print(f"  {col['name']:20s} -> {col['data_type']:10s} nullable={col['nullable']}")
+    print("###MARK:spark_validate###", flush=True)
 
     # --- Section 10: Validate ---
     print(f"\n{'=' * 60}")
@@ -257,6 +338,7 @@ def scene_spark():
         error_df.select("error_type", "column_name", "error_message").show(truncate=60)
         print("(Expected: Spark infers double, UMF spec says DECIMAL)")
     umf_path.unlink(missing_ok=True)
+    print("###MARK:spark_sample###", flush=True)
 
     # --- Section 11: Sample Data ---
     print(f"{'=' * 60}")
@@ -416,6 +498,58 @@ def scene_sql_plan():
         print(f"... ({len(lines)} total lines)")
 
 
+def scene_cli():
+    """Demonstrate the CLI mutation commands."""
+    import subprocess
+
+    from tablespec import load_umf_from_yaml
+
+    with tempfile.NamedTemporaryFile(
+        mode="w", suffix=".json", delete=False, dir=tempfile.gettempdir()
+    ) as f:
+        umf = load_umf_from_yaml(str(CLAIMS_YAML))
+        d = umf.model_dump(mode="json", exclude_none=True)
+        f.write(json.dumps(d, indent=2))
+        umf_path = f.name
+
+    def run_cmd(args: list[str]) -> None:
+        cmd = ["uv", "run", "tablespec", *args]
+        print(f"$ tablespec {' '.join(args)}")
+        result = subprocess.run(cmd, capture_output=True, text=True)
+        if result.stdout.strip():
+            print(result.stdout.strip())
+        if result.stderr.strip() and result.returncode != 0:
+            print(result.stderr.strip())
+        print()
+
+    print("--- Column Mutations ---\n")
+    run_cmd(["column-add", umf_path, "--name", "service_date", "--type", "DATE", "--description", "Date of service"])
+    run_cmd(["column-modify", umf_path, "--name", "service_date", "--type", "DATETIME"])
+    run_cmd(["column-rename", umf_path, "--from", "service_date", "--to", "svc_dt", "--keep-alias"])
+
+    print("--- Domain Assignment ---\n")
+    run_cmd(["domains-set", umf_path, "--column", "provider_id", "--type", "npi"])
+
+    print("--- Validation Management ---\n")
+    run_cmd(["validation-remove", umf_path, "--type", "expect_column_values_to_not_be_null", "--column", "claim_id"])
+
+    # Show final state
+    d = json.loads(Path(umf_path).read_text())
+    print("Final columns:")
+    for col in d["columns"]:
+        dt = col.get("domain_type", "")
+        alias = col.get("aliases", [])
+        extras = []
+        if dt:
+            extras.append(f"domain={dt}")
+        if alias:
+            extras.append(f"aliases={alias}")
+        extra_str = f"  ({', '.join(extras)})" if extras else ""
+        print(f"  {col['name']:20s} {col['data_type']:10s}{extra_str}")
+
+    Path(umf_path).unlink(missing_ok=True)
+
+
 # ─── Dispatch ─────────────────────────────────────────────────────
 
 SCENES = {
@@ -427,8 +561,12 @@ def scene_sql_plan():
     "gx": scene_gx,
     "prompts": scene_prompts,
     "diff": scene_diff,
+    "context": scene_context,
+    "compat": scene_compat,
+    "excel": scene_excel,
     "sql_plan": scene_sql_plan,
     "spark": scene_spark,
+    "cli": scene_cli,
 }
 
 if __name__ == "__main__":
diff --git a/examples/screencast.sh b/examples/screencast.sh