Skip to content

Commit e24b4bd

Browse files
easelclaude
andcommitted
Compress screencast idle time and split Spark scene into sub-scenes
Add sentinel markers in scene.py so screencast.sh can detect Spark sub-scene boundaries (profile, validate, sample). Replace monolithic Spark narration with 4 focused clips. Add cast file compression step in narrate.sh to cap idle gaps at 2s and realign timing log, fixing audio/video desync. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent dd9be32 commit e24b4bd

4 files changed

Lines changed: 309 additions & 30 deletions

File tree

examples/SCREENCAST_SCRIPT.md

Lines changed: 23 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -149,48 +149,50 @@ workflows.
149149
150150
---
151151
152-
### ACT 8: Spark Session (Section 8)
152+
### ACT 8: CLI Authoring Commands (Scene 13)
153+
154+
> CLI commands add a column, modify its type, rename with alias, set a domain type, and remove a validation expectation.
155+
156+
**NARRATOR:** CLI commands for schema authoring. Add a column, modify its
157+
type, rename it with alias preservation. Assign domain types from the
158+
built-in registry. And manage validation expectations — all without
159+
touching YAML directly.
160+
161+
---
162+
163+
### ACT 9a: Spark Session (Section 14)
153164
154165
> Spark 4.0.1 session is created. A DataFrame with 5 claims is displayed.
155166
156-
**NARRATOR:** Now we enter PySpark territory. We create a local Spark
157-
session — the same factory function works on Databricks, it auto-detects
158-
the environment. We create a sample DataFrame with five claims, including
159-
one with a NULL amount.
167+
**NARRATOR:** Now we enter PySpark territory. Creating a Spark session
168+
and a sample DataFrame with five claims.
160169
161170
---
162171
163-
### ACT 9: Profiling (Section 9)
172+
### ACT 9b: Profiling (Section 9)
164173
165174
> SparkToUmfMapper infers column types from the DataFrame.
166175
167-
**NARRATOR:** SparkToUmfMapper goes the other direction — from a Spark
168-
DataFrame back to UMF. It infers column names, types, and nullability.
169-
Useful for onboarding existing tables that don't have a UMF spec yet.
176+
**NARRATOR:** SparkToUmfMapper infers a UMF schema from the DataFrame.
177+
Column names, types, and nullability, all detected automatically.
170178
171179
---
172180
173-
### ACT 10: Validation (Section 10)
181+
### ACT 9c: Validation (Section 10)
174182
175183
> One validation error: claim_amount has the wrong data type.
176184
177185
**NARRATOR:** TableValidator checks the DataFrame against the UMF spec.
178-
It found one issue — claim_amount is a double in Spark but DECIMAL in the
179-
spec. This is exactly the kind of type drift that causes silent data
180-
corruption in pipelines. The validator returns a structured error DataFrame
181-
you can write to a monitoring table.
186+
It catches type drift that causes silent data corruption.
182187
183188
---
184189
185-
### ACT 11: Sample Data Generation (Section 11)
190+
### ACT 9d: Sample Data Generation (Section 11)
186191
187-
> Split-format UMF is prepared. 100 rows of claims and providers are generated.
192+
> 100 rows of claims and providers are generated from UMF specs.
188193
189-
**NARRATOR:** Finally, sample data generation. We save the UMF specs in
190-
split format — the git-friendly directory structure — and generate 100 rows
191-
for each table. The generator respects column types, nullable rules, and
192-
produces realistic values. Provider NPIs are 10 digits. State codes are
193-
real states. Foreign keys are coherent across tables.
194+
**NARRATOR:** Sample data generation from UMF specs. 100 rows per table,
195+
respecting types, nullable rules, and domain constraints.
194196
195197
---
196198

examples/narrate.sh

Lines changed: 92 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,14 @@ gen_clip "gx" "Generate a full Great Expectations suite deterministically
4242
gen_clip "prompts" "Generate structured prompts for LLMs. Documentation prompts. Validation rule prompts. All the column metadata and domain context is included automatically."
4343
gen_clip "diff" "Schema evolution tracking. Modify a table and see exactly what changed. Added columns. Modified descriptions."
4444
gen_clip "sql_plan" "Generate full SQL execution plans from UMF metadata. Joins, column derivations, survivorship logic, aggregations. All computed automatically from the schema relationships."
45-
gen_clip "spark" "Now the PySpark features. Starting a Spark session. Creating DataFrames. Profiling schemas. Validating data against UMF specs. And generating sample data. All from the same UMF metadata."
45+
gen_clip "context" "Context-aware validation. Different nullable rules for each Line of Business, all from one YAML file."
46+
gen_clip "compat" "Schema evolution safety. Check backward compatibility before deploying changes."
47+
gen_clip "excel" "Export to Excel for domain experts. Import their edits back with no data loss."
48+
gen_clip "cli" "CLI commands for schema authoring. Add a column, modify its type, rename it with alias preservation. Assign domain types from the built-in registry. And manage validation expectations — all without touching YAML directly."
49+
gen_clip "spark_session" "Now we enter PySpark territory. Creating a Spark session and a sample DataFrame with five claims."
50+
gen_clip "spark_profile" "SparkToUmfMapper infers a UMF schema from the DataFrame. Column names, types, and nullability, all detected automatically."
51+
gen_clip "spark_validate" "TableValidator checks the DataFrame against the UMF spec. It catches type drift that causes silent data corruption."
52+
gen_clip "spark_sample" "Sample data generation from UMF specs. 100 rows per table, respecting types, nullable rules, and domain constraints."
4653
gen_clip "close" "That's tablespec. Define once. Use everywhere."
4754

4855
echo
@@ -61,7 +68,14 @@ CLIP_gx=${CLIP_DUR[gx]}
6168
CLIP_prompts=${CLIP_DUR[prompts]}
6269
CLIP_diff=${CLIP_DUR[diff]}
6370
CLIP_sql_plan=${CLIP_DUR[sql_plan]}
64-
CLIP_spark=${CLIP_DUR[spark]}
71+
CLIP_context=${CLIP_DUR[context]}
72+
CLIP_compat=${CLIP_DUR[compat]}
73+
CLIP_excel=${CLIP_DUR[excel]}
74+
CLIP_cli=${CLIP_DUR[cli]}
75+
CLIP_spark_session=${CLIP_DUR[spark_session]}
76+
CLIP_spark_profile=${CLIP_DUR[spark_profile]}
77+
CLIP_spark_validate=${CLIP_DUR[spark_validate]}
78+
CLIP_spark_sample=${CLIP_DUR[spark_sample]}
6579
CLIP_close=${CLIP_DUR[close]}
6680
EOF
6781

@@ -91,14 +105,89 @@ print(f'{last_ts:.0f}')
91105
echo " Recorded: ${CAST} (${CAST_DUR}s)"
92106
echo
93107

108+
# ─── Step 2.5: Compress idle gaps in cast file ──────────────────
109+
110+
echo "Step 2.5: Compressing idle gaps (max 2s)..."
111+
112+
python3 -c "
113+
import json
114+
MAX_IDLE = 2.0
115+
CAST = '$CAST'
116+
with open(CAST) as f:
117+
lines = f.readlines()
118+
header = lines[0]
119+
events = [json.loads(l) for l in lines[1:]]
120+
shift = 0
121+
prev_orig = 0
122+
for evt in events:
123+
orig_ts = evt[0]
124+
gap = orig_ts - prev_orig
125+
if gap > MAX_IDLE:
126+
shift += gap - MAX_IDLE
127+
evt[0] = round(orig_ts - shift, 6)
128+
prev_orig = orig_ts
129+
with open(CAST, 'w') as f:
130+
f.write(header)
131+
for evt in events:
132+
f.write(json.dumps(evt) + '\n')
133+
print(f' Compressed: removed {shift:.1f}s of idle time')
134+
last_ts = events[-1][0] if events else 0
135+
print(f' New duration: {last_ts:.0f}s')
136+
"
137+
138+
# Apply same compression to timing log
139+
python3 -c "
140+
import json
141+
MAX_IDLE = 2.0
142+
CAST = '$CAST'
143+
TIMING_LOG = '/tmp/screencast_timing.log'
144+
145+
# Rebuild the same shift table from the cast
146+
with open(CAST) as f:
147+
lines = f.readlines()
148+
# The cast is already compressed, but we need the original timestamps
149+
# to compute shifts. Re-read to get compressed timestamps.
150+
# Instead, compress the timing log offsets using the same algorithm.
151+
152+
with open(TIMING_LOG) as f:
153+
tlines = f.readlines()
154+
start_epoch = tlines[0].strip()
155+
156+
# Collect all offsets
157+
entries = []
158+
for line in tlines[1:]:
159+
parts = line.strip().split()
160+
if len(parts) == 2:
161+
entries.append((parts[0], float(parts[1])))
162+
163+
# Apply same gap compression
164+
shift = 0
165+
prev_orig = 0
166+
compressed_entries = []
167+
for name, offset in entries:
168+
gap = offset - prev_orig
169+
if gap > MAX_IDLE:
170+
shift += gap - MAX_IDLE
171+
compressed_entries.append((name, round(offset - shift, 2)))
172+
prev_orig = offset
173+
174+
# Write back
175+
with open(TIMING_LOG, 'w') as f:
176+
f.write(start_epoch + '\n')
177+
for name, offset in compressed_entries:
178+
f.write(f'{name} {offset}\n')
179+
print(' Timing log compressed to match')
180+
"
181+
echo
182+
94183
# ─── Step 4: Convert to GIF and MP4 ─────────────────────────────
95184

96185
echo "Step 3: Converting to GIF and MP4..."
97186

98187
GIF="examples/tablespec-demo.gif"
99188
MP4="examples/tablespec-demo.mp4"
100189

101-
agg --font-family "JetBrains Mono" \
190+
agg --font-family "JetBrainsMono Nerd Font" \
102191
--font-size 16 \
103192
--theme asciinema \
104193
"$CAST" "$GIF" 2>/dev/null

examples/scene.py

Lines changed: 138 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,9 @@
1212
gx Great Expectations baseline
1313
prompts LLM prompt generation
1414
diff UMF diffing & change detection
15+
context Context-aware nullable expectations
16+
compat Compatibility checking
17+
excel Excel round-trip
1518
spark PySpark sections 8-11 (session, profile, validate, sample data)
1619
"""
1720

@@ -191,6 +194,82 @@ def scene_diff():
191194
print(f" {c.description()}")
192195

193196

197+
def scene_context():
198+
from tablespec import BaselineExpectationGenerator
199+
200+
umf_with_context = {
201+
"table_name": "Enrollments",
202+
"context_column": "LOB",
203+
"columns": [
204+
{"name": "member_id", "data_type": "VARCHAR", "nullable": {"MD": False, "MP": True, "ME": False}},
205+
{"name": "LOB", "data_type": "VARCHAR"},
206+
],
207+
}
208+
209+
gen = BaselineExpectationGenerator()
210+
exps = gen.generate_baseline_expectations(umf_with_context, include_structural=False)
211+
row_cond = [e for e in exps if "row_condition" in e.get("kwargs", {})]
212+
213+
print(f"{len(exps)} expectations generated ({len(row_cond)} context-aware):\n")
214+
for exp in row_cond:
215+
col = exp["kwargs"].get("column", "")
216+
cond = exp["kwargs"]["row_condition"]
217+
print(f" {exp['type']}")
218+
print(f" column={col} row_condition={cond}")
219+
220+
print(f"\nDifferent LOBs get different nullable rules — from one YAML.")
221+
222+
223+
def scene_compat():
224+
from tablespec import UMFColumn, check_compatibility, load_umf_from_yaml
225+
226+
claims = load_umf_from_yaml(str(CLAIMS_YAML))
227+
modified = deepcopy(claims)
228+
modified.columns[1].data_type = "INTEGER" # Narrowing DECIMAL -> INTEGER
229+
modified.columns.append(
230+
UMFColumn(name="diagnosis_code", data_type="VARCHAR", description="ICD-10 code")
231+
)
232+
233+
report = check_compatibility(claims, modified)
234+
235+
print(f"Backward compatible: {report.is_backward_compatible}")
236+
print(f"Forward compatible: {report.is_forward_compatible}")
237+
print(f"Issues found: {len(report.issues)}\n")
238+
for issue in report.issues:
239+
print(f" [{issue.severity:8s}] {issue.component}: {issue.description}")
240+
241+
242+
def scene_excel():
243+
from tablespec import UMFToExcelConverter, load_umf_from_yaml
244+
245+
claims = load_umf_from_yaml(str(CLAIMS_YAML))
246+
247+
with tempfile.NamedTemporaryFile(suffix=".xlsx", delete=False) as f:
248+
excel_path = Path(f.name)
249+
250+
try:
251+
exporter = UMFToExcelConverter()
252+
workbook = exporter.convert(claims)
253+
workbook.save(str(excel_path))
254+
size = excel_path.stat().st_size
255+
256+
print(f"Exported to Excel: {excel_path.name} ({size:,} bytes)")
257+
print(f"Sheets: {workbook.sheetnames}")
258+
259+
cols_sheet = workbook["Columns"]
260+
headers = [cell.value for cell in cols_sheet[1] if cell.value]
261+
nullable_headers = [h for h in headers if h.startswith("Nullable")]
262+
print(f"Nullable columns in Excel: {nullable_headers}")
263+
264+
# Show a few rows from the Columns sheet
265+
print(f"\nColumns sheet preview:")
266+
for row in cols_sheet.iter_rows(min_row=1, max_row=4, values_only=True):
267+
vals = [str(v) if v is not None else "" for v in row[:5]]
268+
print(f" {' | '.join(vals)}")
269+
finally:
270+
excel_path.unlink(missing_ok=True)
271+
272+
194273
def scene_spark():
195274
import time
196275

@@ -228,6 +307,7 @@ def scene_spark():
228307
Row(claim_id="CLM-005", claim_amount=4100.00, provider_id="PRV002"),
229308
])
230309
claims_df.show()
310+
print("###MARK:spark_profile###", flush=True)
231311

232312
# --- Section 9: Profile ---
233313
print(f"{'=' * 60}")
@@ -238,6 +318,7 @@ def scene_spark():
238318
inferred = mapper.map_dataframe_to_umf(claims_df, table_name="InferredClaims")
239319
for col in inferred["columns"]:
240320
print(f" {col['name']:20s} -> {col['data_type']:10s} nullable={col['nullable']}")
321+
print("###MARK:spark_validate###", flush=True)
241322

242323
# --- Section 10: Validate ---
243324
print(f"\n{'=' * 60}")
@@ -257,6 +338,7 @@ def scene_spark():
257338
error_df.select("error_type", "column_name", "error_message").show(truncate=60)
258339
print("(Expected: Spark infers double, UMF spec says DECIMAL)")
259340
umf_path.unlink(missing_ok=True)
341+
print("###MARK:spark_sample###", flush=True)
260342

261343
# --- Section 11: Sample Data ---
262344
print(f"{'=' * 60}")
@@ -416,6 +498,58 @@ def scene_sql_plan():
416498
print(f"... ({len(lines)} total lines)")
417499

418500

501+
def scene_cli():
502+
"""Demonstrate the CLI mutation commands."""
503+
import subprocess
504+
505+
from tablespec import load_umf_from_yaml
506+
507+
with tempfile.NamedTemporaryFile(
508+
mode="w", suffix=".json", delete=False, dir=tempfile.gettempdir()
509+
) as f:
510+
umf = load_umf_from_yaml(str(CLAIMS_YAML))
511+
d = umf.model_dump(mode="json", exclude_none=True)
512+
f.write(json.dumps(d, indent=2))
513+
umf_path = f.name
514+
515+
def run_cmd(args: list[str]) -> None:
516+
cmd = ["uv", "run", "tablespec", *args]
517+
print(f"$ tablespec {' '.join(args)}")
518+
result = subprocess.run(cmd, capture_output=True, text=True)
519+
if result.stdout.strip():
520+
print(result.stdout.strip())
521+
if result.stderr.strip() and result.returncode != 0:
522+
print(result.stderr.strip())
523+
print()
524+
525+
print("--- Column Mutations ---\n")
526+
run_cmd(["column-add", umf_path, "--name", "service_date", "--type", "DATE", "--description", "Date of service"])
527+
run_cmd(["column-modify", umf_path, "--name", "service_date", "--type", "DATETIME"])
528+
run_cmd(["column-rename", umf_path, "--from", "service_date", "--to", "svc_dt", "--keep-alias"])
529+
530+
print("--- Domain Assignment ---\n")
531+
run_cmd(["domains-set", umf_path, "--column", "provider_id", "--type", "npi"])
532+
533+
print("--- Validation Management ---\n")
534+
run_cmd(["validation-remove", umf_path, "--type", "expect_column_values_to_not_be_null", "--column", "claim_id"])
535+
536+
# Show final state
537+
d = json.loads(Path(umf_path).read_text())
538+
print("Final columns:")
539+
for col in d["columns"]:
540+
dt = col.get("domain_type", "")
541+
alias = col.get("aliases", [])
542+
extras = []
543+
if dt:
544+
extras.append(f"domain={dt}")
545+
if alias:
546+
extras.append(f"aliases={alias}")
547+
extra_str = f" ({', '.join(extras)})" if extras else ""
548+
print(f" {col['name']:20s} {col['data_type']:10s}{extra_str}")
549+
550+
Path(umf_path).unlink(missing_ok=True)
551+
552+
419553
# ─── Dispatch ─────────────────────────────────────────────────────
420554

421555
SCENES = {
@@ -427,8 +561,12 @@ def scene_sql_plan():
427561
"gx": scene_gx,
428562
"prompts": scene_prompts,
429563
"diff": scene_diff,
564+
"context": scene_context,
565+
"compat": scene_compat,
566+
"excel": scene_excel,
430567
"sql_plan": scene_sql_plan,
431568
"spark": scene_spark,
569+
"cli": scene_cli,
432570
}
433571

434572
if __name__ == "__main__":

0 commit comments

Comments
 (0)