Skip to content

Commit 74bdb48

Browse files
easelclaude
andcommitted
Fix documentation: public API, import consistency, MkDocs split, sample files
- Remove _-prefixed functions and internal constants from __all__, keeping only the public API surface (65 exports, organized with section comments) - Add 10 missing exports to __init__.py (Excel, UMFLoader, SampleData, DomainTypes, UMFDiff, ChangelogGenerator) so README can use top-level imports - Slim README from 777 to 130 lines (install + quickstart + docs links) - Move detailed feature docs to 11 MkDocs guide pages + development page - Update mkdocs.yml nav with full User Guide section - Ship examples/schema.yaml and examples/providers.yaml as runnable samples - Fix FEAT-006 and US-008 to reference public API names Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent dceccaf commit 74bdb48

20 files changed

Lines changed: 787 additions & 722 deletions

README.md

Lines changed: 8 additions & 652 deletions
Large diffs are not rendered by default.

docs/development.md

Lines changed: 115 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,115 @@
1+
# Development
2+
3+
## Setup
4+
5+
```bash
6+
# Clone repository
7+
git clone <repository-url>
8+
cd tablespec
9+
10+
# Install with development dependencies
11+
uv sync --all-extras
12+
13+
# Install with Spark support
14+
uv sync --extra spark
15+
```
16+
17+
## Running Tests
18+
19+
```bash
20+
# Run all tests
21+
uv run pytest
22+
23+
# Run with coverage
24+
uv run pytest --cov=src/tablespec --cov-report=html
25+
26+
# Run specific test file
27+
uv run pytest tests/unit/test_gx_baseline.py
28+
```
29+
30+
## Project Structure
31+
32+
```
33+
src/tablespec/
34+
├── __init__.py # Public API exports
35+
├── cli.py # Typer CLI (validate, info, convert, excel, domains)
36+
├── models/
37+
│ ├── umf.py # Pydantic UMF models
38+
│ ├── changelog.py # Changelog entry models
39+
│ └── pipeline.py # Pipeline configuration models
40+
├── schemas/
41+
│ ├── generators.py # Schema generation (SQL, PySpark, JSON)
42+
│ ├── umf.schema.json # JSON Schema for UMF validation
43+
│ ├── gx_expectation_suite.schema.json
44+
│ ├── expectation_categories.json
45+
│ └── expectation_parameters.json
46+
├── type_mappings.py # Type system conversions
47+
├── date_formats.py # Date/datetime format definitions
48+
├── naming.py # Naming utilities (to_spark_identifier, position_sort_key)
49+
├── naming_validator.py # Column naming convention validation
50+
├── gx_baseline.py # GX baseline expectation generation
51+
├── gx_constraint_extractor.py # Extract constraints from GX suites
52+
├── gx_schema_validator.py # Schema validation with GX
53+
├── gx_wrapper.py # GX utility wrapper
54+
├── excel_converter.py # Bidirectional Excel <-> UMF conversion
55+
├── excel_import_git.py # Git-integrated Excel import with atomic commits
56+
├── umf_loader.py # Split/JSON format loader with auto-detection
57+
├── umf_diff.py # UMF version diffing
58+
├── umf_change_applier.py # Atomic change application for per-change commits
59+
├── umf_validator.py # UMF structural validation
60+
├── changelog_generator.py # Git-based changelog generation
61+
├── changelog_diff_parser.py # YAML diff parsing for change detection
62+
├── changelog_formatter.py # Changelog output formatting
63+
├── inference/
64+
│ └── domain_types.py # Domain type registry and inference engine
65+
├── sample_data/
66+
│ ├── engine.py # Main sample data generation engine
67+
│ ├── config.py # Generation configuration
68+
│ ├── generators.py # Healthcare-specific data generators
69+
│ ├── column_value_generator.py # Per-column value generation
70+
│ ├── constraint_handlers.py # Validation constraint handling
71+
│ ├── foreign_keys.py # FK relationship-aware generation
72+
│ ├── graph.py # Dependency graph for generation order
73+
│ ├── filename_generator.py # Filename pattern generation
74+
│ ├── date_processing.py # Date format handling
75+
│ ├── registry.py # Key registry for uniqueness
76+
│ └── validation.py # Validation rule processing
77+
├── quality/
78+
│ ├── baseline_service.py # Baseline capture and comparison
79+
│ ├── baseline_storage.py # Baseline persistence
80+
│ ├── executor.py # Quality check execution
81+
│ └── storage.py # Quality result storage
82+
├── profiling/
83+
│ ├── types.py # Profiling result types
84+
│ ├── spark_mapper.py # Spark DataFrame -> UMF (requires PySpark)
85+
│ └── deequ_mapper.py # Deequ profile -> UMF
86+
├── prompts/
87+
│ ├── documentation.py # Documentation enrichment prompts
88+
│ ├── validation.py # Table-level validation rule prompts
89+
│ ├── validation_per_column.py # Per-column validation prompts
90+
│ ├── column_validation.py # Column-specific validation prompts
91+
│ ├── relationship.py # Relationship detection prompts
92+
│ ├── survivorship.py # Survivorship logic prompts
93+
│ ├── filename_pattern.py # Filename pattern prompts
94+
│ ├── expectation_guide.py # GX expectation reference
95+
│ └── utils.py # Prompt utilities
96+
├── formatting/
97+
│ ├── constants.py # Formatting constants
98+
│ └── yaml_formatter.py # YAML output formatting
99+
├── validation/
100+
│ ├── gx_processor.py # GX expectation processing
101+
│ ├── table_validator.py # Table validation engine (requires PySpark)
102+
│ └── custom_gx_expectations.py # Custom GX expectation types
103+
├── casting_utils.py # Type casting utilities
104+
├── completeness_validator.py # Data completeness validation
105+
├── dependency_resolver.py # Module dependency resolution
106+
├── format_utils.py # Format conversion utilities
107+
├── merge.py # Table merge with survivorship (requires PySpark)
108+
├── relationship_validator.py # FK relationship validation
109+
├── spark_factory.py # SparkSession factory (requires PySpark)
110+
├── survivorship_display.py # Survivorship rule display
111+
├── sync_baseline.py # Baseline synchronization
112+
├── output_formatting.py # Output display formatting
113+
├── validator.py # Pipeline-level validation orchestration
114+
└── domain_types.yaml # Domain type registry definitions
115+
```

docs/guide/change-management.md

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
# Change Management
2+
3+
Detect differences between UMF versions and generate structured changelogs. `UMFDiff` compares two UMF objects to identify column, validation, metadata, and relationship changes. Integrates with git for commit-level change history.
4+
5+
## Diffing UMF Versions
6+
7+
```python
8+
from tablespec import UMFDiff, UMF, load_umf_from_yaml
9+
10+
old_umf = load_umf_from_yaml("v1/schema.yaml")
11+
new_umf = load_umf_from_yaml("v2/schema.yaml")
12+
13+
diff = UMFDiff(old_umf, new_umf)
14+
column_changes = diff.get_column_changes()
15+
16+
for change in column_changes:
17+
print(change.description())
18+
# "Add column diagnosis_code"
19+
# "Modify column claim_amount: data_type changed from INTEGER to DECIMAL"
20+
```
21+
22+
## Generating Changelogs from Git History
23+
24+
```python
25+
from tablespec import ChangelogGenerator
26+
27+
# Generate changelog from git history
28+
generator = ChangelogGenerator(repo_path=".")
29+
changelog = generator.generate(table_path="tables/medical_claims/")
30+
```

docs/guide/cli.md

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
# CLI
2+
3+
The `tablespec` command provides schema management, conversion, and validation from the terminal. Requires `typer` and `rich` (included in default dependencies).
4+
5+
## Commands
6+
7+
```bash
8+
# Validate a UMF schema (single table or entire pipeline directory)
9+
tablespec validate tables/outreach_list/
10+
11+
# Display schema summary
12+
tablespec info tables/outreach_list/
13+
14+
# Convert between split and JSON formats
15+
tablespec convert outreach_list.json tables/outreach_list/
16+
17+
# Batch convert a directory of UMF files
18+
tablespec batch-convert tables/ output/ --format split
19+
20+
# Export UMF to Excel for domain expert review
21+
tablespec export-excel tables/medical_claims/ claims.xlsx
22+
23+
# Import edited Excel back to UMF (split format)
24+
tablespec import-excel claims.xlsx tables/medical_claims/
25+
26+
# List all registered domain types
27+
tablespec domains-list
28+
29+
# Show details of a specific domain type
30+
tablespec domains-show us_state_code
31+
32+
# Infer domain type for a column
33+
tablespec domains-infer --column state --description "State code abbreviation"
34+
```

docs/guide/domain-inference.md

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
# Domain Type Inference
2+
3+
Automatic detection of semantic domain types from column names, descriptions, and sample values. Uses a YAML-based registry of domain types (e.g., `us_state_code`, `email`, `phone_number`, `npi`, `ssn`).
4+
5+
## Usage
6+
7+
```python
8+
from tablespec import DomainTypeInference, DomainTypeRegistry
9+
10+
# List available domain types
11+
registry = DomainTypeRegistry()
12+
print(registry.list_domain_types())
13+
14+
# Infer domain type for a column
15+
inference = DomainTypeInference()
16+
domain_type, confidence = inference.infer_domain_type(
17+
"member_state_code",
18+
description="State where member resides",
19+
sample_values=["CA", "NY", "TX"],
20+
)
21+
# domain_type="us_state_code", confidence=0.95
22+
```

docs/guide/excel.md

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
# Excel Conversion
2+
3+
tablespec provides round-trip conversion between UMF and Excel for non-technical domain experts. Excel workbooks include data validation dropdowns, helper columns, and instructions.
4+
5+
## Export UMF to Excel
6+
7+
```python
8+
from tablespec import UMFToExcelConverter, UMFLoader
9+
10+
loader = UMFLoader()
11+
umf = loader.load("tables/medical_claims/")
12+
converter = UMFToExcelConverter()
13+
workbook = converter.convert(umf)
14+
workbook.save("medical_claims.xlsx")
15+
```
16+
17+
## Import Excel back to UMF
18+
19+
```python
20+
from tablespec import ExcelToUMFConverter
21+
22+
importer = ExcelToUMFConverter()
23+
umf, metadata = importer.convert("medical_claims.xlsx")
24+
```

docs/guide/great-expectations.md

Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
# Great Expectations Integration
2+
3+
tablespec integrates with Great Expectations for baseline expectation generation, constraint extraction, and UMF-to-GX mapping.
4+
5+
## Baseline Expectation Generation
6+
7+
Generate deterministic expectations from UMF metadata:
8+
9+
```python
10+
from tablespec import BaselineExpectationGenerator, load_umf_from_yaml
11+
12+
# Load UMF
13+
umf = load_umf_from_yaml("examples/schema.yaml")
14+
umf_dict = umf.model_dump()
15+
16+
# Generate baseline expectations
17+
generator = BaselineExpectationGenerator()
18+
expectations = generator.generate_baseline_expectations(
19+
umf_dict,
20+
include_structural=True
21+
)
22+
23+
# Expectations include:
24+
# - Column existence
25+
# - Column types
26+
# - Nullability
27+
# - Length constraints
28+
# - Column count and order
29+
```
30+
31+
## Constraint Extraction
32+
33+
Extract existing Great Expectations suite into UMF format:
34+
35+
```python
36+
from tablespec import GXConstraintExtractor
37+
38+
extractor = GXConstraintExtractor()
39+
40+
# Extract from GX checkpoint JSON
41+
validation_rules = extractor.extract_from_checkpoint(
42+
checkpoint_path="checkpoints/my_checkpoint.json"
43+
)
44+
45+
# Add to UMF
46+
umf.validation_rules = validation_rules
47+
```
48+
49+
## UMF to Great Expectations Mapping
50+
51+
Map UMF models to GX format:
52+
53+
```python
54+
from tablespec import UmfToGxMapper
55+
56+
mapper = UmfToGxMapper()
57+
58+
# Convert column definitions
59+
gx_columns = mapper.map_columns(umf.columns)
60+
61+
# Convert validation rules
62+
gx_expectations = mapper.map_validation_rules(umf.validation_rules)
63+
```

docs/guide/llm-prompts.md

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
# LLM Prompt Generation
2+
3+
tablespec generates structured prompts for LLM-based enrichment of UMF schemas.
4+
5+
## Available Prompt Generators
6+
7+
```python
8+
from pathlib import Path
9+
from tablespec import (
10+
generate_documentation_prompt,
11+
generate_validation_prompt,
12+
generate_relationship_prompt,
13+
generate_survivorship_prompt
14+
)
15+
16+
umf_dict = umf.model_dump()
17+
18+
# Generate documentation prompt
19+
doc_prompt = generate_documentation_prompt(umf_dict)
20+
# Asks LLM to enhance table and column descriptions
21+
22+
# Generate validation rules prompt
23+
validation_prompt = generate_validation_prompt(umf_dict)
24+
# Asks LLM to suggest validation rules (uniqueness, ranges, formats)
25+
26+
# Generate relationship prompt (uses UMF directory paths)
27+
relationship_prompt = generate_relationship_prompt(
28+
Path("tables/medical_claims"),
29+
Path("tables")
30+
)
31+
# Asks LLM to identify foreign key relationships
32+
33+
# Generate survivorship prompt (uses table name and UMF directory)
34+
survivorship_prompt = generate_survivorship_prompt(
35+
"Medical_Claims",
36+
Path("tables/medical_claims")
37+
)
38+
# Asks LLM to suggest survivorship/merge logic for deduplication
39+
```

docs/guide/profiling.md

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
# Profiling Integration
2+
3+
tablespec converts profiling results from Spark DataFrames and Deequ into UMF format.
4+
5+
## Spark DataFrame Profiling
6+
7+
```python
8+
from tablespec import SparkToUmfMapper # Requires tablespec[spark]
9+
from tablespec import save_umf_to_yaml
10+
from pyspark.sql import DataFrame
11+
12+
# Profile Spark DataFrame
13+
mapper = SparkToUmfMapper()
14+
umf = mapper.create_umf_from_dataframe(
15+
df=spark_df,
16+
table_name="Medical_Claims",
17+
source_file="claims.parquet"
18+
)
19+
20+
# UMF includes inferred types, nullability, and sample values
21+
save_umf_to_yaml(umf, "medical_claims.yaml")
22+
```
23+
24+
## Deequ Profiling
25+
26+
```python
27+
from tablespec import DeequToUmfMapper
28+
29+
# Convert Deequ profile to UMF
30+
mapper = DeequToUmfMapper()
31+
umf = mapper.create_umf_from_profile(
32+
profile_json="deequ_profile.json",
33+
table_name="Medical_Claims"
34+
)
35+
```

docs/guide/sample-data.md

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
# Sample Data Generation
2+
3+
Generate realistic, constraint-aware sample data from UMF specifications. Supports healthcare-specific generators (SSN, NPI, drug codes), foreign key relationship graphs for referential integrity, and CSV/JSON output.
4+
5+
## Usage
6+
7+
```python
8+
from tablespec import SampleDataGenerator, GenerationConfig
9+
10+
config = GenerationConfig(record_count=100, seed=42)
11+
generator = SampleDataGenerator(
12+
input_dir="tables/",
13+
output_dir="sample_output/",
14+
config=config,
15+
)
16+
generator.generate()
17+
# Produces CSV files in sample_output/ with realistic, relationship-aware data
18+
```

0 commit comments

Comments
 (0)