DocumentDrivenDX
diff --git a/‎README.md‎
Lines changed: 8 additions & 652 deletions b/‎README.md‎
Lines changed: 8 additions & 652 deletions
diff --git a/‎docs/development.md‎
Lines changed: 115 additions & 0 deletions b/‎docs/development.md‎
Lines changed: 115 additions & 0 deletions
diff --git a/‎docs/guide/change-management.md‎
Lines changed: 30 additions & 0 deletions b/‎docs/guide/change-management.md‎
Lines changed: 30 additions & 0 deletions
diff --git a/‎docs/guide/cli.md‎
Lines changed: 34 additions & 0 deletions b/‎docs/guide/cli.md‎
Lines changed: 34 additions & 0 deletions
diff --git a/‎docs/guide/domain-inference.md‎
Lines changed: 22 additions & 0 deletions b/‎docs/guide/domain-inference.md‎
Lines changed: 22 additions & 0 deletions
diff --git a/‎docs/guide/excel.md‎
Lines changed: 24 additions & 0 deletions b/‎docs/guide/excel.md‎
Lines changed: 24 additions & 0 deletions
diff --git a/‎docs/guide/great-expectations.md‎
Lines changed: 63 additions & 0 deletions b/‎docs/guide/great-expectations.md‎
Lines changed: 63 additions & 0 deletions
diff --git a/‎docs/guide/llm-prompts.md‎
Lines changed: 39 additions & 0 deletions b/‎docs/guide/llm-prompts.md‎
Lines changed: 39 additions & 0 deletions
diff --git a/‎docs/guide/profiling.md‎
Lines changed: 35 additions & 0 deletions b/‎docs/guide/profiling.md‎
Lines changed: 35 additions & 0 deletions
diff --git a/‎docs/guide/sample-data.md‎
Lines changed: 18 additions & 0 deletions b/‎docs/guide/sample-data.md‎
Lines changed: 18 additions & 0 deletions
@@ -0,0 +1,115 @@
+# Development
+
+## Setup
+
+```bash
+# Clone repository
+git clone <repository-url>
+cd tablespec
+
+# Install with development dependencies
+uv sync --all-extras
+
+# Install with Spark support
+uv sync --extra spark
+```
+
+## Running Tests
+
+```bash
+# Run all tests
+uv run pytest
+
+# Run with coverage
+uv run pytest --cov=src/tablespec --cov-report=html
+
+# Run specific test file
+uv run pytest tests/unit/test_gx_baseline.py
+```
+
+## Project Structure
+
+```
+src/tablespec/
+├── __init__.py                  # Public API exports
+├── cli.py                       # Typer CLI (validate, info, convert, excel, domains)
+├── models/
+│   ├── umf.py                   # Pydantic UMF models
+│   ├── changelog.py             # Changelog entry models
+│   └── pipeline.py              # Pipeline configuration models
+├── schemas/
+│   ├── generators.py            # Schema generation (SQL, PySpark, JSON)
+│   ├── umf.schema.json          # JSON Schema for UMF validation
+│   ├── gx_expectation_suite.schema.json
+│   ├── expectation_categories.json
+│   └── expectation_parameters.json
+├── type_mappings.py             # Type system conversions
+├── date_formats.py              # Date/datetime format definitions
+├── naming.py                    # Naming utilities (to_spark_identifier, position_sort_key)
+├── naming_validator.py          # Column naming convention validation
+├── gx_baseline.py               # GX baseline expectation generation
+├── gx_constraint_extractor.py   # Extract constraints from GX suites
+├── gx_schema_validator.py       # Schema validation with GX
+├── gx_wrapper.py                # GX utility wrapper
+├── excel_converter.py           # Bidirectional Excel <-> UMF conversion
+├── excel_import_git.py          # Git-integrated Excel import with atomic commits
+├── umf_loader.py                # Split/JSON format loader with auto-detection
+├── umf_diff.py                  # UMF version diffing
+├── umf_change_applier.py        # Atomic change application for per-change commits
+├── umf_validator.py             # UMF structural validation
+├── changelog_generator.py       # Git-based changelog generation
+├── changelog_diff_parser.py     # YAML diff parsing for change detection
+├── changelog_formatter.py       # Changelog output formatting
+├── inference/
+│   └── domain_types.py          # Domain type registry and inference engine
+├── sample_data/
+│   ├── engine.py                # Main sample data generation engine
+│   ├── config.py                # Generation configuration
+│   ├── generators.py            # Healthcare-specific data generators
+│   ├── column_value_generator.py # Per-column value generation
+│   ├── constraint_handlers.py   # Validation constraint handling
+│   ├── foreign_keys.py          # FK relationship-aware generation
+│   ├── graph.py                 # Dependency graph for generation order
+│   ├── filename_generator.py    # Filename pattern generation
+│   ├── date_processing.py       # Date format handling
+│   ├── registry.py              # Key registry for uniqueness
+│   └── validation.py            # Validation rule processing
+├── quality/
+│   ├── baseline_service.py      # Baseline capture and comparison
+│   ├── baseline_storage.py      # Baseline persistence
+│   ├── executor.py              # Quality check execution
+│   └── storage.py               # Quality result storage
+├── profiling/
+│   ├── types.py                 # Profiling result types
+│   ├── spark_mapper.py          # Spark DataFrame -> UMF (requires PySpark)
+│   └── deequ_mapper.py          # Deequ profile -> UMF
+├── prompts/
+│   ├── documentation.py         # Documentation enrichment prompts
+│   ├── validation.py            # Table-level validation rule prompts
+│   ├── validation_per_column.py # Per-column validation prompts
+│   ├── column_validation.py     # Column-specific validation prompts
+│   ├── relationship.py          # Relationship detection prompts
+│   ├── survivorship.py          # Survivorship logic prompts
+│   ├── filename_pattern.py      # Filename pattern prompts
+│   ├── expectation_guide.py     # GX expectation reference
+│   └── utils.py                 # Prompt utilities
+├── formatting/
+│   ├── constants.py             # Formatting constants
+│   └── yaml_formatter.py        # YAML output formatting
+├── validation/
+│   ├── gx_processor.py          # GX expectation processing
+│   ├── table_validator.py       # Table validation engine (requires PySpark)
+│   └── custom_gx_expectations.py # Custom GX expectation types
+├── casting_utils.py             # Type casting utilities
+├── completeness_validator.py    # Data completeness validation
+├── dependency_resolver.py       # Module dependency resolution
+├── format_utils.py              # Format conversion utilities
+├── merge.py                     # Table merge with survivorship (requires PySpark)
+├── relationship_validator.py    # FK relationship validation
+├── spark_factory.py             # SparkSession factory (requires PySpark)
+├── survivorship_display.py      # Survivorship rule display
+├── sync_baseline.py             # Baseline synchronization
+├── output_formatting.py         # Output display formatting
+├── validator.py                 # Pipeline-level validation orchestration
+└── domain_types.yaml            # Domain type registry definitions
+```
@@ -0,0 +1,30 @@
+# Change Management
+
+Detect differences between UMF versions and generate structured changelogs. `UMFDiff` compares two UMF objects to identify column, validation, metadata, and relationship changes. Integrates with git for commit-level change history.
+
+## Diffing UMF Versions
+
+```python
+from tablespec import UMFDiff, UMF, load_umf_from_yaml
+
+old_umf = load_umf_from_yaml("v1/schema.yaml")
+new_umf = load_umf_from_yaml("v2/schema.yaml")
+
+diff = UMFDiff(old_umf, new_umf)
+column_changes = diff.get_column_changes()
+
+for change in column_changes:
+    print(change.description())
+    # "Add column diagnosis_code"
+    # "Modify column claim_amount: data_type changed from INTEGER to DECIMAL"
+```
+
+## Generating Changelogs from Git History
+
+```python
+from tablespec import ChangelogGenerator
+
+# Generate changelog from git history
+generator = ChangelogGenerator(repo_path=".")
+changelog = generator.generate(table_path="tables/medical_claims/")
+```
@@ -0,0 +1,34 @@
+# CLI
+
+The `tablespec` command provides schema management, conversion, and validation from the terminal. Requires `typer` and `rich` (included in default dependencies).
+
+## Commands
+
+```bash
+# Validate a UMF schema (single table or entire pipeline directory)
+tablespec validate tables/outreach_list/
+
+# Display schema summary
+tablespec info tables/outreach_list/
+
+# Convert between split and JSON formats
+tablespec convert outreach_list.json tables/outreach_list/
+
+# Batch convert a directory of UMF files
+tablespec batch-convert tables/ output/ --format split
+
+# Export UMF to Excel for domain expert review
+tablespec export-excel tables/medical_claims/ claims.xlsx
+
+# Import edited Excel back to UMF (split format)
+tablespec import-excel claims.xlsx tables/medical_claims/
+
+# List all registered domain types
+tablespec domains-list
+
+# Show details of a specific domain type
+tablespec domains-show us_state_code
+
+# Infer domain type for a column
+tablespec domains-infer --column state --description "State code abbreviation"
+```
@@ -0,0 +1,22 @@
+# Domain Type Inference
+
+Automatic detection of semantic domain types from column names, descriptions, and sample values. Uses a YAML-based registry of domain types (e.g., `us_state_code`, `email`, `phone_number`, `npi`, `ssn`).
+
+## Usage
+
+```python
+from tablespec import DomainTypeInference, DomainTypeRegistry
+
+# List available domain types
+registry = DomainTypeRegistry()
+print(registry.list_domain_types())
+
+# Infer domain type for a column
+inference = DomainTypeInference()
+domain_type, confidence = inference.infer_domain_type(
+    "member_state_code",
+    description="State where member resides",
+    sample_values=["CA", "NY", "TX"],
+)
+# domain_type="us_state_code", confidence=0.95
+```
@@ -0,0 +1,24 @@
+# Excel Conversion
+
+tablespec provides round-trip conversion between UMF and Excel for non-technical domain experts. Excel workbooks include data validation dropdowns, helper columns, and instructions.
+
+## Export UMF to Excel
+
+```python
+from tablespec import UMFToExcelConverter, UMFLoader
+
+loader = UMFLoader()
+umf = loader.load("tables/medical_claims/")
+converter = UMFToExcelConverter()
+workbook = converter.convert(umf)
+workbook.save("medical_claims.xlsx")
+```
+
+## Import Excel back to UMF
+
+```python
+from tablespec import ExcelToUMFConverter
+
+importer = ExcelToUMFConverter()
+umf, metadata = importer.convert("medical_claims.xlsx")
+```
@@ -0,0 +1,63 @@
+# Great Expectations Integration
+
+tablespec integrates with Great Expectations for baseline expectation generation, constraint extraction, and UMF-to-GX mapping.
+
+## Baseline Expectation Generation
+
+Generate deterministic expectations from UMF metadata:
+
+```python
+from tablespec import BaselineExpectationGenerator, load_umf_from_yaml
+
+# Load UMF
+umf = load_umf_from_yaml("examples/schema.yaml")
+umf_dict = umf.model_dump()
+
+# Generate baseline expectations
+generator = BaselineExpectationGenerator()
+expectations = generator.generate_baseline_expectations(
+    umf_dict,
+    include_structural=True
+)
+
+# Expectations include:
+# - Column existence
+# - Column types
+# - Nullability
+# - Length constraints
+# - Column count and order
+```
+
+## Constraint Extraction
+
+Extract existing Great Expectations suite into UMF format:
+
+```python
+from tablespec import GXConstraintExtractor
+
+extractor = GXConstraintExtractor()
+
+# Extract from GX checkpoint JSON
+validation_rules = extractor.extract_from_checkpoint(
+    checkpoint_path="checkpoints/my_checkpoint.json"
+)
+
+# Add to UMF
+umf.validation_rules = validation_rules
+```
+
+## UMF to Great Expectations Mapping
+
+Map UMF models to GX format:
+
+```python
+from tablespec import UmfToGxMapper
+
+mapper = UmfToGxMapper()
+
+# Convert column definitions
+gx_columns = mapper.map_columns(umf.columns)
+
+# Convert validation rules
+gx_expectations = mapper.map_validation_rules(umf.validation_rules)
+```
@@ -0,0 +1,39 @@
+# LLM Prompt Generation
+
+tablespec generates structured prompts for LLM-based enrichment of UMF schemas.
+
+## Available Prompt Generators
+
+```python
+from pathlib import Path
+from tablespec import (
+    generate_documentation_prompt,
+    generate_validation_prompt,
+    generate_relationship_prompt,
+    generate_survivorship_prompt
+)
+
+umf_dict = umf.model_dump()
+
+# Generate documentation prompt
+doc_prompt = generate_documentation_prompt(umf_dict)
+# Asks LLM to enhance table and column descriptions
+
+# Generate validation rules prompt
+validation_prompt = generate_validation_prompt(umf_dict)
+# Asks LLM to suggest validation rules (uniqueness, ranges, formats)
+
+# Generate relationship prompt (uses UMF directory paths)
+relationship_prompt = generate_relationship_prompt(
+    Path("tables/medical_claims"),
+    Path("tables")
+)
+# Asks LLM to identify foreign key relationships
+
+# Generate survivorship prompt (uses table name and UMF directory)
+survivorship_prompt = generate_survivorship_prompt(
+    "Medical_Claims",
+    Path("tables/medical_claims")
+)
+# Asks LLM to suggest survivorship/merge logic for deduplication
+```
@@ -0,0 +1,35 @@
+# Profiling Integration
+
+tablespec converts profiling results from Spark DataFrames and Deequ into UMF format.
+
+## Spark DataFrame Profiling
+
+```python
+from tablespec import SparkToUmfMapper  # Requires tablespec[spark]
+from tablespec import save_umf_to_yaml
+from pyspark.sql import DataFrame
+
+# Profile Spark DataFrame
+mapper = SparkToUmfMapper()
+umf = mapper.create_umf_from_dataframe(
+    df=spark_df,
+    table_name="Medical_Claims",
+    source_file="claims.parquet"
+)
+
+# UMF includes inferred types, nullability, and sample values
+save_umf_to_yaml(umf, "medical_claims.yaml")
+```
+
+## Deequ Profiling
+
+```python
+from tablespec import DeequToUmfMapper
+
+# Convert Deequ profile to UMF
+mapper = DeequToUmfMapper()
+umf = mapper.create_umf_from_profile(
+    profile_json="deequ_profile.json",
+    table_name="Medical_Claims"
+)
+```
@@ -0,0 +1,18 @@
+# Sample Data Generation
+
+Generate realistic, constraint-aware sample data from UMF specifications. Supports healthcare-specific generators (SSN, NPI, drug codes), foreign key relationship graphs for referential integrity, and CSV/JSON output.
+
+## Usage
+
+```python
+from tablespec import SampleDataGenerator, GenerationConfig
+
+config = GenerationConfig(record_count=100, seed=42)
+generator = SampleDataGenerator(
+    input_dir="tables/",
+    output_dir="sample_output/",
+    config=config,
+)
+generator.generate()
+# Produces CSV files in sample_output/ with realistic, relationship-aware data
+```