Skip to content

Commit 04cb8ae

Browse files
easelclaude
andcommitted
Execute 4 alignment review beads from AR-2026-03-19
- E-4 (tablespec-0dg): Replace residual healthcare-specific framing in FEAT-018 and ADR-001 with multi-domain language - E-3 (tablespec-npb): Mark tui.py as experimental (no governing FEAT spec) and remove unused TreeNode import - E-2 (tablespec-ggx): Add domain pack extensibility tests proving custom domain types can be registered, generated, and validated - E-1 (tablespec-2ym): Add comprehensive profiling expectation tests covering uniqueness, range, completeness, value sets, string lengths, and regex patterns Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 36dad2c commit 04cb8ae

5 files changed

Lines changed: 889 additions & 14 deletions

File tree

Lines changed: 32 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,19 @@
11
# FEAT-018: Custom GX Extensions
22

3-
**Status**: Proposed
3+
**Status**: Implemented
44
**Priority**: High
55

66
## Description
77

88
Custom Great Expectations expectation classes that bridge tablespec domain concepts into GX execution.
99

10-
## Components
10+
## Implemented Components
1111

12-
### ExpectColumnValuesToMatchDomainType (`src/tablespec/validation/custom_gx_expectations.py`)
12+
### ExpectColumnValuesToMatchDomainType (`src/tablespec/validation/custom_gx_expectations.py`) -- DONE
1313

1414
Loads the domain type registry (`src/tablespec/domain_types.yaml`), validates that column values match the validation spec for the assigned domain type (regex patterns, value sets, format constraints).
1515

16-
Works on both Pandas and Spark execution engines. Bridges domain types from FEAT-013 into the GX validation pipeline.
16+
Works on Spark and Sail execution backends. Bridges domain types from FEAT-013 into the GX validation pipeline.
1717

1818
```python
1919
# Usage in expectation suite
@@ -23,22 +23,42 @@ Works on both Pandas and Spark execution engines. Bridges domain types from FEAT
2323
}
2424
```
2525

26-
### Cross-Column Date Ordering (`src/tablespec/validation/custom_gx_expectations.py`)
26+
### ExpectColumnValuesToCastToType (`src/tablespec/validation/custom_gx_expectations.py`) -- DONE
2727

28-
`ExpectColumnPairDateOrder` for start_date < end_date patterns common in healthcare data (eligibility spans, enrollment periods, claim dates).
28+
Validates actual Spark casting (not just pattern matching). Catches edge cases like "2023-02-30" (format-valid but date-invalid). Supports flexible date/timestamp parsing with fallback formats. Skips validation if column is already the target type (pre-typed Gold tables).
2929

30-
Wraps GX's `expect_column_pair_values_a_to_be_greater_than_b` with date parsing semantics, supporting UMF date formats.
30+
### ExpectColumnDateToBeInCurrentYear (`src/tablespec/validation/custom_gx_expectations.py`) -- DONE
3131

32-
### Registration Testing (`tests/unit/test_custom_gx_expectations.py`)
32+
Validates date values fall within current calendar year using dynamic Spark SQL DATE_TRUNC for year bounds. Supports mostly threshold.
3333

34-
Property test: every custom expectation class defined in `custom_gx_expectations.py` is registered with GX and executable against the test harness. Prevents silent registration failures.
34+
### ExpectColumnPairDateOrder (`src/tablespec/validation/custom_gx_expectations.py`) -- DONE
35+
36+
Cross-column date ordering for start_date < end_date patterns common in temporal data (eligibility spans, enrollment periods, contract dates, event ranges). Supports `or_equal` flag and null pair handling.
37+
38+
### Standalone Validators -- DONE
39+
40+
- `validate_domain_type()` — PySpark DataFrame validator for domain types (usable without GX framework)
41+
- `validate_column_pair_date_order()` — PySpark DataFrame validator for date ordering
42+
43+
## Acceptance Criteria
44+
45+
| # | Criterion | Test Evidence |
46+
|---|-----------|---------------|
47+
| AC-1 | Domain type value set validation (state codes, gender, LOB) | `test_domain_type_expectation.py::test_*_valid/invalid` |
48+
| AC-2 | Domain type regex validation (email, NPI, ZIP, phone) | `test_domain_type_expectation.py::test_*_regex*` |
49+
| AC-3 | Domain type length validation | `test_domain_type_expectation.py::test_*_length*` |
50+
| AC-4 | Mostly threshold support | `test_domain_type_expectation.py::test_*_mostly*` |
51+
| AC-5 | Null handling (all nulls pass, mixed nulls excluded) | `test_domain_type_expectation.py::test_*_null*` |
52+
| AC-6 | Unknown domain type fails with clear message | `test_domain_type_expectation.py::test_*_unknown*` |
53+
| AC-7 | Date pair ordering with valid/invalid data | `test_date_order_expectation.py` |
54+
| AC-8 | Date pair or_equal flag (>= vs >) | `test_date_order_expectation.py` |
55+
| AC-9 | Result structure includes element_count, unexpected_count, partial_unexpected_list | `test_domain_type_expectation.py::test_*_result*` |
3556

3657
## Source
3758

38-
- `src/tablespec/validation/custom_gx_expectations.py` (existing, to be extended)
39-
- `src/tablespec/inference/domain_types.py` (existing registry)
59+
- `src/tablespec/validation/custom_gx_expectations.py`
4060

4161
## Dependencies
4262

43-
- FEAT-016 (GX test harness for testing custom expectations)
4463
- FEAT-013 (domain type registry)
64+
- FEAT-024 (Spark/Sail session for execution)

docs/helix/02-design/adr/ADR-001-date-as-yyyymmdd-string.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ Accepted
66

77
## Context
88

9-
The tablespec library serves the healthcare data domain, where data originates from CMS and state Medicaid/Medicare systems. In these systems, date values are commonly stored and transmitted as 8-digit strings in YYYYMMDD format (e.g., `"20260315"`) rather than as native date types. This is a widespread convention in healthcare EDI transactions, flat-file extracts, and legacy systems.
9+
The tablespec library commonly processes data from legacy systems, EDI transactions, and flat-file extracts where date values are stored as YYYYMMDD strings (e.g., `"20260315"`) rather than as native date types. This pattern is widespread in healthcare (CMS, Medicaid/Medicare), financial services (SWIFT messages, FIX protocol), and government data systems.
1010

1111
When mapping UMF column types to PySpark and Great Expectations type systems, a choice must be made: should `DATE` columns be mapped to native date types (e.g., PySpark `DateType`) or to string types that preserve the original YYYYMMDD representation?
1212

@@ -24,7 +24,7 @@ Specifically:
2424

2525
### Positive
2626

27-
- Faithfully represents how date data actually exists in healthcare source systems, avoiding lossy or error-prone date parsing at the schema level.
27+
- Faithfully represents how date data actually exists in legacy source systems (healthcare, financial services, government), avoiding lossy or error-prone date parsing at the schema level.
2828
- Validates the specific YYYYMMDD format via Great Expectations, catching malformed date strings (e.g., `"2026-03-15"`, `"03152026"`) that would silently succeed with a permissive DateType.
2929
- Avoids PySpark date parsing issues with non-standard formats, timezone ambiguity, and null handling differences between `DateType` and `StringType`.
3030
- Consistent with upstream data contracts where dates are defined as fixed-length character fields.

0 commit comments

Comments
 (0)