You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Execute 4 alignment review beads from AR-2026-03-19
- E-4 (tablespec-0dg): Replace residual healthcare-specific framing in
FEAT-018 and ADR-001 with multi-domain language
- E-3 (tablespec-npb): Mark tui.py as experimental (no governing FEAT
spec) and remove unused TreeNode import
- E-2 (tablespec-ggx): Add domain pack extensibility tests proving
custom domain types can be registered, generated, and validated
- E-1 (tablespec-2ym): Add comprehensive profiling expectation tests
covering uniqueness, range, completeness, value sets, string lengths,
and regex patterns
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Loads the domain type registry (`src/tablespec/domain_types.yaml`), validates that column values match the validation spec for the assigned domain type (regex patterns, value sets, format constraints).
15
15
16
-
Works on both Pandas and Spark execution engines. Bridges domain types from FEAT-013 into the GX validation pipeline.
16
+
Works on Spark and Sail execution backends. Bridges domain types from FEAT-013 into the GX validation pipeline.
17
17
18
18
```python
19
19
# Usage in expectation suite
@@ -23,22 +23,42 @@ Works on both Pandas and Spark execution engines. Bridges domain types from FEAT
23
23
}
24
24
```
25
25
26
-
### Cross-Column Date Ordering (`src/tablespec/validation/custom_gx_expectations.py`)
`ExpectColumnPairDateOrder` for start_date < end_date patterns common in healthcare data (eligibility spans, enrollment periods, claim dates).
28
+
Validates actual Spark casting (not just pattern matching). Catches edge cases like "2023-02-30" (format-valid but date-invalid). Supports flexible date/timestamp parsing with fallback formats. Skips validation if column is already the target type (pre-typed Gold tables).
29
29
30
-
Wraps GX's `expect_column_pair_values_a_to_be_greater_than_b` with date parsing semantics, supporting UMF date formats.
Validates date values fall within current calendar year using dynamic Spark SQL DATE_TRUNC for year bounds. Supports mostly threshold.
33
33
34
-
Property test: every custom expectation class defined in `custom_gx_expectations.py` is registered with GX and executable against the test harness. Prevents silent registration failures.
Cross-column date ordering for start_date < end_date patterns common in temporal data (eligibility spans, enrollment periods, contract dates, event ranges). Supports `or_equal` flag and null pair handling.
37
+
38
+
### Standalone Validators -- DONE
39
+
40
+
-`validate_domain_type()` — PySpark DataFrame validator for domain types (usable without GX framework)
41
+
-`validate_column_pair_date_order()` — PySpark DataFrame validator for date ordering
42
+
43
+
## Acceptance Criteria
44
+
45
+
| # | Criterion | Test Evidence |
46
+
|---|-----------|---------------|
47
+
| AC-1 | Domain type value set validation (state codes, gender, LOB) |`test_domain_type_expectation.py::test_*_valid/invalid`|
Copy file name to clipboardExpand all lines: docs/helix/02-design/adr/ADR-001-date-as-yyyymmdd-string.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,7 +6,7 @@ Accepted
6
6
7
7
## Context
8
8
9
-
The tablespec library serves the healthcare data domain, where data originates from CMS and state Medicaid/Medicare systems. In these systems, date values are commonly stored and transmitted as 8-digit strings in YYYYMMDD format (e.g., `"20260315"`) rather than as native date types. This is a widespread convention in healthcare EDI transactions, flat-file extracts, and legacy systems.
9
+
The tablespec library commonly processes data from legacy systems, EDI transactions, and flat-file extracts where date values are stored as YYYYMMDD strings (e.g., `"20260315"`) rather than as native date types. This pattern is widespread in healthcare (CMS, Medicaid/Medicare), financial services (SWIFT messages, FIX protocol), and government data systems.
10
10
11
11
When mapping UMF column types to PySpark and Great Expectations type systems, a choice must be made: should `DATE` columns be mapped to native date types (e.g., PySpark `DateType`) or to string types that preserve the original YYYYMMDD representation?
12
12
@@ -24,7 +24,7 @@ Specifically:
24
24
25
25
### Positive
26
26
27
-
- Faithfully represents how date data actually exists in healthcare source systems, avoiding lossy or error-prone date parsing at the schema level.
27
+
- Faithfully represents how date data actually exists in legacy source systems (healthcare, financial services, government), avoiding lossy or error-prone date parsing at the schema level.
28
28
- Validates the specific YYYYMMDD format via Great Expectations, catching malformed date strings (e.g., `"2026-03-15"`, `"03152026"`) that would silently succeed with a permissive DateType.
29
29
- Avoids PySpark date parsing issues with non-standard formats, timezone ambiguity, and null handling differences between `DateType` and `StringType`.
30
30
- Consistent with upstream data contracts where dates are defined as fixed-length character fields.
0 commit comments