Skip to content

Commit cd93433

Browse files
committed
feat(pyspark): introduce package with codegen pipeline and CLI
Introduce overture-schema-pyspark, a runtime PySpark validation package whose per-feature expression modules and conformance tests are generated from the same Pydantic models that define the schema, along with an `overture-validate` CLI. Runtime (overture-schema-pyspark/src/overture/schema/pyspark/): - check.py — Check, CheckShape, FeatureValidation dataclasses. - schema_check.py — write-first comparison of Spark schemas against an expected StructType, with structural type matching and SchemaMismatch reporting. - validate.py — public API: validate_feature(), evaluate_checks(), explain_errors(). The explain stage UNPIVOTs per-row check results into one row per violation, preserving all input columns for downstream join-back. - cli.py — `overture-validate <parquet-or-directory>` runs the validation pipeline against a path of GeoParquet files. Output is one row per violation: feature ID, theme/type, failing field, check name, offending value. Single-pass evaluation keeps memory bounded for arbitrarily large inputs. - expressions/ — shared runtime utilities (constraint_expressions, column_patterns, _schema_structs). Per-feature expression modules live under expressions/overture/ and are added by the codegen in a follow-up commit. - tests/_support/ — conformance test infrastructure (scenarios, harness, helpers, mutations). The harness builds one DataFrame per feature, applies all scenarios as deterministic-UUID-tagged rows, runs validation once, and indexes violations back to scenario IDs — O(checks) rather than O(checks * scenarios). CLI filtering options: --theme <theme> limit to one theme --feature <feature> limit to one feature type --skip-schema-check run only constraint checks (no schema comparison) --count-only print violation counts per check rather than rows --suppress <key> suppress specific (feature, field, check) triples per a YAML config Codegen pipeline (overture-schema-codegen/src/.../pyspark/): FeatureSpec | constraint_dispatch.py map constraints to descriptors | check_builder.py walk FieldSpec -> CheckNode IR; resolve array nesting, variant gating | schema_builder.py FieldSpec -> SchemaField list (StructType source) | renderer.py CheckNode -> per-feature expression module test_renderer.py CheckNode -> per-feature conformance test module synthetic.py FeatureSpec -> BASE_ROW + invalid values | pipeline.py orchestrate, return GeneratedModule list The dispatch tables map every supported constraint (Ge/Gt/Le/Lt/ Interval, MinLen/MaxLen, StrippedConstraint, PatternConstraint, UniqueItemsConstraint, GeometryTypeConstraint, JsonPointerConstraint, RequireAnyOfConstraint, RadioGroupConstraint, RequireIfConstraint, ForbidIfConstraint, MinFieldsSetConstraint), NewType (Country- CodeAlpha2, LinearlyReferencedRange, RegionCode), and base type (HttpUrl, EmailStr) to constraint_expressions check functions. Discriminated unions (segment is the canonical hard case) split into per-arm test files. The codegen handles arm splitting via generate_arm_rows in synthetic.py and _filter_field_nodes_for_arm in test_renderer.py. The Makefile gains a `generate-pyspark` target and gates `check` on it so a stale generation surfaces immediately. The CLI is exposed as a `[project.scripts]` entry point so `overture-validate` becomes available after `pip install` / `uv sync`. Signed-off-by: Seth Fitzsimmons <seth@mojodna.net>
1 parent 8b4e0f5 commit cd93433

116 files changed

Lines changed: 21265 additions & 2891 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

Makefile

Lines changed: 16 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
.PHONY: default uv-sync check test-all test test-only docformat doctest doctest-only mypy mypy-only lint-only update-baselines
1+
.PHONY: default uv-sync clean-pyspark generate-pyspark check test-all test test-only docformat doctest doctest-only mypy mypy-only lint-only update-baselines
22

33
TESTMON ?= --testmon
44

@@ -7,9 +7,22 @@ default: test-all
77
install: uv-sync
88

99
uv-sync:
10-
@uv sync --all-packages 2> /dev/null
10+
@uv sync --all-packages --all-extras 2> /dev/null
1111

12-
check: uv-sync
12+
PYSPARK_EXPRESSIONS := packages/overture-schema-pyspark/src/overture/schema/pyspark/expressions/generated
13+
PYSPARK_GENERATED_TESTS := packages/overture-schema-pyspark/tests/generated
14+
15+
clean-pyspark:
16+
@rm -rf $(PYSPARK_EXPRESSIONS) $(PYSPARK_GENERATED_TESTS)
17+
18+
generate-pyspark: uv-sync clean-pyspark
19+
@uv run overture-codegen generate --format pyspark \
20+
--output-dir $(PYSPARK_EXPRESSIONS) \
21+
--test-output-dir $(PYSPARK_GENERATED_TESTS)
22+
@uv run ruff check --fix --quiet $(PYSPARK_EXPRESSIONS) $(PYSPARK_GENERATED_TESTS)
23+
@uv run ruff format --quiet $(PYSPARK_EXPRESSIONS) $(PYSPARK_GENERATED_TESTS)
24+
25+
check: uv-sync generate-pyspark
1326
@$(MAKE) -j test-only doctest-only lint-only mypy-only
1427

1528
test-all: uv-sync

packages/overture-schema-codegen/docs/design.md

Lines changed: 252 additions & 74 deletions
Large diffs are not rendered by default.

packages/overture-schema-codegen/docs/walkthrough.md

Lines changed: 212 additions & 222 deletions
Large diffs are not rendered by default.

packages/overture-schema-codegen/pyproject.toml

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,9 +20,25 @@ name = "overture-schema-codegen"
2020
overture-codegen = "overture.schema.codegen.cli:main"
2121

2222
[tool.uv.sources]
23+
overture-schema-addresses-theme = { workspace = true }
24+
overture-schema-base-theme = { workspace = true }
25+
overture-schema-buildings-theme = { workspace = true }
2326
overture-schema-cli = { workspace = true }
2427
overture-schema-common = { workspace = true }
28+
overture-schema-divisions-theme = { workspace = true }
29+
overture-schema-places-theme = { workspace = true }
2530
overture-schema-system = { workspace = true }
31+
overture-schema-transportation-theme = { workspace = true }
32+
33+
[dependency-groups]
34+
test = [
35+
"overture-schema-addresses-theme",
36+
"overture-schema-base-theme",
37+
"overture-schema-buildings-theme",
38+
"overture-schema-divisions-theme",
39+
"overture-schema-places-theme",
40+
"overture-schema-transportation-theme",
41+
]
2642

2743
[tool.hatch.version]
2844
path = "src/overture/schema/codegen/__about__.py"

packages/overture-schema-codegen/src/overture/schema/codegen/cli.py

Lines changed: 45 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@
1717
FeatureSpec,
1818
is_model_class,
1919
is_union_alias,
20+
partitions_from_tags,
2021
)
2122
from .extraction.union_extraction import extract_union
2223
from .layout.module_layout import (
@@ -26,12 +27,13 @@
2627
entry_point_module,
2728
)
2829
from .markdown.pipeline import generate_markdown_pages
30+
from .pyspark.pipeline import generate_pyspark_modules
2931

3032
log = logging.getLogger(__name__)
3133

3234
__all__ = ["cli"]
3335

34-
_OUTPUT_FORMATS = ("markdown",)
36+
_OUTPUT_FORMATS = ("markdown", "pyspark")
3537

3638
_FEATURE_FRONTMATTER = "---\nsidebar_position: 1\n---\n\n"
3739

@@ -84,21 +86,29 @@ def list_models() -> None:
8486
"--output-dir",
8587
type=click.Path(path_type=Path),
8688
default=None,
87-
help="Write output to directory (default: stdout)",
89+
help="Write output files directly into this directory (default: stdout). "
90+
"For pyspark, writes expression modules (*.py) and a _registry.py. "
91+
"For markdown, writes theme subdirectories.",
92+
)
93+
@click.option(
94+
"--test-output-dir",
95+
type=click.Path(path_type=Path),
96+
default=None,
97+
help="Write test modules (test_*.py) into this directory (pyspark only).",
8898
)
8999
def generate(
90100
output_format: str,
91101
tags: tuple[str, ...],
92102
filters: tuple[str, ...],
93103
excludes: tuple[str, ...],
94104
output_dir: Path | None,
105+
test_output_dir: Path | None,
95106
) -> None:
96107
"""Generate code/docs from discovered models."""
97-
all_models = discover_models()
108+
if output_format != "pyspark" and test_output_dir is not None:
109+
raise click.UsageError("--test-output-dir is only valid with --format pyspark")
98110

99-
# Schema root from ALL entry points (before tag filters).
100-
module_paths = [entry_point_module(k.entry_point) for k in all_models]
101-
schema_root = compute_schema_root(module_paths)
111+
all_models = discover_models()
102112

103113
models = filter_models(all_models, build_selector(tags, filters, excludes))
104114

@@ -107,18 +117,27 @@ def generate(
107117

108118
feature_specs: list[FeatureSpec] = []
109119
for key, entry in models.items():
120+
partitions = partitions_from_tags(key.tags)
110121
if is_model_class(entry):
111-
feature_specs.append(extract_model(entry, entry_point=key.entry_point))
122+
feature_specs.append(
123+
extract_model(entry, entry_point=key.entry_point, partitions=partitions)
124+
)
112125
elif is_union_alias(entry):
113126
feature_specs.append(
114127
extract_union(
115128
entry_point_class(key.entry_point),
116129
entry,
117130
entry_point=key.entry_point,
131+
partitions=partitions,
118132
)
119133
)
120134

121-
_generate_markdown(feature_specs, schema_root, output_dir)
135+
if output_format == "pyspark":
136+
_generate_pyspark(feature_specs, output_dir, test_output_dir)
137+
else:
138+
module_paths = [entry_point_module(k.entry_point) for k in all_models]
139+
schema_root = compute_schema_root(module_paths)
140+
_generate_markdown(feature_specs, schema_root, output_dir)
122141

123142

124143
def _generate_markdown(
@@ -141,6 +160,24 @@ def _generate_markdown(
141160
_write_category_files(output_dir, all_paths, feature_paths)
142161

143162

163+
def _generate_pyspark(
164+
feature_specs: list[FeatureSpec],
165+
output_dir: Path | None,
166+
test_output_dir: Path | None = None,
167+
) -> None:
168+
"""Generate PySpark validation modules.
169+
170+
Output is syntactically valid Python; we assume a code formatter runs
171+
over the written directories afterwards to match existing conventions.
172+
"""
173+
modules = generate_pyspark_modules(feature_specs)
174+
for mod in modules.source:
175+
_write_output(mod.content, output_dir, mod.path)
176+
if test_output_dir is not None:
177+
for mod in modules.test:
178+
_write_output(mod.content, test_output_dir, mod.path)
179+
180+
144181
def _ancestor_dirs(paths: set[PurePosixPath]) -> set[PurePosixPath]:
145182
"""Collect all ancestor directories for a set of file paths."""
146183
dirs: set[PurePosixPath] = set()

packages/overture-schema-codegen/src/overture/schema/codegen/extraction/case_conversion.py

Lines changed: 0 additions & 41 deletions
This file was deleted.
Lines changed: 172 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,172 @@
1+
"""Tree-shaped IR for model field types.
2+
3+
`FieldShape` is a discriminated union -- `Primitive`, `LiteralScalar`,
4+
`AnyScalar`, `ModelRef`, `UnionRef`, `ArrayOf`, `MapOf`, `NewTypeShape`
5+
-- nested to describe arbitrary list / dict / NewType wrapping. Each
6+
variant carries its own constraints (where meaningful), and walkers
7+
encounter each constraint at the layer it targets.
8+
9+
The three terminal scalar variants (`Primitive`, `LiteralScalar`,
10+
`AnyScalar`) are grouped under the `Scalar` type alias for consumers
11+
that only need to ask "is this a leaf?".
12+
13+
`NewTypeShape` wraps an inner shape, so its position relative to
14+
`ArrayOf` is structural: `NewTypeShape(inner=ArrayOf(...))` is a
15+
NewType over `list[X]`, while `ArrayOf(element=NewTypeShape(...))`
16+
is a list of NewType-wrapped values. Consumers pattern-match on
17+
shape to distinguish the two.
18+
"""
19+
20+
from __future__ import annotations
21+
22+
from dataclasses import dataclass
23+
from typing import TYPE_CHECKING, TypeAlias
24+
25+
if TYPE_CHECKING:
26+
from .specs import ModelSpec, UnionSpec
27+
28+
__all__ = [
29+
"AnyScalar",
30+
"ArrayOf",
31+
"ConstraintSource",
32+
"FieldShape",
33+
"LiteralScalar",
34+
"MapOf",
35+
"ModelRef",
36+
"NewTypeShape",
37+
"Primitive",
38+
"Scalar",
39+
"UnionRef",
40+
]
41+
42+
43+
@dataclass(frozen=True, slots=True)
44+
class ConstraintSource:
45+
"""A constraint paired with the NewType that contributed it.
46+
47+
`source_ref` and `source_name` identify the NewType that declared
48+
the constraint; both are `None` for constraints contributed directly
49+
on a field annotation rather than through a NewType. `constraint`
50+
is the raw metadata object from `Annotated[..., constraint]`.
51+
"""
52+
53+
source_ref: object | None
54+
source_name: str | None
55+
constraint: object
56+
57+
58+
@dataclass(frozen=True, slots=True)
59+
class Primitive:
60+
"""Terminal type with a registry lookup key.
61+
62+
Covers primitives (`int32`, `str`), enums, Pydantic built-ins
63+
(`HttpUrl`, `EmailStr`), and `BaseModel` subclasses that weren't
64+
resolved to a `ModelRef` (e.g. when no `model_resolver` was
65+
supplied).
66+
"""
67+
68+
base_type: str
69+
source_type: type | None = None
70+
constraints: tuple[ConstraintSource, ...] = ()
71+
72+
73+
@dataclass(frozen=True, slots=True)
74+
class LiteralScalar:
75+
"""`Literal[X, ...]` terminal."""
76+
77+
values: tuple[object, ...]
78+
constraints: tuple[ConstraintSource, ...] = ()
79+
80+
81+
@dataclass(frozen=True, slots=True)
82+
class AnyScalar:
83+
"""`typing.Any` terminal."""
84+
85+
constraints: tuple[ConstraintSource, ...] = ()
86+
87+
88+
Scalar: TypeAlias = Primitive | LiteralScalar | AnyScalar
89+
"""Terminal shape: a value that doesn't wrap another shape.
90+
91+
Consumers that just need "is this a leaf?" check `isinstance(x, Scalar)`;
92+
consumers that need terminal-specific data narrow to a variant.
93+
"""
94+
95+
96+
@dataclass(frozen=True, slots=True)
97+
class ModelRef:
98+
"""Reference to a Pydantic sub-model.
99+
100+
`starts_cycle` marks the back-edge of a cycle in the model graph;
101+
consumers that recurse into models must stop at cycle starts.
102+
"""
103+
104+
model: ModelSpec
105+
starts_cycle: bool = False
106+
107+
108+
@dataclass(frozen=True, slots=True)
109+
class UnionRef:
110+
"""Reference to a discriminated union of models."""
111+
112+
union: UnionSpec
113+
114+
115+
@dataclass(frozen=True, slots=True)
116+
class ArrayOf:
117+
"""Sequence of values sharing a single element shape.
118+
119+
Nested arrays are nested `ArrayOf` instances; there is no numeric
120+
depth field. `constraints` carries array-level validation rules
121+
(length, uniqueness). Per-element constraints live on `element`
122+
and its descendants.
123+
"""
124+
125+
element: FieldShape
126+
constraints: tuple[ConstraintSource, ...] = ()
127+
128+
129+
@dataclass(frozen=True, slots=True)
130+
class MapOf:
131+
"""Mapping from a key shape to a value shape.
132+
133+
`constraints` carries map-level validation rules. Per-key and
134+
per-value constraints live on `key` / `value` respectively.
135+
"""
136+
137+
key: FieldShape
138+
value: FieldShape
139+
constraints: tuple[ConstraintSource, ...] = ()
140+
141+
142+
@dataclass(frozen=True, slots=True)
143+
class NewTypeShape:
144+
"""A NewType wrapper around an inner shape.
145+
146+
Position relative to other wrappers is meaningful:
147+
`NewTypeShape(inner=ArrayOf(...))` is a NewType over `list[X]`;
148+
`ArrayOf(element=NewTypeShape(...))` is a list of NewType-wrapped
149+
values. Consumers distinguish the two by pattern, not a numeric
150+
offset.
151+
152+
Constraints contributed by the NewType chain attach to the
153+
`Scalar` / `ArrayOf` / `MapOf` layer they target, not to the
154+
wrapper itself. `name` and `ref` identify the NewType for linking
155+
without owning constraint state.
156+
"""
157+
158+
name: str
159+
ref: object
160+
inner: FieldShape
161+
162+
163+
FieldShape: TypeAlias = (
164+
Primitive
165+
| LiteralScalar
166+
| AnyScalar
167+
| ModelRef
168+
| UnionRef
169+
| ArrayOf
170+
| MapOf
171+
| NewTypeShape
172+
)

0 commit comments

Comments
 (0)