Commit f21eded
committed
v2: oneof schema, typed builders, GraphFrames-style stub delivery
Implements Stage 1 of the protobuf architecture review (see deequ repo
docs/adr/0001..0005). The wire format moves from a string-discriminator
ConstraintMessage/AnalyzerMessage to typed `oneof` arms — one per Check
or Analyzer builder method — with reused `…Spec` payload submessages.
Schema-driven builder rewrites:
- pydeequ/v2/predicates.py — Predicate uses CompareOp enum; field
renamed `operator` → `op`.
- pydeequ/v2/checks.py — each Check builder method (isComplete,
hasPattern, isContainedIn, isLessThan, …) populates the matching
oneof arm directly. The old _add_constraint(constraint_type, …)
shape-shifter is gone.
- pydeequ/v2/analyzers.py — same pattern. Compliance gets its own
ComplianceAnalyzerSpec (instance/predicate, no longer mislabeled as
column/pattern). Pair-column constraints use named column_a/column_b.
- pydeequ/v2/suggestions.py — Rules enum values map to ConstraintRuleSet
proto enum values directly. testset_seed=0 is now a legal user choice
(proto3 `optional`).
- pydeequ/v2/profiles.py — `optional` low_cardinality_histogram_threshold.
Stub delivery (graphframes pattern, per ADR-0005):
- pydeequ/v2/proto/deequ_connect_pb2.py and _pb2.pyi are checked in.
Refresh by running `DEEQU_PROTO_PATH=… python scripts/regen_proto.py`
whenever the deequ schema changes; intermediate .proto is gitignored.
- scripts/regen_proto.py is a developer convenience, not a build hook.
- pyproject.toml gains grpcio-tools as a dev dependency for the regen
script.
- .github/workflows/base.yml gains a (currently disabled) drift-check
step that will activate once the matching deequ JAR ships .proto under
META-INF/protobuf/.
Cross-cutting:
- WIRE_FORMAT_VERSION constant + field dropped from runners and proto
__init__.py per ADR-0005 (Spark Connect's RelationPlugin contract uses
the protobuf type URL as the version discriminator).
- tests/v2/test_unit.py rewritten to assert msg.WhichOneof("body")
shape; 11 unit tests all pass.
Verified via end-to-end smoke test (Spark 3.5 Connect server + freshly
built deequ JAR + tutorials/data_quality_example_v2.py): 9 analyzers, 12
constraints (1 expected failure on duplicate detection), 10-column
profile, 19 constraint suggestions all produce correct DataFrames.1 parent 63af152 commit f21eded
15 files changed
Lines changed: 1144 additions & 1876 deletions
File tree
- .github/workflows
- pydeequ/v2
- proto
- scripts
- tests/v2
- tutorials
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
35 | 35 | | |
36 | 36 | | |
37 | 37 | | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
38 | 42 | | |
39 | 43 | | |
40 | 44 | | |
| |||
46 | 50 | | |
47 | 51 | | |
48 | 52 | | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
49 | 64 | | |
50 | 65 | | |
51 | 66 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
148 | 148 | | |
149 | 149 | | |
150 | 150 | | |
151 | | - | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
152 | 159 | | |
0 commit comments