Skip to content

Commit f21eded

Browse files
committed
v2: oneof schema, typed builders, GraphFrames-style stub delivery
Implements Stage 1 of the protobuf architecture review (see deequ repo docs/adr/0001..0005). The wire format moves from a string-discriminator ConstraintMessage/AnalyzerMessage to typed `oneof` arms — one per Check or Analyzer builder method — with reused `…Spec` payload submessages. Schema-driven builder rewrites: - pydeequ/v2/predicates.py — Predicate uses CompareOp enum; field renamed `operator` → `op`. - pydeequ/v2/checks.py — each Check builder method (isComplete, hasPattern, isContainedIn, isLessThan, …) populates the matching oneof arm directly. The old _add_constraint(constraint_type, …) shape-shifter is gone. - pydeequ/v2/analyzers.py — same pattern. Compliance gets its own ComplianceAnalyzerSpec (instance/predicate, no longer mislabeled as column/pattern). Pair-column constraints use named column_a/column_b. - pydeequ/v2/suggestions.py — Rules enum values map to ConstraintRuleSet proto enum values directly. testset_seed=0 is now a legal user choice (proto3 `optional`). - pydeequ/v2/profiles.py — `optional` low_cardinality_histogram_threshold. Stub delivery (graphframes pattern, per ADR-0005): - pydeequ/v2/proto/deequ_connect_pb2.py and _pb2.pyi are checked in. Refresh by running `DEEQU_PROTO_PATH=… python scripts/regen_proto.py` whenever the deequ schema changes; intermediate .proto is gitignored. - scripts/regen_proto.py is a developer convenience, not a build hook. - pyproject.toml gains grpcio-tools as a dev dependency for the regen script. - .github/workflows/base.yml gains a (currently disabled) drift-check step that will activate once the matching deequ JAR ships .proto under META-INF/protobuf/. Cross-cutting: - WIRE_FORMAT_VERSION constant + field dropped from runners and proto __init__.py per ADR-0005 (Spark Connect's RelationPlugin contract uses the protobuf type URL as the version discriminator). - tests/v2/test_unit.py rewritten to assert msg.WhichOneof("body") shape; 11 unit tests all pass. Verified via end-to-end smoke test (Spark 3.5 Connect server + freshly built deequ JAR + tutorials/data_quality_example_v2.py): 9 analyzers, 12 constraints (1 expected failure on duplicate detection), 10-column profile, 19 constraint suggestions all produce correct DataFrames.
1 parent 63af152 commit f21eded

15 files changed

Lines changed: 1144 additions & 1876 deletions

.github/workflows/base.yml

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,10 @@ jobs:
3535
echo "SPARK_HOME=$PWD/spark-3.5.0-bin-hadoop3" >> $GITHUB_ENV
3636
3737
- name: Download Deequ JAR
38+
# The pinned JAR must match the schema in pydeequ/v2/proto/.
39+
# When the schema changes, both the deequ JAR (built from the
40+
# corresponding deequ branch) and this URL need to update in the
41+
# same PR pair (per ADR-0004 in the deequ repo).
3842
run: |
3943
curl -L -o deequ_2.12-2.1.0b-spark-3.5.jar \
4044
https://github.com/awslabs/python-deequ/releases/download/v2.0.0b1/deequ_2.12-2.1.0b-spark-3.5.jar
@@ -46,6 +50,17 @@ jobs:
4650
poetry install
4751
poetry add "pyspark[connect]==3.5.0"
4852
53+
- name: Verify checked-in proto stubs are not stale
54+
# ADR-0005 drift safeguard: regenerate the stubs from the JAR's
55+
# bundled .proto and assert the diff is empty.
56+
# NOTE: this requires a JAR that ships META-INF/protobuf/deequ_connect.proto
57+
# — only deequ JARs built from the protobuf-stage1 branch onward do.
58+
# Until the matching JAR is released, this step is skipped.
59+
if: false
60+
run: |
61+
DEEQU_JAR_PATH=$PWD/deequ_2.12-2.1.0b-spark-3.5.jar poetry run python scripts/regen_proto.py
62+
git diff --exit-code pydeequ/v2/proto/
63+
4964
- name: Run V2 unit tests
5065
run: |
5166
poetry run pytest tests/v2/test_unit.py -v

.gitignore

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -148,5 +148,12 @@ dmypy.json
148148
# Cython debug symbols
149149
cython_debug/
150150

151-
# DS_STORE
151+
# Note: pydeequ/v2/proto/deequ_connect_pb2.py and _pb2.pyi are
152+
# CHECKED IN per ADR-0005 (graphframes pattern). Run scripts/regen_proto.py
153+
# to refresh them when the deequ schema changes.
154+
# The intermediate .proto extracted by regen_proto.py is NOT checked in —
155+
# the canonical schema lives in the deequ repo.
156+
pydeequ/v2/proto/deequ_connect.proto
157+
158+
# DS_STORE
152159
.DS_Store

0 commit comments

Comments
 (0)