V2 rewrite (beta): Support Spark Connect#254
Open
chenliu0831 wants to merge 17 commits into
Open
Conversation
- Fix NameError in tests/conftest.py setup_pyspark shim - Drop duplicate PyDeequSession entry in __getattr__ tuple - Widen pyspark pin to >=3.5.0,<4.0.0; remove empty extras table - Bump black target_version to py39 to match python requirement - Raise ValueError on Check.where() before any constraint - Replace fixed sleep with port-wait for Spark Connect server in CI
# Conflicts: # .github/workflows/base.yml # poetry.lock # pyproject.toml
Implements Stage 1 of the protobuf architecture review (see deequ repo
docs/adr/0001..0005). The wire format moves from a string-discriminator
ConstraintMessage/AnalyzerMessage to typed `oneof` arms — one per Check
or Analyzer builder method — with reused `…Spec` payload submessages.
Schema-driven builder rewrites:
- pydeequ/v2/predicates.py — Predicate uses CompareOp enum; field
renamed `operator` → `op`.
- pydeequ/v2/checks.py — each Check builder method (isComplete,
hasPattern, isContainedIn, isLessThan, …) populates the matching
oneof arm directly. The old _add_constraint(constraint_type, …)
shape-shifter is gone.
- pydeequ/v2/analyzers.py — same pattern. Compliance gets its own
ComplianceAnalyzerSpec (instance/predicate, no longer mislabeled as
column/pattern). Pair-column constraints use named column_a/column_b.
- pydeequ/v2/suggestions.py — Rules enum values map to ConstraintRuleSet
proto enum values directly. testset_seed=0 is now a legal user choice
(proto3 `optional`).
- pydeequ/v2/profiles.py — `optional` low_cardinality_histogram_threshold.
Stub delivery (graphframes pattern, per ADR-0005):
- pydeequ/v2/proto/deequ_connect_pb2.py and _pb2.pyi are checked in.
Refresh by running `DEEQU_PROTO_PATH=… python scripts/regen_proto.py`
whenever the deequ schema changes; intermediate .proto is gitignored.
- scripts/regen_proto.py is a developer convenience, not a build hook.
- pyproject.toml gains grpcio-tools as a dev dependency for the regen
script.
- .github/workflows/base.yml gains a (currently disabled) drift-check
step that will activate once the matching deequ JAR ships .proto under
META-INF/protobuf/.
Cross-cutting:
- WIRE_FORMAT_VERSION constant + field dropped from runners and proto
__init__.py per ADR-0005 (Spark Connect's RelationPlugin contract uses
the protobuf type URL as the version discriminator).
- tests/v2/test_unit.py rewritten to assert msg.WhichOneof("body")
shape; 11 unit tests all pass.
Verified via end-to-end smoke test (Spark 3.5 Connect server + freshly
built deequ JAR + tutorials/data_quality_example_v2.py): 9 analyzers, 12
constraints (1 expected failure on duplicate detection), 10-column
profile, 19 constraint suggestions all produce correct DataFrames.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Issue #, if available:
Description of changes:
This PR introduces PyDeequ 2.0 beta, a major release that replaces the Py4J-based architecture with Spark Connect for client-server communication.
The Deequ side change will be opened separately. the proto file here is copied for review purpose. For ease of testing, I created a pre-release https://github.com/awslabs/python-deequ/releases/tag/v2.0.0b1 to host the jars/wheels.
Motivation
The legacy PyDeequ relied on Py4J to bridge Python and the JVM, which had several limitations:
Spark Connect (introduced in Spark 3.4) provides a clean gRPC-based protocol that solves these issues.
Code Changes
New
pydeequ/v2/module with Spark Connect implementation:checks.py- Check and constraint buildersanalyzers.py- Analyzer classespredicates.py- Serializable predicates (eq,gte,between, etc.)verification.py- VerificationSuite and AnalysisRunnerproto/- Protobuf definitions and generated codeNew test suite in
tests/v2/:test_unit.py- Unit tests (no Spark required)test_analyzers.py- Analyzer integration teststest_checks.py- Check constraint teststest_e2e_spark_connect.py- End-to-end testsUpdated documentation:
API Changes
Testing
More details see https://github.com/awslabs/python-deequ/blob/v2_rewrite/README.md.
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.