Skip to content

Commit 5c16f22

Browse files
author
Peter Liu
committed
Update readme
1 parent ef51c1c commit 5c16f22

1 file changed

Lines changed: 28 additions & 13 deletions

File tree

README.md

Lines changed: 28 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -24,25 +24,40 @@ PyDeequ 2.0 introduces a new multi-engine architecture with **DuckDB** and **Spa
2424
### Architecture
2525

2626
```mermaid
27-
flowchart LR
27+
flowchart TB
2828
subgraph CLIENT["Python Client"]
29-
A["Python Code"] --> B["Protobuf<br/>Serialization"]
29+
A["pydeequ.connect()"] --> B["Engine Auto-Detection"]
3030
end
31-
B -- gRPC --> C["Spark Connect (gRPC)"]
32-
subgraph SERVER["Spark Connect Server"]
33-
D["DeequRelationPlugin"] --> E["Deequ Core"] --> F["Spark DataFrame API"] --> G["(Data)"]
31+
32+
B --> C{Connection Type}
33+
34+
C -->|DuckDB| D["DuckDBEngine"]
35+
C -->|SparkSession| E["SparkEngine"]
36+
37+
subgraph DUCKDB["DuckDB Backend (Local)"]
38+
D --> F["SQL Operators"] --> G["DuckDB"] --> H["Local Files<br/>Parquet/CSV"]
3439
end
35-
G --> H["Results"] -- gRPC --> I["Python DataFrame"]
36-
%% Styling for compactness and distinction
37-
classDef code fill:#C8F2FB,stroke:#35a7c2,color:#13505B,font-weight:bold;
38-
class A code;
40+
41+
subgraph SPARK["Spark Connect Backend (Distributed)"]
42+
E --> I["Protobuf"] -- gRPC --> J["Spark Connect Server"]
43+
J --> K["DeequRelationPlugin"] --> L["Deequ Core"] --> M["Data Lake"]
44+
end
45+
46+
H --> N["Results"]
47+
M --> N
48+
N --> O["MetricResult / ConstraintResult / ColumnProfile"]
49+
50+
classDef duckdb fill:#FFF4CC,stroke:#E6B800,color:#806600;
51+
classDef spark fill:#CCE5FF,stroke:#0066CC,color:#003366;
52+
class D,F,G,H duckdb;
53+
class E,I,J,K,L,M spark;
3954
```
4055

4156
**How it works:**
42-
1. **Client Side**: PyDeequ 2.0 builds checks and analyzers as Protobuf messages
43-
2. **Transport**: Messages are sent via gRPC to the Spark Connect server
44-
3. **Server Side**: The `DeequRelationPlugin` deserializes messages and executes Deequ operations
45-
4. **Results**: Verification results are returned as a Spark DataFrame
57+
- **Auto-detection**: `pydeequ.connect()` inspects the connection type and creates the appropriate engine
58+
- **DuckDB path**: Direct SQL execution in-process, no JVM required
59+
- **Spark path**: Protobuf serialization over gRPC to Spark Connect server with Deequ plugin
60+
- **Unified results**: Both engines return the same `MetricResult`, `ConstraintResult`, and `ColumnProfile` types
4661

4762
### Feature Support Matrix
4863

0 commit comments

Comments
 (0)