Skip to content

Commit 289cdde

Browse files
suryaiyer95claude
andcommitted
test: add comprehensive data-diff test suite and research documentation
- Add 519 integration tests (516 pass + 3 xfail) across 120 test classes - Tests cover: DuckDB, Postgres, cross-warehouse, all 6 algorithms - Edge cases: NULL semantics, numeric precision, reserved keywords, composite keys - Add Docker Compose for Postgres 16 test environment - Add 28 research documents (themes A-Z) covering data validation landscape Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 4020efc commit 289cdde

33 files changed

Lines changed: 59269 additions & 0 deletions

docs/research/SYNTHESIS.md

Lines changed: 428 additions & 0 deletions
Large diffs are not rendered by default.
Lines changed: 87 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,87 @@
1+
# Iteration 1: Data Validation Landscape Research
2+
3+
_Date: 2026-03-12_
4+
5+
## Key Findings
6+
7+
### Cost Optimization Strategy Hierarchy (cheapest → most expensive)
8+
9+
1. **Metadata-only**`SHOW TABLES`, `read_metadata()`, `iceberg_snapshots()`, `HASH_AGG`, Delta `DESCRIBE DETAIL` — near-zero cost
10+
2. **Sketch comparison**`HLL_COMBINE`, `MINHASH`, T-Digest state comparison — O(1) after accumulation
11+
3. **Partition/filter-scoped** — DMFs on specific date ranges, dbt `state:modified+`, incremental Dataplex scans — cost proportional to change volume
12+
4. **Statistical sampling**`TABLESAMPLE (10000 ROWS)`, reservoir/stratified sampling — fixed cost regardless of table size
13+
5. **CDC-based incremental** — Delta CDF, Hudi incremental, Iceberg snapshot diff — cost proportional to change volume
14+
6. **Full table**`COUNT(*)`, `HASH_AGG` without index, full distribution checks — cost proportional to table size
15+
16+
### Probabilistic Data Structures
17+
18+
- **HyperLogLog**: Every major warehouse has native HLL (~1.6% error cardinality estimation, 100x cheaper than `COUNT(DISTINCT)`)
19+
- Snowflake: `HLL()`, `HLL_ACCUMULATE()`, `HLL_COMBINE()`, `HLL_EXPORT()`/`HLL_IMPORT()`
20+
- ClickHouse: `uniqHLL12`, `uniqCombined`
21+
- Pattern: pre-accumulate HLL states per partition during ingestion; at validation time compare combined states
22+
23+
- **MinHash / Jaccard similarity**: Snowflake `MINHASH()` + `APPROXIMATE_SIMILARITY()` — compute signatures on both tables, estimate row overlap. 0.99+ = ~99% match. Natural replacement for `pt-table-checksum` in cloud warehouses.
24+
25+
- **T-Digest for distribution comparison**: Accumulate per-partition percentile sketches, compare p50/p90/p99 across source/target.
26+
27+
- **HASH_AGG for table fingerprinting**: Snowflake's `HASH_AGG(*)` — single 64-bit hash of all rows. Fastest possible "did anything change?" check.
28+
29+
### Native Warehouse Sampling
30+
31+
- **Snowflake**: `BERNOULLI (N ROWS)` (exact), `SYSTEM (p PERCENT)` (block-level), `SEED(n)` for deterministic
32+
- **BigQuery**: `TABLESAMPLE SYSTEM (n PERCENT)` — block-based, costs only sampled fraction
33+
- **Databricks**: `TABLESAMPLE (n PERCENT) REPEATABLE(seed)` — deterministic re-runs
34+
- **Stratified sampling**: `ROW_NUMBER() OVER (PARTITION BY strata ORDER BY RANDOM()) <= N` — guarantees proportional representation
35+
36+
### Metadata-Based Validation (Zero Data Scan)
37+
38+
- **Parquet/Iceberg/Delta file statistics**: Per-file or per-row-group column stats (min, max, null_count, distinct_count) in metadata files
39+
- **Iceberg Puffin files**: HLL NDV sketches stored alongside data — answer `COUNT(DISTINCT)` from metadata
40+
- **Snowflake**: `SHOW TABLES` / `information_schema.tables` has `row_count`, `bytes`, `last_altered` — no warehouse credit
41+
42+
### CDC / Incremental Validation
43+
44+
- **Delta Lake Change Data Feed**: `table_changes('t', startVersion, endVersion)` returns only changed rows
45+
- **Apache Hudi**: `_hoodie_commit_time` column for time-bounded incremental reads
46+
- **dbt Slim CI**: `--state:modified+` builds/tests only changed models and downstream deps
47+
48+
### Cross-Database Validation Tools
49+
50+
- **Google DVT**: Uses Ibis for query abstraction → 15+ dialects. Three types: column aggregates, row-level hash joins, schema comparison. Partition strategy with parallel execution.
51+
- **Ibis**: Same Python expression compiles to each backend's native SQL — natural for cross-database validation
52+
- **Fugue**: Pandas validation functions execute on Spark/DuckDB/Ray/Polars unchanged
53+
- **SQLGlot AST diff**: `sqlglot.diff(expr1, expr2)` for semantic SQL comparison across 31 dialects
54+
55+
### DuckDB as Local Validation Engine
56+
57+
- `ATTACH ... AS pg (TYPE POSTGRES)` / `ATTACH ... AS mysql (TYPE MYSQL)` for direct multi-database querying
58+
- `httpfs` for direct S3/GCS Parquet/Iceberg/Delta reads — local validation, only S3 egress costs
59+
- **ADBC**: Arrow zero-copy columnar transfer, 20-50x faster than ODBC for bulk retrieval
60+
61+
### Statistical Distribution Validation
62+
63+
- **Evidently AI**: 20+ statistical tests (PSI, KS-test, Jensen-Shannon, Wasserstein) — treat source as "reference", target as "current"
64+
- **dbt-expectations**: `expect_table_aggregation_to_equal_other_table` with configurable `tolerance_percent`
65+
- **Whylogs**: Mergeable streaming statistical profiles (~KB) — compare profiles across systems without data transfer
66+
67+
### PR-Scoped Validation
68+
69+
- **Recce**: PR-review-focused — lineage diff, Profile Diff, Value Diff, Top-K Diff between dev and prod environments
70+
71+
## Implications for Reladiff Engine
72+
73+
### Already implemented
74+
- JoinDiff (FULL OUTER JOIN)
75+
- HashDiff (bisection with checksums)
76+
- Profile (column statistics)
77+
- Cascade (progressive count → profile → content)
78+
- Per-table WHERE clauses
79+
- Numeric/timestamp tolerance
80+
81+
### Potential additions (priority order)
82+
1. **HASH_AGG fingerprint** as a fast pre-check before full diff (near-zero cost)
83+
2. **Sampling mode**`TABLESAMPLE` or `LIMIT` with `ORDER BY RANDOM()` for quick confidence check
84+
3. **HLL-based cardinality comparison** — approximate distinct counts without full scan
85+
4. **Distribution comparison** — KS-test or percentile comparison using aggregate queries
86+
5. **Incremental validation** — only diff rows changed since last validation (requires timestamp column)
87+
6. **DuckDB multi-attach** for cross-database without data movement

0 commit comments

Comments
 (0)