|
| 1 | +# Iteration 1: Data Validation Landscape Research |
| 2 | + |
| 3 | +_Date: 2026-03-12_ |
| 4 | + |
| 5 | +## Key Findings |
| 6 | + |
| 7 | +### Cost Optimization Strategy Hierarchy (cheapest → most expensive) |
| 8 | + |
| 9 | +1. **Metadata-only** — `SHOW TABLES`, `read_metadata()`, `iceberg_snapshots()`, `HASH_AGG`, Delta `DESCRIBE DETAIL` — near-zero cost |
| 10 | +2. **Sketch comparison** — `HLL_COMBINE`, `MINHASH`, T-Digest state comparison — O(1) after accumulation |
| 11 | +3. **Partition/filter-scoped** — DMFs on specific date ranges, dbt `state:modified+`, incremental Dataplex scans — cost proportional to change volume |
| 12 | +4. **Statistical sampling** — `TABLESAMPLE (10000 ROWS)`, reservoir/stratified sampling — fixed cost regardless of table size |
| 13 | +5. **CDC-based incremental** — Delta CDF, Hudi incremental, Iceberg snapshot diff — cost proportional to change volume |
| 14 | +6. **Full table** — `COUNT(*)`, `HASH_AGG` without index, full distribution checks — cost proportional to table size |
| 15 | + |
| 16 | +### Probabilistic Data Structures |
| 17 | + |
| 18 | +- **HyperLogLog**: Every major warehouse has native HLL (~1.6% error cardinality estimation, 100x cheaper than `COUNT(DISTINCT)`) |
| 19 | + - Snowflake: `HLL()`, `HLL_ACCUMULATE()`, `HLL_COMBINE()`, `HLL_EXPORT()`/`HLL_IMPORT()` |
| 20 | + - ClickHouse: `uniqHLL12`, `uniqCombined` |
| 21 | + - Pattern: pre-accumulate HLL states per partition during ingestion; at validation time compare combined states |
| 22 | + |
| 23 | +- **MinHash / Jaccard similarity**: Snowflake `MINHASH()` + `APPROXIMATE_SIMILARITY()` — compute signatures on both tables, estimate row overlap. 0.99+ = ~99% match. Natural replacement for `pt-table-checksum` in cloud warehouses. |
| 24 | + |
| 25 | +- **T-Digest for distribution comparison**: Accumulate per-partition percentile sketches, compare p50/p90/p99 across source/target. |
| 26 | + |
| 27 | +- **HASH_AGG for table fingerprinting**: Snowflake's `HASH_AGG(*)` — single 64-bit hash of all rows. Fastest possible "did anything change?" check. |
| 28 | + |
| 29 | +### Native Warehouse Sampling |
| 30 | + |
| 31 | +- **Snowflake**: `BERNOULLI (N ROWS)` (exact), `SYSTEM (p PERCENT)` (block-level), `SEED(n)` for deterministic |
| 32 | +- **BigQuery**: `TABLESAMPLE SYSTEM (n PERCENT)` — block-based, costs only sampled fraction |
| 33 | +- **Databricks**: `TABLESAMPLE (n PERCENT) REPEATABLE(seed)` — deterministic re-runs |
| 34 | +- **Stratified sampling**: `ROW_NUMBER() OVER (PARTITION BY strata ORDER BY RANDOM()) <= N` — guarantees proportional representation |
| 35 | + |
| 36 | +### Metadata-Based Validation (Zero Data Scan) |
| 37 | + |
| 38 | +- **Parquet/Iceberg/Delta file statistics**: Per-file or per-row-group column stats (min, max, null_count, distinct_count) in metadata files |
| 39 | +- **Iceberg Puffin files**: HLL NDV sketches stored alongside data — answer `COUNT(DISTINCT)` from metadata |
| 40 | +- **Snowflake**: `SHOW TABLES` / `information_schema.tables` has `row_count`, `bytes`, `last_altered` — no warehouse credit |
| 41 | + |
| 42 | +### CDC / Incremental Validation |
| 43 | + |
| 44 | +- **Delta Lake Change Data Feed**: `table_changes('t', startVersion, endVersion)` returns only changed rows |
| 45 | +- **Apache Hudi**: `_hoodie_commit_time` column for time-bounded incremental reads |
| 46 | +- **dbt Slim CI**: `--state:modified+` builds/tests only changed models and downstream deps |
| 47 | + |
| 48 | +### Cross-Database Validation Tools |
| 49 | + |
| 50 | +- **Google DVT**: Uses Ibis for query abstraction → 15+ dialects. Three types: column aggregates, row-level hash joins, schema comparison. Partition strategy with parallel execution. |
| 51 | +- **Ibis**: Same Python expression compiles to each backend's native SQL — natural for cross-database validation |
| 52 | +- **Fugue**: Pandas validation functions execute on Spark/DuckDB/Ray/Polars unchanged |
| 53 | +- **SQLGlot AST diff**: `sqlglot.diff(expr1, expr2)` for semantic SQL comparison across 31 dialects |
| 54 | + |
| 55 | +### DuckDB as Local Validation Engine |
| 56 | + |
| 57 | +- `ATTACH ... AS pg (TYPE POSTGRES)` / `ATTACH ... AS mysql (TYPE MYSQL)` for direct multi-database querying |
| 58 | +- `httpfs` for direct S3/GCS Parquet/Iceberg/Delta reads — local validation, only S3 egress costs |
| 59 | +- **ADBC**: Arrow zero-copy columnar transfer, 20-50x faster than ODBC for bulk retrieval |
| 60 | + |
| 61 | +### Statistical Distribution Validation |
| 62 | + |
| 63 | +- **Evidently AI**: 20+ statistical tests (PSI, KS-test, Jensen-Shannon, Wasserstein) — treat source as "reference", target as "current" |
| 64 | +- **dbt-expectations**: `expect_table_aggregation_to_equal_other_table` with configurable `tolerance_percent` |
| 65 | +- **Whylogs**: Mergeable streaming statistical profiles (~KB) — compare profiles across systems without data transfer |
| 66 | + |
| 67 | +### PR-Scoped Validation |
| 68 | + |
| 69 | +- **Recce**: PR-review-focused — lineage diff, Profile Diff, Value Diff, Top-K Diff between dev and prod environments |
| 70 | + |
| 71 | +## Implications for Reladiff Engine |
| 72 | + |
| 73 | +### Already implemented |
| 74 | +- JoinDiff (FULL OUTER JOIN) |
| 75 | +- HashDiff (bisection with checksums) |
| 76 | +- Profile (column statistics) |
| 77 | +- Cascade (progressive count → profile → content) |
| 78 | +- Per-table WHERE clauses |
| 79 | +- Numeric/timestamp tolerance |
| 80 | + |
| 81 | +### Potential additions (priority order) |
| 82 | +1. **HASH_AGG fingerprint** as a fast pre-check before full diff (near-zero cost) |
| 83 | +2. **Sampling mode** — `TABLESAMPLE` or `LIMIT` with `ORDER BY RANDOM()` for quick confidence check |
| 84 | +3. **HLL-based cardinality comparison** — approximate distinct counts without full scan |
| 85 | +4. **Distribution comparison** — KS-test or percentile comparison using aggregate queries |
| 86 | +5. **Incremental validation** — only diff rows changed since last validation (requires timestamp column) |
| 87 | +6. **DuckDB multi-attach** for cross-database without data movement |
0 commit comments