|
| 1 | +--- |
| 2 | +name: data-validate |
| 3 | +description: Compare data between two tables across any warehouses using progressive validation — row counts, column profiles, segment checksums, and row-level drill-down. |
| 4 | +--- |
| 5 | + |
| 6 | +# Data Validate |
| 7 | + |
| 8 | +## Requirements |
| 9 | +**Agent:** data-diff or migrator (requires sql_execute on both source and target) |
| 10 | +**Tools used:** sql_execute, warehouse_list, warehouse_test, schema_inspect, read, glob |
| 11 | + |
| 12 | +Cross-database data validation using a progressive, multi-level approach. Each level provides increasing confidence with increasing query cost — stop as soon as you have enough evidence. |
| 13 | + |
| 14 | +## Validation Levels |
| 15 | + |
| 16 | +### Level 1: Row Count (seconds, near-zero cost) |
| 17 | +Compare total row counts between source and target. If counts match exactly, proceed to Level 2. If they differ, report the delta immediately — no deeper checks needed. |
| 18 | + |
| 19 | +```sql |
| 20 | +-- Run on source warehouse |
| 21 | +SELECT COUNT(*) AS row_count FROM {source_table} [WHERE ...] |
| 22 | + |
| 23 | +-- Run on target warehouse |
| 24 | +SELECT COUNT(*) AS row_count FROM {target_table} [WHERE ...] |
| 25 | +``` |
| 26 | + |
| 27 | +### Level 2: Column Profile (seconds, low cost) |
| 28 | +For each column, compare aggregate statistics. This catches type coercion bugs, NULL handling differences, and truncation issues without scanning every row. |
| 29 | + |
| 30 | +```sql |
| 31 | +SELECT |
| 32 | + COUNT(*) AS total_rows, |
| 33 | + COUNT({col}) AS non_null_count, |
| 34 | + COUNT(DISTINCT {col}) AS distinct_count, |
| 35 | + MIN({col}) AS min_val, |
| 36 | + MAX({col}) AS max_val, |
| 37 | + -- Numeric columns only: |
| 38 | + AVG(CAST({col} AS DOUBLE)) AS avg_val, |
| 39 | + SUM(CAST({col} AS DOUBLE)) AS sum_val |
| 40 | +FROM {table} [WHERE ...] |
| 41 | +``` |
| 42 | + |
| 43 | +Run this for each column (or the key columns + any columns the user cares about). Compare results side by side: |
| 44 | + |
| 45 | +``` |
| 46 | +Column Profile Comparison |
| 47 | +========================= |
| 48 | +Column | Source | Target | Match |
| 49 | +----------------|-----------------|-----------------|------ |
| 50 | +total_rows | 1,234,567 | 1,234,567 | OK |
| 51 | +user_id.distinct| 500,000 | 500,000 | OK |
| 52 | +email.nulls | 0 | 1,204 | MISMATCH |
| 53 | +amount.sum | 45,678,901.23 | 45,678,901.23 | OK |
| 54 | +amount.avg | 37.01 | 37.01 | OK |
| 55 | +created_at.min | 2020-01-01 | 2020-01-01 | OK |
| 56 | +created_at.max | 2024-12-31 | 2024-12-31 | OK |
| 57 | +``` |
| 58 | + |
| 59 | +If all profiles match, tables are equivalent with high confidence. Report and stop. |
| 60 | + |
| 61 | +### Level 3: Segment Checksums (moderate cost) |
| 62 | +If profiles match but the user wants stronger guarantees, or if you need to locate WHERE the differences are, split the key space into segments and compare checksums. |
| 63 | + |
| 64 | +Requires: a sortable key column (integer PK, timestamp, etc.) |
| 65 | + |
| 66 | +```sql |
| 67 | +-- Get key range |
| 68 | +SELECT MIN({key_col}) AS min_key, MAX({key_col}) AS max_key FROM {table} |
| 69 | + |
| 70 | +-- Segment checksum (dialect-specific hash aggregation) |
| 71 | +-- Snowflake: |
| 72 | +SELECT |
| 73 | + FLOOR(({key_col} - {min_key}) * {num_buckets} / ({max_key} - {min_key} + 1)) AS bucket, |
| 74 | + COUNT(*) AS cnt, |
| 75 | + BITXOR_AGG(HASH({columns})) AS checksum |
| 76 | +FROM {table} |
| 77 | +WHERE {key_col} >= {min_key} AND {key_col} <= {max_key} |
| 78 | +GROUP BY bucket ORDER BY bucket |
| 79 | + |
| 80 | +-- Postgres: |
| 81 | +SELECT |
| 82 | + FLOOR(({key_col} - {min_key}) * {num_buckets} / ({max_key} - {min_key} + 1)) AS bucket, |
| 83 | + COUNT(*) AS cnt, |
| 84 | + BIT_XOR(('x' || SUBSTR(MD5(CONCAT({columns}::text)), 1, 12))::bit(48)::bigint) AS checksum |
| 85 | +FROM {table} |
| 86 | +WHERE {key_col} >= {min_key} AND {key_col} <= {max_key} |
| 87 | +GROUP BY bucket ORDER BY bucket |
| 88 | + |
| 89 | +-- BigQuery: |
| 90 | +SELECT |
| 91 | + CAST(FLOOR(({key_col} - {min_key}) * {num_buckets} / ({max_key} - {min_key} + 1)) AS INT64) AS bucket, |
| 92 | + COUNT(*) AS cnt, |
| 93 | + BIT_XOR(FARM_FINGERPRINT(CONCAT({columns}))) AS checksum |
| 94 | +FROM {table} |
| 95 | +WHERE {key_col} >= {min_key} AND {key_col} <= {max_key} |
| 96 | +GROUP BY bucket ORDER BY bucket |
| 97 | + |
| 98 | +-- DuckDB: |
| 99 | +SELECT |
| 100 | + FLOOR(({key_col} - {min_key}) * {num_buckets} / ({max_key} - {min_key} + 1)) AS bucket, |
| 101 | + COUNT(*) AS cnt, |
| 102 | + BIT_XOR(md5_number_lower64(CONCAT({columns}::text))) AS checksum |
| 103 | +FROM {table} |
| 104 | +WHERE {key_col} >= {min_key} AND {key_col} <= {max_key} |
| 105 | +GROUP BY bucket ORDER BY bucket |
| 106 | +``` |
| 107 | + |
| 108 | +Compare bucket-by-bucket. Matching checksums = identical data in that segment. Mismatched buckets narrow down where differences live. |
| 109 | + |
| 110 | +### Level 4: Row-Level Diff (targeted, on mismatched segments only) |
| 111 | +For any mismatched segments from Level 3, download the actual rows and diff them locally. Only fetch rows in the mismatched key range. |
| 112 | + |
| 113 | +```sql |
| 114 | +SELECT {key_col}, {columns} |
| 115 | +FROM {table} |
| 116 | +WHERE {key_col} >= {segment_min} AND {key_col} < {segment_max} |
| 117 | +ORDER BY {key_col} |
| 118 | +``` |
| 119 | + |
| 120 | +Compare row by row. Report additions, deletions, and value changes. |
| 121 | + |
| 122 | +## Workflow |
| 123 | + |
| 124 | +1. **Identify source and target** — Ask the user or infer from context: |
| 125 | + - Which warehouse connections? (use `warehouse_list` to show available) |
| 126 | + - Which tables to compare? |
| 127 | + - Any WHERE clause filters? (date range, partition, etc.) |
| 128 | + - Which columns matter? (all, or specific subset) |
| 129 | + |
| 130 | +2. **Verify connectivity** — Run `warehouse_test` on both connections. |
| 131 | + |
| 132 | +3. **Inspect schemas** — Use `schema_inspect` on both tables. Compare column names, types, and nullability. Flag any schema differences before proceeding (e.g., VARCHAR(100) vs VARCHAR(256), INT vs BIGINT). |
| 133 | + |
| 134 | +4. **Run Level 1** — Row counts. If mismatched, report and ask if user wants to drill deeper. |
| 135 | + |
| 136 | +5. **Run Level 2** — Column profiles. Compare side by side. If all match, report high-confidence equivalence. If mismatches found, highlight which columns differ and by how much. |
| 137 | + |
| 138 | +6. **Run Level 3** (if needed) — Segment checksums. Use 32 buckets by default. Report which segments match and which differ. |
| 139 | + |
| 140 | +7. **Run Level 4** (if needed) — Fetch rows from mismatched segments. Show the actual diff rows (additions/deletions/changes). |
| 141 | + |
| 142 | +8. **Report** — Always produce a structured summary: |
| 143 | + |
| 144 | +``` |
| 145 | +Data Validation Report |
| 146 | +====================== |
| 147 | +Source: snowflake://analytics.public.orders |
| 148 | +Target: bigquery://project.dataset.orders |
| 149 | +Filter: created_at >= '2024-01-01' |
| 150 | +Status: PASS | FAIL | PARTIAL |
| 151 | +
|
| 152 | +Level 1 — Row Count: PASS (1,234,567 rows both sides) |
| 153 | +Level 2 — Profile: PASS (all 12 columns match) |
| 154 | +Level 3 — Checksum: PASS (32/32 segments match) |
| 155 | +Level 4 — Row Diff: SKIPPED (not needed) |
| 156 | +
|
| 157 | +Confidence: HIGH |
| 158 | +``` |
| 159 | + |
| 160 | +## Dialect-Specific Notes |
| 161 | + |
| 162 | +**Hash functions by dialect:** |
| 163 | +| Dialect | Row Hash | Aggregation | |
| 164 | +|-------------|-----------------------|-----------------| |
| 165 | +| Snowflake | `HASH(cols)` | `BITXOR_AGG` | |
| 166 | +| Postgres | `MD5(CONCAT(cols))` | `BIT_XOR` | |
| 167 | +| BigQuery | `FARM_FINGERPRINT` | `BIT_XOR` | |
| 168 | +| DuckDB | `md5_number_lower64` | `BIT_XOR` | |
| 169 | +| Databricks | `xxhash64(cols)` | `BIT_XOR` | |
| 170 | +| MySQL | `MD5(CONCAT(cols))` | `BIT_XOR` | |
| 171 | +| ClickHouse | `cityHash64(cols)` | `groupBitXor` | |
| 172 | + |
| 173 | +**Cross-database checksum comparison**: When source and target use different dialects, checksums won't match even for identical data (different hash functions). In this case, skip Level 3 and go directly from Level 2 to Level 4 if needed, OR download sorted rows from both sides and compare locally. |
| 174 | + |
| 175 | +## Usage |
| 176 | + |
| 177 | +- `/data-validate` — Start interactive validation (will ask for source/target) |
| 178 | +- `/data-validate orders` — Validate the `orders` table across connected warehouses |
| 179 | +- `/data-validate snowflake.orders bigquery.orders` — Explicit source and target |
| 180 | +- `/data-validate --level 2` — Stop at profile level (skip checksums) |
| 181 | +- `/data-validate --columns id,amount,created_at` — Only validate specific columns |
| 182 | + |
| 183 | +Use the tools: `sql_execute`, `warehouse_list`, `warehouse_test`, `schema_inspect`, `read`, `glob`. |
0 commit comments