|
| 1 | +--- |
| 2 | +name: data-parity |
| 3 | +description: Validate that two tables or query results are identical — or diagnose exactly how they differ. Discover schema, identify keys, profile cheaply, then diff. Use for migration validation, ETL regression, and query refactor verification. |
| 4 | +--- |
| 5 | + |
| 6 | +# Data Parity (Table Diff) |
| 7 | + |
| 8 | +## Output Style |
| 9 | + |
| 10 | +**Report facts only. No editorializing.** |
| 11 | +- Show counts, changed values, missing rows, new rows — that's it. |
| 12 | +- Do NOT explain why row-level diffing is valuable, why COUNT(*) is insufficient, or pitch the tool. |
| 13 | +- Do NOT add "the dangerous one", "this is exactly why", "this matters" style commentary. |
| 14 | +- The user asked for a diff result, not a lecture. |
| 15 | + |
| 16 | +## Requirements |
| 17 | +**Agent:** any |
| 18 | +**Tools used:** `sql_query` (for schema discovery), `data_diff` |
| 19 | + |
| 20 | +## When to Use This Skill |
| 21 | + |
| 22 | +**Use when the user wants to:** |
| 23 | +- Confirm two tables contain the same data after a migration |
| 24 | +- Find rows added, deleted, or modified between source and target |
| 25 | +- Validate that a dbt model produces the same output as the old query |
| 26 | +- Run regression checks after a pipeline change |
| 27 | + |
| 28 | +**Do NOT use for:** |
| 29 | +- Schema comparison (column names, types) — check DDL instead |
| 30 | +- Performance benchmarking — this runs SELECT queries |
| 31 | + |
| 32 | +--- |
| 33 | + |
| 34 | +## The `data_diff` Tool |
| 35 | + |
| 36 | +`data_diff` takes table names and key columns. It generates SQL, routes it through the specified warehouse connections, and reports differences. It **does not discover schema** — you must provide key columns and relevant comparison columns. |
| 37 | + |
| 38 | +**Key parameters:** |
| 39 | +- `source` — table name (`orders`, `db.schema.orders`) or full SELECT/WITH query |
| 40 | +- `target` — table name or SELECT query |
| 41 | +- `key_columns` — primary key(s) uniquely identifying each row (required) |
| 42 | +- `source_warehouse` — connection name for source |
| 43 | +- `target_warehouse` — connection name for target (omit = same as source) |
| 44 | +- `extra_columns` — columns to compare beyond keys (omit = compare all) |
| 45 | +- `algorithm` — `auto`, `joindiff`, `hashdiff`, `profile`, `cascade` |
| 46 | +- `where_clause` — filter applied to both tables |
| 47 | + |
| 48 | +> **CRITICAL — Algorithm choice:** |
| 49 | +> - If `source_warehouse` ≠ `target_warehouse` → **always use `hashdiff`** (or `auto`). |
| 50 | +> - `joindiff` runs a single SQL JOIN on ONE connection — it physically cannot see the other table. |
| 51 | +> Using `joindiff` across different servers always reports 0 differences (both sides look identical). |
| 52 | +> - When in doubt, use `algorithm="auto"` — it picks `joindiff` for same-warehouse and `hashdiff` for cross-warehouse automatically. |
| 53 | +
|
| 54 | +--- |
| 55 | + |
| 56 | +## Workflow |
| 57 | + |
| 58 | +The key principle: **the LLM does the identification work using SQL tools first, then calls data_diff with informed parameters.** |
| 59 | + |
| 60 | +### Step 1: Inspect the tables |
| 61 | + |
| 62 | +Before calling `data_diff`, use `sql_query` to understand what you're comparing: |
| 63 | + |
| 64 | +```sql |
| 65 | +-- Get columns and types |
| 66 | +SELECT column_name, data_type, is_nullable |
| 67 | +FROM information_schema.columns |
| 68 | +WHERE table_schema = 'public' AND table_name = 'orders' |
| 69 | +ORDER BY ordinal_position |
| 70 | +``` |
| 71 | + |
| 72 | +For ClickHouse: |
| 73 | +```sql |
| 74 | +DESCRIBE TABLE source_db.events |
| 75 | +``` |
| 76 | + |
| 77 | +For Snowflake: |
| 78 | +```sql |
| 79 | +SHOW COLUMNS IN TABLE orders |
| 80 | +``` |
| 81 | + |
| 82 | +**Look for:** |
| 83 | +- Columns that look like primary keys (named `id`, `*_id`, `*_key`, `uuid`) |
| 84 | +- Columns with `NOT NULL` constraints |
| 85 | +- Whether there are composite keys |
| 86 | + |
| 87 | +### Step 2: Identify the key columns |
| 88 | + |
| 89 | +If the primary key isn't obvious from the schema, run a cardinality check: |
| 90 | + |
| 91 | +```sql |
| 92 | +SELECT |
| 93 | + COUNT(*) AS total_rows, |
| 94 | + COUNT(DISTINCT order_id) AS distinct_order_id, |
| 95 | + COUNT(DISTINCT customer_id) AS distinct_customer_id, |
| 96 | + COUNT(DISTINCT created_at) AS distinct_created_at |
| 97 | +FROM orders |
| 98 | +``` |
| 99 | + |
| 100 | +**A good key column:** `distinct_count = total_rows` (fully unique) and `null_count = 0`. |
| 101 | + |
| 102 | +If no single column is unique, find a composite key: |
| 103 | +```sql |
| 104 | +SELECT order_id, line_item_id, COUNT(*) as cnt |
| 105 | +FROM order_lines |
| 106 | +GROUP BY order_id, line_item_id |
| 107 | +HAVING COUNT(*) > 1 |
| 108 | +LIMIT 5 |
| 109 | +``` |
| 110 | +If this returns 0 rows, `(order_id, line_item_id)` is a valid composite key. |
| 111 | + |
| 112 | +### Step 3: Estimate table size |
| 113 | + |
| 114 | +```sql |
| 115 | +SELECT COUNT(*) FROM orders |
| 116 | +``` |
| 117 | + |
| 118 | +Use this to choose the algorithm: |
| 119 | +- **< 1M rows**: `joindiff` (same DB) or `hashdiff` (cross-DB) — either is fine |
| 120 | +- **1M–100M rows**: `hashdiff` or `cascade` |
| 121 | +- **> 100M rows**: `hashdiff` with a `where_clause` date filter to validate a recent window first |
| 122 | + |
| 123 | +### Step 4: Profile first for unknown tables |
| 124 | + |
| 125 | +If you don't know what to expect (first-time validation, unfamiliar pipeline), start cheap: |
| 126 | + |
| 127 | +``` |
| 128 | +data_diff( |
| 129 | + source="orders", |
| 130 | + target="orders_migrated", |
| 131 | + key_columns=["order_id"], |
| 132 | + source_warehouse="postgres_prod", |
| 133 | + target_warehouse="snowflake_dw", |
| 134 | + algorithm="profile" |
| 135 | +) |
| 136 | +``` |
| 137 | + |
| 138 | +Profile output tells you: |
| 139 | +- Row count on each side (mismatch = load completeness problem) |
| 140 | +- Which columns have null count differences (mismatch = NULL handling bug) |
| 141 | +- Min/max divergence per column (mismatch = value transformation bug) |
| 142 | +- Which columns match exactly (safe to skip in row-level diff) |
| 143 | + |
| 144 | +**Interpret profile to narrow the diff:** |
| 145 | +``` |
| 146 | +Column Profile Comparison |
| 147 | +
|
| 148 | + ✓ order_id: match |
| 149 | + ✓ customer_id: match |
| 150 | + ✗ amount: DIFFER ← source min=10.00, target min=10.01 — rounding issue? |
| 151 | + ✗ status: DIFFER ← source nulls=0, target nulls=47 — NULL mapping bug? |
| 152 | + ✓ created_at: match |
| 153 | +``` |
| 154 | +→ Only diff `amount` and `status` in the next step. |
| 155 | + |
| 156 | +### Step 5: Run targeted row-level diff |
| 157 | + |
| 158 | +``` |
| 159 | +data_diff( |
| 160 | + source="orders", |
| 161 | + target="orders_migrated", |
| 162 | + key_columns=["order_id"], |
| 163 | + extra_columns=["amount", "status"], // only the columns profile said differ |
| 164 | + source_warehouse="postgres_prod", |
| 165 | + target_warehouse="snowflake_dw", |
| 166 | + algorithm="hashdiff" |
| 167 | +) |
| 168 | +``` |
| 169 | + |
| 170 | +--- |
| 171 | + |
| 172 | +## Algorithm Selection |
| 173 | + |
| 174 | +| Algorithm | When to use | |
| 175 | +|-----------|-------------| |
| 176 | +| `profile` | First pass — column stats (count, min, max, nulls). No row scan. | |
| 177 | +| `joindiff` | Same database — single FULL OUTER JOIN query. Fast. | |
| 178 | +| `hashdiff` | Cross-database, or large tables — bisection with checksums. Scales. | |
| 179 | +| `cascade` | Auto-escalate: profile → hashdiff on diverging columns. | |
| 180 | +| `auto` | JoinDiff if same warehouse, HashDiff if cross-database. | |
| 181 | + |
| 182 | +**JoinDiff constraint:** Both tables must be on the **same database connection**. If source and target are on different servers, JoinDiff will always report 0 diffs (it only sees one side). Use `hashdiff` or `auto` for cross-database. |
| 183 | + |
| 184 | +--- |
| 185 | + |
| 186 | +## Output Interpretation |
| 187 | + |
| 188 | +### IDENTICAL |
| 189 | +``` |
| 190 | +✓ Tables are IDENTICAL |
| 191 | + Rows checked: 1,000,000 |
| 192 | +``` |
| 193 | +→ Migration validated. Data is identical. |
| 194 | + |
| 195 | +### DIFFER — Diagnose by pattern |
| 196 | + |
| 197 | +``` |
| 198 | +✗ Tables DIFFER |
| 199 | +
|
| 200 | + Only in source: 2 → rows deleted in target (ETL missed deletes) |
| 201 | + Only in target: 2 → rows added to target (dedup issue or new data) |
| 202 | + Updated rows: 3 → values changed (transform bug, type casting, rounding) |
| 203 | + Identical rows: 15 |
| 204 | +``` |
| 205 | + |
| 206 | +| Pattern | Root cause hypothesis | |
| 207 | +|---------|----------------------| |
| 208 | +| `only_in_source > 0`, `only_in_target = 0` | ETL dropped rows — check filters, incremental logic | |
| 209 | +| `only_in_source = 0`, `only_in_target > 0` | Target has extra rows — check dedup or wrong join | |
| 210 | +| `updated_rows > 0`, row counts match | Silent value corruption — check transforms, type casts | |
| 211 | +| Row count differs | Load completeness issue — check ETL watermarks | |
| 212 | + |
| 213 | +Sample diffs point to the specific key + column + old→new value: |
| 214 | +``` |
| 215 | +key={"order_id":"4"} col=amount: 300.00 → 305.00 |
| 216 | +``` |
| 217 | +Use this to query the source systems directly and trace the discrepancy. |
| 218 | + |
| 219 | +--- |
| 220 | + |
| 221 | +## Usage Examples |
| 222 | + |
| 223 | +### Full workflow: unknown migration |
| 224 | +``` |
| 225 | +// 1. Discover schema |
| 226 | +sql_query("SELECT column_name, data_type FROM information_schema.columns WHERE table_name='orders'", warehouse="postgres_prod") |
| 227 | +
|
| 228 | +// 2. Check row count |
| 229 | +sql_query("SELECT COUNT(*), COUNT(DISTINCT order_id) FROM orders", warehouse="postgres_prod") |
| 230 | +
|
| 231 | +// 3. Profile to find which columns differ |
| 232 | +data_diff(source="orders", target="orders", key_columns=["order_id"], |
| 233 | + source_warehouse="postgres_prod", target_warehouse="snowflake_dw", algorithm="profile") |
| 234 | +
|
| 235 | +// 4. Row-level diff on diverging columns only |
| 236 | +data_diff(source="orders", target="orders", key_columns=["order_id"], |
| 237 | + extra_columns=["amount", "status"], |
| 238 | + source_warehouse="postgres_prod", target_warehouse="snowflake_dw", algorithm="hashdiff") |
| 239 | +``` |
| 240 | + |
| 241 | +### Same-database query refactor |
| 242 | +``` |
| 243 | +data_diff( |
| 244 | + source="SELECT id, amount, status FROM orders WHERE region = 'us-east'", |
| 245 | + target="SELECT id, amount, status FROM orders_v2 WHERE region = 'us-east'", |
| 246 | + key_columns=["id"] |
| 247 | +) |
| 248 | +``` |
| 249 | + |
| 250 | +### Large table — filter to recent window first |
| 251 | +``` |
| 252 | +data_diff( |
| 253 | + source="fact_events", |
| 254 | + target="fact_events_v2", |
| 255 | + key_columns=["event_id"], |
| 256 | + where_clause="event_date >= '2024-01-01'", |
| 257 | + algorithm="hashdiff" |
| 258 | +) |
| 259 | +``` |
| 260 | + |
| 261 | +### ClickHouse — always qualify with database.table |
| 262 | +``` |
| 263 | +data_diff( |
| 264 | + source="source_db.events", |
| 265 | + target="target_db.events", |
| 266 | + key_columns=["event_id"], |
| 267 | + source_warehouse="clickhouse_source", |
| 268 | + target_warehouse="clickhouse_target", |
| 269 | + algorithm="hashdiff" |
| 270 | +) |
| 271 | +``` |
| 272 | + |
| 273 | +--- |
| 274 | + |
| 275 | +## Common Mistakes |
| 276 | + |
| 277 | +**Calling data_diff without knowing the key** |
| 278 | +→ Run `sql_query` to check cardinality first. A bad key gives meaningless results. |
| 279 | + |
| 280 | +**Using joindiff for cross-database tables** |
| 281 | +→ JoinDiff runs one SQL query on one connection. It can't see the other table. Use `hashdiff` or `auto`. |
| 282 | + |
| 283 | +**Diffing a 1B row table without a date filter** |
| 284 | +→ Add `where_clause` to scope to recent data. Validate a window first, then expand. |
| 285 | + |
| 286 | +**Ignoring profile output and jumping to full diff** |
| 287 | +→ Profile is free. It tells you which columns actually differ so you can avoid scanning all columns across all rows. |
| 288 | + |
| 289 | +**Forgetting to check row counts before diffing** |
| 290 | +→ If source has 1M rows and target has 900K, row-level diff is misleading. Fix the load completeness issue first. |
0 commit comments