Skip to content

Commit 7909e55

Browse files
committed
feat: add data-parity cross-database table comparison
- Add DataParity engine integration via native Rust bindings - Add data-diff tool for LLM agent (profile, joindiff, hashdiff, cascade, auto) - Add ClickHouse driver support - Add data-parity skill: profile-first workflow, algorithm selection guide, CRITICAL warning that joindiff cannot run cross-database (always returns 0 diffs), output style rules (facts only, no editorializing) - Gitignore .altimate-code/ (credentials) and *.node (platform binaries)
1 parent abcaa1d commit 7909e55

File tree

10 files changed

+907
-0
lines changed

10 files changed

+907
-0
lines changed

.gitignore

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,12 @@ target
2828
# Commit message scratch files
2929
.github/meta/
3030

31+
# Local connections config (may contain credentials)
32+
.altimate-code/
33+
34+
# Pre-built native binaries (platform-specific, not for source control)
35+
packages/opencode/*.node
36+
3137
# Local dev files
3238
opencode-dev
3339
logs/
Lines changed: 290 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,290 @@
1+
---
2+
name: data-parity
3+
description: Validate that two tables or query results are identical — or diagnose exactly how they differ. Discover schema, identify keys, profile cheaply, then diff. Use for migration validation, ETL regression, and query refactor verification.
4+
---
5+
6+
# Data Parity (Table Diff)
7+
8+
## Output Style
9+
10+
**Report facts only. No editorializing.**
11+
- Show counts, changed values, missing rows, new rows — that's it.
12+
- Do NOT explain why row-level diffing is valuable, why COUNT(*) is insufficient, or pitch the tool.
13+
- Do NOT add "the dangerous one", "this is exactly why", "this matters" style commentary.
14+
- The user asked for a diff result, not a lecture.
15+
16+
## Requirements
17+
**Agent:** any
18+
**Tools used:** `sql_query` (for schema discovery), `data_diff`
19+
20+
## When to Use This Skill
21+
22+
**Use when the user wants to:**
23+
- Confirm two tables contain the same data after a migration
24+
- Find rows added, deleted, or modified between source and target
25+
- Validate that a dbt model produces the same output as the old query
26+
- Run regression checks after a pipeline change
27+
28+
**Do NOT use for:**
29+
- Schema comparison (column names, types) — check DDL instead
30+
- Performance benchmarking — this runs SELECT queries
31+
32+
---
33+
34+
## The `data_diff` Tool
35+
36+
`data_diff` takes table names and key columns. It generates SQL, routes it through the specified warehouse connections, and reports differences. It **does not discover schema** — you must provide key columns and relevant comparison columns.
37+
38+
**Key parameters:**
39+
- `source` — table name (`orders`, `db.schema.orders`) or full SELECT/WITH query
40+
- `target` — table name or SELECT query
41+
- `key_columns` — primary key(s) uniquely identifying each row (required)
42+
- `source_warehouse` — connection name for source
43+
- `target_warehouse` — connection name for target (omit = same as source)
44+
- `extra_columns` — columns to compare beyond keys (omit = compare all)
45+
- `algorithm``auto`, `joindiff`, `hashdiff`, `profile`, `cascade`
46+
- `where_clause` — filter applied to both tables
47+
48+
> **CRITICAL — Algorithm choice:**
49+
> - If `source_warehouse``target_warehouse`**always use `hashdiff`** (or `auto`).
50+
> - `joindiff` runs a single SQL JOIN on ONE connection — it physically cannot see the other table.
51+
> Using `joindiff` across different servers always reports 0 differences (both sides look identical).
52+
> - When in doubt, use `algorithm="auto"` — it picks `joindiff` for same-warehouse and `hashdiff` for cross-warehouse automatically.
53+
54+
---
55+
56+
## Workflow
57+
58+
The key principle: **the LLM does the identification work using SQL tools first, then calls data_diff with informed parameters.**
59+
60+
### Step 1: Inspect the tables
61+
62+
Before calling `data_diff`, use `sql_query` to understand what you're comparing:
63+
64+
```sql
65+
-- Get columns and types
66+
SELECT column_name, data_type, is_nullable
67+
FROM information_schema.columns
68+
WHERE table_schema = 'public' AND table_name = 'orders'
69+
ORDER BY ordinal_position
70+
```
71+
72+
For ClickHouse:
73+
```sql
74+
DESCRIBE TABLE source_db.events
75+
```
76+
77+
For Snowflake:
78+
```sql
79+
SHOW COLUMNS IN TABLE orders
80+
```
81+
82+
**Look for:**
83+
- Columns that look like primary keys (named `id`, `*_id`, `*_key`, `uuid`)
84+
- Columns with `NOT NULL` constraints
85+
- Whether there are composite keys
86+
87+
### Step 2: Identify the key columns
88+
89+
If the primary key isn't obvious from the schema, run a cardinality check:
90+
91+
```sql
92+
SELECT
93+
COUNT(*) AS total_rows,
94+
COUNT(DISTINCT order_id) AS distinct_order_id,
95+
COUNT(DISTINCT customer_id) AS distinct_customer_id,
96+
COUNT(DISTINCT created_at) AS distinct_created_at
97+
FROM orders
98+
```
99+
100+
**A good key column:** `distinct_count = total_rows` (fully unique) and `null_count = 0`.
101+
102+
If no single column is unique, find a composite key:
103+
```sql
104+
SELECT order_id, line_item_id, COUNT(*) as cnt
105+
FROM order_lines
106+
GROUP BY order_id, line_item_id
107+
HAVING COUNT(*) > 1
108+
LIMIT 5
109+
```
110+
If this returns 0 rows, `(order_id, line_item_id)` is a valid composite key.
111+
112+
### Step 3: Estimate table size
113+
114+
```sql
115+
SELECT COUNT(*) FROM orders
116+
```
117+
118+
Use this to choose the algorithm:
119+
- **< 1M rows**: `joindiff` (same DB) or `hashdiff` (cross-DB) — either is fine
120+
- **1M–100M rows**: `hashdiff` or `cascade`
121+
- **> 100M rows**: `hashdiff` with a `where_clause` date filter to validate a recent window first
122+
123+
### Step 4: Profile first for unknown tables
124+
125+
If you don't know what to expect (first-time validation, unfamiliar pipeline), start cheap:
126+
127+
```
128+
data_diff(
129+
source="orders",
130+
target="orders_migrated",
131+
key_columns=["order_id"],
132+
source_warehouse="postgres_prod",
133+
target_warehouse="snowflake_dw",
134+
algorithm="profile"
135+
)
136+
```
137+
138+
Profile output tells you:
139+
- Row count on each side (mismatch = load completeness problem)
140+
- Which columns have null count differences (mismatch = NULL handling bug)
141+
- Min/max divergence per column (mismatch = value transformation bug)
142+
- Which columns match exactly (safe to skip in row-level diff)
143+
144+
**Interpret profile to narrow the diff:**
145+
```
146+
Column Profile Comparison
147+
148+
✓ order_id: match
149+
✓ customer_id: match
150+
✗ amount: DIFFER ← source min=10.00, target min=10.01 — rounding issue?
151+
✗ status: DIFFER ← source nulls=0, target nulls=47 — NULL mapping bug?
152+
✓ created_at: match
153+
```
154+
→ Only diff `amount` and `status` in the next step.
155+
156+
### Step 5: Run targeted row-level diff
157+
158+
```
159+
data_diff(
160+
source="orders",
161+
target="orders_migrated",
162+
key_columns=["order_id"],
163+
extra_columns=["amount", "status"], // only the columns profile said differ
164+
source_warehouse="postgres_prod",
165+
target_warehouse="snowflake_dw",
166+
algorithm="hashdiff"
167+
)
168+
```
169+
170+
---
171+
172+
## Algorithm Selection
173+
174+
| Algorithm | When to use |
175+
|-----------|-------------|
176+
| `profile` | First pass — column stats (count, min, max, nulls). No row scan. |
177+
| `joindiff` | Same database — single FULL OUTER JOIN query. Fast. |
178+
| `hashdiff` | Cross-database, or large tables — bisection with checksums. Scales. |
179+
| `cascade` | Auto-escalate: profile → hashdiff on diverging columns. |
180+
| `auto` | JoinDiff if same warehouse, HashDiff if cross-database. |
181+
182+
**JoinDiff constraint:** Both tables must be on the **same database connection**. If source and target are on different servers, JoinDiff will always report 0 diffs (it only sees one side). Use `hashdiff` or `auto` for cross-database.
183+
184+
---
185+
186+
## Output Interpretation
187+
188+
### IDENTICAL
189+
```
190+
✓ Tables are IDENTICAL
191+
Rows checked: 1,000,000
192+
```
193+
→ Migration validated. Data is identical.
194+
195+
### DIFFER — Diagnose by pattern
196+
197+
```
198+
✗ Tables DIFFER
199+
200+
Only in source: 2 → rows deleted in target (ETL missed deletes)
201+
Only in target: 2 → rows added to target (dedup issue or new data)
202+
Updated rows: 3 → values changed (transform bug, type casting, rounding)
203+
Identical rows: 15
204+
```
205+
206+
| Pattern | Root cause hypothesis |
207+
|---------|----------------------|
208+
| `only_in_source > 0`, `only_in_target = 0` | ETL dropped rows — check filters, incremental logic |
209+
| `only_in_source = 0`, `only_in_target > 0` | Target has extra rows — check dedup or wrong join |
210+
| `updated_rows > 0`, row counts match | Silent value corruption — check transforms, type casts |
211+
| Row count differs | Load completeness issue — check ETL watermarks |
212+
213+
Sample diffs point to the specific key + column + old→new value:
214+
```
215+
key={"order_id":"4"} col=amount: 300.00 → 305.00
216+
```
217+
Use this to query the source systems directly and trace the discrepancy.
218+
219+
---
220+
221+
## Usage Examples
222+
223+
### Full workflow: unknown migration
224+
```
225+
// 1. Discover schema
226+
sql_query("SELECT column_name, data_type FROM information_schema.columns WHERE table_name='orders'", warehouse="postgres_prod")
227+
228+
// 2. Check row count
229+
sql_query("SELECT COUNT(*), COUNT(DISTINCT order_id) FROM orders", warehouse="postgres_prod")
230+
231+
// 3. Profile to find which columns differ
232+
data_diff(source="orders", target="orders", key_columns=["order_id"],
233+
source_warehouse="postgres_prod", target_warehouse="snowflake_dw", algorithm="profile")
234+
235+
// 4. Row-level diff on diverging columns only
236+
data_diff(source="orders", target="orders", key_columns=["order_id"],
237+
extra_columns=["amount", "status"],
238+
source_warehouse="postgres_prod", target_warehouse="snowflake_dw", algorithm="hashdiff")
239+
```
240+
241+
### Same-database query refactor
242+
```
243+
data_diff(
244+
source="SELECT id, amount, status FROM orders WHERE region = 'us-east'",
245+
target="SELECT id, amount, status FROM orders_v2 WHERE region = 'us-east'",
246+
key_columns=["id"]
247+
)
248+
```
249+
250+
### Large table — filter to recent window first
251+
```
252+
data_diff(
253+
source="fact_events",
254+
target="fact_events_v2",
255+
key_columns=["event_id"],
256+
where_clause="event_date >= '2024-01-01'",
257+
algorithm="hashdiff"
258+
)
259+
```
260+
261+
### ClickHouse — always qualify with database.table
262+
```
263+
data_diff(
264+
source="source_db.events",
265+
target="target_db.events",
266+
key_columns=["event_id"],
267+
source_warehouse="clickhouse_source",
268+
target_warehouse="clickhouse_target",
269+
algorithm="hashdiff"
270+
)
271+
```
272+
273+
---
274+
275+
## Common Mistakes
276+
277+
**Calling data_diff without knowing the key**
278+
→ Run `sql_query` to check cardinality first. A bad key gives meaningless results.
279+
280+
**Using joindiff for cross-database tables**
281+
→ JoinDiff runs one SQL query on one connection. It can't see the other table. Use `hashdiff` or `auto`.
282+
283+
**Diffing a 1B row table without a date filter**
284+
→ Add `where_clause` to scope to recent data. Validate a window first, then expand.
285+
286+
**Ignoring profile output and jumping to full diff**
287+
→ Profile is free. It tells you which columns actually differ so you can avoid scanning all columns across all rows.
288+
289+
**Forgetting to check row counts before diffing**
290+
→ If source has 1M rows and target has 900K, row-level diff is misleading. Fix the load completeness issue first.

0 commit comments

Comments
 (0)