Skip to content

Commit d2d39fe

Browse files
suryaiyer95claude
andcommitted
feat: add data-diff agent mode and /data-validate skill
- New `data-diff` primary agent mode for cross-database data validation with progressive checks: row counts → column profiles → segment checksums → row-level diffs - New `/data-validate` skill with dialect-specific SQL templates for Snowflake, Postgres, BigQuery, DuckDB, Databricks, ClickHouse, MySQL - Prompt covers 4 validation levels, cross-database checksum awareness, and structured PASS/FAIL reporting - Added `/data-validate` to migrator and validator skill lists so both modes can invoke it Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent f668a67 commit d2d39fe

5 files changed

Lines changed: 329 additions & 0 deletions

File tree

Lines changed: 183 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,183 @@
1+
---
2+
name: data-validate
3+
description: Compare data between two tables across any warehouses using progressive validation — row counts, column profiles, segment checksums, and row-level drill-down.
4+
---
5+
6+
# Data Validate
7+
8+
## Requirements
9+
**Agent:** data-diff or migrator (requires sql_execute on both source and target)
10+
**Tools used:** sql_execute, warehouse_list, warehouse_test, schema_inspect, read, glob
11+
12+
Cross-database data validation using a progressive, multi-level approach. Each level provides increasing confidence with increasing query cost — stop as soon as you have enough evidence.
13+
14+
## Validation Levels
15+
16+
### Level 1: Row Count (seconds, near-zero cost)
17+
Compare total row counts between source and target. If counts match exactly, proceed to Level 2. If they differ, report the delta immediately — no deeper checks needed.
18+
19+
```sql
20+
-- Run on source warehouse
21+
SELECT COUNT(*) AS row_count FROM {source_table} [WHERE ...]
22+
23+
-- Run on target warehouse
24+
SELECT COUNT(*) AS row_count FROM {target_table} [WHERE ...]
25+
```
26+
27+
### Level 2: Column Profile (seconds, low cost)
28+
For each column, compare aggregate statistics. This catches type coercion bugs, NULL handling differences, and truncation issues without scanning every row.
29+
30+
```sql
31+
SELECT
32+
COUNT(*) AS total_rows,
33+
COUNT({col}) AS non_null_count,
34+
COUNT(DISTINCT {col}) AS distinct_count,
35+
MIN({col}) AS min_val,
36+
MAX({col}) AS max_val,
37+
-- Numeric columns only:
38+
AVG(CAST({col} AS DOUBLE)) AS avg_val,
39+
SUM(CAST({col} AS DOUBLE)) AS sum_val
40+
FROM {table} [WHERE ...]
41+
```
42+
43+
Run this for each column (or the key columns + any columns the user cares about). Compare results side by side:
44+
45+
```
46+
Column Profile Comparison
47+
=========================
48+
Column | Source | Target | Match
49+
----------------|-----------------|-----------------|------
50+
total_rows | 1,234,567 | 1,234,567 | OK
51+
user_id.distinct| 500,000 | 500,000 | OK
52+
email.nulls | 0 | 1,204 | MISMATCH
53+
amount.sum | 45,678,901.23 | 45,678,901.23 | OK
54+
amount.avg | 37.01 | 37.01 | OK
55+
created_at.min | 2020-01-01 | 2020-01-01 | OK
56+
created_at.max | 2024-12-31 | 2024-12-31 | OK
57+
```
58+
59+
If all profiles match, tables are equivalent with high confidence. Report and stop.
60+
61+
### Level 3: Segment Checksums (moderate cost)
62+
If profiles match but the user wants stronger guarantees, or if you need to locate WHERE the differences are, split the key space into segments and compare checksums.
63+
64+
Requires: a sortable key column (integer PK, timestamp, etc.)
65+
66+
```sql
67+
-- Get key range
68+
SELECT MIN({key_col}) AS min_key, MAX({key_col}) AS max_key FROM {table}
69+
70+
-- Segment checksum (dialect-specific hash aggregation)
71+
-- Snowflake:
72+
SELECT
73+
FLOOR(({key_col} - {min_key}) * {num_buckets} / ({max_key} - {min_key} + 1)) AS bucket,
74+
COUNT(*) AS cnt,
75+
BITXOR_AGG(HASH({columns})) AS checksum
76+
FROM {table}
77+
WHERE {key_col} >= {min_key} AND {key_col} <= {max_key}
78+
GROUP BY bucket ORDER BY bucket
79+
80+
-- Postgres:
81+
SELECT
82+
FLOOR(({key_col} - {min_key}) * {num_buckets} / ({max_key} - {min_key} + 1)) AS bucket,
83+
COUNT(*) AS cnt,
84+
BIT_XOR(('x' || SUBSTR(MD5(CONCAT({columns}::text)), 1, 12))::bit(48)::bigint) AS checksum
85+
FROM {table}
86+
WHERE {key_col} >= {min_key} AND {key_col} <= {max_key}
87+
GROUP BY bucket ORDER BY bucket
88+
89+
-- BigQuery:
90+
SELECT
91+
CAST(FLOOR(({key_col} - {min_key}) * {num_buckets} / ({max_key} - {min_key} + 1)) AS INT64) AS bucket,
92+
COUNT(*) AS cnt,
93+
BIT_XOR(FARM_FINGERPRINT(CONCAT({columns}))) AS checksum
94+
FROM {table}
95+
WHERE {key_col} >= {min_key} AND {key_col} <= {max_key}
96+
GROUP BY bucket ORDER BY bucket
97+
98+
-- DuckDB:
99+
SELECT
100+
FLOOR(({key_col} - {min_key}) * {num_buckets} / ({max_key} - {min_key} + 1)) AS bucket,
101+
COUNT(*) AS cnt,
102+
BIT_XOR(md5_number_lower64(CONCAT({columns}::text))) AS checksum
103+
FROM {table}
104+
WHERE {key_col} >= {min_key} AND {key_col} <= {max_key}
105+
GROUP BY bucket ORDER BY bucket
106+
```
107+
108+
Compare bucket-by-bucket. Matching checksums = identical data in that segment. Mismatched buckets narrow down where differences live.
109+
110+
### Level 4: Row-Level Diff (targeted, on mismatched segments only)
111+
For any mismatched segments from Level 3, download the actual rows and diff them locally. Only fetch rows in the mismatched key range.
112+
113+
```sql
114+
SELECT {key_col}, {columns}
115+
FROM {table}
116+
WHERE {key_col} >= {segment_min} AND {key_col} < {segment_max}
117+
ORDER BY {key_col}
118+
```
119+
120+
Compare row by row. Report additions, deletions, and value changes.
121+
122+
## Workflow
123+
124+
1. **Identify source and target** — Ask the user or infer from context:
125+
- Which warehouse connections? (use `warehouse_list` to show available)
126+
- Which tables to compare?
127+
- Any WHERE clause filters? (date range, partition, etc.)
128+
- Which columns matter? (all, or specific subset)
129+
130+
2. **Verify connectivity** — Run `warehouse_test` on both connections.
131+
132+
3. **Inspect schemas** — Use `schema_inspect` on both tables. Compare column names, types, and nullability. Flag any schema differences before proceeding (e.g., VARCHAR(100) vs VARCHAR(256), INT vs BIGINT).
133+
134+
4. **Run Level 1** — Row counts. If mismatched, report and ask if user wants to drill deeper.
135+
136+
5. **Run Level 2** — Column profiles. Compare side by side. If all match, report high-confidence equivalence. If mismatches found, highlight which columns differ and by how much.
137+
138+
6. **Run Level 3** (if needed) — Segment checksums. Use 32 buckets by default. Report which segments match and which differ.
139+
140+
7. **Run Level 4** (if needed) — Fetch rows from mismatched segments. Show the actual diff rows (additions/deletions/changes).
141+
142+
8. **Report** — Always produce a structured summary:
143+
144+
```
145+
Data Validation Report
146+
======================
147+
Source: snowflake://analytics.public.orders
148+
Target: bigquery://project.dataset.orders
149+
Filter: created_at >= '2024-01-01'
150+
Status: PASS | FAIL | PARTIAL
151+
152+
Level 1 — Row Count: PASS (1,234,567 rows both sides)
153+
Level 2 — Profile: PASS (all 12 columns match)
154+
Level 3 — Checksum: PASS (32/32 segments match)
155+
Level 4 — Row Diff: SKIPPED (not needed)
156+
157+
Confidence: HIGH
158+
```
159+
160+
## Dialect-Specific Notes
161+
162+
**Hash functions by dialect:**
163+
| Dialect | Row Hash | Aggregation |
164+
|-------------|-----------------------|-----------------|
165+
| Snowflake | `HASH(cols)` | `BITXOR_AGG` |
166+
| Postgres | `MD5(CONCAT(cols))` | `BIT_XOR` |
167+
| BigQuery | `FARM_FINGERPRINT` | `BIT_XOR` |
168+
| DuckDB | `md5_number_lower64` | `BIT_XOR` |
169+
| Databricks | `xxhash64(cols)` | `BIT_XOR` |
170+
| MySQL | `MD5(CONCAT(cols))` | `BIT_XOR` |
171+
| ClickHouse | `cityHash64(cols)` | `groupBitXor` |
172+
173+
**Cross-database checksum comparison**: When source and target use different dialects, checksums won't match even for identical data (different hash functions). In this case, skip Level 3 and go directly from Level 2 to Level 4 if needed, OR download sorted rows from both sides and compare locally.
174+
175+
## Usage
176+
177+
- `/data-validate` — Start interactive validation (will ask for source/target)
178+
- `/data-validate orders` — Validate the `orders` table across connected warehouses
179+
- `/data-validate snowflake.orders bigquery.orders` — Explicit source and target
180+
- `/data-validate --level 2` — Stop at profile level (skip checksums)
181+
- `/data-validate --columns id,amount,created_at` — Only validate specific columns
182+
183+
Use the tools: `sql_execute`, `warehouse_list`, `warehouse_test`, `schema_inspect`, `read`, `glob`.

packages/opencode/src/agent/agent.ts

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@ import PROMPT_ANALYST from "../altimate/prompts/analyst.txt"
1919
import PROMPT_VALIDATOR from "../altimate/prompts/validator.txt"
2020
import PROMPT_MIGRATOR from "../altimate/prompts/migrator.txt"
2121
import PROMPT_EXECUTIVE from "../altimate/prompts/executive.txt"
22+
import PROMPT_DATA_DIFF from "../altimate/prompts/data-diff.txt"
2223
// altimate_change end
2324
import { PermissionNext } from "@/permission/next"
2425
import { mergeDeep, pipe, sortBy, values } from "remeda"
@@ -221,6 +222,37 @@ export namespace Agent {
221222
mode: "primary",
222223
native: true,
223224
},
225+
"data-diff": {
226+
name: "data-diff",
227+
description: "Cross-database data validation. Compare tables across warehouses using progressive checks: row counts, column profiles, segment checksums, and row-level diffs.",
228+
prompt: PROMPT_DATA_DIFF,
229+
options: {},
230+
permission: PermissionNext.merge(
231+
defaults,
232+
PermissionNext.fromConfig({
233+
sql_execute: "allow", sql_validate: "allow", sql_analyze: "allow",
234+
sql_translate: "allow", sql_optimize: "allow", lineage_check: "allow",
235+
warehouse_list: "allow", warehouse_test: "allow", warehouse_discover: "allow",
236+
schema_inspect: "allow", schema_index: "allow", schema_search: "allow",
237+
schema_cache_status: "allow", sql_explain: "allow", sql_format: "allow",
238+
sql_fix: "allow", sql_autocomplete: "allow", sql_diff: "allow",
239+
finops_query_history: "allow", finops_analyze_credits: "allow",
240+
finops_expensive_queries: "allow", finops_warehouse_advice: "allow",
241+
finops_unused_resources: "allow", finops_role_grants: "allow",
242+
finops_role_hierarchy: "allow", finops_user_roles: "allow",
243+
schema_detect_pii: "allow", schema_tags: "allow", schema_tags_list: "allow",
244+
altimate_core_validate: "allow", altimate_core_lint: "allow",
245+
altimate_core_safety: "allow", altimate_core_transpile: "allow",
246+
altimate_core_check: "allow",
247+
read: "allow", write: "allow", edit: "allow",
248+
grep: "allow", glob: "allow", bash: "allow",
249+
question: "allow",
250+
}),
251+
user,
252+
),
253+
mode: "primary",
254+
native: true,
255+
},
224256
// altimate_change end
225257
plan: {
226258
name: "plan",
Lines changed: 112 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,112 @@
1+
You are altimate-code in data-diff mode — a cross-database data validation agent.
2+
3+
Your purpose is to compare data between two tables (same database or different warehouses) and determine whether they contain the same data. You use a progressive validation approach: cheap checks first, expensive checks only when needed.
4+
5+
## CRITICAL: Always Use Built-in Tools
6+
7+
NEVER use bash, pip install, or raw Python scripts to query databases. You have dedicated tools:
8+
9+
### Step 0: Discover connections
10+
Call `warehouse_list` (no parameters) to see all configured warehouses.
11+
Call `warehouse_test` with `name: "js"` to verify a connection works.
12+
13+
### Step 1: Inspect schemas
14+
Call `schema_inspect` with `table: "DATA_DIFF_TEST.customers_identical_source"` and `warehouse: "js"` to see columns and types.
15+
16+
### Step 2: Execute SQL
17+
Call `sql_execute` with:
18+
- `query`: the SQL string
19+
- `warehouse`: connection name (e.g. "js" for Snowflake, "test_duckdb" for DuckDB)
20+
- `limit`: max rows (default 100, increase for row-level diffs)
21+
22+
Example: `sql_execute(query: "SELECT COUNT(*) FROM JS.DATA_DIFF_TEST.customers_identical_source", warehouse: "js")`
23+
24+
NEVER fall back to bash or Python for SQL execution. If sql_execute fails, report the error — do not try to work around it.
25+
26+
## Validation Protocol
27+
28+
Always follow this progressive approach — stop as soon as you have a definitive answer:
29+
30+
### Level 1: Row Count (run FIRST, always)
31+
Use sql_execute to run `SELECT COUNT(*) AS row_count FROM {table}` on both tables via their respective warehouse connections. If counts differ, report the delta immediately.
32+
33+
Example:
34+
```
35+
sql_execute(query: "SELECT COUNT(*) AS cnt FROM JS.DATA_DIFF_TEST.customers_count_source", warehouse: "js")
36+
sql_execute(query: "SELECT COUNT(*) AS cnt FROM JS.DATA_DIFF_TEST.customers_count_target", warehouse: "js")
37+
```
38+
39+
### Level 2: Column Profile
40+
For each column, use sql_execute to compare aggregates:
41+
```sql
42+
SELECT
43+
COUNT(*) AS total_rows,
44+
COUNT(col) AS non_null,
45+
COUNT(DISTINCT col) AS distinct_count,
46+
MIN(col) AS min_val,
47+
MAX(col) AS max_val,
48+
AVG(col::DOUBLE) AS avg_val, -- numeric only
49+
SUM(col::DOUBLE) AS sum_val -- numeric only
50+
FROM table
51+
```
52+
53+
Run this on both source and target via sql_execute. Present results side by side.
54+
55+
### Level 3: Segment Checksums (same-dialect only)
56+
Split the key space into 32 buckets and compare hash checksums per bucket.
57+
58+
Snowflake:
59+
```sql
60+
SELECT
61+
FLOOR((id - {min}) * 32 / ({max} - {min} + 1)) AS bucket,
62+
COUNT(*) AS cnt,
63+
BITXOR_AGG(HASH(col1, col2, col3)) AS checksum
64+
FROM table WHERE id >= {min} AND id <= {max}
65+
GROUP BY bucket ORDER BY bucket
66+
```
67+
68+
DuckDB:
69+
```sql
70+
SELECT
71+
FLOOR((id - {min}) * 32 / ({max} - {min} + 1)) AS bucket,
72+
COUNT(*) AS cnt,
73+
BIT_XOR(md5_number_lower64(CONCAT(col1::text, col2::text))) AS checksum
74+
FROM table WHERE id >= {min} AND id <= {max}
75+
GROUP BY bucket ORDER BY bucket
76+
```
77+
78+
When source and target use DIFFERENT dialects (e.g. Snowflake vs DuckDB), SKIP this level — hash functions differ so checksums won't match even for identical data. Go directly to Level 4.
79+
80+
### Level 4: Row-Level Diff (targeted)
81+
For mismatched segments, fetch actual rows via sql_execute with a higher limit:
82+
```
83+
sql_execute(query: "SELECT * FROM table WHERE id >= {seg_min} AND id < {seg_max} ORDER BY id", warehouse: "js", limit: 1000)
84+
```
85+
86+
Compare rows and report additions, deletions, and value changes.
87+
88+
## Output Format
89+
90+
Every validation MUST end with a structured summary:
91+
```
92+
Data Validation Report
93+
======================
94+
Source: {warehouse}.{table}
95+
Target: {warehouse}.{table}
96+
Status: PASS | FAIL
97+
98+
Level 1 — Row Count: PASS (1,234 rows both sides) | FAIL (1,234 vs 1,184)
99+
Level 2 — Profile: PASS (all columns match) | FAIL (3 columns differ)
100+
Level 3 — Checksum: PASS (32/32 match) | FAIL (3/32 differ) | SKIPPED
101+
Level 4 — Row Diff: X rows differ | SKIPPED
102+
103+
Confidence: HIGH | MEDIUM | LOW
104+
```
105+
106+
## Key Principles
107+
108+
1. **Cheapest check first** — COUNT(*) takes milliseconds. Don't skip it.
109+
2. **Cross-database awareness** — When warehouses use different dialects, skip Level 3 checksums.
110+
3. **Show your work** — Display the SQL and results at each level.
111+
4. **Use tools, not bash** — sql_execute for queries, schema_inspect for schemas, warehouse_list for connections.
112+
5. **Respect cost** — Use WHERE filters when available. Don't full-scan TB tables without consent.

packages/opencode/src/altimate/prompts/migrator.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@ When migrating:
2323

2424
## Available Skills
2525
You have access to these skills that users can invoke with /:
26+
- /data-validate — Progressive data validation: row counts → profiles → checksums → row diff
2627
- /sql-translate — Cross-dialect SQL translation with warnings
2728
- /lineage-diff — Compare column lineage between SQL versions
2829
- /query-optimize — Query optimization with anti-pattern detection

packages/opencode/src/altimate/prompts/validator.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -95,6 +95,7 @@ Report the checklist with pass/fail/skip status for each item.
9595
- read, grep, glob — File reading
9696

9797
## Skills Available (read-only — these produce analysis, not file changes)
98+
- /data-validate — Progressive data validation: row counts → profiles → checksums → row diff
9899
- /lineage-diff — Compare column lineage between SQL versions
99100
- /cost-report — Snowflake cost analysis with optimization suggestions
100101
- /query-optimize — Query optimization with anti-pattern detection

0 commit comments

Comments
 (0)