feat: add data-diff agent mode and /data-validate skill

suryaiyer95 · claude · suryaiyer95 · commit d2d39fe3ee5d · 2026-03-13T09:46:19.000-07:00
- New `data-diff` primary agent mode for cross-database data validation
  with progressive checks: row counts → column profiles → segment
  checksums → row-level diffs
- New `/data-validate` skill with dialect-specific SQL templates for
  Snowflake, Postgres, BigQuery, DuckDB, Databricks, ClickHouse, MySQL
- Prompt covers 4 validation levels, cross-database checksum awareness,
  and structured PASS/FAIL reporting
- Added `/data-validate` to migrator and validator skill lists so both
  modes can invoke it

Co-Authored-By: Claude Opus 4.6 &lt;noreply@anthropic.com&gt;
diff --git a/.opencode/skills/data-validate/SKILL.md b/.opencode/skills/data-validate/SKILL.md
@@ -0,0 +1,183 @@
+---
+name: data-validate
+description: Compare data between two tables across any warehouses using progressive validation — row counts, column profiles, segment checksums, and row-level drill-down.
+---
+
+# Data Validate
+
+## Requirements
+**Agent:** data-diff or migrator (requires sql_execute on both source and target)
+**Tools used:** sql_execute, warehouse_list, warehouse_test, schema_inspect, read, glob
+
+Cross-database data validation using a progressive, multi-level approach. Each level provides increasing confidence with increasing query cost — stop as soon as you have enough evidence.
+
+## Validation Levels
+
+### Level 1: Row Count (seconds, near-zero cost)
+Compare total row counts between source and target. If counts match exactly, proceed to Level 2. If they differ, report the delta immediately — no deeper checks needed.
+
+```sql
+-- Run on source warehouse
+SELECT COUNT(*) AS row_count FROM {source_table} [WHERE ...]
+
+-- Run on target warehouse
+SELECT COUNT(*) AS row_count FROM {target_table} [WHERE ...]
+```
+
+### Level 2: Column Profile (seconds, low cost)
+For each column, compare aggregate statistics. This catches type coercion bugs, NULL handling differences, and truncation issues without scanning every row.
+
+```sql
+SELECT
+  COUNT(*)                    AS total_rows,
+  COUNT({col})                AS non_null_count,
+  COUNT(DISTINCT {col})       AS distinct_count,
+  MIN({col})                  AS min_val,
+  MAX({col})                  AS max_val,
+  -- Numeric columns only:
+  AVG(CAST({col} AS DOUBLE))  AS avg_val,
+  SUM(CAST({col} AS DOUBLE))  AS sum_val
+FROM {table} [WHERE ...]
+```
+
+Run this for each column (or the key columns + any columns the user cares about). Compare results side by side:
+
+```
+Column Profile Comparison
+=========================
+Column          | Source          | Target          | Match
+----------------|-----------------|-----------------|------
+total_rows      | 1,234,567       | 1,234,567       | OK
+user_id.distinct| 500,000         | 500,000         | OK
+email.nulls     | 0               | 1,204           | MISMATCH
+amount.sum      | 45,678,901.23   | 45,678,901.23   | OK
+amount.avg      | 37.01           | 37.01           | OK
+created_at.min  | 2020-01-01      | 2020-01-01      | OK
+created_at.max  | 2024-12-31      | 2024-12-31      | OK
+```
+
+If all profiles match, tables are equivalent with high confidence. Report and stop.
+
+### Level 3: Segment Checksums (moderate cost)
+If profiles match but the user wants stronger guarantees, or if you need to locate WHERE the differences are, split the key space into segments and compare checksums.
+
+Requires: a sortable key column (integer PK, timestamp, etc.)
+
+```sql
+-- Get key range
+SELECT MIN({key_col}) AS min_key, MAX({key_col}) AS max_key FROM {table}
+
+-- Segment checksum (dialect-specific hash aggregation)
+-- Snowflake:
+SELECT
+  FLOOR(({key_col} - {min_key}) * {num_buckets} / ({max_key} - {min_key} + 1)) AS bucket,
+  COUNT(*) AS cnt,
+  BITXOR_AGG(HASH({columns})) AS checksum
+FROM {table}
+WHERE {key_col} >= {min_key} AND {key_col} <= {max_key}
+GROUP BY bucket ORDER BY bucket
+
+-- Postgres:
+SELECT
+  FLOOR(({key_col} - {min_key}) * {num_buckets} / ({max_key} - {min_key} + 1)) AS bucket,
+  COUNT(*) AS cnt,
+  BIT_XOR(('x' || SUBSTR(MD5(CONCAT({columns}::text)), 1, 12))::bit(48)::bigint) AS checksum
+FROM {table}
+WHERE {key_col} >= {min_key} AND {key_col} <= {max_key}
+GROUP BY bucket ORDER BY bucket
+
+-- BigQuery:
+SELECT
+  CAST(FLOOR(({key_col} - {min_key}) * {num_buckets} / ({max_key} - {min_key} + 1)) AS INT64) AS bucket,
+  COUNT(*) AS cnt,
+  BIT_XOR(FARM_FINGERPRINT(CONCAT({columns}))) AS checksum
+FROM {table}
+WHERE {key_col} >= {min_key} AND {key_col} <= {max_key}
+GROUP BY bucket ORDER BY bucket
+
+-- DuckDB:
+SELECT
+  FLOOR(({key_col} - {min_key}) * {num_buckets} / ({max_key} - {min_key} + 1)) AS bucket,
+  COUNT(*) AS cnt,
+  BIT_XOR(md5_number_lower64(CONCAT({columns}::text))) AS checksum
+FROM {table}
+WHERE {key_col} >= {min_key} AND {key_col} <= {max_key}
+GROUP BY bucket ORDER BY bucket
+```
+
+Compare bucket-by-bucket. Matching checksums = identical data in that segment. Mismatched buckets narrow down where differences live.
+
+### Level 4: Row-Level Diff (targeted, on mismatched segments only)
+For any mismatched segments from Level 3, download the actual rows and diff them locally. Only fetch rows in the mismatched key range.
+
+```sql
+SELECT {key_col}, {columns}
+FROM {table}
+WHERE {key_col} >= {segment_min} AND {key_col} < {segment_max}
+ORDER BY {key_col}
+```
+
+Compare row by row. Report additions, deletions, and value changes.
+
+## Workflow
+
+1. **Identify source and target** — Ask the user or infer from context:
+   - Which warehouse connections? (use `warehouse_list` to show available)
+   - Which tables to compare?
+   - Any WHERE clause filters? (date range, partition, etc.)
+   - Which columns matter? (all, or specific subset)
+
+2. **Verify connectivity** — Run `warehouse_test` on both connections.
+
+3. **Inspect schemas** — Use `schema_inspect` on both tables. Compare column names, types, and nullability. Flag any schema differences before proceeding (e.g., VARCHAR(100) vs VARCHAR(256), INT vs BIGINT).
+
+4. **Run Level 1** — Row counts. If mismatched, report and ask if user wants to drill deeper.
+
+5. **Run Level 2** — Column profiles. Compare side by side. If all match, report high-confidence equivalence. If mismatches found, highlight which columns differ and by how much.
+
+6. **Run Level 3** (if needed) — Segment checksums. Use 32 buckets by default. Report which segments match and which differ.
+
+7. **Run Level 4** (if needed) — Fetch rows from mismatched segments. Show the actual diff rows (additions/deletions/changes).
+
+8. **Report** — Always produce a structured summary:
+
+```
+Data Validation Report
+======================
+Source: snowflake://analytics.public.orders
+Target: bigquery://project.dataset.orders
+Filter: created_at >= '2024-01-01'
+Status: PASS | FAIL | PARTIAL
+
+Level 1 — Row Count:    PASS (1,234,567 rows both sides)
+Level 2 — Profile:      PASS (all 12 columns match)
+Level 3 — Checksum:     PASS (32/32 segments match)
+Level 4 — Row Diff:     SKIPPED (not needed)
+
+Confidence: HIGH
+```
+
+## Dialect-Specific Notes
+
+**Hash functions by dialect:**
+| Dialect     | Row Hash              | Aggregation     |
+|-------------|-----------------------|-----------------|
+| Snowflake   | `HASH(cols)`          | `BITXOR_AGG`    |
+| Postgres    | `MD5(CONCAT(cols))`   | `BIT_XOR`       |
+| BigQuery    | `FARM_FINGERPRINT`    | `BIT_XOR`       |
+| DuckDB      | `md5_number_lower64`  | `BIT_XOR`       |
+| Databricks  | `xxhash64(cols)`      | `BIT_XOR`       |
+| MySQL       | `MD5(CONCAT(cols))`   | `BIT_XOR`       |
+| ClickHouse  | `cityHash64(cols)`    | `groupBitXor`   |
+
+**Cross-database checksum comparison**: When source and target use different dialects, checksums won't match even for identical data (different hash functions). In this case, skip Level 3 and go directly from Level 2 to Level 4 if needed, OR download sorted rows from both sides and compare locally.
+
+## Usage
+
+- `/data-validate` — Start interactive validation (will ask for source/target)
+- `/data-validate orders` — Validate the `orders` table across connected warehouses
+- `/data-validate snowflake.orders bigquery.orders` — Explicit source and target
+- `/data-validate --level 2` — Stop at profile level (skip checksums)
+- `/data-validate --columns id,amount,created_at` — Only validate specific columns
+
+Use the tools: `sql_execute`, `warehouse_list`, `warehouse_test`, `schema_inspect`, `read`, `glob`.
diff --git a/packages/opencode/src/agent/agent.ts b/packages/opencode/src/agent/agent.ts
@@ -19,6 +19,7 @@ import PROMPT_ANALYST from "../altimate/prompts/analyst.txt"
 import PROMPT_VALIDATOR from "../altimate/prompts/validator.txt"
 import PROMPT_MIGRATOR from "../altimate/prompts/migrator.txt"
 import PROMPT_EXECUTIVE from "../altimate/prompts/executive.txt"
+import PROMPT_DATA_DIFF from "../altimate/prompts/data-diff.txt"
 // altimate_change end
 import { PermissionNext } from "@/permission/next"
 import { mergeDeep, pipe, sortBy, values } from "remeda"
@@ -221,6 +222,37 @@ export namespace Agent {
         mode: "primary",
         native: true,
       },
+      "data-diff": {
+        name: "data-diff",
+        description: "Cross-database data validation. Compare tables across warehouses using progressive checks: row counts, column profiles, segment checksums, and row-level diffs.",
+        prompt: PROMPT_DATA_DIFF,
+        options: {},
+        permission: PermissionNext.merge(
+          defaults,
+          PermissionNext.fromConfig({
+            sql_execute: "allow", sql_validate: "allow", sql_analyze: "allow",
+            sql_translate: "allow", sql_optimize: "allow", lineage_check: "allow",
+            warehouse_list: "allow", warehouse_test: "allow", warehouse_discover: "allow",
+            schema_inspect: "allow", schema_index: "allow", schema_search: "allow",
+            schema_cache_status: "allow", sql_explain: "allow", sql_format: "allow",
+            sql_fix: "allow", sql_autocomplete: "allow", sql_diff: "allow",
+            finops_query_history: "allow", finops_analyze_credits: "allow",
+            finops_expensive_queries: "allow", finops_warehouse_advice: "allow",
+            finops_unused_resources: "allow", finops_role_grants: "allow",
+            finops_role_hierarchy: "allow", finops_user_roles: "allow",
+            schema_detect_pii: "allow", schema_tags: "allow", schema_tags_list: "allow",
+            altimate_core_validate: "allow", altimate_core_lint: "allow",
+            altimate_core_safety: "allow", altimate_core_transpile: "allow",
+            altimate_core_check: "allow",
+            read: "allow", write: "allow", edit: "allow",
+            grep: "allow", glob: "allow", bash: "allow",
+            question: "allow",
+          }),
+          user,
+        ),
+        mode: "primary",
+        native: true,
+      },
       // altimate_change end
       plan: {
         name: "plan",
diff --git a/packages/opencode/src/altimate/prompts/data-diff.txt b/packages/opencode/src/altimate/prompts/data-diff.txt
@@ -0,0 +1,112 @@
+You are altimate-code in data-diff mode — a cross-database data validation agent.
+
+Your purpose is to compare data between two tables (same database or different warehouses) and determine whether they contain the same data. You use a progressive validation approach: cheap checks first, expensive checks only when needed.
+
+## CRITICAL: Always Use Built-in Tools
+
+NEVER use bash, pip install, or raw Python scripts to query databases. You have dedicated tools:
+
+### Step 0: Discover connections
+Call `warehouse_list` (no parameters) to see all configured warehouses.
+Call `warehouse_test` with `name: "js"` to verify a connection works.
+
+### Step 1: Inspect schemas
+Call `schema_inspect` with `table: "DATA_DIFF_TEST.customers_identical_source"` and `warehouse: "js"` to see columns and types.
+
+### Step 2: Execute SQL
+Call `sql_execute` with:
+- `query`: the SQL string
+- `warehouse`: connection name (e.g. "js" for Snowflake, "test_duckdb" for DuckDB)
+- `limit`: max rows (default 100, increase for row-level diffs)
+
+Example: `sql_execute(query: "SELECT COUNT(*) FROM JS.DATA_DIFF_TEST.customers_identical_source", warehouse: "js")`
+
+NEVER fall back to bash or Python for SQL execution. If sql_execute fails, report the error — do not try to work around it.
+
+## Validation Protocol
+
+Always follow this progressive approach — stop as soon as you have a definitive answer:
+
+### Level 1: Row Count (run FIRST, always)
+Use sql_execute to run `SELECT COUNT(*) AS row_count FROM {table}` on both tables via their respective warehouse connections. If counts differ, report the delta immediately.
+
+Example:
+```
+sql_execute(query: "SELECT COUNT(*) AS cnt FROM JS.DATA_DIFF_TEST.customers_count_source", warehouse: "js")
+sql_execute(query: "SELECT COUNT(*) AS cnt FROM JS.DATA_DIFF_TEST.customers_count_target", warehouse: "js")
+```
+
+### Level 2: Column Profile
+For each column, use sql_execute to compare aggregates:
+```sql
+SELECT
+  COUNT(*) AS total_rows,
+  COUNT(col) AS non_null,
+  COUNT(DISTINCT col) AS distinct_count,
+  MIN(col) AS min_val,
+  MAX(col) AS max_val,
+  AVG(col::DOUBLE) AS avg_val,    -- numeric only
+  SUM(col::DOUBLE) AS sum_val     -- numeric only
+FROM table
+```
+
+Run this on both source and target via sql_execute. Present results side by side.
+
+### Level 3: Segment Checksums (same-dialect only)
+Split the key space into 32 buckets and compare hash checksums per bucket.
+
+Snowflake:
+```sql
+SELECT
+  FLOOR((id - {min}) * 32 / ({max} - {min} + 1)) AS bucket,
+  COUNT(*) AS cnt,
+  BITXOR_AGG(HASH(col1, col2, col3)) AS checksum
+FROM table WHERE id >= {min} AND id <= {max}
+GROUP BY bucket ORDER BY bucket
+```
+
+DuckDB:
+```sql
+SELECT
+  FLOOR((id - {min}) * 32 / ({max} - {min} + 1)) AS bucket,
+  COUNT(*) AS cnt,
+  BIT_XOR(md5_number_lower64(CONCAT(col1::text, col2::text))) AS checksum
+FROM table WHERE id >= {min} AND id <= {max}
+GROUP BY bucket ORDER BY bucket
+```
+
+When source and target use DIFFERENT dialects (e.g. Snowflake vs DuckDB), SKIP this level — hash functions differ so checksums won't match even for identical data. Go directly to Level 4.
+
+### Level 4: Row-Level Diff (targeted)
+For mismatched segments, fetch actual rows via sql_execute with a higher limit:
+```
+sql_execute(query: "SELECT * FROM table WHERE id >= {seg_min} AND id < {seg_max} ORDER BY id", warehouse: "js", limit: 1000)
+```
+
+Compare rows and report additions, deletions, and value changes.
+
+## Output Format
+
+Every validation MUST end with a structured summary:
+```
+Data Validation Report
+======================
+Source: {warehouse}.{table}
+Target: {warehouse}.{table}
+Status: PASS | FAIL
+
+Level 1 — Row Count:    PASS (1,234 rows both sides) | FAIL (1,234 vs 1,184)
+Level 2 — Profile:      PASS (all columns match) | FAIL (3 columns differ)
+Level 3 — Checksum:     PASS (32/32 match) | FAIL (3/32 differ) | SKIPPED
+Level 4 — Row Diff:     X rows differ | SKIPPED
+
+Confidence: HIGH | MEDIUM | LOW
+```
+
+## Key Principles
+
+1. **Cheapest check first** — COUNT(*) takes milliseconds. Don't skip it.
+2. **Cross-database awareness** — When warehouses use different dialects, skip Level 3 checksums.
+3. **Show your work** — Display the SQL and results at each level.
+4. **Use tools, not bash** — sql_execute for queries, schema_inspect for schemas, warehouse_list for connections.
+5. **Respect cost** — Use WHERE filters when available. Don't full-scan TB tables without consent.
diff --git a/packages/opencode/src/altimate/prompts/migrator.txt b/packages/opencode/src/altimate/prompts/migrator.txt
@@ -23,6 +23,7 @@ When migrating:
 
 ## Available Skills
 You have access to these skills that users can invoke with /:
+- /data-validate — Progressive data validation: row counts → profiles → checksums → row diff
 - /sql-translate — Cross-dialect SQL translation with warnings
 - /lineage-diff — Compare column lineage between SQL versions
 - /query-optimize — Query optimization with anti-pattern detection
diff --git a/packages/opencode/src/altimate/prompts/validator.txt b/packages/opencode/src/altimate/prompts/validator.txt
@@ -95,6 +95,7 @@ Report the checklist with pass/fail/skip status for each item.
 - read, grep, glob — File reading
 
 ## Skills Available (read-only — these produce analysis, not file changes)
+- /data-validate — Progressive data validation: row counts → profiles → checksums → row diff
 - /lineage-diff — Compare column lineage between SQL versions
 - /cost-report — Snowflake cost analysis with optimization suggestions
 - /query-optimize — Query optimization with anti-pattern detection