Commit c438bb0
feat: data-parity skill — TypeScript orchestrator, ClickHouse driver, partition support (#493)
* feat: add data-parity cross-database table comparison
- Add DataParity engine integration via native Rust bindings
- Add data-diff tool for LLM agent (profile, joindiff, hashdiff, cascade, auto)
- Add ClickHouse driver support
- Add data-parity skill: profile-first workflow, algorithm selection guide,
CRITICAL warning that joindiff cannot run cross-database (always returns 0 diffs),
output style rules (facts only, no editorializing)
- Gitignore .altimate-code/ (credentials) and *.node (platform binaries)
* feat: add partition support to data_diff
Split large tables by a date or numeric column before diffing.
Each partition is diffed independently then results are aggregated.
New params:
- partition_column: column to split on (date or numeric)
- partition_granularity: day | week | month | year (for dates)
- partition_bucket_size: bucket width for numeric columns
New output field:
- partition_results: per-partition breakdown (identical / differ / error)
Dialect-aware SQL: Postgres, Snowflake, BigQuery, ClickHouse, MySQL.
Skill updated with partition guidance and examples.
* feat: add categorical partition mode (string, enum, boolean)
When partition_column is set without partition_granularity or
partition_bucket_size, groups by raw DISTINCT values. Works for
any non-date, non-numeric column: status, region, country, etc.
WHERE clause uses equality: col = 'value' with proper escaping.
* fix: correct outcome shape handling in extractStats and formatOutcome
Rust serializes ReladiffOutcome with serde tag 'mode', producing:
{mode: 'diff', diff_rows: [...], stats: {rows_table1, rows_table2, exclusive_table1, exclusive_table2, updated, unchanged}}
Previous code checked for {Match: {...}} / {Diff: {...}} shapes that
never matched, causing partitioned diff to report all partitions as
'identical' with 0 rows.
- extractStats(): check outcome.mode === 'diff', read from stats fields
- mergeOutcomes(): aggregate mode-based outcomes correctly
- summarize()/formatOutcome(): display mode-based shape with correct labels
* feat: rewrite data-parity skill with interactive, plan-first workflow
Key changes based on feedback:
- Always generate TODO plan before any tool is called
- Enforce data_diff tool usage (never manual EXCEPT/JOIN SQL)
- Add PK discovery + explicit user confirmation step
- Profile pass is now mandatory before row-level diff
- Ask user before expensive row-level diff on large tables:
- <100K rows: proceed automatically
- 100K-10M rows: ask with where_clause option
- >10M rows: offer window/partition/full choices
- Document partition modes (date/numeric/categorical) with examples
- Add warehouse_list as first step to confirm connections
* fix: auto-discover extra_columns and exclude audit/timestamp columns from data diff
The Rust engine only compares columns explicitly listed in extra_columns.
When omitted, it was silently reporting all key-matched rows as 'identical'
even when non-key values differed — a false positive bug.
Changes:
- Auto-discover columns from information_schema when extra_columns is omitted
and source is a plain table name (not a SQL query)
- Exclude audit/timestamp columns (updated_at, created_at, inserted_at,
modified_at, _fivetran_*, _airbyte_*, publisher_last_updated_*, etc.)
from comparison by default since they typically differ due to ETL timing
- Report excluded columns in tool output so users know what was skipped
- Fix misleading tool description that said 'Omit to compare all columns'
- Update SKILL.md with critical guidance on extra_columns behavior
* fix: add `noLimit` option to driver `execute()` to prevent silent result truncation
All drivers default to `LIMIT 1001` on SELECT queries and post-truncate to
1000 rows. This silently drops rows when the data-diff engine needs complete
result sets — a FULL OUTER JOIN returning >1000 diff rows would be truncated,
causing the engine to undercount differences.
- Add `ExecuteOptions { noLimit?: boolean }` to the `Connector` interface
- When `noLimit: true`, set `effectiveLimit = 0` (falsy) so the existing
LIMIT injection guard is skipped, and add `effectiveLimit > 0` to the
truncation check so rows aren't sliced to zero
- Update all 12 drivers: postgres, clickhouse, snowflake, bigquery, mysql,
redshift, databricks, duckdb, oracle, sqlserver, sqlite, mongodb
- Pass `{ noLimit: true }` from `data-diff.ts` `executeQuery()`
Interactive SQL callers are unaffected — they continue to get the default
1000-row limit. Only the data-diff pipeline opts out.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: detect auto-timestamp defaults from database catalog and confirm exclusions with user
Column exclusion now has two layers:
1. Name-pattern matching (existing) — updated_at, created_at, _fivetran_synced, etc.
2. Schema-level default detection (new) — queries column_default for NOW(),
CURRENT_TIMESTAMP, GETDATE(), SYSDATE, SYSTIMESTAMP, etc.
Covers PostgreSQL, MySQL, Snowflake, SQL Server, Oracle, ClickHouse, DuckDB,
SQLite, and Redshift in a single round-trip (no extra query).
The skill prompt now instructs the agent to present detected auto-timestamp
columns to the user and ask for confirmation before excluding them, since
migrations should preserve timestamps while ETL replication regenerates them.
* fix: address code review findings in data-diff orchestrator
- `buildColumnDiscoverySQL`: escape single quotes in all interpolated table
name parts to prevent SQL injection via crafted source/target names
- `dateTruncExpr`: add Oracle case (`TRUNC(col, 'UNIT')`) — Oracle does not
have `DATE_TRUNC`, date-partitioned diffs on Oracle tables previously failed
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix: address code review security and correctness findings
- Apply esc() to Oracle and SQLite paths in buildColumnDiscoverySQL
(SQL injection via table name was unpatched in these dialects)
- Quote identifiers in resolveTableSources to prevent injection via
table names containing semicolons or special characters
- Surface SQL execution errors before feeding empty rows to the engine
(silent false "match" when warehouse is unreachable is now an error)
- Fix Oracle TRUNC() format model map: 'WEEK' → 'IW' (ISO week)
('WEEK' throws ORA-01800 on all Oracle versions)
- Quote partition column identifier in buildPartitionWhereClause
* fix: resolve simulation suite failures — object stringification, error propagation, and test mock formats
- `altimate-core-column-lineage`: fix `[object Object]` in `column_dict` output when source entries are `{ source_table, source_column }` objects instead of strings
- `schema-inspect`: propagate `{ success: false, error }` dispatcher responses to `metadata.error` instead of silently returning empty schema
- `sql-analyze`: guard against null/undefined result from dispatcher to prevent "undefined" literal in output
- `lineage-check`: guard against null/undefined result from dispatcher to prevent "undefined" literal in output
- `simulation-suite.test.ts`: fix `sql-translate` mock format — data fields must be flat (not wrapped in `data: {}`), add `source_dialect`/`target_dialect` to mock so assertions pass
- `simulation-suite.test.ts`: fix `dbt-manifest` mock format — unwrap `data: {}` so `model_count` and `models` are accessible at top level
Simulation suite: 695/839 → 839/839 (100%)
* refactor: remove existing-tool improvements — scope to data-diff only
* refactor: revert .gitignore changes — scope to data-diff only
* fix: silence @clickhouse/client internal stderr logger to prevent TUI corruption
The @clickhouse/client package enables ERROR-level logging by default and writes
`[ERROR][@clickhouse/client][Connection]` lines directly to stderr on auth/query
failures. These raw writes corrupt the terminal TUI rendering.
Set `log: { level: 127 }` (ClickHouseLogLevel.OFF) when creating the client —
consistent with how Snowflake (`logLevel: 'OFF'`) and Databricks (no-op logger)
already suppress their SDK loggers for the same reason.
* fix: SQL injection hardening, target partition discovery, and local pack script
- Validate table names before interpolating into DESCRIBE/SHOW COLUMNS for
ClickHouse and Snowflake — reject names with non-alphanumeric characters to
prevent SQL injection; also quote parts with dialect-appropriate delimiters
- Discover partition values from BOTH source and target tables and union the
results — previously only source was queried, silently missing rows that
existed only in target-side partitions
- Add script/pack-local.ts: mirrors publish.ts but stops before npm publish;
injects local altimate-core tarballs from /tmp/altimate-local-dist/ for
local end-to-end testing
* feat: add Step 9 result presentation guidelines to data-parity skill
Require that every diff result summary surfaces:
- Exact scope (tables + warehouses compared)
- Filters and time period applied (or explicitly states none)
- Key columns used and how they were confirmed
- Columns compared and excluded, with reasons (auto-timestamp, user request)
- Algorithm used
Includes example full result summary and guidance for identical results —
emphasising that bare numbers without context are meaningless to the user.
* fix: use correct outcome format for empty/fallback partition results
The partitioned diff returned `{ Match: { row_count: 0, algorithm: 'partitioned' } }`
when no partition values were found or all partitions failed. This format lacks
`mode: 'diff'`, so `formatOutcome` fell through to raw JSON.stringify instead
of producing clean output.
Use the standard Rust engine format:
`{ mode: 'diff', stats: {...}, diff_rows: [] }`
* chore: remove pack-local.ts — dev-only utility, not part of the feature
* feat: add data-parity skill to builder prompt with table and SQL query comparison modes
* fix: address code review findings — Oracle TRUNC, dialect-aware quoting, query+partition guard
- Oracle day granularity: 'DDD' (day-of-year) → 'DD' (day-of-month)
- Add `quoteIdentForDialect()` helper: MySQL/ClickHouse use backticks,
TSQL/Fabric use brackets, others use ANSI double-quotes
- `buildPartitionDiscoverySQL` and `buildPartitionWhereClause` now use
dialect-aware quoting instead of hardcoded double-quotes
- `runPartitionedDiff` rejects SQL queries as source/target with a clear
error — partitioning requires table names to discover column values
* fix: pin `duckdb` to 1.4.4 to prevent bun runtime timeout
- Pin `duckdb` from `^1.0.0` to exact `1.4.4` in `packages/drivers`
- Add `duckdb: 1.4.4` to root `package.json` for workspace resolution
- Update `bun.lock`
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Revert "fix: pin `duckdb` to 1.4.4 to prevent bun runtime timeout"
This reverts commit b2cf288.
---------
Co-authored-by: Aditya Pandey <aditya.p@altimate.ai>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>1 parent adaebe0 commit c438bb0
File tree
20 files changed
+1732
-57
lines changed- .opencode/skills/data-parity
- packages
- drivers/src
- opencode
- src
- altimate
- native
- connections
- prompts
- tools
- tool
- test/altimate
20 files changed
+1732
-57
lines changedLarge diffs are not rendered by default.
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
2 | 2 | | |
3 | 3 | | |
4 | 4 | | |
5 | | - | |
| 5 | + | |
6 | 6 | | |
7 | 7 | | |
8 | 8 | | |
| |||
37 | 37 | | |
38 | 38 | | |
39 | 39 | | |
40 | | - | |
41 | | - | |
| 40 | + | |
| 41 | + | |
42 | 42 | | |
43 | 43 | | |
44 | 44 | | |
| |||
58 | 58 | | |
59 | 59 | | |
60 | 60 | | |
61 | | - | |
| 61 | + | |
62 | 62 | | |
63 | 63 | | |
64 | 64 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
5 | 5 | | |
6 | 6 | | |
7 | 7 | | |
8 | | - | |
| 8 | + | |
9 | 9 | | |
10 | 10 | | |
11 | 11 | | |
| |||
57 | 57 | | |
58 | 58 | | |
59 | 59 | | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
60 | 63 | | |
61 | 64 | | |
62 | 65 | | |
63 | | - | |
| 66 | + | |
64 | 67 | | |
65 | 68 | | |
66 | 69 | | |
67 | | - | |
| 70 | + | |
68 | 71 | | |
69 | 72 | | |
70 | 73 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
2 | 2 | | |
3 | 3 | | |
4 | 4 | | |
5 | | - | |
| 5 | + | |
6 | 6 | | |
7 | 7 | | |
8 | 8 | | |
| |||
44 | 44 | | |
45 | 45 | | |
46 | 46 | | |
47 | | - | |
48 | | - | |
| 47 | + | |
| 48 | + | |
49 | 49 | | |
50 | 50 | | |
51 | 51 | | |
| |||
65 | 65 | | |
66 | 66 | | |
67 | 67 | | |
68 | | - | |
| 68 | + | |
69 | 69 | | |
70 | 70 | | |
71 | 71 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
2 | 2 | | |
3 | 3 | | |
4 | 4 | | |
5 | | - | |
| 5 | + | |
6 | 6 | | |
7 | 7 | | |
8 | 8 | | |
| |||
105 | 105 | | |
106 | 106 | | |
107 | 107 | | |
108 | | - | |
109 | | - | |
| 108 | + | |
| 109 | + | |
110 | 110 | | |
111 | 111 | | |
112 | 112 | | |
| |||
123 | 123 | | |
124 | 124 | | |
125 | 125 | | |
126 | | - | |
| 126 | + | |
127 | 127 | | |
128 | 128 | | |
129 | 129 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
2 | 2 | | |
3 | 3 | | |
4 | 4 | | |
5 | | - | |
| 5 | + | |
6 | 6 | | |
7 | 7 | | |
8 | 8 | | |
| |||
41 | 41 | | |
42 | 42 | | |
43 | 43 | | |
44 | | - | |
45 | | - | |
| 44 | + | |
| 45 | + | |
46 | 46 | | |
47 | 47 | | |
48 | 48 | | |
| |||
56 | 56 | | |
57 | 57 | | |
58 | 58 | | |
59 | | - | |
| 59 | + | |
60 | 60 | | |
61 | 61 | | |
62 | 62 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
2 | 2 | | |
3 | 3 | | |
4 | 4 | | |
5 | | - | |
| 5 | + | |
6 | 6 | | |
7 | 7 | | |
8 | 8 | | |
| |||
37 | 37 | | |
38 | 38 | | |
39 | 39 | | |
40 | | - | |
41 | | - | |
| 40 | + | |
| 41 | + | |
42 | 42 | | |
43 | 43 | | |
44 | 44 | | |
| |||
61 | 61 | | |
62 | 62 | | |
63 | 63 | | |
64 | | - | |
| 64 | + | |
65 | 65 | | |
66 | 66 | | |
67 | 67 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
2 | 2 | | |
3 | 3 | | |
4 | 4 | | |
5 | | - | |
| 5 | + | |
6 | 6 | | |
7 | 7 | | |
8 | 8 | | |
| |||
46 | 46 | | |
47 | 47 | | |
48 | 48 | | |
49 | | - | |
| 49 | + | |
50 | 50 | | |
51 | 51 | | |
52 | 52 | | |
| |||
57 | 57 | | |
58 | 58 | | |
59 | 59 | | |
60 | | - | |
| 60 | + | |
61 | 61 | | |
62 | 62 | | |
63 | 63 | | |
| |||
70 | 70 | | |
71 | 71 | | |
72 | 72 | | |
73 | | - | |
| 73 | + | |
74 | 74 | | |
75 | 75 | | |
76 | 76 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
3 | 3 | | |
4 | 4 | | |
5 | 5 | | |
6 | | - | |
| 6 | + | |
7 | 7 | | |
8 | 8 | | |
9 | 9 | | |
| |||
46 | 46 | | |
47 | 47 | | |
48 | 48 | | |
49 | | - | |
| 49 | + | |
50 | 50 | | |
51 | 51 | | |
52 | | - | |
| 52 | + | |
53 | 53 | | |
54 | 54 | | |
55 | 55 | | |
| |||
62 | 62 | | |
63 | 63 | | |
64 | 64 | | |
65 | | - | |
| 65 | + | |
66 | 66 | | |
67 | 67 | | |
68 | 68 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
3 | 3 | | |
4 | 4 | | |
5 | 5 | | |
6 | | - | |
| 6 | + | |
7 | 7 | | |
8 | 8 | | |
9 | 9 | | |
| |||
232 | 232 | | |
233 | 233 | | |
234 | 234 | | |
235 | | - | |
236 | | - | |
| 235 | + | |
| 236 | + | |
237 | 237 | | |
238 | 238 | | |
239 | 239 | | |
| |||
245 | 245 | | |
246 | 246 | | |
247 | 247 | | |
248 | | - | |
| 248 | + | |
249 | 249 | | |
250 | 250 | | |
251 | 251 | | |
| |||
0 commit comments