Skip to content

fix: data-diff orchestrator, DuckDB bun compat, noLimit, and skill#615

Merged
suryaiyer95 merged 13 commits intofeat/data-parity-skill-improvementsfrom
fix/data-parity-review-fixes
Apr 2, 2026
Merged

fix: data-diff orchestrator, DuckDB bun compat, noLimit, and skill#615
suryaiyer95 merged 13 commits intofeat/data-parity-skill-improvementsfrom
fix/data-parity-review-fixes

Conversation

@suryaiyer95
Copy link
Copy Markdown
Contributor

Summary

  • DuckDB bun compatibility: Replace async callback constructor with synchronous form — bun's runtime never fires native addon async callbacks, causing a 2s timeout on every DuckDB connection attempt
  • noLimit execute option: Add ExecuteOptions.noLimit to all drivers so data-diff pipelines can fetch complete result sets without silent truncation
  • data.diff dispatcher: Full runDataDiff orchestrator in connections/data-diff.ts — drives the cooperative DataParitySession state machine, handles auto-discovery of extra_columns, CTE injection for SQL queries, and partition diffing
  • DataDiffTool: New tool registered in the tool registry, exposed to the LLM
  • Data-parity skill: .opencode/skills/data-parity/SKILL.md with interactive, plan-first workflow guidance

Test plan

  • DuckDB :memory: connection opens without timeout in bun
  • data_diff tool completes against local postgres (pg_tpch) — detects row differences
  • data_diff tool completes with DuckDB source
  • noLimit flag returns untruncated rows for diff pipelines
  • Auto-discovery of extra_columns works for plain table names

🤖 Generated with Claude Code

suryaiyer95 and others added 12 commits March 30, 2026 19:00
- Add DataParity engine integration via native Rust bindings
- Add data-diff tool for LLM agent (profile, joindiff, hashdiff, cascade, auto)
- Add ClickHouse driver support
- Add data-parity skill: profile-first workflow, algorithm selection guide,
  CRITICAL warning that joindiff cannot run cross-database (always returns 0 diffs),
  output style rules (facts only, no editorializing)
- Gitignore .altimate-code/ (credentials) and *.node (platform binaries)
Split large tables by a date or numeric column before diffing.
Each partition is diffed independently then results are aggregated.

New params:
- partition_column: column to split on (date or numeric)
- partition_granularity: day | week | month | year (for dates)
- partition_bucket_size: bucket width for numeric columns

New output field:
- partition_results: per-partition breakdown (identical / differ / error)

Dialect-aware SQL: Postgres, Snowflake, BigQuery, ClickHouse, MySQL.

Skill updated with partition guidance and examples.
When partition_column is set without partition_granularity or
partition_bucket_size, groups by raw DISTINCT values. Works for
any non-date, non-numeric column: status, region, country, etc.

WHERE clause uses equality: col = 'value' with proper escaping.
Rust serializes ReladiffOutcome with serde tag 'mode', producing:
  {mode: 'diff', diff_rows: [...], stats: {rows_table1, rows_table2, exclusive_table1, exclusive_table2, updated, unchanged}}

Previous code checked for {Match: {...}} / {Diff: {...}} shapes that
never matched, causing partitioned diff to report all partitions as
'identical' with 0 rows.

- extractStats(): check outcome.mode === 'diff', read from stats fields
- mergeOutcomes(): aggregate mode-based outcomes correctly
- summarize()/formatOutcome(): display mode-based shape with correct labels
Key changes based on feedback:
- Always generate TODO plan before any tool is called
- Enforce data_diff tool usage (never manual EXCEPT/JOIN SQL)
- Add PK discovery + explicit user confirmation step
- Profile pass is now mandatory before row-level diff
- Ask user before expensive row-level diff on large tables:
  - <100K rows: proceed automatically
  - 100K-10M rows: ask with where_clause option
  - >10M rows: offer window/partition/full choices
- Document partition modes (date/numeric/categorical) with examples
- Add warehouse_list as first step to confirm connections
…from data diff

The Rust engine only compares columns explicitly listed in extra_columns.
When omitted, it was silently reporting all key-matched rows as 'identical'
even when non-key values differed — a false positive bug.

Changes:
- Auto-discover columns from information_schema when extra_columns is omitted
  and source is a plain table name (not a SQL query)
- Exclude audit/timestamp columns (updated_at, created_at, inserted_at,
  modified_at, _fivetran_*, _airbyte_*, publisher_last_updated_*, etc.)
  from comparison by default since they typically differ due to ETL timing
- Report excluded columns in tool output so users know what was skipped
- Fix misleading tool description that said 'Omit to compare all columns'
- Update SKILL.md with critical guidance on extra_columns behavior
…ult truncation

All drivers default to `LIMIT 1001` on SELECT queries and post-truncate to
1000 rows. This silently drops rows when the data-diff engine needs complete
result sets — a FULL OUTER JOIN returning >1000 diff rows would be truncated,
causing the engine to undercount differences.

- Add `ExecuteOptions { noLimit?: boolean }` to the `Connector` interface
- When `noLimit: true`, set `effectiveLimit = 0` (falsy) so the existing
  LIMIT injection guard is skipped, and add `effectiveLimit > 0` to the
  truncation check so rows aren't sliced to zero
- Update all 12 drivers: postgres, clickhouse, snowflake, bigquery, mysql,
  redshift, databricks, duckdb, oracle, sqlserver, sqlite, mongodb
- Pass `{ noLimit: true }` from `data-diff.ts` `executeQuery()`

Interactive SQL callers are unaffected — they continue to get the default
1000-row limit. Only the data-diff pipeline opts out.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…m exclusions with user

Column exclusion now has two layers:
1. Name-pattern matching (existing) — updated_at, created_at, _fivetran_synced, etc.
2. Schema-level default detection (new) — queries column_default for NOW(),
   CURRENT_TIMESTAMP, GETDATE(), SYSDATE, SYSTIMESTAMP, etc.

Covers PostgreSQL, MySQL, Snowflake, SQL Server, Oracle, ClickHouse, DuckDB,
SQLite, and Redshift in a single round-trip (no extra query).

The skill prompt now instructs the agent to present detected auto-timestamp
columns to the user and ask for confirmation before excluding them, since
migrations should preserve timestamps while ETL replication regenerates them.
- `buildColumnDiscoverySQL`: escape single quotes in all interpolated table
  name parts to prevent SQL injection via crafted source/target names
- `dateTruncExpr`: add Oracle case (`TRUNC(col, 'UNIT')`) — Oracle does not
  have `DATE_TRUNC`, date-partitioned diffs on Oracle tables previously failed

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Apply esc() to Oracle and SQLite paths in buildColumnDiscoverySQL
  (SQL injection via table name was unpatched in these dialects)
- Quote identifiers in resolveTableSources to prevent injection via
  table names containing semicolons or special characters
- Surface SQL execution errors before feeding empty rows to the engine
  (silent false "match" when warehouse is unreachable is now an error)
- Fix Oracle TRUNC() format model map: 'WEEK' → 'IW' (ISO week)
  ('WEEK' throws ORA-01800 on all Oracle versions)
- Quote partition column identifier in buildPartitionWhereClause
…r propagation, and test mock formats

- `altimate-core-column-lineage`: fix `[object Object]` in `column_dict` output when source entries are `{ source_table, source_column }` objects instead of strings
- `schema-inspect`: propagate `{ success: false, error }` dispatcher responses to `metadata.error` instead of silently returning empty schema
- `sql-analyze`: guard against null/undefined result from dispatcher to prevent "undefined" literal in output
- `lineage-check`: guard against null/undefined result from dispatcher to prevent "undefined" literal in output
- `simulation-suite.test.ts`: fix `sql-translate` mock format — data fields must be flat (not wrapped in `data: {}`), add `source_dialect`/`target_dialect` to mock so assertions pass
- `simulation-suite.test.ts`: fix `dbt-manifest` mock format — unwrap `data: {}` so `model_count` and `models` are accessible at top level

Simulation suite: 695/839 → 839/839 (100%)
Bun's runtime never fires native addon async callbacks, so the async
`new duckdb.Database(path, opts, callback)` form would hit the 2-second
timeout fallback on every connection attempt.

Switch to the synchronous constructor form `new duckdb.Database(path)` /
`new duckdb.Database(path, opts)` which throws on error and completes
immediately in both Node and bun runtimes.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.

Tip: disable this comment in your organization's Code Review settings.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 2, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: d18ecbad-8122-4e2a-bd0d-031843a760e1

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/data-parity-review-fixes

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

The async callback form with 2s fallback was already working correctly
at e3df5a4. The timeout was caused by a missing duckdb .node binary,
not a bun incompatibility.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@suryaiyer95 suryaiyer95 merged commit 38daa6d into feat/data-parity-skill-improvements Apr 2, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants