You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fix: auto-discover extra_columns and exclude audit/timestamp columns from data diff
The Rust engine only compares columns explicitly listed in extra_columns.
When omitted, it was silently reporting all key-matched rows as 'identical'
even when non-key values differed — a false positive bug.
Changes:
- Auto-discover columns from information_schema when extra_columns is omitted
and source is a plain table name (not a SQL query)
- Exclude audit/timestamp columns (updated_at, created_at, inserted_at,
modified_at, _fivetran_*, _airbyte_*, publisher_last_updated_*, etc.)
from comparison by default since they typically differ due to ETL timing
- Report excluded columns in tool output so users know what was skipped
- Fix misleading tool description that said 'Omit to compare all columns'
- Update SKILL.md with critical guidance on extra_columns behavior
Copy file name to clipboardExpand all lines: .opencode/skills/data-parity/SKILL.md
+15Lines changed: 15 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -256,6 +256,18 @@ Output includes aggregate diff + per-partition breakdown showing which group has
256
256
257
257
---
258
258
259
+
## CRITICAL: `extra_columns` Behavior
260
+
261
+
The Rust engine **only compares columns listed in `extra_columns`**. If the list is empty, it compares key existence only — rows that match on key but differ in values will be silently reported as "identical". This is the most common source of false positives.
262
+
263
+
**Auto-discovery (default for table names):** When `extra_columns` is omitted and the source is a plain table name, `data_diff` auto-discovers all non-key columns from `information_schema` and excludes audit/timestamp columns (like `updated_at`, `created_at`, `inserted_at`, `modified_at`, `publisher_last_updated_epoch_ms`, ETL metadata columns like `_fivetran_synced`, etc.). The output will list which columns were auto-excluded.
264
+
265
+
**SQL queries:** When source is a SQL query (not a table name), auto-discovery cannot work. You **must** provide `extra_columns` explicitly. If you don't, only key-level matching occurs.
266
+
267
+
**When to override auto-exclusion:** If the user specifically wants to compare audit columns (e.g., verifying that `created_at` was preserved during migration), pass those columns explicitly in `extra_columns`.
268
+
269
+
---
270
+
259
271
## Common Mistakes
260
272
261
273
**Writing manual diff SQL instead of calling data_diff**
@@ -272,3 +284,6 @@ Output includes aggregate diff + per-partition breakdown showing which group has
272
284
273
285
**Running full diff on a billion-row table without asking**
274
286
→ Always ask the user before expensive operations. Offer filtering and partition options.
287
+
288
+
**Omitting extra_columns when source is a SQL query**
289
+
→ Auto-discovery only works for table names. For SQL queries, always list the columns to compare explicitly.
0 commit comments