You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat: detect auto-timestamp defaults from database catalog and confirm exclusions with user
Column exclusion now has two layers:
1. Name-pattern matching (existing) — updated_at, created_at, _fivetran_synced, etc.
2. Schema-level default detection (new) — queries column_default for NOW(),
CURRENT_TIMESTAMP, GETDATE(), SYSDATE, SYSTIMESTAMP, etc.
Covers PostgreSQL, MySQL, Snowflake, SQL Server, Oracle, ClickHouse, DuckDB,
SQLite, and Redshift in a single round-trip (no extra query).
The skill prompt now instructs the agent to present detected auto-timestamp
columns to the user and ask for confirmation before excluding them, since
migrations should preserve timestamps while ETL replication regenerates them.
- MySQL/MariaDB: `CURRENT_TIMESTAMP` (in default or EXTRA)
84
+
- Snowflake: `CURRENT_TIMESTAMP()`, `SYSDATE()`
85
+
- SQL Server: `getdate()`, `sysdatetime()`
86
+
- Oracle: `SYSDATE`, `SYSTIMESTAMP`
87
+
88
+
These columns auto-generate values on INSERT, so they inherently differ between source and target due to write timing — not because of actual data discrepancies. **Collect them for confirmation in Step 4.**
89
+
72
90
If no obvious PK, run a cardinality check:
73
91
74
92
```sql
@@ -101,7 +119,33 @@ Do not proceed to diff until the user confirms or corrects.
If you detected any columns with auto-generating timestamp defaults in Step 2, **present them to the user and ask for confirmation** before excluding them.
125
+
126
+
**Example prompt when auto-timestamp columns are found:**
127
+
128
+
> "I found **3 columns** with auto-generating timestamp defaults that will inherently differ between source and target (due to when each row was written, not actual data differences):
129
+
>
130
+
> | Column | Default | Reason to exclude |
131
+
> |--------|---------|-------------------|
132
+
> |`created_at`|`DEFAULT now()`| Set on insert — reflects when this copy was written |
133
+
> |`updated_at`|`DEFAULT now()`| Set on insert — reflects when this copy was written |
> Should I **exclude** these from the comparison? Or do you want to include any of them (e.g., if you're verifying that `created_at` was preserved during migration)?"
137
+
138
+
**If user confirms exclusion:** Omit those columns from `extra_columns` when calling `data_diff`.
139
+
140
+
**If user wants to include some:** Add them explicitly to `extra_columns`.
141
+
142
+
**If no auto-timestamp columns were detected:** Skip this step and proceed to Step 5.
143
+
144
+
> **Why ask?** In migration validation, `created_at` should often be *identical* between source and target (it was migrated, not regenerated). But in ETL replication, `created_at` is freshly generated on each side and *should* differ. Only the user knows which case applies.
145
+
146
+
---
147
+
148
+
## Step 5: Check Row Counts
105
149
106
150
```sql
107
151
SELECTCOUNT(*) FROM orders -- run on both source and target
@@ -114,7 +158,7 @@ Use counts to:
114
158
115
159
---
116
160
117
-
## Step 5: Column-Level Profile (Always Run This First)
161
+
## Step 6: Column-Level Profile (Always Run This First)
118
162
119
163
Profile is cheap — it runs aggregates, not row scans. **Always run profile before row-level diff.**
120
164
@@ -148,7 +192,7 @@ Column Profile Comparison
148
192
149
193
---
150
194
151
-
## Step 6: Ask Before Running Row-Level Diff on Large Tables
195
+
## Step 7: Ask Before Running Row-Level Diff on Large Tables
152
196
153
197
After profiling, check row count and **ask the user** before proceeding:
154
198
@@ -166,7 +210,7 @@ After profiling, check row count and **ask the user** before proceeding:
166
210
167
211
---
168
212
169
-
## Step 7: Run Targeted Row-Level Diff
213
+
## Step 8: Run Targeted Row-Level Diff
170
214
171
215
Use only the columns that the profile said differ. This is faster and produces cleaner output.
172
216
@@ -260,7 +304,12 @@ Output includes aggregate diff + per-partition breakdown showing which group has
260
304
261
305
The Rust engine **only compares columns listed in `extra_columns`**. If the list is empty, it compares key existence only — rows that match on key but differ in values will be silently reported as "identical". This is the most common source of false positives.
262
306
263
-
**Auto-discovery (default for table names):** When `extra_columns` is omitted and the source is a plain table name, `data_diff` auto-discovers all non-key columns from `information_schema` and excludes audit/timestamp columns (like `updated_at`, `created_at`, `inserted_at`, `modified_at`, `publisher_last_updated_epoch_ms`, ETL metadata columns like `_fivetran_synced`, etc.). The output will list which columns were auto-excluded.
307
+
**Auto-discovery (default for table names):** When `extra_columns` is omitted and the source is a plain table name, `data_diff` auto-discovers all non-key columns from the database catalog and excludes columns using two detection layers:
308
+
309
+
1.**Name-pattern matching** — columns named like `updated_at`, `created_at`, `inserted_at`, `modified_at`, `publisher_last_updated_epoch_ms`, ETL metadata columns like `_fivetran_synced`, `_airbyte_extracted_at`, etc.
310
+
2.**Schema-level default detection** — columns with auto-generating timestamp defaults (`DEFAULT NOW()`, `DEFAULT CURRENT_TIMESTAMP`, `GETDATE()`, `SYSDATE()`, `SYSTIMESTAMP`, etc.), detected directly from the database catalog. This catches columns that don't follow naming conventions but still auto-generate values on INSERT. Works across PostgreSQL, MySQL, Snowflake, SQL Server, Oracle, ClickHouse, DuckDB, SQLite, and Redshift.
311
+
312
+
The output lists which columns were auto-excluded and why.
264
313
265
314
**SQL queries:** When source is a SQL query (not a table name), auto-discovery cannot work. You **must** provide `extra_columns` explicitly. If you don't, only key-level matching occurs.
266
315
@@ -287,3 +336,6 @@ The Rust engine **only compares columns listed in `extra_columns`**. If the list
287
336
288
337
**Omitting extra_columns when source is a SQL query**
289
338
→ Auto-discovery only works for table names. For SQL queries, always list the columns to compare explicitly.
339
+
340
+
**Silently excluding auto-timestamp columns without asking the user**
341
+
→ Always present detected auto-timestamp columns (Step 4) and get explicit confirmation. In migration scenarios, `created_at` should be *identical* — excluding it silently hides real bugs.
0 commit comments