Skip to content

Commit e3df5a4

Browse files
aidtyasuryaiyer95
authored andcommitted
feat: detect auto-timestamp defaults from database catalog and confirm exclusions with user
Column exclusion now has two layers: 1. Name-pattern matching (existing) — updated_at, created_at, _fivetran_synced, etc. 2. Schema-level default detection (new) — queries column_default for NOW(), CURRENT_TIMESTAMP, GETDATE(), SYSDATE, SYSTIMESTAMP, etc. Covers PostgreSQL, MySQL, Snowflake, SQL Server, Oracle, ClickHouse, DuckDB, SQLite, and Redshift in a single round-trip (no extra query). The skill prompt now instructs the agent to present detected auto-timestamp columns to the user and ask for confirmation before excluding them, since migrations should preserve timestamps while ETL replication regenerates them.
1 parent 4131ea9 commit e3df5a4

File tree

3 files changed

+210
-38
lines changed

3 files changed

+210
-38
lines changed

.opencode/skills/data-parity/SKILL.md

Lines changed: 66 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -12,13 +12,14 @@ description: Validate that two tables or query results are identical — or diag
1212
```
1313
Here's my plan:
1414
1. [ ] List available warehouse connections
15-
2. [ ] Inspect schema and discover primary key candidates
15+
2. [ ] Inspect schema, discover primary key candidates, and detect auto-timestamp columns
1616
3. [ ] Confirm primary keys with you
17-
4. [ ] Check row counts on both sides
18-
5. [ ] Run column-level profile (cheap — no row scan)
19-
6. [ ] Ask whether to proceed with row-level diff (may be expensive for large tables)
20-
7. [ ] Run targeted row-level diff on diverging columns only
21-
8. [ ] Report findings
17+
4. [ ] Confirm which auto-timestamp columns to exclude
18+
5. [ ] Check row counts on both sides
19+
6. [ ] Run column-level profile (cheap — no row scan)
20+
7. [ ] Ask whether to proceed with row-level diff (may be expensive for large tables)
21+
8. [ ] Run targeted row-level diff on diverging columns only
22+
9. [ ] Report findings
2223
```
2324

2425
Update each item to `[x]` as you complete it. This plan should be visible before any tool is called.
@@ -45,13 +46,13 @@ Use `warehouse_list` to show the user what connections are available and which w
4546

4647
---
4748

48-
## Step 2: Inspect Schema and Discover Primary Keys
49+
## Step 2: Inspect Schema, Discover Primary Keys, and Detect Auto-Timestamp Columns
4950

50-
Use `sql_query` to get columns and identify key candidates:
51+
Use `sql_query` to get columns, defaults, and identify key candidates:
5152

5253
```sql
5354
-- Postgres / Redshift / DuckDB
54-
SELECT column_name, data_type, is_nullable
55+
SELECT column_name, data_type, is_nullable, column_default
5556
FROM information_schema.columns
5657
WHERE table_schema = 'public' AND table_name = 'orders'
5758
ORDER BY ordinal_position
@@ -62,13 +63,30 @@ ORDER BY ordinal_position
6263
SHOW COLUMNS IN TABLE orders
6364
```
6465

66+
```sql
67+
-- MySQL / MariaDB (also fetch EXTRA for ON UPDATE detection)
68+
SELECT column_name, data_type, is_nullable, column_default, extra
69+
FROM information_schema.columns
70+
WHERE table_schema = 'mydb' AND table_name = 'orders'
71+
ORDER BY ordinal_position
72+
```
73+
6574
```sql
6675
-- ClickHouse
6776
DESCRIBE TABLE source_db.events
6877
```
6978

7079
**Look for:** columns named `id`, `*_id`, `*_key`, `uuid`, or with `NOT NULL` + unique index.
7180

81+
**Also look for auto-timestamp columns** — any column whose `column_default` contains a time-generating function:
82+
- PostgreSQL/DuckDB/Redshift: `now()`, `CURRENT_TIMESTAMP`, `clock_timestamp()`
83+
- MySQL/MariaDB: `CURRENT_TIMESTAMP` (in default or EXTRA)
84+
- Snowflake: `CURRENT_TIMESTAMP()`, `SYSDATE()`
85+
- SQL Server: `getdate()`, `sysdatetime()`
86+
- Oracle: `SYSDATE`, `SYSTIMESTAMP`
87+
88+
These columns auto-generate values on INSERT, so they inherently differ between source and target due to write timing — not because of actual data discrepancies. **Collect them for confirmation in Step 4.**
89+
7290
If no obvious PK, run a cardinality check:
7391

7492
```sql
@@ -101,7 +119,33 @@ Do not proceed to diff until the user confirms or corrects.
101119

102120
---
103121

104-
## Step 4: Check Row Counts
122+
## Step 4: Confirm Auto-Timestamp Column Exclusions
123+
124+
If you detected any columns with auto-generating timestamp defaults in Step 2, **present them to the user and ask for confirmation** before excluding them.
125+
126+
**Example prompt when auto-timestamp columns are found:**
127+
128+
> "I found **3 columns** with auto-generating timestamp defaults that will inherently differ between source and target (due to when each row was written, not actual data differences):
129+
>
130+
> | Column | Default | Reason to exclude |
131+
> |--------|---------|-------------------|
132+
> | `created_at` | `DEFAULT now()` | Set on insert — reflects when this copy was written |
133+
> | `updated_at` | `DEFAULT now()` | Set on insert — reflects when this copy was written |
134+
> | `_loaded_at` | `DEFAULT CURRENT_TIMESTAMP` | ETL load timestamp |
135+
>
136+
> Should I **exclude** these from the comparison? Or do you want to include any of them (e.g., if you're verifying that `created_at` was preserved during migration)?"
137+
138+
**If user confirms exclusion:** Omit those columns from `extra_columns` when calling `data_diff`.
139+
140+
**If user wants to include some:** Add them explicitly to `extra_columns`.
141+
142+
**If no auto-timestamp columns were detected:** Skip this step and proceed to Step 5.
143+
144+
> **Why ask?** In migration validation, `created_at` should often be *identical* between source and target (it was migrated, not regenerated). But in ETL replication, `created_at` is freshly generated on each side and *should* differ. Only the user knows which case applies.
145+
146+
---
147+
148+
## Step 5: Check Row Counts
105149

106150
```sql
107151
SELECT COUNT(*) FROM orders -- run on both source and target
@@ -114,7 +158,7 @@ Use counts to:
114158

115159
---
116160

117-
## Step 5: Column-Level Profile (Always Run This First)
161+
## Step 6: Column-Level Profile (Always Run This First)
118162

119163
Profile is cheap — it runs aggregates, not row scans. **Always run profile before row-level diff.**
120164

@@ -148,7 +192,7 @@ Column Profile Comparison
148192

149193
---
150194

151-
## Step 6: Ask Before Running Row-Level Diff on Large Tables
195+
## Step 7: Ask Before Running Row-Level Diff on Large Tables
152196

153197
After profiling, check row count and **ask the user** before proceeding:
154198

@@ -166,7 +210,7 @@ After profiling, check row count and **ask the user** before proceeding:
166210
167211
---
168212

169-
## Step 7: Run Targeted Row-Level Diff
213+
## Step 8: Run Targeted Row-Level Diff
170214

171215
Use only the columns that the profile said differ. This is faster and produces cleaner output.
172216

@@ -260,7 +304,12 @@ Output includes aggregate diff + per-partition breakdown showing which group has
260304

261305
The Rust engine **only compares columns listed in `extra_columns`**. If the list is empty, it compares key existence only — rows that match on key but differ in values will be silently reported as "identical". This is the most common source of false positives.
262306

263-
**Auto-discovery (default for table names):** When `extra_columns` is omitted and the source is a plain table name, `data_diff` auto-discovers all non-key columns from `information_schema` and excludes audit/timestamp columns (like `updated_at`, `created_at`, `inserted_at`, `modified_at`, `publisher_last_updated_epoch_ms`, ETL metadata columns like `_fivetran_synced`, etc.). The output will list which columns were auto-excluded.
307+
**Auto-discovery (default for table names):** When `extra_columns` is omitted and the source is a plain table name, `data_diff` auto-discovers all non-key columns from the database catalog and excludes columns using two detection layers:
308+
309+
1. **Name-pattern matching** — columns named like `updated_at`, `created_at`, `inserted_at`, `modified_at`, `publisher_last_updated_epoch_ms`, ETL metadata columns like `_fivetran_synced`, `_airbyte_extracted_at`, etc.
310+
2. **Schema-level default detection** — columns with auto-generating timestamp defaults (`DEFAULT NOW()`, `DEFAULT CURRENT_TIMESTAMP`, `GETDATE()`, `SYSDATE()`, `SYSTIMESTAMP`, etc.), detected directly from the database catalog. This catches columns that don't follow naming conventions but still auto-generate values on INSERT. Works across PostgreSQL, MySQL, Snowflake, SQL Server, Oracle, ClickHouse, DuckDB, SQLite, and Redshift.
311+
312+
The output lists which columns were auto-excluded and why.
264313

265314
**SQL queries:** When source is a SQL query (not a table name), auto-discovery cannot work. You **must** provide `extra_columns` explicitly. If you don't, only key-level matching occurs.
266315

@@ -287,3 +336,6 @@ The Rust engine **only compares columns listed in `extra_columns`**. If the list
287336

288337
**Omitting extra_columns when source is a SQL query**
289338
→ Auto-discovery only works for table names. For SQL queries, always list the columns to compare explicitly.
339+
340+
**Silently excluding auto-timestamp columns without asking the user**
341+
→ Always present detected auto-timestamp columns (Step 4) and get explicit confirmation. In migration scenarios, `created_at` should be *identical* — excluding it silently hides real bugs.

packages/opencode/src/altimate/native/connections/data-diff.ts

Lines changed: 140 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -147,8 +147,70 @@ function isAuditColumn(columnName: string): boolean {
147147
return AUDIT_COLUMN_PATTERNS.some((pattern) => pattern.test(columnName))
148148
}
149149

150+
// ---------------------------------------------------------------------------
151+
// Auto-timestamp default detection (schema-level)
152+
// ---------------------------------------------------------------------------
153+
154+
/**
155+
* Patterns that detect auto-generated timestamp/date defaults in column_default
156+
* expressions. These functions produce the current time when a row is inserted
157+
* (or updated), meaning the column value will inherently differ between source
158+
* and target — not because of actual data discrepancies, but because of when
159+
* each copy was written.
160+
*
161+
* Covers: PostgreSQL, MySQL/MariaDB, Snowflake, SQL Server, Oracle,
162+
* ClickHouse, DuckDB, SQLite, Redshift, BigQuery, Databricks.
163+
*/
164+
const AUTO_TIMESTAMP_DEFAULT_PATTERNS = [
165+
// PostgreSQL, DuckDB, Redshift
166+
/\bnow\s*\(\)/i,
167+
/\bclock_timestamp\s*\(\)/i,
168+
/\bstatement_timestamp\s*\(\)/i,
169+
/\btransaction_timestamp\s*\(\)/i,
170+
/\blocaltimestamp\b/i,
171+
// Standard SQL — used by most dialects
172+
/\bcurrent_timestamp\b/i,
173+
// MySQL / MariaDB — "ON UPDATE CURRENT_TIMESTAMP" in the EXTRA column
174+
/\bon\s+update\s+current_timestamp/i,
175+
// Snowflake
176+
/\bsysdate\s*\(\)/i,
177+
// SQL Server
178+
/\bgetdate\s*\(\)/i,
179+
/\bsysdatetime\s*\(\)/i,
180+
/\bsysutcdatetime\s*\(\)/i,
181+
/\bsysdatetimeoffset\s*\(\)/i,
182+
// Oracle
183+
/\bSYSDATE\b/i,
184+
/\bSYSTIMESTAMP\b/i,
185+
// ClickHouse
186+
/\btoday\s*\(\)/i,
187+
// SQLite
188+
/\bdatetime\s*\(\s*'now'/i,
189+
]
190+
191+
/**
192+
* Check whether a column_default expression contains an auto-generating
193+
* timestamp function. Also matches expressions that *contain* these functions
194+
* (e.g. `(now() + '1 mon'::interval)`).
195+
*/
196+
function isAutoTimestampDefault(defaultExpr: string | null): boolean {
197+
if (!defaultExpr) return false
198+
return AUTO_TIMESTAMP_DEFAULT_PATTERNS.some((pattern) => pattern.test(defaultExpr))
199+
}
200+
201+
// ---------------------------------------------------------------------------
202+
// Column discovery (names + defaults) — dialect-aware
203+
// ---------------------------------------------------------------------------
204+
205+
interface ColumnInfo {
206+
name: string
207+
defaultExpr: string | null
208+
}
209+
150210
/**
151-
* Build a query to discover column names for a table, appropriate for the dialect.
211+
* Build a query to discover column names and default expressions for a table.
212+
* Returns both pieces of information in a single round-trip so we can detect
213+
* auto-timestamp defaults without an extra query.
152214
*/
153215
function buildColumnDiscoverySQL(tableName: string, dialect: string): string {
154216
// Parse schema.table or db.schema.table
@@ -168,33 +230,85 @@ function buildColumnDiscoverySQL(tableName: string, dialect: string): string {
168230

169231
switch (dialect) {
170232
case "clickhouse":
233+
// Returns: name, type, default_type, default_expression, ...
171234
return `DESCRIBE TABLE ${tableName}`
172235
case "snowflake":
236+
// Returns: table_name, schema_name, column_name, data_type, null?, default, ...
173237
return `SHOW COLUMNS IN TABLE ${tableName}`
238+
case "mysql":
239+
case "mariadb": {
240+
// MySQL puts "on update CURRENT_TIMESTAMP" in the EXTRA column, not column_default
241+
const conditions = [tableFilter]
242+
if (schemaFilter) conditions.push(schemaFilter)
243+
return `SELECT column_name, column_default, extra FROM information_schema.columns WHERE ${conditions.join(" AND ")} ORDER BY ordinal_position`
244+
}
245+
case "oracle": {
246+
// Oracle uses ALL_TAB_COLUMNS (no information_schema)
247+
const oracleTable = parts[parts.length - 1]
248+
const conditions = [`TABLE_NAME = '${oracleTable.toUpperCase()}'`]
249+
if (parts.length >= 2) {
250+
conditions.push(`OWNER = '${parts[parts.length - 2].toUpperCase()}'`)
251+
}
252+
return `SELECT COLUMN_NAME, DATA_DEFAULT FROM ALL_TAB_COLUMNS WHERE ${conditions.join(" AND ")} ORDER BY COLUMN_ID`
253+
}
254+
case "sqlite": {
255+
// PRAGMA table_info returns: cid, name, type, notnull, dflt_value, pk
256+
const table = parts[parts.length - 1]
257+
return `PRAGMA table_info('${table}')`
258+
}
174259
default: {
175-
// Postgres, MySQL, Redshift, DuckDB, etc. — use information_schema
260+
// Postgres, Redshift, DuckDB, SQL Server, BigQuery, Databricks, etc.
176261
const conditions = [tableFilter]
177262
if (schemaFilter) conditions.push(schemaFilter)
178-
return `SELECT column_name FROM information_schema.columns WHERE ${conditions.join(" AND ")} ORDER BY ordinal_position`
263+
return `SELECT column_name, column_default FROM information_schema.columns WHERE ${conditions.join(" AND ")} ORDER BY ordinal_position`
179264
}
180265
}
181266
}
182267

183268
/**
184-
* Parse column names from the discovery query result, handling dialect differences.
269+
* Parse column info (name + default expression) from the discovery query result,
270+
* handling dialect-specific output formats.
185271
*/
186-
function parseColumnNames(rows: (string | null)[][], dialect: string): string[] {
272+
function parseColumnInfo(rows: (string | null)[][], dialect: string): ColumnInfo[] {
187273
switch (dialect) {
188274
case "clickhouse":
189-
// DESCRIBE returns: name, type, default_type, default_expression, ...
190-
return rows.map((r) => r[0] ?? "").filter(Boolean)
275+
// DESCRIBE: name[0], type[1], default_type[2], default_expression[3], ...
276+
return rows.map((r) => ({
277+
name: r[0] ?? "",
278+
defaultExpr: r[3] ?? null,
279+
})).filter((c) => c.name)
191280
case "snowflake":
192-
// SHOW COLUMNS returns: table_name, schema_name, column_name, data_type, ...
193-
// column_name is at index 2
194-
return rows.map((r) => r[2] ?? "").filter(Boolean)
281+
// SHOW COLUMNS: table_name[0], schema_name[1], column_name[2], data_type[3], null?[4], default[5], ...
282+
return rows.map((r) => ({
283+
name: r[2] ?? "",
284+
defaultExpr: r[5] ?? null,
285+
})).filter((c) => c.name)
286+
case "oracle":
287+
// ALL_TAB_COLUMNS: COLUMN_NAME[0], DATA_DEFAULT[1]
288+
return rows.map((r) => ({
289+
name: r[0] ?? "",
290+
defaultExpr: r[1] ?? null,
291+
})).filter((c) => c.name)
292+
case "sqlite":
293+
// PRAGMA table_info: cid[0], name[1], type[2], notnull[3], dflt_value[4], pk[5]
294+
return rows.map((r) => ({
295+
name: r[1] ?? "",
296+
defaultExpr: r[4] ?? null,
297+
})).filter((c) => c.name)
298+
case "mysql":
299+
case "mariadb":
300+
// column_name[0], column_default[1], extra[2]
301+
// Merge default + extra — MySQL puts "on update CURRENT_TIMESTAMP" in extra
302+
return rows.map((r) => ({
303+
name: r[0] ?? "",
304+
defaultExpr: [r[1], r[2]].filter(Boolean).join(" ") || null,
305+
})).filter((c) => c.name)
195306
default:
196-
// information_schema returns: column_name
197-
return rows.map((r) => r[0] ?? "").filter(Boolean)
307+
// Postgres, Redshift, DuckDB, SQL Server, BigQuery: column_name[0], column_default[1]
308+
return rows.map((r) => ({
309+
name: r[0] ?? "",
310+
defaultExpr: r[1] ?? null,
311+
})).filter((c) => c.name)
198312
}
199313
}
200314

@@ -204,8 +318,13 @@ function parseColumnNames(rows: (string | null)[][], dialect: string): string[]
204318
* When the caller omits `extra_columns`, we query the source table's schema to
205319
* find all columns, then exclude:
206320
* 1. Key columns (already used for matching)
207-
* 2. Audit/timestamp columns (updated_at, created_at, etc.) that typically
208-
* differ between source and target due to ETL timing
321+
* 2. Audit/timestamp columns matched by name pattern (updated_at, created_at, etc.)
322+
* 3. Columns with auto-generating timestamp defaults (DEFAULT NOW(), CURRENT_TIMESTAMP,
323+
* GETDATE(), SYSDATE, etc.) — detected from the database catalog
324+
*
325+
* The schema-level default detection (layer 3) catches columns that don't follow
326+
* naming conventions but still auto-generate values on INSERT — these inherently
327+
* differ between source and target due to when each copy was written.
209328
*
210329
* Returns the list of columns to compare, or undefined if discovery fails
211330
* (in which case the engine falls back to key-only comparison).
@@ -222,20 +341,20 @@ async function discoverExtraColumns(
222341
try {
223342
const sql = buildColumnDiscoverySQL(tableName, dialect)
224343
const rows = await executeQuery(sql, warehouseName)
225-
const allColumns = parseColumnNames(rows, dialect)
344+
const columnInfos = parseColumnInfo(rows, dialect)
226345

227-
if (allColumns.length === 0) return undefined
346+
if (columnInfos.length === 0) return undefined
228347

229348
const keySet = new Set(keyColumns.map((k) => k.toLowerCase()))
230349
const extraColumns: string[] = []
231350
const excludedAudit: string[] = []
232351

233-
for (const col of allColumns) {
234-
if (keySet.has(col.toLowerCase())) continue
235-
if (isAuditColumn(col)) {
236-
excludedAudit.push(col)
352+
for (const col of columnInfos) {
353+
if (keySet.has(col.name.toLowerCase())) continue
354+
if (isAuditColumn(col.name) || isAutoTimestampDefault(col.defaultExpr)) {
355+
excludedAudit.push(col.name)
237356
} else {
238-
extraColumns.push(col)
357+
extraColumns.push(col.name)
239358
}
240359
}
241360

packages/opencode/src/altimate/tools/data-diff.ts

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,8 @@ export const DataDiffTool = Tool.define("data_diff", {
3838
.describe(
3939
"Columns to compare beyond the key columns. " +
4040
"IMPORTANT: If omitted AND source is a plain table name, columns are auto-discovered from the schema " +
41-
"(excluding key columns and audit/timestamp columns like updated_at, created_at, inserted_at, modified_at). " +
41+
"(excluding key columns, audit/timestamp columns matched by name like updated_at/created_at, " +
42+
"and columns with auto-generating timestamp defaults like DEFAULT NOW()/CURRENT_TIMESTAMP/GETDATE()/SYSDATE). " +
4243
"If omitted AND source is a SQL query, ONLY key columns are compared — value changes in non-key columns will NOT be detected. " +
4344
"Always provide explicit extra_columns when comparing SQL queries to ensure value-level comparison."
4445
),
@@ -117,10 +118,10 @@ export const DataDiffTool = Tool.define("data_diff", {
117118
output += formatPartitionResults(result.partition_results, args.partition_column!)
118119
}
119120

120-
// Report auto-excluded audit columns so the LLM and user know what was skipped
121+
// Report auto-excluded columns so the LLM and user know what was skipped
121122
const excluded = (result as any).excluded_audit_columns as string[] | undefined
122123
if (excluded && excluded.length > 0) {
123-
output += `\n\n Note: ${excluded.length} audit/timestamp column${excluded.length === 1 ? "" : "s"} auto-excluded from comparison: ${excluded.join(", ")}`
124+
output += `\n\n Note: ${excluded.length} column${excluded.length === 1 ? "" : "s"} auto-excluded from comparison (audit name patterns + auto-timestamp defaults like NOW()/CURRENT_TIMESTAMP): ${excluded.join(", ")}`
124125
}
125126

126127
return {

0 commit comments

Comments
 (0)