feat: detect auto-timestamp defaults from database catalog and confirm exclusions with user

aidtya · suryaiyer95 · commit e3df5a47af09 · 2026-03-30T19:01:04.000-07:00
Column exclusion now has two layers:
1. Name-pattern matching (existing) — updated_at, created_at, _fivetran_synced, etc.
2. Schema-level default detection (new) — queries column_default for NOW(),
   CURRENT_TIMESTAMP, GETDATE(), SYSDATE, SYSTIMESTAMP, etc.

Covers PostgreSQL, MySQL, Snowflake, SQL Server, Oracle, ClickHouse, DuckDB,
SQLite, and Redshift in a single round-trip (no extra query).

The skill prompt now instructs the agent to present detected auto-timestamp
columns to the user and ask for confirmation before excluding them, since
migrations should preserve timestamps while ETL replication regenerates them.
diff --git a/.opencode/skills/data-parity/SKILL.md b/.opencode/skills/data-parity/SKILL.md
@@ -12,13 +12,14 @@ description: Validate that two tables or query results are identical — or diag
 ```
 Here's my plan:
 1. [ ] List available warehouse connections
-2. [ ] Inspect schema and discover primary key candidates
+2. [ ] Inspect schema, discover primary key candidates, and detect auto-timestamp columns
 3. [ ] Confirm primary keys with you
-4. [ ] Check row counts on both sides
-5. [ ] Run column-level profile (cheap — no row scan)
-6. [ ] Ask whether to proceed with row-level diff (may be expensive for large tables)
-7. [ ] Run targeted row-level diff on diverging columns only
-8. [ ] Report findings
+4. [ ] Confirm which auto-timestamp columns to exclude
+5. [ ] Check row counts on both sides
+6. [ ] Run column-level profile (cheap — no row scan)
+7. [ ] Ask whether to proceed with row-level diff (may be expensive for large tables)
+8. [ ] Run targeted row-level diff on diverging columns only
+9. [ ] Report findings
 ```
 
 Update each item to `[x]` as you complete it. This plan should be visible before any tool is called.
@@ -45,13 +46,13 @@ Use `warehouse_list` to show the user what connections are available and which w
 
 ---
 
-## Step 2: Inspect Schema and Discover Primary Keys
+## Step 2: Inspect Schema, Discover Primary Keys, and Detect Auto-Timestamp Columns
 
-Use `sql_query` to get columns and identify key candidates:
+Use `sql_query` to get columns, defaults, and identify key candidates:
 
 ```sql
 -- Postgres / Redshift / DuckDB
-SELECT column_name, data_type, is_nullable
+SELECT column_name, data_type, is_nullable, column_default
 FROM information_schema.columns
 WHERE table_schema = 'public' AND table_name = 'orders'
 ORDER BY ordinal_position
@@ -62,13 +63,30 @@ ORDER BY ordinal_position
 SHOW COLUMNS IN TABLE orders
 ```
 
+```sql
+-- MySQL / MariaDB  (also fetch EXTRA for ON UPDATE detection)
+SELECT column_name, data_type, is_nullable, column_default, extra
+FROM information_schema.columns
+WHERE table_schema = 'mydb' AND table_name = 'orders'
+ORDER BY ordinal_position
+```
+
 ```sql
 -- ClickHouse
 DESCRIBE TABLE source_db.events
 ```
 
 **Look for:** columns named `id`, `*_id`, `*_key`, `uuid`, or with `NOT NULL` + unique index.
 
+**Also look for auto-timestamp columns** — any column whose `column_default` contains a time-generating function:
+- PostgreSQL/DuckDB/Redshift: `now()`, `CURRENT_TIMESTAMP`, `clock_timestamp()`
+- MySQL/MariaDB: `CURRENT_TIMESTAMP` (in default or EXTRA)
+- Snowflake: `CURRENT_TIMESTAMP()`, `SYSDATE()`
+- SQL Server: `getdate()`, `sysdatetime()`
+- Oracle: `SYSDATE`, `SYSTIMESTAMP`
+
+These columns auto-generate values on INSERT, so they inherently differ between source and target due to write timing — not because of actual data discrepancies. **Collect them for confirmation in Step 4.**
+
 If no obvious PK, run a cardinality check:
 
 ```sql
@@ -101,7 +119,33 @@ Do not proceed to diff until the user confirms or corrects.
 
 ---
 
-## Step 4: Check Row Counts
+## Step 4: Confirm Auto-Timestamp Column Exclusions
+
+If you detected any columns with auto-generating timestamp defaults in Step 2, **present them to the user and ask for confirmation** before excluding them.
+
+**Example prompt when auto-timestamp columns are found:**
+
+> "I found **3 columns** with auto-generating timestamp defaults that will inherently differ between source and target (due to when each row was written, not actual data differences):
+>
+> | Column | Default | Reason to exclude |
+> |--------|---------|-------------------|
+> | `created_at` | `DEFAULT now()` | Set on insert — reflects when this copy was written |
+> | `updated_at` | `DEFAULT now()` | Set on insert — reflects when this copy was written |
+> | `_loaded_at` | `DEFAULT CURRENT_TIMESTAMP` | ETL load timestamp |
+>
+> Should I **exclude** these from the comparison? Or do you want to include any of them (e.g., if you're verifying that `created_at` was preserved during migration)?"
+
+**If user confirms exclusion:** Omit those columns from `extra_columns` when calling `data_diff`.
+
+**If user wants to include some:** Add them explicitly to `extra_columns`.
+
+**If no auto-timestamp columns were detected:** Skip this step and proceed to Step 5.
+
+> **Why ask?** In migration validation, `created_at` should often be *identical* between source and target (it was migrated, not regenerated). But in ETL replication, `created_at` is freshly generated on each side and *should* differ. Only the user knows which case applies.
+
+---
+
+## Step 5: Check Row Counts
 
 ```sql
 SELECT COUNT(*) FROM orders   -- run on both source and target
@@ -114,7 +158,7 @@ Use counts to:
 
 ---
 
-## Step 5: Column-Level Profile (Always Run This First)
+## Step 6: Column-Level Profile (Always Run This First)
 
 Profile is cheap — it runs aggregates, not row scans. **Always run profile before row-level diff.**
 
@@ -148,7 +192,7 @@ Column Profile Comparison
 
 ---
 
-## Step 6: Ask Before Running Row-Level Diff on Large Tables
+## Step 7: Ask Before Running Row-Level Diff on Large Tables
 
 After profiling, check row count and **ask the user** before proceeding:
 
@@ -166,7 +210,7 @@ After profiling, check row count and **ask the user** before proceeding:
 
 ---
 
-## Step 7: Run Targeted Row-Level Diff
+## Step 8: Run Targeted Row-Level Diff
 
 Use only the columns that the profile said differ. This is faster and produces cleaner output.
 
@@ -260,7 +304,12 @@ Output includes aggregate diff + per-partition breakdown showing which group has
 
 The Rust engine **only compares columns listed in `extra_columns`**. If the list is empty, it compares key existence only — rows that match on key but differ in values will be silently reported as "identical". This is the most common source of false positives.
 
-**Auto-discovery (default for table names):** When `extra_columns` is omitted and the source is a plain table name, `data_diff` auto-discovers all non-key columns from `information_schema` and excludes audit/timestamp columns (like `updated_at`, `created_at`, `inserted_at`, `modified_at`, `publisher_last_updated_epoch_ms`, ETL metadata columns like `_fivetran_synced`, etc.). The output will list which columns were auto-excluded.
+**Auto-discovery (default for table names):** When `extra_columns` is omitted and the source is a plain table name, `data_diff` auto-discovers all non-key columns from the database catalog and excludes columns using two detection layers:
+
+1. **Name-pattern matching** — columns named like `updated_at`, `created_at`, `inserted_at`, `modified_at`, `publisher_last_updated_epoch_ms`, ETL metadata columns like `_fivetran_synced`, `_airbyte_extracted_at`, etc.
+2. **Schema-level default detection** — columns with auto-generating timestamp defaults (`DEFAULT NOW()`, `DEFAULT CURRENT_TIMESTAMP`, `GETDATE()`, `SYSDATE()`, `SYSTIMESTAMP`, etc.), detected directly from the database catalog. This catches columns that don't follow naming conventions but still auto-generate values on INSERT. Works across PostgreSQL, MySQL, Snowflake, SQL Server, Oracle, ClickHouse, DuckDB, SQLite, and Redshift.
+
+The output lists which columns were auto-excluded and why.
 
 **SQL queries:** When source is a SQL query (not a table name), auto-discovery cannot work. You **must** provide `extra_columns` explicitly. If you don't, only key-level matching occurs.
 
@@ -287,3 +336,6 @@ The Rust engine **only compares columns listed in `extra_columns`**. If the list
 
 **Omitting extra_columns when source is a SQL query**
 → Auto-discovery only works for table names. For SQL queries, always list the columns to compare explicitly.
+
+**Silently excluding auto-timestamp columns without asking the user**
+→ Always present detected auto-timestamp columns (Step 4) and get explicit confirmation. In migration scenarios, `created_at` should be *identical* — excluding it silently hides real bugs.
diff --git a/packages/opencode/src/altimate/native/connections/data-diff.ts b/packages/opencode/src/altimate/native/connections/data-diff.ts
@@ -147,8 +147,70 @@ function isAuditColumn(columnName: string): boolean {
   return AUDIT_COLUMN_PATTERNS.some((pattern) => pattern.test(columnName))
 }
 
+// ---------------------------------------------------------------------------
+// Auto-timestamp default detection (schema-level)
+// ---------------------------------------------------------------------------
+
+/**
+ * Patterns that detect auto-generated timestamp/date defaults in column_default
+ * expressions. These functions produce the current time when a row is inserted
+ * (or updated), meaning the column value will inherently differ between source
+ * and target — not because of actual data discrepancies, but because of when
+ * each copy was written.
+ *
+ * Covers: PostgreSQL, MySQL/MariaDB, Snowflake, SQL Server, Oracle,
+ *         ClickHouse, DuckDB, SQLite, Redshift, BigQuery, Databricks.
+ */
+const AUTO_TIMESTAMP_DEFAULT_PATTERNS = [
+  // PostgreSQL, DuckDB, Redshift
+  /\bnow\s*\(\)/i,
+  /\bclock_timestamp\s*\(\)/i,
+  /\bstatement_timestamp\s*\(\)/i,
+  /\btransaction_timestamp\s*\(\)/i,
+  /\blocaltimestamp\b/i,
+  // Standard SQL — used by most dialects
+  /\bcurrent_timestamp\b/i,
+  // MySQL / MariaDB — "ON UPDATE CURRENT_TIMESTAMP" in the EXTRA column
+  /\bon\s+update\s+current_timestamp/i,
+  // Snowflake
+  /\bsysdate\s*\(\)/i,
+  // SQL Server
+  /\bgetdate\s*\(\)/i,
+  /\bsysdatetime\s*\(\)/i,
+  /\bsysutcdatetime\s*\(\)/i,
+  /\bsysdatetimeoffset\s*\(\)/i,
+  // Oracle
+  /\bSYSDATE\b/i,
+  /\bSYSTIMESTAMP\b/i,
+  // ClickHouse
+  /\btoday\s*\(\)/i,
+  // SQLite
+  /\bdatetime\s*\(\s*'now'/i,
+]
+
+/**
+ * Check whether a column_default expression contains an auto-generating
+ * timestamp function. Also matches expressions that *contain* these functions
+ * (e.g. `(now() + '1 mon'::interval)`).
+ */
+function isAutoTimestampDefault(defaultExpr: string | null): boolean {
+  if (!defaultExpr) return false
+  return AUTO_TIMESTAMP_DEFAULT_PATTERNS.some((pattern) => pattern.test(defaultExpr))
+}
+
+// ---------------------------------------------------------------------------
+// Column discovery (names + defaults) — dialect-aware
+// ---------------------------------------------------------------------------
+
+interface ColumnInfo {
+  name: string
+  defaultExpr: string | null
+}
+
 /**
- * Build a query to discover column names for a table, appropriate for the dialect.
+ * Build a query to discover column names and default expressions for a table.
+ * Returns both pieces of information in a single round-trip so we can detect
+ * auto-timestamp defaults without an extra query.
  */
 function buildColumnDiscoverySQL(tableName: string, dialect: string): string {
   // Parse schema.table or db.schema.table
@@ -168,33 +230,85 @@ function buildColumnDiscoverySQL(tableName: string, dialect: string): string {
 
   switch (dialect) {
     case "clickhouse":
+      // Returns: name, type, default_type, default_expression, ...
       return `DESCRIBE TABLE ${tableName}`
     case "snowflake":
+      // Returns: table_name, schema_name, column_name, data_type, null?, default, ...
       return `SHOW COLUMNS IN TABLE ${tableName}`
+    case "mysql":
+    case "mariadb": {
+      // MySQL puts "on update CURRENT_TIMESTAMP" in the EXTRA column, not column_default
+      const conditions = [tableFilter]
+      if (schemaFilter) conditions.push(schemaFilter)
+      return `SELECT column_name, column_default, extra FROM information_schema.columns WHERE ${conditions.join(" AND ")} ORDER BY ordinal_position`
+    }
+    case "oracle": {
+      // Oracle uses ALL_TAB_COLUMNS (no information_schema)
+      const oracleTable = parts[parts.length - 1]
+      const conditions = [`TABLE_NAME = '${oracleTable.toUpperCase()}'`]
+      if (parts.length >= 2) {
+        conditions.push(`OWNER = '${parts[parts.length - 2].toUpperCase()}'`)
+      }
+      return `SELECT COLUMN_NAME, DATA_DEFAULT FROM ALL_TAB_COLUMNS WHERE ${conditions.join(" AND ")} ORDER BY COLUMN_ID`
+    }
+    case "sqlite": {
+      // PRAGMA table_info returns: cid, name, type, notnull, dflt_value, pk
+      const table = parts[parts.length - 1]
+      return `PRAGMA table_info('${table}')`
+    }
     default: {
-      // Postgres, MySQL, Redshift, DuckDB, etc. — use information_schema
+      // Postgres, Redshift, DuckDB, SQL Server, BigQuery, Databricks, etc.
       const conditions = [tableFilter]
       if (schemaFilter) conditions.push(schemaFilter)
-      return `SELECT column_name FROM information_schema.columns WHERE ${conditions.join(" AND ")} ORDER BY ordinal_position`
+      return `SELECT column_name, column_default FROM information_schema.columns WHERE ${conditions.join(" AND ")} ORDER BY ordinal_position`
     }
   }
 }
 
 /**
- * Parse column names from the discovery query result, handling dialect differences.
+ * Parse column info (name + default expression) from the discovery query result,
+ * handling dialect-specific output formats.
  */
-function parseColumnNames(rows: (string | null)[][], dialect: string): string[] {
+function parseColumnInfo(rows: (string | null)[][], dialect: string): ColumnInfo[] {
   switch (dialect) {
     case "clickhouse":
-      // DESCRIBE returns: name, type, default_type, default_expression, ...
-      return rows.map((r) => r[0] ?? "").filter(Boolean)
+      // DESCRIBE: name[0], type[1], default_type[2], default_expression[3], ...
+      return rows.map((r) => ({
+        name: r[0] ?? "",
+        defaultExpr: r[3] ?? null,
+      })).filter((c) => c.name)
     case "snowflake":
-      // SHOW COLUMNS returns: table_name, schema_name, column_name, data_type, ...
-      // column_name is at index 2
-      return rows.map((r) => r[2] ?? "").filter(Boolean)
+      // SHOW COLUMNS: table_name[0], schema_name[1], column_name[2], data_type[3], null?[4], default[5], ...
+      return rows.map((r) => ({
+        name: r[2] ?? "",
+        defaultExpr: r[5] ?? null,
+      })).filter((c) => c.name)
+    case "oracle":
+      // ALL_TAB_COLUMNS: COLUMN_NAME[0], DATA_DEFAULT[1]
+      return rows.map((r) => ({
+        name: r[0] ?? "",
+        defaultExpr: r[1] ?? null,
+      })).filter((c) => c.name)
+    case "sqlite":
+      // PRAGMA table_info: cid[0], name[1], type[2], notnull[3], dflt_value[4], pk[5]
+      return rows.map((r) => ({
+        name: r[1] ?? "",
+        defaultExpr: r[4] ?? null,
+      })).filter((c) => c.name)
+    case "mysql":
+    case "mariadb":
+      // column_name[0], column_default[1], extra[2]
+      // Merge default + extra — MySQL puts "on update CURRENT_TIMESTAMP" in extra
+      return rows.map((r) => ({
+        name: r[0] ?? "",
+        defaultExpr: [r[1], r[2]].filter(Boolean).join(" ") || null,
+      })).filter((c) => c.name)
     default:
-      // information_schema returns: column_name
-      return rows.map((r) => r[0] ?? "").filter(Boolean)
+      // Postgres, Redshift, DuckDB, SQL Server, BigQuery: column_name[0], column_default[1]
+      return rows.map((r) => ({
+        name: r[0] ?? "",
+        defaultExpr: r[1] ?? null,
+      })).filter((c) => c.name)
   }
 }
 
@@ -204,8 +318,13 @@ function parseColumnNames(rows: (string | null)[][], dialect: string): string[]
  * When the caller omits `extra_columns`, we query the source table's schema to
  * find all columns, then exclude:
  *   1. Key columns (already used for matching)
- *   2. Audit/timestamp columns (updated_at, created_at, etc.) that typically
- *      differ between source and target due to ETL timing
+ *   2. Audit/timestamp columns matched by name pattern (updated_at, created_at, etc.)
+ *   3. Columns with auto-generating timestamp defaults (DEFAULT NOW(), CURRENT_TIMESTAMP,
+ *      GETDATE(), SYSDATE, etc.) — detected from the database catalog
+ *
+ * The schema-level default detection (layer 3) catches columns that don't follow
+ * naming conventions but still auto-generate values on INSERT — these inherently
+ * differ between source and target due to when each copy was written.
  *
  * Returns the list of columns to compare, or undefined if discovery fails
  * (in which case the engine falls back to key-only comparison).
@@ -222,20 +341,20 @@ async function discoverExtraColumns(
   try {
     const sql = buildColumnDiscoverySQL(tableName, dialect)
     const rows = await executeQuery(sql, warehouseName)
-    const allColumns = parseColumnNames(rows, dialect)
+    const columnInfos = parseColumnInfo(rows, dialect)
 
-    if (allColumns.length === 0) return undefined
+    if (columnInfos.length === 0) return undefined
 
     const keySet = new Set(keyColumns.map((k) => k.toLowerCase()))
     const extraColumns: string[] = []
     const excludedAudit: string[] = []
 
-    for (const col of allColumns) {
-      if (keySet.has(col.toLowerCase())) continue
-      if (isAuditColumn(col)) {
-        excludedAudit.push(col)
+    for (const col of columnInfos) {
+      if (keySet.has(col.name.toLowerCase())) continue
+      if (isAuditColumn(col.name) || isAutoTimestampDefault(col.defaultExpr)) {
+        excludedAudit.push(col.name)
       } else {
-        extraColumns.push(col)
+        extraColumns.push(col.name)
       }
     }
 
diff --git a/packages/opencode/src/altimate/tools/data-diff.ts b/packages/opencode/src/altimate/tools/data-diff.ts
@@ -38,7 +38,8 @@ export const DataDiffTool = Tool.define("data_diff", {
       .describe(
         "Columns to compare beyond the key columns. " +
         "IMPORTANT: If omitted AND source is a plain table name, columns are auto-discovered from the schema " +
-        "(excluding key columns and audit/timestamp columns like updated_at, created_at, inserted_at, modified_at). " +
+        "(excluding key columns, audit/timestamp columns matched by name like updated_at/created_at, " +
+        "and columns with auto-generating timestamp defaults like DEFAULT NOW()/CURRENT_TIMESTAMP/GETDATE()/SYSDATE). " +
         "If omitted AND source is a SQL query, ONLY key columns are compared — value changes in non-key columns will NOT be detected. " +
         "Always provide explicit extra_columns when comparing SQL queries to ensure value-level comparison."
       ),
@@ -117,10 +118,10 @@ export const DataDiffTool = Tool.define("data_diff", {
         output += formatPartitionResults(result.partition_results, args.partition_column!)
       }
 
-      // Report auto-excluded audit columns so the LLM and user know what was skipped
+      // Report auto-excluded columns so the LLM and user know what was skipped
       const excluded = (result as any).excluded_audit_columns as string[] | undefined
       if (excluded && excluded.length > 0) {
-        output += `\n\n  Note: ${excluded.length} audit/timestamp column${excluded.length === 1 ? "" : "s"} auto-excluded from comparison: ${excluded.join(", ")}`
+        output += `\n\n  Note: ${excluded.length} column${excluded.length === 1 ? "" : "s"} auto-excluded from comparison (audit name patterns + auto-timestamp defaults like NOW()/CURRENT_TIMESTAMP): ${excluded.join(", ")}`
       }
 
       return {