Skip to content

Commit fcfd122

Browse files
committed
feat: add partition support to data_diff
Split large tables by a date or numeric column before diffing. Each partition is diffed independently then results are aggregated. New params: - partition_column: column to split on (date or numeric) - partition_granularity: day | week | month | year (for dates) - partition_bucket_size: bucket width for numeric columns New output field: - partition_results: per-partition breakdown (identical / differ / error) Dialect-aware SQL: Postgres, Snowflake, BigQuery, ClickHouse, MySQL. Skill updated with partition guidance and examples.
1 parent 71d91ee commit fcfd122

File tree

4 files changed

+348
-4
lines changed

4 files changed

+348
-4
lines changed

.opencode/skills/data-parity/SKILL.md

Lines changed: 28 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,9 @@ description: Validate that two tables or query results are identical — or diag
4444
- `extra_columns` — columns to compare beyond keys (omit = compare all)
4545
- `algorithm``auto`, `joindiff`, `hashdiff`, `profile`, `cascade`
4646
- `where_clause` — filter applied to both tables
47+
- `partition_column` — split the table by this column and diff each group independently (recommended for large tables)
48+
- `partition_granularity``day` | `week` | `month` | `year` for date columns (default: `month`)
49+
- `partition_bucket_size` — for numeric columns: bucket width (e.g. `100000` splits by ranges of 100K)
4750

4851
> **CRITICAL — Algorithm choice:**
4952
> - If `source_warehouse``target_warehouse`**always use `hashdiff`** (or `auto`).
@@ -117,8 +120,31 @@ SELECT COUNT(*) FROM orders
117120

118121
Use this to choose the algorithm:
119122
- **< 1M rows**: `joindiff` (same DB) or `hashdiff` (cross-DB) — either is fine
120-
- **1M–100M rows**: `hashdiff` or `cascade`
121-
- **> 100M rows**: `hashdiff` with a `where_clause` date filter to validate a recent window first
123+
- **1M–100M rows**: `hashdiff` with `partition_column` for faster, more precise results
124+
- **> 100M rows**: `hashdiff` + `partition_column` — required; bisection alone may miss rows at this scale
125+
126+
**When to use `partition_column`:**
127+
- Table has a natural time or key column (e.g. `created_at`, `order_id`, `event_date`)
128+
- Table has > 500K rows and bisection is slow or returning incomplete results
129+
- You need per-partition visibility (which month/range has the problem)
130+
131+
```
132+
// Date column — partition by month
133+
data_diff(source="lineitem", target="lineitem",
134+
key_columns=["l_orderkey", "l_linenumber"],
135+
source_warehouse="pg_source", target_warehouse="pg_target",
136+
partition_column="l_shipdate", partition_granularity="month",
137+
algorithm="hashdiff")
138+
139+
// Numeric column — partition by key ranges of 100K
140+
data_diff(source="orders", target="orders",
141+
key_columns=["o_orderkey"],
142+
source_warehouse="pg_source", target_warehouse="pg_target",
143+
partition_column="o_orderkey", partition_bucket_size=100000,
144+
algorithm="hashdiff")
145+
```
146+
147+
Output includes an aggregate diff plus a per-partition table showing exactly which ranges differ.
122148

123149
### Step 4: Profile first for unknown tables
124150

packages/opencode/src/altimate/native/connections/data-diff.ts

Lines changed: 232 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@
77
* This file is the bridge between that engine and altimate-code's drivers.
88
*/
99

10-
import type { DataDiffParams, DataDiffResult } from "../types"
10+
import type { DataDiffParams, DataDiffResult, PartitionDiffResult } from "../types"
1111
import * as Registry from "./registry"
1212

1313
// ---------------------------------------------------------------------------
@@ -119,7 +119,238 @@ async function executeQuery(sql: string, warehouseName: string | undefined): Pro
119119

120120
const MAX_STEPS = 200
121121

122+
// ---------------------------------------------------------------------------
123+
// Partition support
124+
// ---------------------------------------------------------------------------
125+
126+
/**
127+
* Build a DATE_TRUNC expression appropriate for the warehouse dialect.
128+
*/
129+
function dateTruncExpr(granularity: string, column: string, dialect: string): string {
130+
const g = granularity.toLowerCase()
131+
switch (dialect) {
132+
case "bigquery":
133+
return `DATE_TRUNC(${column}, ${g.toUpperCase()})`
134+
case "clickhouse":
135+
return `toStartOf${g.charAt(0).toUpperCase() + g.slice(1)}(${column})`
136+
case "mysql":
137+
case "mariadb": {
138+
const fmt = { day: "%Y-%m-%d", week: "%Y-%u", month: "%Y-%m-01", year: "%Y-01-01" }[g] ?? "%Y-%m-01"
139+
return `DATE_FORMAT(${column}, '${fmt}')`
140+
}
141+
default:
142+
// Postgres, Snowflake, Redshift, DuckDB, etc.
143+
return `DATE_TRUNC('${g}', ${column})`
144+
}
145+
}
146+
147+
/**
148+
* Build SQL to discover distinct partition values from the source table.
149+
*/
150+
function buildPartitionDiscoverySQL(
151+
table: string,
152+
partitionColumn: string,
153+
granularity: string | undefined,
154+
bucketSize: number | undefined,
155+
dialect: string,
156+
whereClause?: string,
157+
): string {
158+
const isNumeric = bucketSize != null
159+
160+
let expr: string
161+
if (isNumeric) {
162+
expr = `FLOOR(${partitionColumn} / ${bucketSize}) * ${bucketSize}`
163+
} else {
164+
expr = dateTruncExpr(granularity ?? "month", partitionColumn, dialect)
165+
}
166+
167+
const where = whereClause ? `WHERE ${whereClause}` : ""
168+
return `SELECT DISTINCT ${expr} AS _p FROM ${table} ${where} ORDER BY _p`
169+
}
170+
171+
/**
172+
* Build a WHERE clause that scopes to a single partition.
173+
*/
174+
function buildPartitionWhereClause(
175+
partitionColumn: string,
176+
partitionValue: string,
177+
granularity: string | undefined,
178+
bucketSize: number | undefined,
179+
dialect: string,
180+
): string {
181+
if (bucketSize != null) {
182+
const lo = Number(partitionValue)
183+
const hi = lo + bucketSize
184+
return `${partitionColumn} >= ${lo} AND ${partitionColumn} < ${hi}`
185+
}
186+
187+
const expr = dateTruncExpr(granularity ?? "month", partitionColumn, dialect)
188+
189+
// Cast the literal appropriately per dialect
190+
switch (dialect) {
191+
case "bigquery":
192+
return `${expr} = '${partitionValue}'`
193+
case "clickhouse":
194+
return `${expr} = toDate('${partitionValue}')`
195+
case "mysql":
196+
case "mariadb":
197+
return `${expr} = '${partitionValue}'`
198+
default:
199+
return `${expr} = '${partitionValue}'`
200+
}
201+
}
202+
203+
/**
204+
* Extract DiffStats from a successful outcome (if present).
205+
*/
206+
function extractStats(outcome: unknown): {
207+
rows_source: number
208+
rows_target: number
209+
differences: number
210+
status: "identical" | "differ"
211+
} {
212+
const o = outcome as any
213+
if (!o) return { rows_source: 0, rows_target: 0, differences: 0, status: "identical" }
214+
215+
if (o.Match) {
216+
return {
217+
rows_source: o.Match.row_count ?? 0,
218+
rows_target: o.Match.row_count ?? 0,
219+
differences: 0,
220+
status: "identical",
221+
}
222+
}
223+
224+
if (o.Diff) {
225+
const d = o.Diff
226+
return {
227+
rows_source: d.total_source_rows ?? 0,
228+
rows_target: d.total_target_rows ?? 0,
229+
differences: (d.rows_only_in_source ?? 0) + (d.rows_only_in_target ?? 0) + (d.rows_updated ?? 0),
230+
status: "differ",
231+
}
232+
}
233+
234+
return { rows_source: 0, rows_target: 0, differences: 0, status: "identical" }
235+
}
236+
237+
/**
238+
* Merge two Diff outcomes into one aggregated Diff outcome.
239+
*/
240+
function mergeOutcomes(accumulated: unknown, next: unknown): unknown {
241+
const a = accumulated as any
242+
const n = next as any
243+
244+
const aD = a?.Diff ?? (a?.Match ? { total_source_rows: a.Match.row_count, total_target_rows: a.Match.row_count, rows_only_in_source: 0, rows_only_in_target: 0, rows_updated: 0, rows_identical: a.Match.row_count, sample_diffs: [] } : null)
245+
const nD = n?.Diff ?? (n?.Match ? { total_source_rows: n.Match.row_count, total_target_rows: n.Match.row_count, rows_only_in_source: 0, rows_only_in_target: 0, rows_updated: 0, rows_identical: n.Match.row_count, sample_diffs: [] } : null)
246+
247+
if (!aD && !nD) return { Match: { row_count: 0 } }
248+
if (!aD) return next
249+
if (!nD) return accumulated
250+
251+
const merged = {
252+
total_source_rows: (aD.total_source_rows ?? 0) + (nD.total_source_rows ?? 0),
253+
total_target_rows: (aD.total_target_rows ?? 0) + (nD.total_target_rows ?? 0),
254+
rows_only_in_source: (aD.rows_only_in_source ?? 0) + (nD.rows_only_in_source ?? 0),
255+
rows_only_in_target: (aD.rows_only_in_target ?? 0) + (nD.rows_only_in_target ?? 0),
256+
rows_updated: (aD.rows_updated ?? 0) + (nD.rows_updated ?? 0),
257+
rows_identical: (aD.rows_identical ?? 0) + (nD.rows_identical ?? 0),
258+
sample_diffs: [...(aD.sample_diffs ?? []), ...(nD.sample_diffs ?? [])].slice(0, 20),
259+
}
260+
261+
const totalDiff = merged.rows_only_in_source + merged.rows_only_in_target + merged.rows_updated
262+
if (totalDiff === 0) {
263+
return { Match: { row_count: merged.total_source_rows, algorithm: "partitioned" } }
264+
}
265+
return { Diff: merged }
266+
}
267+
268+
/**
269+
* Run a partitioned diff: discover partition values, diff each partition independently,
270+
* then aggregate results.
271+
*/
272+
async function runPartitionedDiff(params: DataDiffParams): Promise<DataDiffResult> {
273+
const resolveDialect = (warehouse: string | undefined): string => {
274+
if (warehouse) {
275+
const cfg = Registry.getConfig(warehouse)
276+
return cfg?.type ?? "generic"
277+
}
278+
const warehouses = Registry.list().warehouses
279+
return warehouses[0]?.type ?? "generic"
280+
}
281+
282+
const sourceDialect = resolveDialect(params.source_warehouse)
283+
const { table1Name } = resolveTableSources(params.source, params.target)
284+
285+
// Discover partition values from source
286+
const discoverySql = buildPartitionDiscoverySQL(
287+
table1Name,
288+
params.partition_column!,
289+
params.partition_granularity,
290+
params.partition_bucket_size,
291+
sourceDialect,
292+
params.where_clause,
293+
)
294+
295+
let partitionValues: string[]
296+
try {
297+
const rows = await executeQuery(discoverySql, params.source_warehouse)
298+
partitionValues = rows.map((r) => String(r[0] ?? "")).filter(Boolean)
299+
} catch (e) {
300+
return { success: false, error: `Partition discovery failed: ${e}`, steps: 0 }
301+
}
302+
303+
if (partitionValues.length === 0) {
304+
return { success: true, steps: 1, outcome: { Match: { row_count: 0, algorithm: "partitioned" } }, partition_results: [] }
305+
}
306+
307+
// Diff each partition
308+
const partitionResults: PartitionDiffResult[] = []
309+
let aggregatedOutcome: unknown = null
310+
let totalSteps = 1
311+
312+
for (const pVal of partitionValues) {
313+
const partWhere = buildPartitionWhereClause(
314+
params.partition_column!,
315+
pVal,
316+
params.partition_granularity,
317+
params.partition_bucket_size,
318+
sourceDialect,
319+
)
320+
const fullWhere = params.where_clause ? `(${params.where_clause}) AND (${partWhere})` : partWhere
321+
322+
const result = await runDataDiff({
323+
...params,
324+
where_clause: fullWhere,
325+
partition_column: undefined, // prevent recursion
326+
})
327+
328+
totalSteps += result.steps
329+
330+
if (!result.success) {
331+
partitionResults.push({ partition: pVal, rows_source: 0, rows_target: 0, differences: 0, status: "error", error: result.error })
332+
continue
333+
}
334+
335+
const stats = extractStats(result.outcome)
336+
partitionResults.push({ partition: pVal, ...stats })
337+
aggregatedOutcome = aggregatedOutcome == null ? result.outcome : mergeOutcomes(aggregatedOutcome, result.outcome)
338+
}
339+
340+
return {
341+
success: true,
342+
steps: totalSteps,
343+
outcome: aggregatedOutcome ?? { Match: { row_count: 0, algorithm: "partitioned" } },
344+
partition_results: partitionResults,
345+
}
346+
}
347+
122348
export async function runDataDiff(params: DataDiffParams): Promise<DataDiffResult> {
349+
// Dispatch to partitioned diff if partition_column is set
350+
if (params.partition_column) {
351+
return runPartitionedDiff(params)
352+
}
353+
123354
// Dynamically import NAPI module (not available in test environments without the binary)
124355
let DataParitySession: new (specJson: string) => {
125356
start(): string

packages/opencode/src/altimate/native/types.ts

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -985,13 +985,48 @@ export interface DataDiffParams {
985985
numeric_tolerance?: number
986986
/** Timestamp tolerance in milliseconds */
987987
timestamp_tolerance_ms?: number
988+
/**
989+
* Column to partition on before diffing. The table is split into groups by
990+
* this column and each group is diffed independently. Results are aggregated.
991+
* Use for large tables where bisection alone is too slow or imprecise.
992+
*
993+
* Examples: "l_shipdate" (date column), "l_orderkey" (numeric column)
994+
*/
995+
partition_column?: string
996+
/**
997+
* Granularity for date partition columns: "day" | "week" | "month" | "year".
998+
* For numeric columns, ignored — use partition_bucket_size instead.
999+
* Defaults to "month".
1000+
*/
1001+
partition_granularity?: "day" | "week" | "month" | "year"
1002+
/**
1003+
* For numeric partition columns: size of each bucket.
1004+
* E.g. 100000 splits l_orderkey into [0, 100000), [100000, 200000), …
1005+
*/
1006+
partition_bucket_size?: number
1007+
}
1008+
1009+
export interface PartitionDiffResult {
1010+
/** The partition value (date string or numeric bucket start) */
1011+
partition: string
1012+
/** Source row count in this partition */
1013+
rows_source: number
1014+
/** Target row count in this partition */
1015+
rows_target: number
1016+
/** Total differences found (exclusive + updated) */
1017+
differences: number
1018+
/** "identical" | "differ" | "error" */
1019+
status: "identical" | "differ" | "error"
1020+
error?: string
9881021
}
9891022

9901023
export interface DataDiffResult {
9911024
success: boolean
9921025
steps: number
9931026
outcome?: unknown
9941027
error?: string
1028+
/** Per-partition breakdown when partition_column is used */
1029+
partition_results?: PartitionDiffResult[]
9951030
}
9961031

9971032
// --- Method registry ---

0 commit comments

Comments
 (0)