|
| 1 | +# Data Parity (Table Diff) |
| 2 | + |
| 3 | +Validate that two tables — or two query results — are identical across databases, or diagnose exactly how they differ. Use for **migration validation**, **ETL regression**, and **query refactor verification**. |
| 4 | + |
| 5 | +altimate-code ships a dedicated `data_diff` tool and a `data-parity` skill that orchestrates the full workflow: plan, inspect schema, confirm keys, profile, then diff. |
| 6 | + |
| 7 | +## Supported warehouse pairs |
| 8 | + |
| 9 | +Works across any combination of: |
| 10 | + |
| 11 | +- PostgreSQL |
| 12 | +- Snowflake |
| 13 | +- BigQuery |
| 14 | +- Databricks (SQL Warehouses) |
| 15 | +- ClickHouse |
| 16 | +- MySQL |
| 17 | +- Redshift |
| 18 | +- SQL Server |
| 19 | +- Microsoft Fabric |
| 20 | +- DuckDB |
| 21 | +- SQLite |
| 22 | +- Oracle |
| 23 | + |
| 24 | +Same-dialect comparisons use a fast FULL OUTER JOIN. Cross-database comparisons use a bisection hashing algorithm that streams checksums rather than raw rows — so you can diff a 100M-row Postgres table against its Snowflake replica without pulling the data out. |
| 25 | + |
| 26 | +## Quick start |
| 27 | + |
| 28 | +```bash |
| 29 | +altimate |
| 30 | +``` |
| 31 | + |
| 32 | +In the TUI, just describe what you want to compare: |
| 33 | + |
| 34 | +``` |
| 35 | +Compare orders in postgres_prod with orders in snowflake_dw using id as the primary key. |
| 36 | +``` |
| 37 | + |
| 38 | +The agent will: |
| 39 | + |
| 40 | +1. List your warehouse connections. |
| 41 | +2. Inspect both schemas, propose primary keys, and flag audit/timestamp columns to exclude. |
| 42 | +3. Confirm your choices. |
| 43 | +4. Run a column profile first (cheap — no row scan). |
| 44 | +5. Run the row-level diff only on columns that diverged. |
| 45 | + |
| 46 | +## Algorithms |
| 47 | + |
| 48 | +| Algorithm | When to use | Cost | |
| 49 | +|-----------|-------------|------| |
| 50 | +| `auto` | Default. Picks JoinDiff for same-dialect, HashDiff for cross-database. | Cheapest valid choice | |
| 51 | +| `joindiff` | Same-database comparison. Fast. | One FULL OUTER JOIN | |
| 52 | +| `hashdiff` | Cross-database. Works at any scale. | Bisection over checksums | |
| 53 | +| `profile` | Compliance-safe. Column stats only — no row values leave the database. | Cheapest | |
| 54 | +| `cascade` | Profile first, then HashDiff on columns that diverged. Balanced default for exploratory diffs. | Column stats + targeted row diff | |
| 55 | + |
| 56 | +## Partitioning large tables |
| 57 | + |
| 58 | +For tables beyond ~10M rows, partition the diff into independent batches: |
| 59 | + |
| 60 | +```text |
| 61 | +Compare orders between postgres and snowflake, partitioned by order_date month. |
| 62 | +``` |
| 63 | + |
| 64 | +Three partition modes: |
| 65 | + |
| 66 | +| Mode | How to trigger | Example | |
| 67 | +|------|----------------|---------| |
| 68 | +| **Date** | Set `partition_column` + `partition_granularity` | `l_shipdate` + `month` | |
| 69 | +| **Numeric** | Set `partition_column` + `partition_bucket_size` | `l_orderkey` + `100000` | |
| 70 | +| **Categorical** | Set `partition_column` alone (no granularity/bucket) | `region`, `status`, `country` | |
| 71 | + |
| 72 | +Each partition is diffed independently. Results are aggregated with a per-partition breakdown so you can see *which* groups have differences. |
| 73 | + |
| 74 | +## SQL Server and Microsoft Fabric |
| 75 | + |
| 76 | +Both `sqlserver` and `fabric` are supported. For Azure AD / Entra ID authentication, altimate-code recognizes all of the major flows through `tedious`: |
| 77 | + |
| 78 | +| `authentication` | Config fields | Use case | |
| 79 | +|------------------|---------------|----------| |
| 80 | +| `azure-active-directory-password` | `azure_client_id`, `azure_tenant_id`, `user`, `password` | User credentials | |
| 81 | +| `azure-active-directory-access-token` (or `access-token`) | `access_token` | Pre-fetched token | |
| 82 | +| `service-principal-secret` (`service-principal`) | `azure_tenant_id`, `azure_client_id`, `azure_client_secret` | Service principals | |
| 83 | +| `azure-active-directory-msi-vm` (`msi`) | `azure_client_id` (optional) | Azure VM managed identity | |
| 84 | +| `azure-active-directory-msi-app-service` | `azure_client_id` (optional) | App Service managed identity | |
| 85 | +| `azure-active-directory-default` (`default` / `CLI`) | — | DefaultAzureCredential chain (CLI, env, MSI) | |
| 86 | + |
| 87 | +All Azure AD connections force TLS encryption. |
| 88 | + |
| 89 | +## Compliance and sensitive data |
| 90 | + |
| 91 | +!!! warning "PII / PHI / PCI data" |
| 92 | + `data_diff` prints up to 5 sample diff rows in tool output. Those rows become part of the conversation and are sent to your LLM provider. |
| 93 | + |
| 94 | + When comparing tables that might contain regulated data: |
| 95 | + |
| 96 | + - Start with `algorithm: "profile"` — column-level statistics only, no row values leave the database. |
| 97 | + - If a row-level diff is genuinely required, scope it with a `where_clause` that excludes sensitive customers / accounts. |
| 98 | + - The `data-parity` skill asks for confirmation before sending sample rows to the LLM when the table name matches common regulated patterns (`customers`, `patients`, `orders`, `payments`, `accounts`, `users`). |
| 99 | + |
| 100 | +## Column auto-discovery and audit exclusion |
| 101 | + |
| 102 | +When you omit `extra_columns` and the source is a plain table name, altimate-code: |
| 103 | + |
| 104 | +1. Queries `information_schema` (or the dialect-specific equivalent) on both sides. |
| 105 | +2. Excludes audit/timestamp columns by name pattern (`updated_at`, `created_at`, `_fivetran_synced`, `_airbyte_emitted_at`, etc.). |
| 106 | +3. Queries column defaults and excludes anything with an auto-generating timestamp default (`NOW()`, `CURRENT_TIMESTAMP`, `GETDATE()`, `SYSDATE`, `SYSTIMESTAMP`). |
| 107 | +4. Reports excluded columns so you can override if the timestamps are part of what you're validating. |
| 108 | + |
| 109 | +When the source is a SQL query, only the key columns are compared unless you explicitly list `extra_columns`. Always provide `extra_columns` for query-mode comparisons. |
| 110 | + |
| 111 | +## The `data_diff` tool |
| 112 | + |
| 113 | +Direct tool invocation (if you prefer not to use the skill): |
| 114 | + |
| 115 | +``` |
| 116 | +data_diff( |
| 117 | + source = "orders", |
| 118 | + target = "orders", |
| 119 | + source_warehouse = "postgres_prod", |
| 120 | + target_warehouse = "snowflake_dw", |
| 121 | + key_columns = ["id"], |
| 122 | + algorithm = "auto", |
| 123 | +) |
| 124 | +``` |
| 125 | + |
| 126 | +See the [tool reference](../tools/warehouse-tools.md) for the full parameter list. |
0 commit comments