Skip to content

EP-DATACOMPARE: row-level data comparison and sync between two databases #132

Description

@ilijevicdenis

Summary

PgProj has rich schema comparison (SchemaCompare, SchemaComparer, IdentityDiffEngine) but no data comparison. SSDT/VS Data Compare compares row-level data between a source and target database, categorizes rows, and synchronizes by updating the target or emitting an editable DML script. Closing this gap brings PgProj to SSDT parity for the data-diff workflow and complements the existing schema-deploy story (reference data / seed tables can drift independently of schema).

SSDT / SqlPackage reference

VS Data Compare provides:

  • A New Data Comparison wizard to pick source/target databases and the tables/views to compare (eligible objects must have a primary key, unique constraint, or unique index).
  • Comparison-key selection — choose which key columns identify a logical row and which columns participate in the value comparison.
  • Difference review/filtering — rows are categorized Different / Only-in-Source / Only-in-Target / Identical, with drill-in to per-column diffs and per-row include/exclude toggles.
  • Synchronization by Update Target (transaction-scoped, applied directly) or by generating an editable DML script.
  • Compare against a reference database as the source of truth.

Proposed PgProj approach

Add a data-compare capability that mirrors the existing schema-compare ergonomics and reuses deploy-planner infrastructure:

  • Eligibility + key mapping — reuse the parsed semantic model to map SSDT comparison keys onto PG PRIMARY KEY / UNIQUE constraint / unique index; skip keyless tables/views (report them). Allow explicit key/column overrides per table.
  • Row diff engine — stream rows via the existing parallel ADO.NET reader, bucket into Different / Only-in-Source / Only-in-Target / Identical, and surface per-column diffs. Honor identity/generated columns (excluded from value comparison by default).
  • DML synthesis — emit a deterministic INSERT/UPDATE/DELETE script (or INSERT ... ON CONFLICT (key) DO UPDATE upsert) ordered by FK topo-sort (reuse the PhasedDeployer / DeployScriptGenerator ordering already used by the schema planner). Run destructiveness/risk classification through RiskAnalyzer.
  • Apply — transaction-wrap by default (rollback on failure), reusing DatabaseDeployer for execution; validate against the shadow-DB harness in tests.
  • CLIpgproj data-compare --source <conn> --target <conn> [--tables ...] [--output diff.json|script.sql] to match schema-compare; support compare-against-reference-database and profile-driven defaults (.pgproj profiles / .pgpkg).

Acceptance criteria

  • pgproj data-compare --source --target enumerates eligible tables/views (PK / UNIQUE constraint / unique index) and reports skipped keyless objects.
  • Diff output categorizes every compared row as Different / Only-in-Source / Only-in-Target / Identical with per-column detail, writable to --output diff.json.
  • --output script.sql emits a deterministic, FK-topo-sorted INSERT/UPDATE/DELETE (or ON CONFLICT upsert) DML script that, applied to the target, makes its data match the source.
  • Apply path (Update Target) is transaction-wrapped by default and rolls back on any failure; identity/generated columns are honored.
  • Comparison key and value-column selection can be overridden per table via CLI/profile.
  • Tests cover all four diff categories and a round-trip (compare → apply → re-compare yields Identical) on the shadow-DB harness.

References

  • docs/ssdt/how-to-compare-and-synchronize-the-data-of-two-databases.md (MicrosoftDocs/sql-docs)
  • Priority: medium

Metadata

Metadata

Assignees

No one assigned

    Labels

    featureNew capabilityssdt-paritySSDT-for-PostgreSQL feature-parity effort

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions