Skip to content

Commit 289dab9

Browse files
authored
Allow deduplication of historical snapshops (#13)
* add dedup option for historical snapshots * fix warning for keys_only dedup on snapshots * exclude _metadata column from dedup logic for snapshots * edit docstring for DeduplicateMode for cdc from snapshots
1 parent 5cfae0a commit 289dab9

7 files changed

Lines changed: 557 additions & 93 deletions

File tree

VERSION

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
v0.5.0
1+
v0.6.0

docs/source/dataflow_spec_ref_cdc.rst

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -87,6 +87,12 @@ The ``cdcSnapshotSettings`` object contains the following properties:
8787
* - **track_history_except_column_list**
8888
- ``list``
8989
- (*optional*) A subset of output columns to be excluded from history tracking in the target table. Use this to specify which columns should not be tracked. This cannot be used in conjunction with ``track_history_column_list``.
90+
* - **deduplicateMode**
91+
- ``string``
92+
- (*optional*) How to deduplicate source snapshot data before CDC. Default: ``off`` (no deduplication). Use ``full_row`` to deduplicate based on the full row (deterministic) excluding the ``_metadata`` column if present. Use ``keys_only`` to keep the first row per key(s): This is non-deterministic; as it preserves the first row per key(s) without ordering on any other columns.
93+
94+
.. warning::
95+
The ``keys_only`` option is **non-deterministic**. It preserves the first row per key(s). Use it with caution and only when you accept that which duplicate row is kept may vary between runs.
9096

9197
.. _cdc-apply-changes-from-snapshot-source:
9298

samples/bronze_sample/src/dataflows/feature_samples/dataflowspec/historical_snapshot_files_datetime_recursive_and_partitioned_parquet_main.json

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,8 @@
2323
"path": "{sample_file_location}/snapshot_customer_partitioned_parquet/{version}/customer.parquet",
2424
"versionType": "timestamp",
2525
"datetimeFormat": "YEAR=%Y/MONTH=%m/DAY=%d",
26-
"recursiveFileLookup": true
26+
"recursiveFileLookup": true,
27+
"deduplicateMode": "keys_only"
2728
},
2829
"track_history_except_column_list":[
2930
"LOAD_TIMESTAMP"

0 commit comments

Comments
 (0)