Commit cd26a84
[SPARK-56870][SDP] Implement SCD1 Batch Processor; Extend Microbatch with CDC Metadata
Approved AutoCDC SPIP: https://lists.apache.org/thread/j6sj9wo9odgdpgzlxtvhoy7szs0jplf7
--------
**Preamble:**
The SCD type 1 flow is a foreachBatch streaming query on an input change-data-feed, and is responsible for reconciling the incoming change data onto some target table that follows SCD1 replication semantics.
SCD1 flows also maintain an "auxiliary" table to keep track of early-arriving out-of-order received events state. Each microbatch will need to reconcile against this auxiliary table as well, and update the auxiliary table's state appropriately for future microbatches.
**Extend Microbatch with CDC Metadata:**
After deduplication, all of the incoming rows can be classified as either a delete event or an upsert event (mutually exclusive), and there's at most one per key.
If we identify a row as a delete event, remember its sequencing as its `deleteSequence`. If we identify a row as an upsert event, remember its sequencing as its `upsertSequence`. That is, `deleteSequence`/`upsertSequence` encode both the sequencing for the row as well as the row classification (delete or upsert).
We need to persist this encoded information now, because in future stages we may drop the columns that `deleteCondition` needed to do the classification in the first place, depending on which columns were selected by `ChangeArgs.columnSelection`.
**Where is the CDC Metadata stored?**
Within the microbatch, we append a `_cdc_metadata` struct column, that stores the `deleteSequence` and `upsertSequence`.
This `_cdc_metadata` column will eventually also land in the persisted target and auxiliary tables, which are the artifacts of an AutoCDC flow. This column represents operational metadata that the AutoCDC flow has tagged a row with, and is necessary for out-of-order correctness of the SCD decomposition.
Users will not be able to opt out of persisting this column in the target table using `ChangeArgs.columnSelection`, as it is necessary for correctness. The column will not have a stable public contract, and users should make no assumptions on its contents.
Closes #55970 from AnishMahto/SPARK-56870-extend-microbatch-with-cdc-metadata.
Authored-by: AnishMahto <anish.mahto99@gmail.com>
Signed-off-by: DB Tsai <dbtsai@dbtsai.com>
(cherry picked from commit 12807c5)
Signed-off-by: DB Tsai <dbtsai@dbtsai.com>1 parent f6312fb commit cd26a84
3 files changed
Lines changed: 419 additions & 17 deletions
File tree
- common/utils/src/main/resources/error
- sql/pipelines/src
- main/scala/org/apache/spark/sql/pipelines/autocdc
- test/scala/org/apache/spark/sql/pipelines/autocdc
Lines changed: 6 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
209 | 209 | | |
210 | 210 | | |
211 | 211 | | |
| 212 | + | |
| 213 | + | |
| 214 | + | |
| 215 | + | |
| 216 | + | |
| 217 | + | |
212 | 218 | | |
213 | 219 | | |
214 | 220 | | |
| |||
Lines changed: 115 additions & 3 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
17 | 17 | | |
18 | 18 | | |
19 | 19 | | |
20 | | - | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
21 | 23 | | |
22 | 24 | | |
| 25 | + | |
23 | 26 | | |
24 | 27 | | |
25 | 28 | | |
26 | 29 | | |
27 | 30 | | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
28 | 35 | | |
29 | | - | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
30 | 40 | | |
31 | 41 | | |
32 | 42 | | |
| |||
59 | 69 | | |
60 | 70 | | |
61 | 71 | | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
62 | 135 | | |
63 | 136 | | |
64 | 137 | | |
65 | 138 | | |
66 | | - | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
| 167 | + | |
| 168 | + | |
| 169 | + | |
| 170 | + | |
| 171 | + | |
| 172 | + | |
| 173 | + | |
| 174 | + | |
| 175 | + | |
| 176 | + | |
| 177 | + | |
| 178 | + | |
67 | 179 | | |
0 commit comments