Skip to content

[SPARK-56956][SDP] Introduce AutoCDC Flow Dataclasses#56042

Open
AnishMahto wants to merge 4 commits into
apache:masterfrom
AnishMahto:SPARK-56956-introduce-flow-data-classes
Open

[SPARK-56956][SDP] Introduce AutoCDC Flow Dataclasses#56042
AnishMahto wants to merge 4 commits into
apache:masterfrom
AnishMahto:SPARK-56956-introduce-flow-data-classes

Conversation

@AnishMahto
Copy link
Copy Markdown
Contributor

@AnishMahto AnishMahto commented May 21, 2026

Approved AutoCDC SPIP: https://lists.apache.org/thread/j6sj9wo9odgdpgzlxtvhoy7szs0jplf7


What changes were proposed in this pull request?

Introduce dataclass for unresolved AutoCDC flow (AutoCdcFlow) and resolved AutoCDC flow (AutoCdcMergeFlow). Add wiring to analyze an AutoCdcFlow to an AutoCdcMergeFlow.

A small refactor was additionally made on the UnresolvedFlow and ResolvedFlow class hierarchy.

Why are the changes needed?

Support AutoCDC flow registration and analysis. AutoCDC flow execution will be supported in a future PR. Previously, an UnresolvedFlow additionally always represented an untyped-flow; a flow where do not yet know its execution-type, i.e streaming, append-once, etc.

AutoCdcFlow is a specialized flow with support for only streaming flows, hence it represents a flow whose execution-type we know at construction. It is still unresolved at registration time, and needs to go through resolution to determine its position in the DAG and its input/output schemas.

Hence we introduce the intermediary child UntypedFlow for UnresolvedFlow, which all previous flows are classified as during registration. An AutoCdcFlow directly implements UnresolvedFlow (skipping `UntypedFlow in its inheritance chain) because it is not untyped.

Does this PR introduce any user-facing change?

No, the AutoCDC feature is not released anywhere yet.

How was this patch tested?

ConnectValidPipelineSuite and AutoCdcFlowSuite

Was this patch authored or co-authored using generative AI tooling?

Co-authored.

Generated-by: Claude-Opus-4.7-thinking-xhigh

@AnishMahto AnishMahto force-pushed the SPARK-56956-introduce-flow-data-classes branch from 02f656a to 6df7b64 Compare May 22, 2026 18:19
@AnishMahto
Copy link
Copy Markdown
Contributor Author

@szehon-ho

This is actually a fairly small change btw, 600 LOC is just tests. The only real logic added here is some validation on construction of an AutoCdcFlow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant