[FLINK-40038][pipeline-connector/mysql] Cache inferred row DataType in MySQL deserialization#4458
Open
monologuist wants to merge 1 commit into
Open
[FLINK-40038][pipeline-connector/mysql] Cache inferred row DataType in MySQL deserialization#4458monologuist wants to merge 1 commit into
monologuist wants to merge 1 commit into
Conversation
…n MySQL deserialization
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What is changed
This PR reduces incremental deserialization overhead in the MySQL pipeline connector by caching the inferred row
DataTypeper table inMySqlEventDeserializer.Previously, the deserializer inferred the row
DataTypefrom the Debezium struct for every incoming record. Under hotspot UPDATE workloads, this repeated inference becomes a visible CPU hotspot in the incremental synchronization path.This PR makes the following changes:
DataTypebyTableIdinMySqlEventDeserializerDebeziumEventDeserializationSchemaso subclasses can reuse the existing converter logic with a precomputedDataTypeWhy
In a MySQL-to-Doris pipeline with large-table hotspot UPDATE traffic, we observed low incremental synchronization throughput and lag accumulation between upstream and downstream.
Based on flame graph analysis, a noticeable portion of CPU time was spent in the MySQL pipeline deserialization path, especially on repeated schema/data type inference for the same table.
FLINK-35715 addressed the correctness issue caused by timestamp precision inference and
BinaryRecordDatalayout mismatch. However, repeated schema inference is still present on the current master branch and remains a measurable CPU hotspot in MySQL pipeline deserialization.This PR focuses on that remaining performance overhead.
The optimization relies on a natural assumption: for the same table, the row
DataTyperemains stable until a schema change event arrives. Based on this assumption, the inferred rowDataTypecan be safely reused across records and invalidated when schema changes are observed.Tests
The following tests were added or updated:
In our internal verification, this optimization improved throughput by about 18% in a test environment and about 19% in a production-like workload.
Notes
This PR focuses on reducing redundant row type inference in the MySQL pipeline deserialization path. It does not change changelog semantics or introduce new user-facing configuration.