Skip to content

[FLINK-40038][pipeline-connector/mysql] Cache inferred row DataType in MySQL deserialization#4458

Open
monologuist wants to merge 1 commit into
apache:masterfrom
monologuist:master
Open

[FLINK-40038][pipeline-connector/mysql] Cache inferred row DataType in MySQL deserialization#4458
monologuist wants to merge 1 commit into
apache:masterfrom
monologuist:master

Conversation

@monologuist

Copy link
Copy Markdown

What is changed

This PR reduces incremental deserialization overhead in the MySQL pipeline connector by caching the inferred row DataType per table in MySqlEventDeserializer.

Previously, the deserializer inferred the row DataType from the Debezium struct for every incoming record. Under hotspot UPDATE workloads, this repeated inference becomes a visible CPU hotspot in the incremental synchronization path.

This PR makes the following changes:

  • cache inferred row DataType by TableId in MySqlEventDeserializer
  • invalidate the cache when schema change events are observed
  • normalize table identifiers consistently in case-insensitive mode so cache writes and invalidation use the same key
  • add a narrow protected helper in DebeziumEventDeserializationSchema so subclasses can reuse the existing converter logic with a precomputed DataType

Why

In a MySQL-to-Doris pipeline with large-table hotspot UPDATE traffic, we observed low incremental synchronization throughput and lag accumulation between upstream and downstream.

Based on flame graph analysis, a noticeable portion of CPU time was spent in the MySQL pipeline deserialization path, especially on repeated schema/data type inference for the same table.

FLINK-35715 addressed the correctness issue caused by timestamp precision inference and BinaryRecordData layout mismatch. However, repeated schema inference is still present on the current master branch and remains a measurable CPU hotspot in MySQL pipeline deserialization.

This PR focuses on that remaining performance overhead.

The optimization relies on a natural assumption: for the same table, the row DataType remains stable until a schema change event arrives. Based on this assumption, the inferred row DataType can be safely reused across records and invalidated when schema changes are observed.

Tests

The following tests were added or updated:

  • verify the cache lifecycle across repeated records and schema change invalidation
  • verify cache isolation across different tables
  • verify case-insensitive table IDs are normalized consistently for cache reuse and invalidation

In our internal verification, this optimization improved throughput by about 18% in a test environment and about 19% in a production-like workload.

Notes

This PR focuses on reducing redundant row type inference in the MySQL pipeline deserialization path. It does not change changelog semantics or introduce new user-facing configuration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant