Problem
mysql_cdc reads each table's initial snapshot through a single goroutine iterating rows with keyset pagination. Even after #4320 adds inter-table parallelism, the snapshot of any individual table is still strictly sequential, so pipelines dominated by a single very large table see essentially no speedup.
Concretely, on a representative message-style table:
- 80M rows observed throughput: ~30M rows/hour
- Extrapolated 400M row production table: ~12 hours
- AWS DMS completes the same 400M rows in ~45 minutes
The ~16x gap vs DMS is almost entirely intra-table. DMS and Debezium's incremental snapshot both split the primary-key range of a single table across multiple workers.
Root cause
readSnapshotTable does one keyset-paginated SELECT per batch:
SELECT * FROM t ORDER BY pk LIMIT N
SELECT * FROM t WHERE (pk) > (last) ORDER BY pk LIMIT N
...
Only one goroutine ever drives that loop per table. With #4320 a different table can run on a different worker, but inside any one table the reads are strictly serial.
For workloads where 90%+ of snapshot time is one table, the effective parallelism is 1 regardless of how snapshot_max_parallel_tables is set.
Desired behaviour
Operators should be able to partition a single table's primary-key space into N chunks and have the existing worker pool consume chunks in parallel. The consistency guarantee (every worker reads under one FLUSH TABLES WITH READ LOCK window at the same binlog position) must be preserved.
Target throughput: match AWS DMS on the 400M row reference workload — ~45 minutes, with 1 hour as an acceptable ceiling. That requires meaningful concurrency (~16 workers x reasonable chunk count) on a single table.
Proposed solution
A new advanced field snapshot_chunks_per_table that, when set above 1:
- Under the existing consistent-snapshot window, probes
MIN(pk) and MAX(pk) on the first primary-key column.
- Splits
[MIN, MAX] into N half-open chunks.
- Emits one work unit per chunk and fans them out across the worker pool controlled by
snapshot_max_parallel_tables.
Scope of the first cut:
- Single-column integer PKs: fully supported.
- Composite PKs: chunk on the leading PK column; per-chunk keyset pagination continues to respect the full PK ordering. This is the common shape for
(tenant_id, id) / (shard, ts) composites.
- Non-numeric first PK columns (VARCHAR, UUID, binary, etc.): fall back to a single whole-table read with an informational log line. Supporting these cleanly requires boundary discovery via sampling or
OFFSET-based probing and is deliberately out of scope here.
Skewed leading-column distributions (e.g. 90% of rows under one tenant_id) will produce uneven chunks; operators hitting that can simply leave snapshot_chunks_per_table at 1 and rely on table-level parallelism alone. No correctness impact.
PR implementing this: will be linked below.
Problem
mysql_cdcreads each table's initial snapshot through a single goroutine iterating rows with keyset pagination. Even after #4320 adds inter-table parallelism, the snapshot of any individual table is still strictly sequential, so pipelines dominated by a single very large table see essentially no speedup.Concretely, on a representative message-style table:
The ~16x gap vs DMS is almost entirely intra-table. DMS and Debezium's incremental snapshot both split the primary-key range of a single table across multiple workers.
Root cause
readSnapshotTabledoes one keyset-paginated SELECT per batch:Only one goroutine ever drives that loop per table. With #4320 a different table can run on a different worker, but inside any one table the reads are strictly serial.
For workloads where 90%+ of snapshot time is one table, the effective parallelism is 1 regardless of how
snapshot_max_parallel_tablesis set.Desired behaviour
Operators should be able to partition a single table's primary-key space into N chunks and have the existing worker pool consume chunks in parallel. The consistency guarantee (every worker reads under one
FLUSH TABLES WITH READ LOCKwindow at the same binlog position) must be preserved.Target throughput: match AWS DMS on the 400M row reference workload — ~45 minutes, with 1 hour as an acceptable ceiling. That requires meaningful concurrency (~16 workers x reasonable chunk count) on a single table.
Proposed solution
A new advanced field
snapshot_chunks_per_tablethat, when set above 1:MIN(pk)andMAX(pk)on the first primary-key column.[MIN, MAX]into N half-open chunks.snapshot_max_parallel_tables.Scope of the first cut:
(tenant_id, id)/(shard, ts)composites.OFFSET-based probing and is deliberately out of scope here.Skewed leading-column distributions (e.g. 90% of rows under one
tenant_id) will produce uneven chunks; operators hitting that can simply leavesnapshot_chunks_per_tableat 1 and rely on table-level parallelism alone. No correctness impact.PR implementing this: will be linked below.