What did you do?
Bulk-create many TiCDC changefeeds whose tables have no row traffic.
Observed workload:
- Creating 800 changefeeds can OOM a 96 GB machine.
- Creating 400 changefeeds can quickly push TiCDC memory to about 60 GB and total CPU usage to about 93%.
- After roughly 3 minutes, memory drops back to about 17 GB.
Code inspection on upstream/master found that changefeed creation can schedule many maintainer bootstraps concurrently.
What did you expect to see?
Bulk changefeed creation should respect the configured scheduler concurrency limit and avoid launching hundreds of maintainer bootstraps at the same time.
Memory and CPU should increase gradually during creation, and should not spike high enough to OOM a machine when the steady-state workload can run those changefeeds.
What did you see instead?
The coordinator is created with hard-coded scheduling settings:
- max task concurrency:
10000
- balance interval:
time.Minute
This bypasses the server scheduler config, whose default max-task-concurrency is 10.
The basic scheduler uses this value as its batch size for absent changefeeds. When hundreds of changefeeds are bulk-created, many AddMaintainer operators can be issued almost at once.
Each maintainer bootstrap performs startup work even when tables have no row traffic, including loading table metadata from schema store and building schema/span info. loadAllPhysicalTablesAtTs currently also loads full table metadata before applying table filters. This makes creation-time memory and CPU scale poorly with the number of concurrently bootstrapping changefeeds.
Versions of the cluster
Upstream TiDB cluster version (execute SELECT tidb_version(); in a MySQL client):
Not captured from the original workload.
Upstream TiKV version (execute tikv-server --version):
Not captured from the original workload.
TiCDC version (execute cdc version):
Code issue verified by inspection on upstream/master at 0a418b4132466aa084517ec7137b3d5f24013dcc.
What did you do?
Bulk-create many TiCDC changefeeds whose tables have no row traffic.
Observed workload:
Code inspection on upstream/master found that changefeed creation can schedule many maintainer bootstraps concurrently.
What did you expect to see?
Bulk changefeed creation should respect the configured scheduler concurrency limit and avoid launching hundreds of maintainer bootstraps at the same time.
Memory and CPU should increase gradually during creation, and should not spike high enough to OOM a machine when the steady-state workload can run those changefeeds.
What did you see instead?
The coordinator is created with hard-coded scheduling settings:
10000time.MinuteThis bypasses the server scheduler config, whose default
max-task-concurrencyis10.The basic scheduler uses this value as its batch size for absent changefeeds. When hundreds of changefeeds are bulk-created, many AddMaintainer operators can be issued almost at once.
Each maintainer bootstrap performs startup work even when tables have no row traffic, including loading table metadata from schema store and building schema/span info.
loadAllPhysicalTablesAtTscurrently also loads full table metadata before applying table filters. This makes creation-time memory and CPU scale poorly with the number of concurrently bootstrapping changefeeds.Versions of the cluster
Upstream TiDB cluster version (execute
SELECT tidb_version();in a MySQL client):Not captured from the original workload.Upstream TiKV version (execute
tikv-server --version):Not captured from the original workload.TiCDC version (execute
cdc version):Code issue verified by inspection on upstream/master at 0a418b4132466aa084517ec7137b3d5f24013dcc.