Seeking help from k8s experts.
I leveraged client-go / controller-runtime to implement a controller for my CRD. And now I noticed a symptom that my controller's performance cannot be improved no matter I added more shards to controller or increased the max-requests-inflight/max-mutating-request-inflight.
Below is the overview of my CRD reconciling.
- Add finalizer
- Mark the CRD status to pending
- Create another CRD and waits for the status to be ready.
- Mark the CRD status to running.
The avg latency of above 4 steps is around 1s - 5s.
I simulated 10000 CRDs creation, and found the E2E duration for all CRD becoming running needs around ~20s.
I observed sometimes entering reconcile (step #1) occurs 8s after the CR creation on api server side.
When I checked api server logs, I found
https://github.com/kubernetes/kubernetes/blob/release-1.28/staging/src/k8s.io/apiserver/pkg/storage/etcd3/watcher.go#L139
- around 80k "Fast watcher, slow processing. Probably caused by slow decoding, user not receiving fast, or other processing logic" incomingEvents=100 objectType="*unstructured.Unstructured" ..."
- around 500 "Fast watcher, slow processing. Probably caused by slow dispatching events to watchers" outgoingEvents=100 objectType="*unstructured.Unstructured" ..."
I cannot tell whether the bottleneck is on controller side or api server side? I tried to increase the shards of the controller, but no help. And I also observed the cpu/memory usage of k8s api server, the usage is around ~50%, not very high.
Any suggestions how to do the further troubleshooting and improve the controller's performance?
The parameters I used:
- controller: 3 shards and max_concurrent_reconciles of each shard is 2000 (the load is balanced across all shards).
- api server side: 3 api server and max-requests-inflight = 2000, max-mutating-request-inflight = 2000 on every api server.
Seeking help from k8s experts.
I leveraged client-go / controller-runtime to implement a controller for my CRD. And now I noticed a symptom that my controller's performance cannot be improved no matter I added more shards to controller or increased the max-requests-inflight/max-mutating-request-inflight.
Below is the overview of my CRD reconciling.
The avg latency of above 4 steps is around 1s - 5s.
I simulated 10000 CRDs creation, and found the E2E duration for all CRD becoming running needs around ~20s.
I observed sometimes entering reconcile (step #1) occurs 8s after the CR creation on api server side.
When I checked api server logs, I found
https://github.com/kubernetes/kubernetes/blob/release-1.28/staging/src/k8s.io/apiserver/pkg/storage/etcd3/watcher.go#L139
I cannot tell whether the bottleneck is on controller side or api server side? I tried to increase the shards of the controller, but no help. And I also observed the cpu/memory usage of k8s api server, the usage is around ~50%, not very high.
Any suggestions how to do the further troubleshooting and improve the controller's performance?
The parameters I used: