Description
The api-gateway-controller-manager pod shows a large, monotonic memory increase under increasing pod load. Memory grows from 73 Mi at baseline to 2070 Mi at 5000 fake pods — a +1997 Mi increase. This makes the controller unreliable in any large production cluster.
Pods load test
| pod (namespace/name) |
baseline |
500 res |
1000 res |
2000 res |
3000 res |
4000 res |
5000 res |
delta |
| kyma-system/api-gateway-controller-manager-66d9d4d67d-f867w |
73 Mi |
195 Mi |
370 Mi |
574 Mi |
873 Mi |
1258 Mi |
2070 Mi |
+1997 Mi |
The growth is continuous and near-linear — there is no plateau. This suggests the controller accumulates unbounded in-memory state proportional to the number of pods in the cluster, most likely by watching pods cluster-wide without any filtering or cache eviction.
Expected result
The api-gateway-controller-manager should handle a large production cluster with thousands of pods without unbounded memory growth. Watches should be scoped to resources relevant to API Gateway (e.g. filtered by namespace or labels), and in-memory caches should be bounded.
Actual result
api-gateway-controller-manager memory grows from 73 Mi to 2070 Mi as pod count increases from 0 to 5000, a ~28x increase.
Steps to reproduce
- Deploy Kyma with the api-gateway module enabled
- Create 5000 fake pods using KWOK
- Observe
api-gateway-controller-manager memory growing monotonically with each step
Possible Solutions
From a business perspective, api-gateway-controller-manager should only be aware of pods that are directly relevant to API Gateway management. Watching and caching every pod in the cluster is unnecessary and causes unbounded memory growth in large clusters.
1. Scope the pod/resource cache to relevant objects using label selectors
The controller could restrict the informer cache to only objects it actually manages (e.g. pods or resources in the kyma-system namespace, or carrying a specific label). This bounds cache size to the number of API Gateway-managed resources rather than the total cluster count:
cache.Options{
ByObject: map[client.Object]cache.ByObject{
&corev1.Pod{}: {
Label: labels.SelectorFromSet(labels.Set{"app.kubernetes.io/managed-by": "api-gateway"}),
},
},
}
2. Use client.Reader (uncached) for read-only lookups
For operations that only need to look up a resource once (not watch it continuously), use the API reader that bypasses the cache:
type MyReconciler struct {
client.Client
APIReader client.Reader // set to mgr.GetAPIReader()
}
// Use APIReader for one-off lookups instead of r.Client.Get(...)
err := r.APIReader.Get(ctx, key, obj)
The APIReader and cache exclusion should be used together: APIReader for the call site, and cache configuration to prevent the informer from being populated by other code paths.
Anything else?
Description
The
api-gateway-controller-managerpod shows a large, monotonic memory increase under increasing pod load. Memory grows from 73 Mi at baseline to 2070 Mi at 5000 fake pods — a +1997 Mi increase. This makes the controller unreliable in any large production cluster.Pods load test
The growth is continuous and near-linear — there is no plateau. This suggests the controller accumulates unbounded in-memory state proportional to the number of pods in the cluster, most likely by watching pods cluster-wide without any filtering or cache eviction.
Expected result
The
api-gateway-controller-managershould handle a large production cluster with thousands of pods without unbounded memory growth. Watches should be scoped to resources relevant to API Gateway (e.g. filtered by namespace or labels), and in-memory caches should be bounded.Actual result
api-gateway-controller-managermemory grows from 73 Mi to 2070 Mi as pod count increases from 0 to 5000, a ~28x increase.Steps to reproduce
api-gateway-controller-managermemory growing monotonically with each stepPossible Solutions
From a business perspective,
api-gateway-controller-managershould only be aware of pods that are directly relevant to API Gateway management. Watching and caching every pod in the cluster is unnecessary and causes unbounded memory growth in large clusters.1. Scope the pod/resource cache to relevant objects using label selectors
The controller could restrict the informer cache to only objects it actually manages (e.g. pods or resources in the
kyma-systemnamespace, or carrying a specific label). This bounds cache size to the number of API Gateway-managed resources rather than the total cluster count:2. Use
client.Reader(uncached) for read-only lookupsFor operations that only need to look up a resource once (not watch it continuously), use the API reader that bypasses the cache:
The
APIReaderand cache exclusion should be used together:APIReaderfor the call site, and cache configuration to prevent the informer from being populated by other code paths.Anything else?