Skip to content

api-gateway-controller-manager memory grows monotonically under pod load #2671

@pPrecel

Description

@pPrecel

Description

The api-gateway-controller-manager pod shows a large, monotonic memory increase under increasing pod load. Memory grows from 73 Mi at baseline to 2070 Mi at 5000 fake pods — a +1997 Mi increase. This makes the controller unreliable in any large production cluster.

Pods load test

pod (namespace/name) baseline 500 res 1000 res 2000 res 3000 res 4000 res 5000 res delta
kyma-system/api-gateway-controller-manager-66d9d4d67d-f867w 73 Mi 195 Mi 370 Mi 574 Mi 873 Mi 1258 Mi 2070 Mi +1997 Mi

The growth is continuous and near-linear — there is no plateau. This suggests the controller accumulates unbounded in-memory state proportional to the number of pods in the cluster, most likely by watching pods cluster-wide without any filtering or cache eviction.

Expected result

The api-gateway-controller-manager should handle a large production cluster with thousands of pods without unbounded memory growth. Watches should be scoped to resources relevant to API Gateway (e.g. filtered by namespace or labels), and in-memory caches should be bounded.

Actual result

api-gateway-controller-manager memory grows from 73 Mi to 2070 Mi as pod count increases from 0 to 5000, a ~28x increase.

Steps to reproduce

  1. Deploy Kyma with the api-gateway module enabled
  2. Create 5000 fake pods using KWOK
  3. Observe api-gateway-controller-manager memory growing monotonically with each step

Possible Solutions

From a business perspective, api-gateway-controller-manager should only be aware of pods that are directly relevant to API Gateway management. Watching and caching every pod in the cluster is unnecessary and causes unbounded memory growth in large clusters.

1. Scope the pod/resource cache to relevant objects using label selectors

The controller could restrict the informer cache to only objects it actually manages (e.g. pods or resources in the kyma-system namespace, or carrying a specific label). This bounds cache size to the number of API Gateway-managed resources rather than the total cluster count:

cache.Options{
    ByObject: map[client.Object]cache.ByObject{
        &corev1.Pod{}: {
            Label: labels.SelectorFromSet(labels.Set{"app.kubernetes.io/managed-by": "api-gateway"}),
        },
    },
}

2. Use client.Reader (uncached) for read-only lookups

For operations that only need to look up a resource once (not watch it continuously), use the API reader that bypasses the cache:

type MyReconciler struct {
    client.Client
    APIReader client.Reader // set to mgr.GetAPIReader()
}

// Use APIReader for one-off lookups instead of r.Client.Get(...)
err := r.APIReader.Get(ctx, key, obj)

The APIReader and cache exclusion should be used together: APIReader for the call site, and cache configuration to prevent the informer from being populated by other code paths.

Anything else?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions