api-gateway-controller-manager memory grows monotonically under pod load

---

**Description**

The `api-gateway-controller-manager` pod shows a large, monotonic memory increase under increasing pod load. Memory grows from 73 Mi at baseline to 2070 Mi at 5000 fake pods — a **+1997 Mi** increase. This makes the controller unreliable in any large production cluster.

### Pods load test

| pod (namespace/name)                                        | baseline | 500 res | 1000 res | 2000 res | 3000 res | 4000 res | 5000 res | delta    |
| ----------------------------------------------------------- | -------- | ------- | -------- | -------- | -------- | -------- | -------- | -------- |
| kyma-system/api-gateway-controller-manager-66d9d4d67d-f867w | 73 Mi    | 195 Mi  | 370 Mi   | 574 Mi   | 873 Mi   | 1258 Mi  | 2070 Mi  | +1997 Mi |

The growth is continuous and near-linear — there is no plateau. This suggests the controller accumulates unbounded in-memory state proportional to the number of pods in the cluster, most likely by watching pods cluster-wide without any filtering or cache eviction.

**Expected result**

The `api-gateway-controller-manager` should handle a large production cluster with thousands of pods without unbounded memory growth. Watches should be scoped to resources relevant to API Gateway (e.g. filtered by namespace or labels), and in-memory caches should be bounded.

**Actual result**

`api-gateway-controller-manager` memory grows from 73 Mi to 2070 Mi as pod count increases from 0 to 5000, a ~28x increase.

**Steps to reproduce**

1. Deploy Kyma with the api-gateway module enabled
2. Create 5000 fake pods using [KWOK](https://github.com/kubernetes-sigs/kwok)
3. Observe `api-gateway-controller-manager` memory growing monotonically with each step

## Possible Solutions

From a business perspective, `api-gateway-controller-manager` should only be aware of pods that are directly relevant to API Gateway management. Watching and caching every pod in the cluster is unnecessary and causes unbounded memory growth in large clusters.

### 1. Scope the pod/resource cache to relevant objects using label selectors

The controller could restrict the informer cache to only objects it actually manages (e.g. pods or resources in the `kyma-system` namespace, or carrying a specific label). This bounds cache size to the number of API Gateway-managed resources rather than the total cluster count:

```go
cache.Options{
    ByObject: map[client.Object]cache.ByObject{
        &corev1.Pod{}: {
            Label: labels.SelectorFromSet(labels.Set{"app.kubernetes.io/managed-by": "api-gateway"}),
        },
    },
}
```

### 2. Use `client.Reader` (uncached) for read-only lookups

For operations that only need to look up a resource once (not watch it continuously), use the API reader that bypasses the cache:

```go
type MyReconciler struct {
    client.Client
    APIReader client.Reader // set to mgr.GetAPIReader()
}

// Use APIReader for one-off lookups instead of r.Client.Get(...)
err := r.APIReader.Get(ctx, key, obj)
```

The `APIReader` and cache exclusion should be used together: `APIReader` for the call site, and cache configuration to prevent the informer from being populated by other code paths.

## Anything else?

- Load test results: https://github.tools.sap/kyma/warden/issues/173#issuecomment-24684347
- Test script: https://github.com/kyma-project/manager-toolkit/pull/66
- controller-runtime cache options: https://pkg.go.dev/sigs.k8s.io/controller-runtime/pkg/cache#Options

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

api-gateway-controller-manager memory grows monotonically under pod load #2671

Pods load test

Possible Solutions

1. Scope the pod/resource cache to relevant objects using label selectors

2. Use `client.Reader` (uncached) for read-only lookups

Anything else?

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

api-gateway-controller-manager memory grows monotonically under pod load #2671

Description

Pods load test

Possible Solutions

1. Scope the pod/resource cache to relevant objects using label selectors

2. Use client.Reader (uncached) for read-only lookups

Anything else?

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

2. Use `client.Reader` (uncached) for read-only lookups