fix(keda): correct EKS production autoscaling — Value semantics, reachable threshold, CPU safety net by skettkepalli · Pull Request #10892 · EclipseFdn/open-vsx.org

skettkepalli · 2026-06-04T19:34:30Z

Production autoscaling on EKS has never fired. Three issues in the ScaledObject:

The 0.75 busy-thread threshold is unreachable. Measured steady state since the cutover is 0.011–0.03, worst hour 0.08.
The trigger used the default AverageValue metricType. Our query is a ratio that never exceeds 1.0, so the HPA could never ask for more than ceil(1/0.75) = 2 pods, even fully saturated.
The query wasn't pinned to a cluster, so before cutover it averaged OKD and EKS pods together.

This also makes the manual minReplicas bump to 8 (done during cutover) permanent, so the next deploy doesn't silently drop it back to 6.

Changes:

metricType: Value — desired replicas now scale proportionally with the signal
threshold 0.15 (5× steady baseline, 2× worst observed hour), smoothing 5m → 3m, scale-up 4 pods/min
query pinned to eks-production-openvsx via a new keda.clusterName value
in-cluster CPU trigger (70%) through metrics-server, so scaling keeps working if Grafana Cloud is unreachable
maxReplicas capped at 15: each pod pins a 30-connection Hikari pool and RDS max_connections is 500, so 20 pods (600 connections) would exhaust it mid-burst; 15×30 = 450 leaves operational headroom. Can go back up once the pool size is reduced (fleet-wide peak active connections this week was 29).

Validation: server-side dry-run against prod, the exact rendered query checked against Grafana Cloud, and a replay of the proposed signal over the whole post-cutover week — zero scale events, no flapping, min-8 floor holds. We also rehearsed the full loop on eks-staging: under synthetic load the app scaled 2→6 in about 3 minutes with the expected math, and drained back 1 pod per 3 minutes after the 15-minute calm window. Staging was restored to its original state afterwards.

One known non-goal: the May 29 thread saturation was cold JVMs during a rollout, not a capacity problem — warmup/readiness gating is a separate follow-up.

Companion PR for aws-main: #10893

Signed-off-by: skettkepalli sridhar.ettkepalli@eclipse-foundation.org

…hold, CPU safety net Old trigger could never fire (threshold 0.75 vs observed max 0.08) and AverageValue math capped demand at 2 pods regardless of load. Pins query to the EKS cluster, codifies min 8, adds in-cluster CPU trigger. Co-Authored-By: skettkepalli <sridhar.ettkepalli@eclipse-foundation.org>

Each pod pins a 30-connection Hikari pool; 20 pods would need 600 of RDS max_connections=500. 15 x 30 = 450 leaves headroom for operations. Co-Authored-By: skettkepalli <sridhar.ettkepalli@eclipse-foundation.org>

Co-Authored-By: skettkepalli <sridhar.ettkepalli@eclipse-foundation.org>

netomi · 2026-06-04T21:03:31Z

 esReplicaCount: 3
 keda:
  enabled: true
+  clusterName: eks-staging


for staging the name is eks-staging but for production it is eks-production-openvsx . Either use the suffix for both or none I guess?

We will need to recreate the staging cluster using IaC so will align it in IaC!

skettkepalli mentioned this pull request Jun 4, 2026

fix(keda): ScaledObject correctness — Value semantics, cluster-pinned query, optional CPU trigger #10893

Open

skettkepalli added 2 commits June 4, 2026 15:59

fix(keda): cap maxReplicas at 15 to fit RDS connection budget

3f216b2

Each pod pins a 30-connection Hikari pool; 20 pods would need 600 of RDS max_connections=500. 15 x 30 = 450 leaves headroom for operations. Co-Authored-By: skettkepalli <sridhar.ettkepalli@eclipse-foundation.org>

chore: make keda values comment generic

478b577

Co-Authored-By: skettkepalli <sridhar.ettkepalli@eclipse-foundation.org>

netomi reviewed Jun 4, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(keda): correct EKS production autoscaling — Value semantics, reachable threshold, CPU safety net#10892

fix(keda): correct EKS production autoscaling — Value semantics, reachable threshold, CPU safety net#10892
skettkepalli wants to merge 3 commits into
EclipseFdn:aws-productionfrom
skettkepalli:fix/eks-production-autoscaling

skettkepalli commented Jun 4, 2026 •

edited

Loading

Uh oh!

netomi Jun 4, 2026

Uh oh!

skettkepalli Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

skettkepalli commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

netomi Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

skettkepalli Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

skettkepalli commented Jun 4, 2026 •

edited

Loading