Skip to content

ci: prevent SSE e2e pod zombies blocking the SSE node#1599

Merged
sre-ci-robot merged 1 commit into
zilliztech:mainfrom
jamesgao-jpg:fix/ci-sse-pod-zombie-cleanup
Apr 28, 2026
Merged

ci: prevent SSE e2e pod zombies blocking the SSE node#1599
sre-ci-robot merged 1 commit into
zilliztech:mainfrom
jamesgao-jpg:fix/ci-sse-pod-zombie-cleanup

Conversation

@jamesgao-jpg
Copy link
Copy Markdown
Collaborator

@jamesgao-jpg jamesgao-jpg commented Apr 23, 2026

Problem

The SSE e2e pipeline occasionally leaves pods in `1/2 NotReady` state for days after the `main` container exits successfully. In the recent real incident, pod `knowhere-kn2-sse--pr-1584-3-grlk2-j07vb-dsqc7` sat on node `k8s-ci-node11` for 7+ days — the `main` container had exited `Completed` with `exitCode=0`, but the `jnlp` sidecar stayed `Running` — which kept the Pod phase at `Running`, so the Pod's full memory reservation (dominated by `main`'s `3Gi` request) stayed held on the node even though `main`'s process was already gone.

The CI cluster only has one node labeled `node.kubernetes.io/cpu-feature.sse=true` (`k8s-ci-node11`, 8 CPU / ~3.9Gi). A single zombie consumes >99% of its allocatable memory, which in turn causes every subsequent SSE build (PR 1588, 1590, 1591, 1592, 1593, 1594, 1595, 1596 were all affected at observation time) to stay `Pending` with:

```
FailedScheduling: 0/7 nodes are available: 1 Insufficient cpu, 1 Insufficient memory,
2 node(s) didn't match Pod's node affinity/selector, ...
```

Root cause

The Kubernetes Plugin cannot always guarantee pod deletion after build termination (Jenkins master restart, network blip between master and agent, or the plugin losing the agent session). Without additional safeguards, the `jnlp` sidecar keeps its TCP connection to an already-torn-down build and the pod is never cleaned up. Compounding this, the timeout in `ci/E2E2-SSE.groovy` was commented out, so nothing bounded hung builds either.

Fix — three layers of defense

Layer Mechanism Triggers when
1 `podRetention never()` + `idleMinutes 0` on both agent blocks Build reaches any terminal state (success / failure / abort) — plugin deletes pod immediately
2 Restored Jenkins `timeout(480 min)` Build genuinely hangs; 480 min picked to safely exceed legitimate 5–6h SSE test runs
3 Pod-level `activeDeadlineSeconds: 36000` (10h) + `restartPolicy: Never` Jenkins loses the agent entirely — kubelet enforces the wall independently

The ordering is intentional: Jenkins timeout (8h) < k8s hard wall (10h), so under normal degradation Jenkins gets to finalize and clean up first, and k8s only fires as the last line of defense.

Note on layer 3: when `activeDeadlineSeconds` fires, k8s marks the Pod `Failed` and kills its containers. That releases the node memory reservation immediately, which is the property we need here. The Pod object itself lingers until GC, but that's an object-count concern, not the resource-hold problem this PR solves.

Scope

Only touches the SSE pipeline:

  • `ci/E2E2-SSE.groovy` — 6 lines added, 2 uncommented
  • `ci/pod/e2e-sse.yaml` — 3 lines added

Other pipelines (CPU / GPU / ARM) are not modified — SSE is the singleton-node bottleneck and the only one where this has caused outages. The same pattern can be applied later to the rest if we see the problem spread.

Verification

  • Ran `groovy -v` / syntactic eyeball — all three `podRetention`, `idleMinutes`, and `timeout` invocations are standard Jenkins Kubernetes Plugin DSL.
  • `activeDeadlineSeconds` is valid on bare Pods per the k8s pod lifecycle docs.
  • Diff kept to the minimum needed; no refactor of surrounding code.

Will monitor node11 over the next few SSE builds after merge to confirm pods are deleted on build completion and the memory allocation drops back to baseline.

The SSE e2e pipeline sometimes left pods in 1/2 NotReady state for days
after the main container exited successfully, because the jnlp sidecar
kept running and held the memory reservation on the only SSE-labeled
node. Since that node has ~3.9Gi capacity and each SSE pod requests
3.5Gi, a single zombie blocked every subsequent SSE build from scheduling.

Three layers of defense, from most to least likely to fire:

1. podRetention never() + idleMinutes 0 on both agent blocks so the
   Kubernetes Plugin deletes the pod the moment the build terminates,
   regardless of outcome.
2. Restore the Jenkins build timeout (uncommented, raised to 480 min to
   accommodate legitimate 5-6h SSE runs) so hung builds are bounded.
3. activeDeadlineSeconds: 36000 on the pod spec as a kubelet-level
   backstop when Jenkins loses control of the agent entirely.

restartPolicy and terminationGracePeriodSeconds are set explicitly on
the pod to make the lifecycle unambiguous.

Signed-off-by: jamesgao-jpg <james.gao@zilliz.com>
@mergify
Copy link
Copy Markdown

mergify Bot commented Apr 23, 2026

@jamesgao-jpg 🔍 Important: PR Classification Needed!

For efficient project management and a seamless review process, it's essential to classify your PR correctly. Here's how:

  1. If you're fixing a bug, label it as kind/bug.
  2. For small tweaks (less than 20 lines without altering any functionality), please use kind/improvement.
  3. Significant changes that don't modify existing functionalities should be tagged as kind/enhancement.
  4. Adjusting APIs or changing functionality? Go with kind/feature.

For any PR outside the kind/improvement category, ensure you link to the associated issue using the format: “issue: #”.

Thanks for your efforts and contribution to the community!.

Copy link
Copy Markdown
Collaborator

@foxspy foxspy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@sre-ci-robot
Copy link
Copy Markdown
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: foxspy, jamesgao-jpg

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@mergify
Copy link
Copy Markdown

mergify Bot commented Apr 24, 2026

@jamesgao-jpg e2e jenkins job failed, comment /run-e2e-sse can trigger the job again.

@alexanderguzhva
Copy link
Copy Markdown
Collaborator

issue: #1604
/kind improvement

@alexanderguzhva
Copy link
Copy Markdown
Collaborator

/lgtm

@alexanderguzhva
Copy link
Copy Markdown
Collaborator

issue: #1604
/kind improvement

@sre-ci-robot sre-ci-robot merged commit 9e442d7 into zilliztech:main Apr 28, 2026
9 of 10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants