ci: prevent SSE e2e pod zombies blocking the SSE node#1599
Conversation
The SSE e2e pipeline sometimes left pods in 1/2 NotReady state for days after the main container exited successfully, because the jnlp sidecar kept running and held the memory reservation on the only SSE-labeled node. Since that node has ~3.9Gi capacity and each SSE pod requests 3.5Gi, a single zombie blocked every subsequent SSE build from scheduling. Three layers of defense, from most to least likely to fire: 1. podRetention never() + idleMinutes 0 on both agent blocks so the Kubernetes Plugin deletes the pod the moment the build terminates, regardless of outcome. 2. Restore the Jenkins build timeout (uncommented, raised to 480 min to accommodate legitimate 5-6h SSE runs) so hung builds are bounded. 3. activeDeadlineSeconds: 36000 on the pod spec as a kubelet-level backstop when Jenkins loses control of the agent entirely. restartPolicy and terminationGracePeriodSeconds are set explicitly on the pod to make the lifecycle unambiguous. Signed-off-by: jamesgao-jpg <james.gao@zilliz.com>
|
@jamesgao-jpg 🔍 Important: PR Classification Needed! For efficient project management and a seamless review process, it's essential to classify your PR correctly. Here's how:
For any PR outside the kind/improvement category, ensure you link to the associated issue using the format: “issue: #”. Thanks for your efforts and contribution to the community!. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: foxspy, jamesgao-jpg The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@jamesgao-jpg e2e jenkins job failed, comment |
|
issue: #1604 |
|
/lgtm |
|
issue: #1604 |
Problem
The SSE e2e pipeline occasionally leaves pods in `1/2 NotReady` state for days after the `main` container exits successfully. In the recent real incident, pod `knowhere-kn2-sse--pr-1584-3-grlk2-j07vb-dsqc7` sat on node `k8s-ci-node11` for 7+ days — the `main` container had exited `Completed` with `exitCode=0`, but the `jnlp` sidecar stayed `Running` — which kept the Pod phase at `Running`, so the Pod's full memory reservation (dominated by `main`'s `3Gi` request) stayed held on the node even though `main`'s process was already gone.
The CI cluster only has one node labeled `node.kubernetes.io/cpu-feature.sse=true` (`k8s-ci-node11`, 8 CPU / ~3.9Gi). A single zombie consumes >99% of its allocatable memory, which in turn causes every subsequent SSE build (PR 1588, 1590, 1591, 1592, 1593, 1594, 1595, 1596 were all affected at observation time) to stay `Pending` with:
```
FailedScheduling: 0/7 nodes are available: 1 Insufficient cpu, 1 Insufficient memory,
2 node(s) didn't match Pod's node affinity/selector, ...
```
Root cause
The Kubernetes Plugin cannot always guarantee pod deletion after build termination (Jenkins master restart, network blip between master and agent, or the plugin losing the agent session). Without additional safeguards, the `jnlp` sidecar keeps its TCP connection to an already-torn-down build and the pod is never cleaned up. Compounding this, the timeout in `ci/E2E2-SSE.groovy` was commented out, so nothing bounded hung builds either.
Fix — three layers of defense
The ordering is intentional: Jenkins timeout (8h) < k8s hard wall (10h), so under normal degradation Jenkins gets to finalize and clean up first, and k8s only fires as the last line of defense.
Note on layer 3: when `activeDeadlineSeconds` fires, k8s marks the Pod `Failed` and kills its containers. That releases the node memory reservation immediately, which is the property we need here. The Pod object itself lingers until GC, but that's an object-count concern, not the resource-hold problem this PR solves.
Scope
Only touches the SSE pipeline:
Other pipelines (CPU / GPU / ARM) are not modified — SSE is the singleton-node bottleneck and the only one where this has caused outages. The same pattern can be applied later to the rest if we see the problem spread.
Verification
Will monitor node11 over the next few SSE builds after merge to confirm pods are deleted on build completion and the memory allocation drops back to baseline.