Skip to content

fix(hyper-pod): make destroy clean up the InferenceEndpointConfig and ALB reliably#85

Open
remote-swe[bot] wants to merge 3 commits into
feature/hyperpod-inference-1772787819from
feature/hyperpod-destroy-cleanup-1777087841
Open

fix(hyper-pod): make destroy clean up the InferenceEndpointConfig and ALB reliably#85
remote-swe[bot] wants to merge 3 commits into
feature/hyperpod-inference-1772787819from
feature/hyperpod-destroy-cleanup-1777087841

Conversation

@remote-swe

@remote-swe remote-swe Bot commented Apr 25, 2026

Copy link
Copy Markdown
Contributor

Summary

Resolves two destroy-time cleanup issues that HyperPodVllmNxdInferenceService exposed during the PR #56 integration test run. Both issues cause cdk destroy to hang for 45+ minutes and eventually fail with orphaned resources that have to be cleaned up by hand.

This PR is stacked on top of #56. The base branch will switch to main automatically once #56 is merged; there is no conflict between the two PRs outside of snapshot regeneration.

Issues addressed

1. InferenceEndpointConfig finalizer deadlock

During stack destroy, the EKS system nodegroup is torn down before the HyperPod worker node is drained. The remaining node keeps the node.kubernetes.io/unreachable:NoSchedule taint, and the hyperpod-inference-controller-manager Deployment (provisioned by the amazon-sagemaker-hyperpod-inference addon) has no toleration for that taint, so the controller pod becomes Pending. The inference.sagemaker.aws.InferenceEndpointConfigFinalizer on the InferenceEndpointConfig CR is never released, the Custom::AWSCDK-EKS-KubernetesResource that manages the manifest hangs in DELETE_IN_PROGRESS, and CloudFormation eventually aborts the destroy with the CR still resident on the cluster.

The addon's configurationValues schema does not currently expose tolerations for the controller-manager Deployment (only for alb and keda), so we cannot fix this by keeping the controller scheduled. Instead this PR installs a KubernetesPatch with a destroy-time restorePatch that strips metadata.finalizers on the CR directly. The patch runs through the CDK-managed kubectl provider Lambda (outside the cluster), so it is unaffected by whether the controller can be scheduled.

The KubernetesPatch construct's node depends on the InferenceEndpointConfig manifest, so on destroy CloudFormation runs the patch's restorePatch before deleting the manifest itself. On create/update the ordering is reversed and the patch applyPatch is a no-op.

2. ALB Controller orphan SG/TG on VPC delete

The AWS Load Balancer Controller (also bundled by the inference addon) creates SecurityGroups and TargetGroups from inside Kubernetes, outside of CloudFormation. If the controller pod is evicted during node drain before it has a chance to process the Ingress deletion event, those SGs/TGs get orphaned and AWS::EC2::VPC fails its delete with DependencyViolation.

This PR adds the same destroy-time tolerations (unreachable and not-ready) to the alb and keda sections of the addon's configurationValues, so that the ALB controller pod can keep processing Ingress deletion events while its nodes drain. This is a mitigation rather than a fix — the orphan path is still possible if the controller terminates unexpectedly — but it removes the primary failure mode observed in PR #56's run.

Known limitation & follow-up

The addon's configurationValues schema does not yet let us pass tolerations to the controller-manager itself. Until that field is exposed upstream, the KubernetesPatch in this PR is the only way to unblock a destroy when the controller pod goes Pending. We will file a feature request with the addon team to expose tolerations on the controller-manager Deployment.

Changes

  • src/hyper-pod/hyper-pod-vllm-nxd-inference-service.ts
    • Add alb.tolerations / keda.tolerations to the addon's configurationValues.
    • Install a KubernetesPatch with applyPatch: {} and restorePatch: { metadata: { finalizers: [] } }, depending on the InferenceEndpointConfig manifest so that on destroy the patch runs first.
  • src/hyper-pod/hyper-pod-vllm-nxd-inference-service.test.ts
    • Unit test asserting the addon's configurationValues contains tolerations for both alb and keda.
    • Unit test asserting the Custom::AWSCDK-EKS-KubernetesPatch resource targets an inferenceendpointconfig/*, uses PatchType: merge, and clears the finalizer list in its restorePatch.
  • Regenerated test/integ.hyper-pod-cluster.ts.snapshot to reflect the new addon configurationValues payload and the new KubernetesPatch custom resource.

Verification

  • npx projen build passes (70 unit tests, 2 integ snapshots verify UNCHANGED after regeneration).
  • End-to-end integ not rerun in this PR; PR feat: add SageMaker HyperPod inference support #56's integ run already reproduced the two failure modes, and the fixes are targeted at those exact symptoms.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license

During stack destroy the EKS system nodegroup drains before the HyperPod
worker node does, which leaves every remaining node tainted with
`node.kubernetes.io/unreachable:NoSchedule` or
`node.kubernetes.io/not-ready:NoSchedule`. The AWS Load Balancer
Controller pod bundled by the `amazon-sagemaker-hyperpod-inference`
addon has no toleration for those taints, so it gets evicted before it
can observe the `Ingress` deletion event. As a result the controller
never cleans up the ALB's SecurityGroup and TargetGroup, and those
resources are orphaned outside of CloudFormation and block VPC
deletion with `DependencyViolation` for the rest of the destroy.

Pass tolerations for the `unreachable` and `not-ready` taints through
the addon's `configurationValues` for both the `alb` and `keda`
components so those pods can keep processing cleanup events while
their nodes drain.

Note: the `hyperpod-inference-controller-manager` Deployment does not
yet expose tolerations through the addon's configuration schema. The
companion commit (destroy-time `KubernetesPatch` to remove the
`InferenceEndpointConfig` finalizer) handles the finalizer deadlock
that arises when that pod becomes Pending.
…ia KubernetesPatch

The HyperPod inference controller manages the
`inference.sagemaker.aws.InferenceEndpointConfigFinalizer` on the
`InferenceEndpointConfig` CR we create. During stack destroy the EKS
system nodegroup is torn down before the HyperPod worker node is
drained, and the controller-manager Deployment has no toleration for
the `node.kubernetes.io/unreachable:NoSchedule` taint left behind on
the remaining node. The controller pod becomes `Pending` and never
processes the CR deletion, so its finalizer is never released and the
`Custom::AWSCDK-EKS-KubernetesResource` managing the manifest hangs in
`DELETE_IN_PROGRESS` for 45+ minutes before CloudFormation aborts.

The upstream addon's `configurationValues` schema does not currently
expose tolerations for the controller-manager Deployment, so we cannot
keep it scheduled during teardown. Instead, install a
`KubernetesPatch` with a destroy-time `restorePatch` that strips
`metadata.finalizers` directly via the CDK-managed kubectl provider.

The patch `applyPatch` is a no-op (the controller owns the finalizer
lifecycle during create/update). Because the patch's construct depends
on the `InferenceEndpointConfig` manifest, CloudFormation orders the
patch deletion before the manifest deletion on destroy, so the
finalizer is cleared before the CR itself is removed.
Adds two unit tests that assert the new destroy-time cleanup affordances:

- The `amazon-sagemaker-hyperpod-inference` EKS addon renders
  `tolerations` for the `node.kubernetes.io/unreachable` and
  `node.kubernetes.io/not-ready` taints on both the `alb` and
  `keda` sections of its `configurationValues`. Because CFN serializes
  the configuration values as a tokenized `Fn::Join`, the test reaches
  into the join parts to assert the substrings that should be present.

- A `Custom::AWSCDK-EKS-KubernetesPatch` resource is installed that
  targets an `inferenceendpointconfig/*` resource, with
  `PatchType: merge` and a `RestorePatchJson` that clears the
  finalizer list (`metadata.finalizers = []`).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant