fix(hyper-pod): make destroy clean up the InferenceEndpointConfig and ALB reliably by remote-swe[bot] · Pull Request #85 · WinterYukky/aws-cdk-neuronx-patterns

remote-swe · 2026-04-25T03:44:30Z

Summary

Resolves two destroy-time cleanup issues that HyperPodVllmNxdInferenceService exposed during the PR #56 integration test run. Both issues cause cdk destroy to hang for 45+ minutes and eventually fail with orphaned resources that have to be cleaned up by hand.

This PR is stacked on top of #56. The base branch will switch to main automatically once #56 is merged; there is no conflict between the two PRs outside of snapshot regeneration.

Issues addressed

1. `InferenceEndpointConfig` finalizer deadlock

During stack destroy, the EKS system nodegroup is torn down before the HyperPod worker node is drained. The remaining node keeps the node.kubernetes.io/unreachable:NoSchedule taint, and the hyperpod-inference-controller-manager Deployment (provisioned by the amazon-sagemaker-hyperpod-inference addon) has no toleration for that taint, so the controller pod becomes Pending. The inference.sagemaker.aws.InferenceEndpointConfigFinalizer on the InferenceEndpointConfig CR is never released, the Custom::AWSCDK-EKS-KubernetesResource that manages the manifest hangs in DELETE_IN_PROGRESS, and CloudFormation eventually aborts the destroy with the CR still resident on the cluster.

The addon's configurationValues schema does not currently expose tolerations for the controller-manager Deployment (only for alb and keda), so we cannot fix this by keeping the controller scheduled. Instead this PR installs a KubernetesPatch with a destroy-time restorePatch that strips metadata.finalizers on the CR directly. The patch runs through the CDK-managed kubectl provider Lambda (outside the cluster), so it is unaffected by whether the controller can be scheduled.

The KubernetesPatch construct's node depends on the InferenceEndpointConfig manifest, so on destroy CloudFormation runs the patch's restorePatch before deleting the manifest itself. On create/update the ordering is reversed and the patch applyPatch is a no-op.

2. ALB Controller orphan SG/TG on VPC delete

The AWS Load Balancer Controller (also bundled by the inference addon) creates SecurityGroups and TargetGroups from inside Kubernetes, outside of CloudFormation. If the controller pod is evicted during node drain before it has a chance to process the Ingress deletion event, those SGs/TGs get orphaned and AWS::EC2::VPC fails its delete with DependencyViolation.

This PR adds the same destroy-time tolerations (unreachable and not-ready) to the alb and keda sections of the addon's configurationValues, so that the ALB controller pod can keep processing Ingress deletion events while its nodes drain. This is a mitigation rather than a fix — the orphan path is still possible if the controller terminates unexpectedly — but it removes the primary failure mode observed in PR #56's run.

Known limitation & follow-up

The addon's configurationValues schema does not yet let us pass tolerations to the controller-manager itself. Until that field is exposed upstream, the KubernetesPatch in this PR is the only way to unblock a destroy when the controller pod goes Pending. We will file a feature request with the addon team to expose tolerations on the controller-manager Deployment.

Changes

src/hyper-pod/hyper-pod-vllm-nxd-inference-service.ts
- Add alb.tolerations / keda.tolerations to the addon's configurationValues.
- Install a KubernetesPatch with applyPatch: {} and restorePatch: { metadata: { finalizers: [] } }, depending on the InferenceEndpointConfig manifest so that on destroy the patch runs first.
src/hyper-pod/hyper-pod-vllm-nxd-inference-service.test.ts
- Unit test asserting the addon's configurationValues contains tolerations for both alb and keda.
- Unit test asserting the Custom::AWSCDK-EKS-KubernetesPatch resource targets an inferenceendpointconfig/*, uses PatchType: merge, and clears the finalizer list in its restorePatch.
Regenerated test/integ.hyper-pod-cluster.ts.snapshot to reflect the new addon configurationValues payload and the new KubernetesPatch custom resource.

Verification

npx projen build passes (70 unit tests, 2 integ snapshots verify UNCHANGED after regeneration).
End-to-end integ not rerun in this PR; PR feat: add SageMaker HyperPod inference support #56's integ run already reproduced the two failure modes, and the fixes are targeted at those exact symptoms.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license

During stack destroy the EKS system nodegroup drains before the HyperPod worker node does, which leaves every remaining node tainted with `node.kubernetes.io/unreachable:NoSchedule` or `node.kubernetes.io/not-ready:NoSchedule`. The AWS Load Balancer Controller pod bundled by the `amazon-sagemaker-hyperpod-inference` addon has no toleration for those taints, so it gets evicted before it can observe the `Ingress` deletion event. As a result the controller never cleans up the ALB's SecurityGroup and TargetGroup, and those resources are orphaned outside of CloudFormation and block VPC deletion with `DependencyViolation` for the rest of the destroy. Pass tolerations for the `unreachable` and `not-ready` taints through the addon's `configurationValues` for both the `alb` and `keda` components so those pods can keep processing cleanup events while their nodes drain. Note: the `hyperpod-inference-controller-manager` Deployment does not yet expose tolerations through the addon's configuration schema. The companion commit (destroy-time `KubernetesPatch` to remove the `InferenceEndpointConfig` finalizer) handles the finalizer deadlock that arises when that pod becomes Pending.

…ia KubernetesPatch The HyperPod inference controller manages the `inference.sagemaker.aws.InferenceEndpointConfigFinalizer` on the `InferenceEndpointConfig` CR we create. During stack destroy the EKS system nodegroup is torn down before the HyperPod worker node is drained, and the controller-manager Deployment has no toleration for the `node.kubernetes.io/unreachable:NoSchedule` taint left behind on the remaining node. The controller pod becomes `Pending` and never processes the CR deletion, so its finalizer is never released and the `Custom::AWSCDK-EKS-KubernetesResource` managing the manifest hangs in `DELETE_IN_PROGRESS` for 45+ minutes before CloudFormation aborts. The upstream addon's `configurationValues` schema does not currently expose tolerations for the controller-manager Deployment, so we cannot keep it scheduled during teardown. Instead, install a `KubernetesPatch` with a destroy-time `restorePatch` that strips `metadata.finalizers` directly via the CDK-managed kubectl provider. The patch `applyPatch` is a no-op (the controller owns the finalizer lifecycle during create/update). Because the patch's construct depends on the `InferenceEndpointConfig` manifest, CloudFormation orders the patch deletion before the manifest deletion on destroy, so the finalizer is cleared before the CR itself is removed.

Adds two unit tests that assert the new destroy-time cleanup affordances: - The `amazon-sagemaker-hyperpod-inference` EKS addon renders `tolerations` for the `node.kubernetes.io/unreachable` and `node.kubernetes.io/not-ready` taints on both the `alb` and `keda` sections of its `configurationValues`. Because CFN serializes the configuration values as a tokenized `Fn::Join`, the test reaches into the join parts to assert the substrings that should be present. - A `Custom::AWSCDK-EKS-KubernetesPatch` resource is installed that targets an `inferenceendpointconfig/*` resource, with `PatchType: merge` and a `RestorePatchJson` that clears the finalizer list (`metadata.finalizers = []`).

WinterYukky added 3 commits April 25, 2026 03:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(hyper-pod): make destroy clean up the InferenceEndpointConfig and ALB reliably#85

fix(hyper-pod): make destroy clean up the InferenceEndpointConfig and ALB reliably#85
remote-swe[bot] wants to merge 3 commits into
feature/hyperpod-inference-1772787819from
feature/hyperpod-destroy-cleanup-1777087841

remote-swe Bot commented Apr 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

remote-swe Bot commented Apr 25, 2026

Summary

Issues addressed

1. InferenceEndpointConfig finalizer deadlock

2. ALB Controller orphan SG/TG on VPC delete

Known limitation & follow-up

Changes

Verification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. `InferenceEndpointConfig` finalizer deadlock