fix(hyper-pod): make destroy clean up the InferenceEndpointConfig and ALB reliably#85
Open
remote-swe[bot] wants to merge 3 commits into
Conversation
During stack destroy the EKS system nodegroup drains before the HyperPod worker node does, which leaves every remaining node tainted with `node.kubernetes.io/unreachable:NoSchedule` or `node.kubernetes.io/not-ready:NoSchedule`. The AWS Load Balancer Controller pod bundled by the `amazon-sagemaker-hyperpod-inference` addon has no toleration for those taints, so it gets evicted before it can observe the `Ingress` deletion event. As a result the controller never cleans up the ALB's SecurityGroup and TargetGroup, and those resources are orphaned outside of CloudFormation and block VPC deletion with `DependencyViolation` for the rest of the destroy. Pass tolerations for the `unreachable` and `not-ready` taints through the addon's `configurationValues` for both the `alb` and `keda` components so those pods can keep processing cleanup events while their nodes drain. Note: the `hyperpod-inference-controller-manager` Deployment does not yet expose tolerations through the addon's configuration schema. The companion commit (destroy-time `KubernetesPatch` to remove the `InferenceEndpointConfig` finalizer) handles the finalizer deadlock that arises when that pod becomes Pending.
…ia KubernetesPatch The HyperPod inference controller manages the `inference.sagemaker.aws.InferenceEndpointConfigFinalizer` on the `InferenceEndpointConfig` CR we create. During stack destroy the EKS system nodegroup is torn down before the HyperPod worker node is drained, and the controller-manager Deployment has no toleration for the `node.kubernetes.io/unreachable:NoSchedule` taint left behind on the remaining node. The controller pod becomes `Pending` and never processes the CR deletion, so its finalizer is never released and the `Custom::AWSCDK-EKS-KubernetesResource` managing the manifest hangs in `DELETE_IN_PROGRESS` for 45+ minutes before CloudFormation aborts. The upstream addon's `configurationValues` schema does not currently expose tolerations for the controller-manager Deployment, so we cannot keep it scheduled during teardown. Instead, install a `KubernetesPatch` with a destroy-time `restorePatch` that strips `metadata.finalizers` directly via the CDK-managed kubectl provider. The patch `applyPatch` is a no-op (the controller owns the finalizer lifecycle during create/update). Because the patch's construct depends on the `InferenceEndpointConfig` manifest, CloudFormation orders the patch deletion before the manifest deletion on destroy, so the finalizer is cleared before the CR itself is removed.
Adds two unit tests that assert the new destroy-time cleanup affordances: - The `amazon-sagemaker-hyperpod-inference` EKS addon renders `tolerations` for the `node.kubernetes.io/unreachable` and `node.kubernetes.io/not-ready` taints on both the `alb` and `keda` sections of its `configurationValues`. Because CFN serializes the configuration values as a tokenized `Fn::Join`, the test reaches into the join parts to assert the substrings that should be present. - A `Custom::AWSCDK-EKS-KubernetesPatch` resource is installed that targets an `inferenceendpointconfig/*` resource, with `PatchType: merge` and a `RestorePatchJson` that clears the finalizer list (`metadata.finalizers = []`).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Resolves two destroy-time cleanup issues that
HyperPodVllmNxdInferenceServiceexposed during the PR #56 integration test run. Both issues causecdk destroyto hang for 45+ minutes and eventually fail with orphaned resources that have to be cleaned up by hand.This PR is stacked on top of #56. The base branch will switch to
mainautomatically once #56 is merged; there is no conflict between the two PRs outside of snapshot regeneration.Issues addressed
1.
InferenceEndpointConfigfinalizer deadlockDuring stack destroy, the EKS system nodegroup is torn down before the HyperPod worker node is drained. The remaining node keeps the
node.kubernetes.io/unreachable:NoScheduletaint, and thehyperpod-inference-controller-managerDeployment (provisioned by theamazon-sagemaker-hyperpod-inferenceaddon) has no toleration for that taint, so the controller pod becomesPending. Theinference.sagemaker.aws.InferenceEndpointConfigFinalizeron theInferenceEndpointConfigCR is never released, theCustom::AWSCDK-EKS-KubernetesResourcethat manages the manifest hangs inDELETE_IN_PROGRESS, and CloudFormation eventually aborts the destroy with the CR still resident on the cluster.The addon's
configurationValuesschema does not currently expose tolerations for the controller-manager Deployment (only foralbandkeda), so we cannot fix this by keeping the controller scheduled. Instead this PR installs aKubernetesPatchwith a destroy-timerestorePatchthat stripsmetadata.finalizerson the CR directly. The patch runs through the CDK-managed kubectl provider Lambda (outside the cluster), so it is unaffected by whether the controller can be scheduled.The
KubernetesPatchconstruct's node depends on theInferenceEndpointConfigmanifest, so on destroy CloudFormation runs the patch'srestorePatchbefore deleting the manifest itself. On create/update the ordering is reversed and the patchapplyPatchis a no-op.2. ALB Controller orphan SG/TG on VPC delete
The AWS Load Balancer Controller (also bundled by the inference addon) creates SecurityGroups and TargetGroups from inside Kubernetes, outside of CloudFormation. If the controller pod is evicted during node drain before it has a chance to process the
Ingressdeletion event, those SGs/TGs get orphaned andAWS::EC2::VPCfails its delete withDependencyViolation.This PR adds the same destroy-time tolerations (
unreachableandnot-ready) to thealbandkedasections of the addon'sconfigurationValues, so that the ALB controller pod can keep processingIngressdeletion events while its nodes drain. This is a mitigation rather than a fix — the orphan path is still possible if the controller terminates unexpectedly — but it removes the primary failure mode observed in PR #56's run.Known limitation & follow-up
The addon's
configurationValuesschema does not yet let us pass tolerations to the controller-manager itself. Until that field is exposed upstream, theKubernetesPatchin this PR is the only way to unblock a destroy when the controller pod goesPending. We will file a feature request with the addon team to exposetolerationson the controller-manager Deployment.Changes
src/hyper-pod/hyper-pod-vllm-nxd-inference-service.tsalb.tolerations/keda.tolerationsto the addon'sconfigurationValues.KubernetesPatchwithapplyPatch: {}andrestorePatch: { metadata: { finalizers: [] } }, depending on theInferenceEndpointConfigmanifest so that on destroy the patch runs first.src/hyper-pod/hyper-pod-vllm-nxd-inference-service.test.tsconfigurationValuescontains tolerations for bothalbandkeda.Custom::AWSCDK-EKS-KubernetesPatchresource targets aninferenceendpointconfig/*, usesPatchType: merge, and clears the finalizer list in itsrestorePatch.test/integ.hyper-pod-cluster.ts.snapshotto reflect the new addonconfigurationValuespayload and the newKubernetesPatchcustom resource.Verification
npx projen buildpasses (70 unit tests, 2 integ snapshots verifyUNCHANGEDafter regeneration).By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license