Skip to content

feat: wire vLLM SageMaker entrypoint to standard-supervisor#6044

Merged
junpuf merged 6 commits into
mainfrom
mhcs-launcher
May 8, 2026
Merged

feat: wire vLLM SageMaker entrypoint to standard-supervisor#6044
junpuf merged 6 commits into
mainfrom
mhcs-launcher

Conversation

@Lokiiiiii
Copy link
Copy Markdown
Contributor

Summary

Adds standard-supervisor (from model-hosting-container-standards) to the vLLM SageMaker launch chain, aligning the DLC with the vLLM open source SageMaker stage which already uses standard-supervisor as its entrypoint wrapper.

Why this is necessary

  1. Parity with vLLM open source

The upstream vLLM project's sagemaker-entrypoint.sh already launches via standard-supervisor.

  1. Process supervision and auto-recovery

Without standard-supervisor, if the vLLM process crashes, the container exits immediately. With it, the process is managed by supervisord which:

Automatically restarts the vLLM process on unexpected exits (configurable via PROCESS_AUTO_RECOVERY, default: true)
Retries up to N times before giving up (configurable via PROCESS_MAX_START_RETRIES, default: 3)
Provides structured exit codes so SageMaker can distinguish between transient failures and permanent ones

  1. Dynamic dependency installation

standard-supervisor (v0.1.15+) automatically installs dependencies from requirements.txt before starting the server. This enables customers to bundle custom dependencies with their model artifacts without needing a custom container.

Controlled by:
STANDARD_AUTO_INSTALL_REQ (default: true) — enable/disable
STANDARD_PIP_ARGS — override with explicit pip arguments

Testing

Local validation
CI will run: sanity tests, security tests, telemetry tests, upstream vLLM tests (GPU), and a real SageMaker endpoint deployment test

Update sagemaker_entrypoint.sh to use standard-supervisor for process
management and automatic requirements.txt installation.

Changes:
- sagemaker_entrypoint.sh: exec standard-supervisor instead of raw exec
- Dockerfile: bump model-hosting-container-standards pin to >=0.1.15

Signed-off-by: Loki Ravi <lokravi@amazon.com>
Lokiiiiii added 3 commits May 5, 2026 23:39
The entrypoint now uses 'exec standard-supervisor python3 ...' instead of
'exec python3 ...'. Update the dry-run regex in TestEntrypointArgHandling
to match both formats.

Signed-off-by: Loki Ravi <lokravi@amazon.com>
log4j-core 2.17.1 is bundled in ray_dist.jar and cannot be patched
without a Ray upgrade. Extend review_by to 2026-06-04.

Signed-off-by: Loki Ravi <lokravi@amazon.com>
rustls-webpki 0.103.12 is statically linked in /usr/local/bin/uv.
Fix requires rustls-webpki>=0.104.0-alpha.7 (pre-release, not yet
available in stable uv). Not exploitable in our usage context.

Signed-off-by: Loki Ravi <lokravi@amazon.com>
Comment thread docker/vllm/Dockerfile
Lokiiiiii added 2 commits May 6, 2026 19:06
…ckerfile

Align the AL2023 vLLM image with the Ubuntu image by requiring
model-hosting-container-standards>=0.1.15 for standard-supervisor
dependency auto-install support.

Signed-off-by: Loki Ravi <lokravi@amazon.com>
Same rustls-webpki vulnerability in uv binary, same justification as
the vllm (Ubuntu) allowlist entry.

Signed-off-by: Loki Ravi <lokravi@amazon.com>
@Lokiiiiii Lokiiiiii requested a review from junpuf May 6, 2026 23:18
@junpuf junpuf enabled auto-merge (squash) May 7, 2026 20:20
@junpuf junpuf merged commit 5b354ff into main May 8, 2026
313 of 329 checks passed
@junpuf junpuf deleted the mhcs-launcher branch May 8, 2026 00:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants