feat: add vLLM-Omni EC2 and SageMaker DLC images by Yadan-Wei · Pull Request #5868 · aws/deep-learning-containers

Yadan-Wei · 2026-04-02T06:42:16Z

vLLM-Omni DLC: EC2 and SageMaker Deep Learning Containers for Omni-Modality

Models

Summary

Adds vLLM-Omni DLC images for EC2 and SageMaker on Amazon Linux 2023, enabling
serving of omni-modality models (TTS, image generation, video generation,
multimodal chat) via vllm-omni==0.18.0.

What's Included

Dockerfile (docker/vllm/Dockerfile.amzn2023)

New stages: omni-deps, builder-oss-omni, omni-base, vllm-omni-ec2-amzn2023,
vllm-omni-sagemaker-amzn2023
Installs vllm-omni==0.18.0 via pip on top of vLLM runtime
SPAL system deps: espeak-ng, sox, ffmpeg-free
Pre-built runtime base support (RUNTIME_BASE arg) to skip vLLM compile in PR
builds (~120min → ~2min)

Entrypoints

omni_dockerd_entrypoint.sh — EC2: exec vllm serve --omni "$@"
omni_sagemaker_entrypoint.sh — SageMaker: parses SM_VLLM_* env vars, adds
--middleware for routing

SageMaker Routing Middleware (omni_sagemaker_serve.py)

ASGI middleware injected via --middleware flag (single process, no proxy)
Routes /invocations to the correct vllm-omni endpoint based on
CustomAttributes: route=
Falls through to vLLM's built-in /invocations handler for chat/completion/
embed
Supports: /v1/audio/speech, /v1/images/generations, /v1/videos,
/v1/chat/completions
12 unit tests covering routing, fallthrough, adapter coexistence

CI Configs & Workflows

vllm-omni-ec2-amzn2023.yml, vllm-omni-sagemaker-amzn2023.yml — framework
configs
pr-vllm-omni-ec2-amzn2023.yml, pr-vllm-omni-sagemaker-amzn2023.yml — PR
workflows with build-runtime caching
reusable-vllm-omni-model-tests.yml — generic smoke test workflow supporting
JSON and multipart/form-data
vllm-omni-model-tests.yml — per-model test config with route, request
payload, and validation

Smoke Tests (5 models, 4 routes)

Model	Route	Runner
Qwen3-TTS-1.7B-CustomVoice	/v1/audio/speech	g6xl
FLUX.2-klein-4B	/v1/images/generations	g6xl
Wan2.1-T2V-1.3B	/v1/videos	g6e4xl
Qwen2.5-Omni-3B	/v1/chat/completions	g6e12xl

EC2 tests use real entrypoint, hit OpenAI-compatible API directly
SageMaker tests use real entrypoint, hit /invocations with middleware
routing
Container log dump (500 lines) on failure
Orphaned endpoint cleanup step

SageMaker Endpoint Tests

Sync endpoint test with retry for torch.compile warmup (60s SageMaker
timeout)
Async endpoint test using AsyncInferenceConfig (bypasses 60s limit)

S3 Model Cache

Models pre-cached in s3://dlc-cicd-models/omni-models/ as tar.gz
Uses download-model action with ETag-based caching and flock-based
concurrency

Key Design Decisions

Middleware over proxy: Single-process ASGI middleware (--middleware flag
) instead of a separate proxy process. Reuses vLLM's existing /invocations,
/ping, /health handlers.
Per-model test config: Route, request payload, content type, and
validation defined in YAML. Adding a new model = adding a config entry, no
code changes.
Pre-built runtime base: build-runtime job checks ECR for cached runtime
image, builds only on first run per vLLM version. Subsequent PR builds skip
the ~120min compile.

Testing

Middleware unit tests (12 tests)
EC2 smoke tests (4 models)
SageMaker smoke tests (4 models)
SageMaker endpoint test (sync + async)
Pre-commit checks
Middleware verified on upstream vllm/vllm-omni:v0.18.0 Docker Hub image

Toggle if you are merging into master Branch

By default, docker image builds and tests are disabled. Two ways to run builds and tests:

Using dlc_developer_config.toml
Using this PR description (currently only supported for PyTorch, TensorFlow, vllm, and base images)

How to use the helper utility for updating dlc_developer_config.toml

Assuming your remote is called origin (you can find out more with git remote -v)...

Run default builds and tests for a particular buildspec - also commits and pushes changes to remote; Example:

python src/prepare_dlc_dev_environment.py -b </path/to/buildspec.yml> -cp origin

Enable specific tests for a buildspec or set of buildspecs - also commits and pushes changes to remote; Example:

python src/prepare_dlc_dev_environment.py -b </path/to/buildspec.yml> -t sanity_tests -cp origin

Restore TOML file when ready to merge

python src/prepare_dlc_dev_environment.py -rcp origin

NOTE: If you are creating a PR for a new framework version, please ensure success of the local, standard, rc, and efa sagemaker tests by updating the dlc_developer_config.toml file:

sagemaker_remote_tests = true
sagemaker_efa_tests = true
sagemaker_rc_tests = true
sagemaker_local_tests = true

How to use PR description

Use the code block below to uncomment commands and run the PR CodeBuild jobs. There are two commands available:

# /buildspec <buildspec_path>
- e.g.: # /buildspec pytorch/training/buildspec.yml
- If this line is commented out, dlc_developer_config.toml will be used.
# /tests <test_list>
- e.g.: # /tests sanity security ec2
- If this line is commented out, it will run the default set of tests (same as the defaults in dlc_developer_config.toml): sanity, security, ec2, ecs, eks, sagemaker, sagemaker-local.

# /buildspec <buildspec_path>
# /tests <test_list>

Toggle if you are merging into main Branch

PR Checklist

[] I ran pre-commit run --all-files locally before creating this PR. (Read DEVELOPMENT.md for details).

- Add omni-deps, builder-oss-omni, omni-base, ec2, sagemaker stages to Dockerfile.amzn2023 - Install vllm-omni as pure Python layer on top of vLLM runtime - Add omni entrypoints (vllm serve --omni) for EC2 and SageMaker - Add PR workflows for both EC2 and SageMaker omni images - Add reusable model smoke tests (Qwen3-TTS, FLUX.2-klein-4B) - Add SageMaker endpoint integration test with Qwen3-TTS - System deps: espeak-ng, ffmpeg, sox, libsox-fmt-all for audio/TTS - OSS compliance runs against omni venv separately Signed-off-by: Yadan Wei <yadanwei@amazon.com>

- espeak (not espeak-ng) available in AL2023 repos - sox available in AL2023 repos - ffmpeg installed from static build (not in AL2023 repos) - Removed libsox-fmt-all (not available on AL2023) Signed-off-by: Yadan Wei <yadanwei@amazon.com>

- espeak/sox not available in AL2023 minimal CUDA runtime image - sox binary only needed for Qwen3-TTS 25Hz tokenizer (not 12Hz) - ffmpeg needed by pydub/imageio-ffmpeg for audio/video I/O - Removed dnf install for unavailable packages Signed-off-by: Yadan Wei <yadanwei@amazon.com>

- Upgrade system-release to latest to enable SPAL (requires 2023.9+) - Install espeak-ng, sox, ffmpeg-free from SPAL (Supplementary Packages for Amazon Linux) - Replaces static binary approach with official AL2023 package repo Signed-off-by: Yadan Wei <yadanwei@amazon.com>

Signed-off-by: Yadan Wei <yadanwei@amazon.com>

- Add test/vllm-omni/sagemaker/requirements.txt with sagemaker>=2,<3 - Install test deps via uv pip matching reusable-vllm-sagemaker-tests pattern - Run pytest from test/ directory with relative path Signed-off-by: Yadan Wei <yadanwei@amazon.com>

- Add --stage-init-timeout 600 to server start (TTS models need multi-stage init) - Add stage_init_timeout=600 to offline Omni() calls - Increase server wait loop from 120s to 300s Signed-off-by: Yadan Wei <yadanwei@amazon.com>

- Use existing download-model GitHub action with caching, locking, eviction - Downloads to /dlc-models/ (root fs) instead of /tmp - Proper cleanup of lock PIDs and docker images Signed-off-by: Yadan Wei <yadanwei@amazon.com>

- Pin gradio>=6.7.0 in omni-base CVE patch layer Signed-off-by: Yadan Wei <yadanwei@amazon.com>

Signed-off-by: Yadan Wei <yadanwei@amazon.com>

- TTS models use OpenAI-compatible speech endpoint, not chat completions - Validate output WAV file size instead of JSON response Signed-off-by: Yadan Wei <yadanwei@amazon.com>

- Both models are public (Apache 2.0, no gating) - Eliminates S3 download/extract issues (corrupted tarballs, disk space) - Models downloaded from HF at runtime inside container - Removed s3_prefix and s3_model from config Signed-off-by: Yadan Wei <yadanwei@amazon.com>

- Parse response JSON, extract and decode base64 image - Print only image size instead of full base64 payload - Validate decoded image is non-trivial (>1000 bytes) Signed-off-by: Yadan Wei <yadanwei@amazon.com>

…16GB T4) Signed-off-by: Yadan Wei <yadanwei@amazon.com>

Signed-off-by: Yadan Wei <yadanwei@amazon.com>

- Reusable workflow uses customer-type input (ec2 or sagemaker) - Maps to vllm_omni_{customer-type}_smoke_test.sh - No extra test-type parameter needed Signed-off-by: Yadan Wei <yadanwei@amazon.com>

* fix telemetry ingress rules Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com> * add test Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com> * temp test Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com> * revert workflow Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com> --------- Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* AI changes made during Kiro-cli session --- X-AI-Tool: Kiro-cli X-AI-Handle-Time-Seconds: 254 X-AI-Prompt: I have uploaded test resources training/ and inference/ in s3://dlc-cicd-models/xgboost/container_test_resources/, I need you to create container_tests/ and add the following tests in xgboost test dir - The tests need a helper that replaces ai_algorithms_container_tests using docker-py directly: test/xgboost/container/ ├── conftest.py # pytest fixtures: --image flag, S3 download, docker client ├── container_helper.py # replaces ai_algorithms_container_tests ├── test_training.py # rewritten training tests ├── test_scoring.py # rewritten inference tests └── test_batch_transform.py # rewritten batch transform tests The container_helper.py needs to: - Download test resources from S3 to a temp dir (once per session) - Create /opt/ml/ directory structure in temp dirs - Write config JSON files (hyperparameters, inputdataconfig, resourceconfig) - Mount volumes and run the container via docker-py - For training: wait for exit, return exit code + logs + model files - For inference: start container, wait for health check, send HTTP requests, you can refer to https://code.amazon.com/packages/SMFrameworksXGBoost3_0-5Tests/trees/mainline/--/src/container_tests * AI changes made during Kiro-cli session --- X-AI-Tool: Kiro-cli X-AI-Handle-Time-Seconds: 135 X-AI-Prompt: Add this in release workflow, comment benchmark tests for now, add on push trigger, create parallel test execution for each test case in wf and prepare cr * AI changes made during Kiro-cli session --- X-AI-Tool: Kiro-cli X-AI-Handle-Time-Seconds: 143 X-AI-Prompt: create a new workflow for xgboost benchmarking, container and integration tests and use that workflow in release wrkflow * AI changes made during Kiro-cli session --- X-AI-Tool: Kiro-cli X-AI-Handle-Time-Seconds: 101 X-AI-Prompt: change the name to - sagemaker-xgboost-integ-tests.yml and remove the integ tests steps it is a todo, comment benchmark tests as i need to test container tests now. * AI changes made during Kiro-cli session --- X-AI-Tool: Kiro-cli X-AI-Handle-Time-Seconds: 25 X-AI-Prompt: change on push current branch * AI changes made during Kiro-cli session --- X-AI-Tool: Kiro-cli X-AI-Handle-Time-Seconds: 13 X-AI-Prompt: remove main this wf will never be pr triggered it is manually triggered * AI changes made during Kiro-cli session --- X-AI-Tool: Kiro-cli X-AI-Handle-Time-Seconds: 41 X-AI-Prompt: yeah lets do with option b * AI changes made during Kiro-cli session --- X-AI-Tool: Kiro-cli X-AI-Handle-Time-Seconds: 22 X-AI-Prompt: E E Invoking script with the following command: E E /miniconda3/bin/python3 -m sagemaker_xgboost_container.training:main --alpha 0.0 --base_score 0.5 --booster gbtree --colsample_bylevel 1 --colsample_bytree 1.0 --csv_weights 1 --dsplit row --early_stopping_rounds 5 --eta 0.3 --eval_metric error --gamma 0.0 --grow_policy depthwise --lambda 1.0 --lambda_bias 0.0 --max_bin 256 --max_delta_step 0 --max_depth 6 --max_leaves 0 --min_child_weight 1.0 --normalize_type tree --nthread 8 --num_round 10 --objective binary:logistic --one_drop 0 --predictor cpu_predictor --process_type default --rate_drop 0.0 --refresh_leaf 1 --sample_type uniform --scale_pos_weight 1.0 --silent 0 --sketch_eps 0.03 --skip_drop 0.0 --subsample 1.0 --tree_method auto --tweedie_variance_power 1.5 --updater grow_colmaker,prune E E E /miniconda3/bin/python3: No module named sagemaker_xgboost_container.training:main E [2026-03-31:21:26:07:ERROR] ExecuteUserScriptError: E Command "/miniconda3/bin/python3 -m sagemaker_xgboost_container.training:main --alpha 0.0 --base_score 0.5 --booster gbtree --colsample_bylevel 1 --colsample_bytree 1.0 --csv_weights 1 --dsplit row --early_stopping_rounds 5 --eta 0.3 --eval_metric error --gamma 0.0 --grow_policy depthwise --lambda 1.0 --lambda_bias 0.0 --max_bin 256 --max_delta_step 0 --max_depth 6 --max_leaves 0 --min_child_weight 1.0 --normalize_type tree --nthread 8 --num_round 10 --objective binary:logistic --one_drop 0 --predictor cpu_predictor --process_type default --rate_drop 0.0 --refresh_leaf 1 --sample_type uniform --scale_pos_weight 1.0 --silent 0 --sketch_eps 0.03 --skip_drop 0.0 --subsample 1.0 --tree_method auto --tweedie_variance_power 1.5 --updater grow_colmaker,prune" E E assert 1 == 0 * AI changes made during Kiro-cli session --- X-AI-Tool: Kiro-cli X-AI-Handle-Time-Seconds: 50 X-AI-Prompt: scan for red flags * AI changes made during Kiro-cli session --- X-AI-Tool: Kiro-cli X-AI-Handle-Time-Seconds: 47 X-AI-Prompt: can we regrenate the model durng test time and upload back to s3? * AI changes made during Kiro-cli session --- X-AI-Tool: Kiro-cli X-AI-Handle-Time-Seconds: 38 X-AI-Prompt: RuntimeError: Model /opt/ml/model/mnist-pkl-model cannot be loaded: Pickle load error=[21:37:57] /workspace/src/learner.cc:1185: Check failed: header == serialisation_header_: If you are loading a serialized model (like pickle in Python, RDS in R) or configuration generated by an older version of XGBoost, please export the model by calling `Booster.save_model` from that version first, then load it back in current version. See: * AI changes made during Kiro-cli session --- X-AI-Tool: Kiro-cli X-AI-Handle-Time-Seconds: 30 X-AI-Prompt: During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/miniconda3/lib/python3.10/site-packages/sagemaker_containers/_trainer.py", line 84, in train entrypoint() File "/miniconda3/lib/python3.10/site-packages/sagemaker_xgboost_container/training.py", line 102, in main train(framework.training_env()) File "/miniconda3/lib/python3.10/site-packages/sagemaker_xgboost_container/training.py", line 98, in train run_algorithm_mode() File "/miniconda3/lib/python3.10/site-packages/sagemaker_xgboost_container/training.py", line 64, in run_algorithm_mode sagemaker_train( File "/miniconda3/lib/python3.10/site-packages/sagemaker_xgboost_container/algorithm_mode/train.py", line 144, in sagemaker_train validated_train_config = hyperparameters.validate(train_config) File "/miniconda3/lib/python3.10/site-packages/sagemaker_algorithm_toolkit/hyperparameter_validation.py", line 278, in validate raise exc.UserError("Extraneous hyperparameter found: {}".format(hp)) sagemaker_algorithm_toolkit.exceptions.UserError: Extraneous hyperparameter found: silent Extraneous hyperparameter found: silent assert 1 == 0 FAILED xgboost/container/test_training.py::TestValidTraining::test_checkpoint_and_reload - assert 1 == 0 * AI changes made during Kiro-cli session --- X-AI-Tool: Kiro-cli X-AI-Handle-Time-Seconds: 21 X-AI-Prompt: The fix is a one-liner in ServingContainer.__enter__. The XGBoost serving entrypoint (sagemaker_xgboost_container.serving) reads /opt/ml/input/config/resourceconfig.json on startup. Without it, the Python app fails to initialize, gunicorn workers exit with code 3, and you get the HaltServer 'Worker failed to boot.' error. * AI changes made during Kiro-cli session --- X-AI-Tool: Kiro-cli X-AI-Handle-Time-Seconds: 14 X-AI-Prompt: ### 2. container_helper.py — tmpdir not cleaned up in __exit__ Both run_training and ServingContainer create temp dirs but never clean them up. The training function at least returns paths so the caller could clean up, but ServingContainer stores self._opt_ml and never removes it. Fix: Add cleanup in __exit__: * AI changes made during Kiro-cli session --- X-AI-Tool: Kiro-cli X-AI-Handle-Time-Seconds: 19 X-AI-Prompt: test_training.py — test_checkpoint_and_reload has inline import json * AI changes made during Kiro-cli session --- X-AI-Tool: Kiro-cli X-AI-Handle-Time-Seconds: 31 X-AI-Prompt: test_training.py — test_checkpoint_and_reload phase 2 container not cleaned up on timeout * AI changes made during Kiro-cli session --- X-AI-Tool: Kiro-cli X-AI-Handle-Time-Seconds: 33 X-AI-Prompt: container-test-training installs docker pytest boto3 but not requests. The training tests import run_training from container_helper, which imports requests at module level. This will fail at import time. * Human changes made during kiro-cli session after prompt completion. --- X-AI-Tool: Human X-AI-Prompt: tests are still failing with same reason * AI changes made during Kiro-cli session --- X-AI-Tool: Kiro-cli X-AI-Handle-Time-Seconds: 35 X-AI-Prompt: scan for red flags * AI changes made during Kiro-cli session --- X-AI-Tool: Kiro-cli X-AI-Handle-Time-Seconds: 84 X-AI-Prompt: RuntimeError: Model /opt/ml/model/mnist-pkl-model cannot be loaded: Pickle load error=[23:48:50] /workspace/src/learner.cc:1185: Check failed: header == serialisation_header_: If you are loading a serialized model (like pickle in Python, RDS in R) or configuration generated by an older version of XGBoost, please export the model by calling `Booster.save_model` from that version first, then load it back in current version. See: * Human changes made during kiro-cli session after prompt completion. --- X-AI-Tool: Human X-AI-Prompt: RuntimeError: Model /opt/ml/model/mnist-pkl-model cannot be loaded: Pickle load error=[23:48:50] /workspace/src/learner.cc:1185: Check failed: header == serialisation_header_: If you are loading a serialized model (like pickle in Python, RDS in R) or configuration generated by an older version of XGBoost, please export the model by calling `Booster.save_model` from that version first, then load it back in current version. See: * AI changes made during Kiro-cli session --- X-AI-Tool: Kiro-cli X-AI-Handle-Time-Seconds: 32 X-AI-Prompt: ============================= test session starts ============================== platform linux -- Python 3.12.12, pytest-9.0.2, pluggy-1.6.0 -- /tmp/codebuild-b0ba6d93-4eb5-444e-b8c3-bebc7c5b99fa/output/src3763/src/eeeffba7_95a5_4ce7_9fdc_ed0e3f9ffdaa/actions-runner/_work/deep-learning-containers/deep-learning-containers/.venv/bin/python3 cachedir: .pytest_cache rootdir: /tmp/codebuild-b0ba6d93-4eb5-444e-b8c3-bebc7c5b99fa/output/src3763/src/eeeffba7_95a5_4ce7_9fdc_ed0e3f9ffdaa/actions-runner/_work/deep-learning-containers/deep-learning-containers configfile: pyproject.toml collecting ... collected 3 items xgboost/container/test_batch_transform.py::TestBatchTransform::test_libsvm_batch FAILED xgboost/container/test_batch_transform.py::TestBatchTransform::test_recordio_protobuf_batch PASSED xgboost/container/test_batch_transform.py::TestBatchTransform::test_csv_batch PASSED =================================== FAILURES =================================== _____________________ TestBatchTransform.test_libsvm_batch _____________________ self = <container.test_batch_transform.TestBatchTransform object at 0x7fd663720d40> docker_client = <docker.client.DockerClient object at 0x7fd6638eec60> image_uri = '404426647817.dkr.ecr.us-west-2.amazonaws.com/ci:xgboost-3.0.5-cpu-py310-cu126-ubuntu20.04-sagemaker-23864956268' inference_resources = '/tmp/xgb-container-test-o7vvveha/inference' def test_libsvm_batch(self, docker_client, image_uri, inference_resources): responses = _send_batch_requests( docker_client, image_uri, inference_resources, "mnist-xgb-model", "text/x-libsvm", ["mnist-1.libsvm", "mnist-less-dim-1.libsvm", "mnist-plus-onedim-1.libsvm", "mnist-700.libsvm"], ) _validate_batch_response(responses[0], 1) _validate_batch_response(responses[1], 1) > _validate_batch_response(responses[2], 1) xgboost/container/test_batch_transform.py:72: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ resp = <Response [400]>, expected_length = 1 def _validate_batch_response(resp, expected_length): """Batch responses are newline-delimited; trailing newline adds +1.""" > assert resp.status_code == httplib.OK, resp.text E AssertionError: Unable to evaluate payload provided: [18:45:55] /workspace/src/learner.cc:1483: Check failed: learner_model_param_.num_feature >= p_fmat->Info().num_col_ (785 vs. 786) : Number of columns does not match number of features in booster. E Stack trace: E [bt] (0) /miniconda3/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(+0x2a6e7c) [0x7fc72964de7c] E [bt] (1) /miniconda3/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(+0x6777a9) [0x7fc729a1e7a9] E [bt] (2) /miniconda3/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(+0x68d962) [0x7fc729a34962] E [bt] (3) /miniconda3/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(XGBoosterPredictFromDMatrix+0x2de) [0x7fc72956196e] E [bt] (4) /miniconda3/lib/python3.10/lib-dynload/../../libffi.so.8(+0x702a) [0x7fc74a42302a] E [bt] (5) /miniconda3/lib/python3.10/lib-dynload/../../libffi.so.8(+0x64a9) [0x7fc74a4224a9] E [bt] (6) /miniconda3/lib/python3.10/lib-dynload/../../libffi.so.8(ffi_call+0xdd) [0x7fc74a422bbd] E [bt] (7) /miniconda3/lib/python3.10/lib-dynload/_ctypes.cpython-310-x86_64-linux-gnu.so(+0x8c7b) [0x7fc74a430c7b] E [bt] (8) /miniconda3/lib/python3.10/lib-dynload/_ctypes.cpython-310-x86_64-linux-gnu.so(+0x8565) [0x7fc74a430565] E E E assert 400 == <HTTPStatus.OK: 200> E + where 400 = <Response [400]>.status_code E + and <HTTPStatus.OK: 200> = httplib.OK xgboost/container/test_batch_transform.py:53: AssertionError ==================================== PASSES ==================================== =========================== short test summary info ============================ PASSED xgboost/container/test_batch_transform.py::TestBatchTransform::test_recordio_protobuf_batch PASSED xgboost/container/test_batch_transform.py::TestBatchTransform::test_csv_batch FAILED xgboost/container/test_batch_transform.py::TestBatchTransform::test_libsvm_batch - AssertionError: Unable to evaluate payload provided: [18:45:55] /workspace/src/learner.cc:1483: Check failed: learner_model_param_.num_feature >= p_fmat->Info().num_col_ (785 vs. 786) : Number of columns does not match number of features in booster. Stack trace: [bt] (0) /miniconda3/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(+0x2a6e7c) [0x7fc72964de7c] [bt] (1) /miniconda3/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(+0x6777a9) [0x7fc729a1e7a9] [bt] (2) /miniconda3/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(+0x68d962) [0x7fc729a34962] [bt] (3) /miniconda3/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(XGBoosterPredictFromDMatrix+0x2de) [0x7fc72956196e] [bt] (4) /miniconda3/lib/python3.10/lib-dynload/../../libffi.so.8(+0x702a) [0x7fc74a42302a] [bt] (5) /miniconda3/lib/python3.10/lib-dynload/../../libffi.so.8(+0x64a9) [0x7fc74a4224a9] [bt] (6) /miniconda3/lib/python3.10/lib-dynload/../../libffi.so.8(ffi_call+0xdd) [0x7fc74a422bbd] [bt] (7) /miniconda3/lib/python3.10/lib-dynload/_ctypes.cpython-310-x86_64-linux-gnu.so(+0x8c7b) [0x7fc74a430c7b] [bt] (8) /miniconda3/lib/python3.10/lib-dynload/_ctypes.cpython-310-x86_64-linux-gnu.so(+0x8565) [0x7fc74a430565] assert 400 == <HTTPStatus.OK: 200> + where 400 = <Response [400]>.status_code + and <HTTPStatus.OK: 200> = httplib.OK ========================= 1 failed, 2 passed in 37.90s ========================= Error: Process completed with exit code 1. how is the test passing? we must need to know what the logs are? * AI changes made during Kiro-cli session --- X-AI-Tool: Kiro-cli X-AI-Handle-Time-Seconds: 29 X-AI-Prompt: same here, xgboost/container/test_training.py::TestValidTraining::test_single_file_libsvm PASSED xgboost/container/test_training.py::TestValidTraining::test_single_file_libsvm_weights PASSED xgboost/container/test_training.py::TestValidTraining::test_single_file_libsvm_hpo_param PASSED xgboost/container/test_training.py::TestValidTraining::test_single_file_libsvm_multiclass_hpo PASSED xgboost/container/test_training.py::TestValidTraining::test_single_file_libsvm_iterate_objectives FAILED xgboost/container/test_training.py::TestValidTraining::test_single_file_libsvm_threshold_eval_metric PASSED xgboost/container/test_training.py::TestValidTraining::test_single_file_libsvm_verbosity PASSED xgboost/container/test_training.py::TestValidTraining::test_multi_files_libsvm PASSED xgboost/container/test_training.py::TestValidTraining::test_single_file_csv PASSED xgboost/container/test_training.py::TestValidTraining::test_single_file_csv_weights PASSED xgboost/container/test_training.py::TestValidTraining::test_multi_file_csv PASSED xgboost/container/test_training.py::TestValidTraining::test_single_file_csv_space_separated PASSED xgboost/container/test_training.py::TestValidTraining::test_single_file_csv_sci_notation PASSED xgboost/container/test_training.py::TestValidTraining::test_single_file_csv_empty_cells PASSED xgboost/container/test_training.py::TestValidTraining::test_checkpoint_and_reload FAILED xgboost/container/test_training.py::TestInvalidTraining::test_no_training_data PASSED xgboost/container/test_training.py::TestInvalidTraining::test_no_validation_data PASSED xgboost/container/test_training.py::TestInvalidTraining::test_invalid_data_csv_content_type PASSED xgboost/container/test_training.py::TestInvalidTraining::test_csv_alpha_with_csv_content_type PASSED xgboost/container/test_training.py::TestInvalidTraining::test_csv_data_with_libsvm_content_type PASSED xgboost/container/test_training.py::TestInvalidTraining::test_invalid_data_with_libsvm_content_type PASSED * AI changes made during Kiro-cli session --- X-AI-Tool: Kiro-cli X-AI-Handle-Time-Seconds: 106 X-AI-Prompt: Run source .venv/bin/activate ============================= test session starts ============================== platform linux -- Python 3.12.12, pytest-9.0.2, pluggy-1.6.0 -- /tmp/codebuild-8acc520a-64b1-45e6-8ddc-2078a24507b5/output/src787/src/b09928cc_a4a3_4b96_9bee_901575f815e0/actions-runner/_work/deep-learning-containers/deep-learning-containers/.venv/bin/python3 cachedir: .pytest_cache rootdir: /tmp/codebuild-8acc520a-64b1-45e6-8ddc-2078a24507b5/output/src787/src/b09928cc_a4a3_4b96_9bee_901575f815e0/actions-runner/_work/deep-learning-containers/deep-learning-containers configfile: pyproject.toml collecting ... collected 45 items xgboost/container/test_training.py::TestValidTraining::test_single_file_libsvm PASSED xgboost/container/test_training.py::TestValidTraining::test_single_file_libsvm_weights PASSED xgboost/container/test_training.py::TestValidTraining::test_single_file_libsvm_hpo_param PASSED xgboost/container/test_training.py::TestValidTraining::test_single_file_libsvm_multiclass_hpo PASSED xgboost/container/test_training.py::TestValidTraining::test_single_file_libsvm_iterate_objectives FAILED xgboost/container/test_training.py::TestValidTraining::test_single_file_libsvm_threshold_eval_metric PASSED xgboost/container/test_training.py::TestValidTraining::test_single_file_libsvm_verbosity PASSED xgboost/container/test_training.py::TestValidTraining::test_multi_files_libsvm PASSED xgboost/container/test_training.py::TestValidTraining::test_single_file_csv PASSED xgboost/container/test_training.py::TestValidTraining::test_single_file_csv_weights PASSED xgboost/container/test_training.py::TestValidTraining::test_multi_file_csv PASSED xgboost/container/test_training.py::TestValidTraining::test_single_file_csv_space_separated PASSED xgboost/container/test_training.py::TestValidTraining::test_single_file_csv_sci_notation PASSED xgboost/container/test_training.py::TestValidTraining::test_single_file_csv_empty_cells PASSED xgboost/container/test_training.py::TestValidTraining::test_checkpoint_and_reload FAILED xgboost/container/test_training.py::TestInvalidTraining::test_no_training_data PASSED xgboost/container/test_training.py::TestInvalidTraining::test_no_validation_data PASSED xgboost/container/test_training.py::TestInvalidTraining::test_invalid_data_csv_content_type PASSED xgboost/container/test_training.py::TestInvalidTraining::test_csv_alpha_with_csv_content_type PASSED xgboost/container/test_training.py::TestInvalidTraining::test_csv_data_with_libsvm_content_type PASSED xgboost/container/test_training.py::TestInvalidTraining::test_invalid_data_with_libsvm_content_type PASSED xgboost/container/test_training.py::TestInvalidTraining::test_invalid_hyperparameter[eta-values0] PASSED xgboost/container/test_training.py::TestInvalidTraining::test_invalid_hyperparameter[gamma-values1] PASSED xgboost/container/test_training.py::TestInvalidTraining::test_invalid_hyperparameter[max_depth-values2] PASSED xgboost/container/test_training.py::TestInvalidTraining::test_invalid_hyperparameter[min_child_weight-values3] PASSED xgboost/container/test_training.py::TestInvalidTraining::test_invalid_hyperparameter[max_delta_step-values4] PASSED xgboost/container/test_training.py::TestInvalidTraining::test_invalid_hyperparameter[colsample_bytree-values5] PASSED xgboost/container/test_training.py::TestInvalidTraining::test_invalid_hyperparameter[colsample_bylevel-values6] PASSED xgboost/container/test_training.py::TestInvalidTraining::test_invalid_hyperparameter[tree_method-values7] FAILED xgboost/container/test_training.py::TestInvalidTraining::test_invalid_hyperparameter[sketch_eps-values8] PASSED xgboost/container/test_training.py::TestInvalidTraining::test_invalid_hyperparameter[refresh_leaf-values9] PASSED xgboost/container/test_training.py::TestInvalidTraining::test_invalid_hyperparameter[process_type-values10] PASSED xgboost/container/test_training.py::TestInvalidTraining::test_invalid_hyperparameter[grow_policy-values11] PASSED xgboost/container/test_training.py::TestInvalidTraining::test_invalid_hyperparameter[sample_type-values12] PASSED xgboost/container/test_training.py::TestInvalidTraining::test_invalid_hyperparameter[normalize_type-values13] PASSED xgboost/container/test_training.py::TestInvalidTraining::test_invalid_hyperparameter[rate_drop-values14] PASSED xgboost/container/test_training.py::TestInvalidTraining::test_invalid_hyperparameter[one_drop-values15] PASSED xgboost/container/test_training.py::TestInvalidTraining::test_invalid_hyperparameter[skip_drop-values16] PASSED xgboost/container/test_training.py::TestInvalidTraining::test_invalid_hyperparameter[tweedie_variance_power-values17] PASSED xgboost/container/test_training.py::TestInvalidTraining::test_invalid_hyperparameter[eval_metric-values18] PASSED xgboost/container/test_training.py::TestInvalidTraining::test_invalid_hyperparameter[booster-values19] PASSED xgboost/container/test_training.py::TestInvalidTraining::test_invalid_hyperparameter[verbosity-values20] PASSED xgboost/container/test_training.py::TestInvalidTraining::test_missing_num_round PASSED xgboost/container/test_training.py::TestInvalidTraining::test_multiclass_without_num_class PASSED xgboost/container/test_training.py::TestInvalidTraining::test_pipe_mode_rejected PASSED =================================== FAILURES =================================== _________ TestValidTraining.test_single_file_libsvm_iterate_objectives _________ self = <container.test_training.TestValidTraining object at 0x7f11f6c34ce0> docker_client = <docker.client.DockerClient object at 0x7f11f6f7d0d0> image_uri = '404426647817.dkr.ecr.us-west-2.amazonaws.com/ci:xgboost-3.0.5-cpu-py310-cu126-ubuntu20.04-sagemaker-23865911659' training_resources = '/tmp/xgb-container-test-ptswvydm/training' def test_single_file_libsvm_iterate_objectives(self, docker_client, image_uri, training_resources): hp = copy.deepcopy(STD_HP) d = _libsvm_dir(training_resources) for obj in ["reg:squarederror", "binary:logistic", "count:poisson", "reg:gamma", "reg:tweedie"]: hp["objective"] = obj result = _run(docker_client, image_uri, training_resources, hp, STD_IDC, STD_RC, [os.path.join(d, "agaricus.libsvm.train")], [os.path.join(d, "agaricus.libsvm.test")]) > _assert_success(result) xgboost/container/test_training.py:170: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ result = (1, '/miniconda3/lib/python3.10/site-packages/sagemaker_containers/_server.py:22: UserWarning: pkg_resources is deprec... '/tmp/xgb-train-bkhw5xxo/input/data/train', 'input_validation': '/tmp/xgb-train-bkhw5xxo/input/data/validation', ...}) regex = None def _assert_success(result, regex=None): exit_code, logs, model_files, _ = result > assert exit_code == 0, f"Training failed:\n{logs}" E AssertionError: Training failed: E /miniconda3/lib/python3.10/site-packages/sagemaker_containers/_server.py:22: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81. E import pkg_resources E [2026-04-01:19:09:22:INFO] Imported framework sagemaker_xgboost_container.training E [2026-04-01:19:09:22:INFO] Failed to parse hyperparameter eval_metric value error to Json. E Returning the value itself E [2026-04-01:19:09:22:INFO] Failed to parse hyperparameter predictor value cpu_predictor to Json. E Returning the value itself E [2026-04-01:19:09:22:INFO] Failed to parse hyperparameter tree_method value auto to Json. E Returning the value itself E [2026-04-01:19:09:22:INFO] Failed to parse hyperparameter normalize_type value tree to Json. E Returning the value itself E [2026-04-01:19:09:22:INFO] Failed to parse hyperparameter sample_type value uniform to Json. E Returning the value itself E [2026-04-01:19:09:22:INFO] Failed to parse hyperparameter booster value gbtree to Json. E Returning the value itself E [2026-04-01:19:09:22:INFO] Failed to parse hyperparameter objective value reg:gamma to Json. E Returning the value itself E [2026-04-01:19:09:22:INFO] Failed to parse hyperparameter updater value grow_colmaker,prune to Json. E Returning the value itself E [2026-04-01:19:09:22:INFO] Failed to parse hyperparameter process_type value default to Json. E Returning the value itself E [2026-04-01:19:09:22:INFO] Failed to parse hyperparameter dsplit value row to Json. E Returning the value itself E [2026-04-01:19:09:22:INFO] Failed to parse hyperparameter grow_policy value depthwise to Json. E Returning the value itself E [2026-04-01:19:09:22:INFO] No GPUs detected (normal if no gpus installed) E [2026-04-01:19:09:22:INFO] Running XGBoost Sagemaker in algorithm mode E [2026-04-01:19:09:22:INFO] Determined 0 GPU(s) available on the instance. E [2026-04-01:19:09:22:INFO] File path /opt/ml/input/data/train of input files E [2026-04-01:19:09:22:INFO] Making smlinks from folder /opt/ml/input/data/train to folder /tmp/sagemaker_xgboost_input_data E [2026-04-01:19:09:22:INFO] creating symlink between Path /opt/ml/input/data/train/agaricus.libsvm.train and destination /tmp/sagemaker_xgboost_input_data/agaricus.libsvm.train1664359970552213804 E [2026-04-01:19:09:22:INFO] files path: /tmp/sagemaker_xgboost_input_data E [2026-04-01:19:09:22:INFO] File path /opt/ml/input/data/validation of input files E [2026-04-01:19:09:22:INFO] Making smlinks from folder /opt/ml/input/data/validation to folder /tmp/sagemaker_xgboost_input_data E [2026-04-01:19:09:22:INFO] creating symlink between Path /opt/ml/input/data/validation/agaricus.libsvm.test and destination /tmp/sagemaker_xgboost_input_data/agaricus.libsvm.test1757920320072049626 E [2026-04-01:19:09:22:INFO] files path: /tmp/sagemaker_xgboost_input_data E [2026-04-01:19:09:22:INFO] Single node training. E [2026-04-01:19:09:22:INFO] TRAIN_JOB_DEBUG: Received is_master=True E TRAIN_JOB_DEBUG: Received is_master=True E [2026-04-01:19:09:22:INFO] Train matrix has 6513 rows and 127 columns E [2026-04-01:19:09:22:INFO] Validation matrix has 1611 rows E [2026-04-01:19:09:22:INFO] CALLBACK_SETUP_DEBUG: save_model_on_termination=false, is_master=True E [2026-04-01:19:09:22:INFO] CALLBACK_SKIPPING save_model_on_termination=false, is_master=True) E /miniconda3/lib/python3.10/site-packages/xgboost/callback.py:386: UserWarning: [19:09:22] WARNING: /workspace/src/common/error_msg.cc:33: You have manually specified the `updater` parameter. The `tree_method` parameter will be ignored. Incorrect sequence of updaters will produce undefined behavior. For common uses, we recommend using `tree_method` parameter instead. E self.starting_round = model.num_boosted_rounds() E /miniconda3/lib/python3.10/site-packages/xgboost/callback.py:386: UserWarning: [19:09:22] WARNING: /workspace/src/learner.cc:738: E Parameters: { "dsplit", "lambda_bias", "normalize_type", "one_drop", "predictor", "rate_drop", "sample_type", "sketch_eps", "skip_drop", "tweedie_variance_power" } are not used. E E self.starting_round = model.num_boosted_rounds() E [2026-04-01:19:09:22:ERROR] Reporting training FAILURE E [2026-04-01:19:09:22:ERROR] framework error: E Traceback (most recent call last): E File "/miniconda3/lib/python3.10/site-packages/sagemaker_xgboost_container/algorithm_mode/train.py", line 367, in train_job E bst = xgb.train( E File "/miniconda3/lib/python3.10/site-packages/xgboost/core.py", line 729, in inner_f E return func(**kwargs) E File "/miniconda3/lib/python3.10/site-packages/xgboost/training.py", line 183, in train E bst.update(dtrain, iteration=i, fobj=obj) E File "/miniconda3/lib/python3.10/site-packages/xgboost/core.py", line 2246, in update E _check_call( E File "/miniconda3/lib/python3.10/site-packages/xgboost/core.py", line 310, in _check_call E raise XGBoostError(py_str(_LIB.XGBGetLastError())) E xgboost.core.XGBoostError: [19:09:22] /workspace/src/objective/regression_obj.cu:88: label must be positive for gamma regression. E Stack trace: E [bt] (0) /miniconda3/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(+0x2a6e7c) [0x7fb583957e7c] E [bt] (1) /miniconda3/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(+0xf02dcb) [0x7fb5845b3dcb] E [bt] (2) /miniconda3/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(+0xf03333) [0x7fb5845b4333] E [bt] (3) /miniconda3/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(+0x68d2a2) [0x7fb583d3e2a2] E [bt] (4) /miniconda3/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(XGBoosterUpdateOneIter+0x77) [0x7fb583867f57] E [bt] (5) /miniconda3/lib/python3.10/lib-dynload/../../libffi.so.8(+0x702a) [0x7fb5b767602a] E [bt] (6) /miniconda3/lib/python3.10/lib-dynload/../../libffi.so.8(+0x64a9) [0x7fb5b76754a9] E [bt] (7) /miniconda3/lib/python3.10/lib-dynload/../../libffi.so.8(ffi_call+0xdd) [0x7fb5b7675bbd] E [bt] (8) /miniconda3/lib/python3.10/lib-dynload/_ctypes.cpython-310-x86_64-linux-gnu.so(+0x8c7b) [0x7fb5b7683c7b] E E E E During handling of the above exception, another exception occurred: E E Traceback (most recent call last): E File "/miniconda3/lib/python3.10/site-packages/sagemaker_containers/_trainer.py", line 84, in train E entrypoint() E File "/miniconda3/lib/python3.10/site-packages/sagemaker_xgboost_container/training.py", line 102, in main E train(framework.training_env()) E File "/miniconda3/lib/python3.10/site-packages/sagemaker_xgboost_container/training.py", line 98, in train E run_algorithm_mode() E File "/miniconda3/lib/python3.10/site-packages/sagemaker_xgboost_container/training.py", line 64, in run_algorithm_mode E sagemaker_train( E File "/miniconda3/lib/python3.10/site-packages/sagemaker_xgboost_container/algorithm_mode/train.py", line 278, in sagemaker_train E train_job(**train_args) E File "/miniconda3/lib/python3.10/site-packages/sagemaker_xgboost_container/algorithm_mode/train.py", line 467, in train_job E raise exc.AlgorithmError(f"{exception_prefix}:\n {str(e)}") E sagemaker_algorithm_toolkit.exceptions.AlgorithmError: XGB train call failed with exception: E [19:09:22] /workspace/src/objective/regression_obj.cu:88: label must be positive for gamma regression. E Stack trace: E [bt] (0) /miniconda3/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(+0x2a6e7c) [0x7fb583957e7c] E [bt] (1) /miniconda3/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(+0xf02dcb) [0x7fb5845b3dcb] E [bt] (2) /miniconda3/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(+0xf03333) [0x7fb5845b4333] E [bt] (3) /miniconda3/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(+0x68d2a2) [0x7fb583d3e2a2] E [bt] (4) /miniconda3/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(XGBoosterUpdateOneIter+0x77) [0x7fb583867f57] E [bt] (5) /miniconda3/lib/python3.10/lib-dynload/../../libffi.so.8(+0x702a) [0x7fb5b767602a] E [bt] (6) /miniconda3/lib/python3.10/lib-dynload/../../libffi.so.8(+0x64a9) [0x7fb5b76754a9] E [bt] (7) /miniconda3/lib/python3.10/lib-dynload/../../libffi.so.8(ffi_call+0xdd) [0x7fb5b7675bbd] E [bt] (8) /miniconda3/lib/python3.10/lib-dynload/_ctypes.cpython-310-x86_64-linux-gnu.so(+0x8c7b) [0x7fb5b7683c7b] E E E E XGB train call failed with exception: E [19:09:22] /workspace/src/objective/regression_obj.cu:88: label must be positive for gamma regression. E Stack trace: E [bt] (0) /miniconda3/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(+0x2a6e7c) [0x7fb583957e7c] E [bt] (1) /miniconda3/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(+0xf02dcb) [0x7fb5845b3dcb] E [bt] (2) /miniconda3/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(+0xf03333) [0x7fb5845b4333] E [bt] (3) /miniconda3/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(+0x68d2a2) [0x7fb583d3e2a2] E [bt] (4) /miniconda3/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(XGBoosterUpdateOneIter+0x77) [0x7fb583867f57] E [bt] (5) /miniconda3/lib/python3.10/lib-dynload/../../libffi.so.8(+0x702a) [0x7fb5b767602a] E [bt] (6) /miniconda3/lib/python3.10/lib-dynload/../../libffi.so.8(+0x64a9) [0x7fb5b76754a9] E [bt] (7) /miniconda3/lib/python3.10/lib-dynload/../../libffi.so.8(ffi_call+0xdd) [0x7fb5b7675bbd] E [bt] (8) /miniconda3/lib/python3.10/lib-dynload/_ctypes.cpython-310-x86_64-linux-gnu.so(+0x8c7b) [0x7fb5b7683c7b] E E E E assert 1 == 0 xgboost/container/test_training.py:104: AssertionError _________________ TestValidTraining.test_checkpoint_and_reload _________________ self = <container.test_training.TestValidTraining object at 0x7f11f6c37380> docker_client = <docker.client.DockerClient object at 0x7f11f6f7d0d0> image_uri = '404426647817.dkr.ecr.us-west-2.amazonaws.com/ci:xgboost-3.0.5-cpu-py310-cu126-ubuntu20.04-sagemaker-23865911659' training_resources = '/tmp/xgb-container-test-ptswvydm/training' def test_checkpoint_and_reload(self, docker_client, image_uri, training_resources): """Train 10 rounds, verify checkpoints, then resume to 20 rounds.""" hp1 = copy.deepcopy(STD_HP) hp1["num_round"] = 10 hp1["eval_metric"] = "error" hp1.pop("early_stopping_rounds", None) idc = copy.deepcopy(STD_IDC) idc["train"]["ContentType"] = "text/libsvm" idc.pop("validation", None) d = _libsvm_dir(training_resources) train_files = [os.path.join(d, "agaricus.libsvm.train")] # Phase 1: train 10 rounds exit_code, logs, model_files, paths = run_training( docker_client, image_uri, hp1, idc, STD_RC, training_files=train_files, checkpointconfig=STD_CPC, ) assert exit_code == 0 assert len(model_files) == 1 ckpt_files = os.listdir(paths["checkpoints"]) assert all(f.startswith("xgboost-checkpoint") for f in ckpt_files) regex = r"\[\d+\].*(?=.*train-error:.*)" assert len(re.findall(regex, logs)) == 10 > assert len(ckpt_files) == 5 E AssertionError: assert 1 == 5 E + where 1 = len(['xgboost-checkpoint_0.ubj']) xgboost/container/test_training.py:283: AssertionError _____ TestInvalidTraining.test_invalid_hyperparameter[tree_method-values7] _____ self = <container.test_training.TestInvalidTraining object at 0x7f11f6c37f20> docker_client = <docker.client.DockerClient object at 0x7f11f6f7d0d0> image_uri = '404426647817.dkr.ecr.us-west-2.amazonaws.com/ci:xgboost-3.0.5-cpu-py310-cu126-ubuntu20.04-sagemaker-23865911659' training_resources = '/tmp/xgb-container-test-ptswvydm/training' param = 'tree_method', values = ['invalid_method', 'gpu_exact', 'gpu_hist'] @pytest.mark.parametrize("param,values", [ ("eta", ["-0.1", "1.01", "invalid_string"]), ("gamma", ["-0.1", "invalid_string"]), ("max_depth", ["-0.1", "invalid_string"]), ("min_child_weight", ["-0.1", "invalid_string"]), ("max_delta_step", ["-0.1", "invalid_string"]), ("colsample_bytree", ["-0.1", "0", "invalid_string"]), ("colsample_bylevel", ["-0.1", "0", "invalid_string"]), ("tree_method", ["invalid_method", "gpu_exact", "gpu_hist"]), ("sketch_eps", ["0", "1", "invalid_string"]), ("refresh_leaf", ["invalid", "2"]), ("process_type", ["invalid", "0.01"]), ("grow_policy", ["invalid", "0.01"]), ("sample_type", ["invalid", "0.01"]), ("normalize_type", ["invalid", "0.01"]), ("rate_drop", ["invalid", "-0.01", "1.01"]), ("one_drop", ["invalid", "-0.01", "1.01"]), ("skip_drop", ["invalid", "-0.01", "1.01"]), ("tweedie_variance_power", ["invalid", "1", "2"]), ("eval_metric", ["invalid", "1", "rmse,invalid", "error@nonfloat"]), ("booster", ["invalid", "1"]), ("verbosity", ["invalid", "-1", "4", "0.5"]), ]) def test_invalid_hyperparameter(self, docker_client, image_uri, training_resources, param, values): train, val = self._get_libsvm_data(training_resources) hp = copy.deepcopy(STD_HP) for v in values: hp[param] = v result = _run(docker_client, image_uri, training_resources, hp, STD_IDC, STD_RC, train, val) > _assert_failed(result) xgboost/container/test_training.py:405: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ result = (0, '/miniconda3/lib/python3.10/site-packages/sagemaker_containers/_server.py:22: UserWarning: pkg_resources is deprec... '/tmp/xgb-train-4tccj7i0/input/data/train', 'input_validation': '/tmp/xgb-train-4tccj7i0/input/data/validation', ...}) regex = 'UserError:' def _assert_failed(result, regex="UserError:"): exit_code, logs, _, _ = result > assert re.search(regex, logs), f"Pattern {regex!r} not found in logs" E AssertionError: Pattern 'UserError:' not found in logs E assert None E + where None = <function search at 0x7f11f9e60680>('UserError:', '/miniconda3/lib/python3.10/site-packages/sagemaker_containers/_server.py:22: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.\n import pkg_resources\n[2026-04-01:19:11:48:INFO] Imported framework sagemaker_xgboost_container.training\n[2026-04-01:19:11:48:INFO] Failed to parse hyperparameter eval_metric value error to Json.\nReturning the value itself\n[2026-04-01:19:11:48:INFO] Failed to parse hyperparameter predictor value cpu_predictor to Json.\nReturning the value itself\n[2026-04-01:19:11:48:INFO] Failed to parse hyperparameter tree_method value gpu_hist to Json.\nReturning the value itself\n[2026-04-01:19:11:48:INFO] Failed to parse hyperparameter normalize_type value tree to Json.\nReturning the value itself\n[2026-04-01:19:11:48:INFO] Failed to parse hyperparameter sample_type value uniform to Json.\nReturning the value itself\n[2026-04-01:19:11:48:INFO] Failed to parse hyperparameter booster value gbtree to Json.\nReturning the value itself\n[2026-04-01:19:11:48:INFO] Failed to pa...61\tvalidation-error:0.00000\n[4]\ttrain-error:0.00000\tvalidation-error:0.00000\n/miniconda3/lib/python3.10/site-packages/xgboost/callback.py:503: UserWarning: [19:11:48] WARNING: /workspace/src/gbm/gbtree.cc:359: \n Loading from a raw memory buffer (like pickle in Python, RDS in R) on a CPU-only\n machine. Consider using `save_model/load_model` instead. See:\n\n https://xgboost.readthedocs.io/en/latest/tutorials/saving_model.html\n\n for more details about differences between saving model and serializing. Changing `tree_method` to `hist`.\n model = model[: best_iteration + 1]\n[2026-04-01:19:11:48:INFO] FINAL_MODEL_DEBUG: is_master=True, model_dir=/opt/ml/model\nFINAL_MODEL_DEBUG: is_master=True, model_dir=/opt/ml/model\n[2026-04-01:19:11:48:INFO] FINAL_MODEL_SAVE: Saving final model as master\nFINAL_MODEL_SAVE: Saving final model as master\n/miniconda3/lib/python3.10/site-packages/sagemaker_xgboost_container/algorithm_mode/train.py:480: UserWarning: [19:11:48] WARNING: /workspace/src/c_api/c_api.cc:1427: Saving model in the UBJSON format as default. You can use file extension: `json`, `ubj` or `deprecated` to choose between formats.\n bst.save_model(model_location)\n') E + where <function search at 0x7f11f9e60680> = re.search xgboost/container/test_training.py:112: AssertionError ==================================== PASSES ==================================== =========================== short test summary info ============================ PASSED xgboost/container/test_training.py::TestValidTraining::test_single_file_libsvm PASSED xgboost/container/test_training.py::TestValidTraining::test_single_file_libsvm_weights PASSED xgboost/container/test_training.py::TestValidTraining::test_single_file_libsvm_hpo_param PASSED xgboost/container/test_training.py::TestValidTraining::test_single_file_libsvm_multiclass_hpo PASSED xgboost/container/test_training.py::TestValidTraining::test_single_file_libsvm_threshold_eval_metric PASSED xgboost/container/test_training.py::TestValidTraining::test_single_file_libsvm_verbosity PASSED xgboost/container/test_training.py::TestValidTraining::test_multi_files_libsvm PASSED xgboost/container/test_training.py::TestValidTraining::test_single_file_csv PASSED xgboost/container/test_training.py::TestValidTraining::test_single_file_csv_weights PASSED xgboost/container/test_training.py::TestValidTraining::test_multi_file_csv PASSED xgboost/container/test_training.py::TestValidTraining::test_single_file_csv_space_separated PASSED xgboost/container/test_training.py::TestValidTraining::test_single_file_csv_sci_notation PASSED xgboost/container/test_training.py::TestValidTraining::test_single_file_csv_empty_cells PASSED xgboost/container/test_training.py::TestInvalidTraining::test_no_training_data PASSED xgboost/container/test_training.py::TestInvalidTraining::test_no_validation_data PASSED xgboost/container/test_training.py::TestInvalidTraining::test_invalid_data_csv_content_type PASSED xgboost/container/test_training.py::TestInvalidTraining::test_csv_alpha_with_csv_content_type PASSED xgboost/container/test_training.py::TestInvalidTraining::test_csv_data_with_libsvm_content_type PASSED xgboost/container/test_training.py::TestInvalidTraining::test_invalid_data_with_libsvm_content_type PASSED xgboost/container/test_training.py::TestInvalidTraining::test_invalid_hyperparameter[eta-values0] PASSED xgboost/container/test_training.py::TestInvalidTraining::test_invalid_hyperparameter[gamma-values1] PASSED xgboost/container/test_training.py::TestInvalidTraining::test_invalid_hyperparameter[max_depth-values2] PASSED xgboost/container/test_training.py::TestInvalidTraining::test_invalid_hyperparameter[min_child_weight-values3] PASSED xgboost/container/test_training.py::TestInvalidTraining::test_invalid_hyperparameter[max_delta_step-values4] PASSED xgboost/container/test_training.py::TestInvalidTraining::test_invalid_hyperparameter[colsample_bytree-values5] PASSED xgboost/container/test_training.py::TestInvalidTraining::test_invalid_hyperparameter[colsample_bylevel-values6] PASSED xgboost/container/test_training.py::TestInvalidTraining::test_invalid_hyperparameter[sketch_eps-values8] PASSED xgboost/container/test_training.py::TestInvalidTraining::test_invalid_hyperparameter[refresh_leaf-values9] PASSED xgboost/container/test_training.py::TestInvalidTraining::test_invalid_hyperparameter[process_type-values10] PASSED xgboost/container/test_training.py::TestInvalidTraining::test_invalid_hyperparameter[grow_policy-values11] PASSED xgboost/container/test_training.py::TestInvalidTraining::test_invalid_hyperparameter[sample_type-values12] PASSED xgboost/container/test_training.py::TestInvalidTraining::test_invalid_hyperparameter[normalize_type-values13] PASSED xgboost/container/test_training.py::TestInvalidTraining::test_invalid_hyperparameter[rate_drop-values14] PASSED xgboost/container/test_training.py::TestInvalidTraining::test_invalid_hyperparameter[one_drop-values15] PASSED xgboost/container/test_training.py::TestInvalidTraining::test_invalid_hyperparameter[skip_drop-values16] PASSED xgboost/container/test_training.py::TestInvalidTraining::test_invalid_hyperparameter[tweedie_variance_power-values17] PASSED xgboost/container/test_training.py::TestInvalidTraining::test_invalid_hyperparameter[eval_metric-values18] PASSED xgboost/container/test_training.py::TestInvalidTraining::test_invalid_hyperparameter[booster-values19] PASSED xgboost/container/test_training.py::TestInvalidTraining::test_invalid_hyperparameter[verbosity-values20] PASSED xgboost/container/test_training.py::TestInvalidTraining::test_missing_num_round PASSED xgboost/container/test_training.py::TestInvalidTraining::test_multiclass_without_num_class PASSED xgboost/container/test_training.py::TestInvalidTraining::test_pipe_mode_rejected FAILED xgboost/container/test_training.py::TestValidTraining::test_single_file_libsvm_iterate_objectives - AssertionError: Training failed: /miniconda3/lib/python3.10/site-packages/sagemaker_containers/_server.py:22: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81. import pkg_resources [2026-04-01:19:09:22:INFO] Imported framework sagemaker_xgboost_container.training [2026-04-01:19:09:22:INFO] Failed to parse hyperparameter eval_metric value error to Json. Returning the value itself [2026-04-01:19:09:22:INFO] Failed to parse hyperparameter predictor value cpu_predictor to Json. Returning the value itself [2026-04-01:19:09:22:INFO] Failed to parse hyperparameter tree_method value auto to Json. Returning the value itself [2026-04-01:19:09:22:INFO] Failed to parse hyperparameter normalize_type value tree to Json. Returning the value itself [2026-04-01:19:09:22:INFO] Failed to parse hyperparameter sample_type value uniform to Json. Returning the value itself [2026-04-01:19:09:22:INFO] Failed to parse hyperparameter booster value gbtree to Json. Returning the value itself [2026-04-01:19:09:22:INFO] Failed to parse hyperparameter objective value reg:gamma to Json. Returning the value itself [2026-04-01:19:09:22:INFO] Failed to parse hyperparameter updater value grow_colmaker,prune to Json. Returning the value itself [2026-04-01:19:09:22:INFO] Failed to parse hyperparameter process_type value default to Json. Returning the value itself [2026-04-01:19:09:22:INFO] Failed to parse hyperparameter dsplit value row to Json. Returning the value itself [2026-04-01:19:09:22:INFO] Failed to parse hyperparameter grow_policy value depthwise to Json. Returning the value itself [2026-04-01:19:09:22:INFO] No GPUs detected (normal if no gpus installed) [2026-04-01:19:09:22:INFO] Running XGBoost Sagemaker in algorithm mode [2026-04-01:19:09:22:INFO] Determined 0 GPU(s) available on the instance. [2026-04-01:19:09:22:INFO] File path /opt/ml/input/data/train of input files [2026-04-01:19:09:22:INFO] Making smlinks from folder /opt/ml/input/data/train to folder /tmp/sagemaker_xgboost_input_data [2026-04-01:19:09:22:INFO] creating symlink between Path /opt/ml/input/data/train/agaricus.libsvm.train and destination /tmp/sagemaker_xgboost_input_data/agaricus.libsvm.train1664359970552213804 [2026-04-01:19:09:22:INFO] files path: /tmp/sagemaker_xgboost_input_data [2026-04-01:19:09:22:INFO] File path /opt/ml/input/data/validation of input files [2026-04-01:19:09:22:INFO] Making smlinks from folder /opt/ml/input/data/validation to folder /tmp/sagemaker_xgboost_input_data [2026-04-01:19:09:22:INFO] creating symlink between Path /opt/ml/input/data/validation/agaricus.libsvm.test and destination /tmp/sagemaker_xgboost_input_data/agaricus.libsvm.test1757920320072049626 [2026-04-01:19:09:22:INFO] files path: /tmp/sagemaker_xgboost_input_data [2026-04-01:19:09:22:INFO] Single node training. [2026-04-01:19:09:22:INFO] TRAIN_JOB_DEBUG: Received is_master=True TRAIN_JOB_DEBUG: Received is_master=True [2026-04-01:19:09:22:INFO] Train matrix has 6513 rows and 127 columns [2026-04-01:19:09:22:INFO] Validation matrix has 1611 rows [2026-04-01:19:09:22:INFO] CALLBACK_SETUP_DEBUG: save_model_on_termination=false, is_master=True [2026-04-01:19:09:22:INFO] CALLBACK_SKIPPING save_model_on_termination=false, is_master=True) /miniconda3/lib/python3.10/site-packages/xgboost/callback.py:386: UserWarning: [19:09:22] WARNING: /workspace/src/common/error_msg.cc:33: You have manually specified the `updater` parameter. The `tree_method` parameter will be ignored. Incorrect sequence of updaters will produce undefined behavior. For common uses, we recommend using `tree_method` parameter instead. self.starting_round = model.num_boosted_rounds() /miniconda3/lib/python3.10/site-packages/xgboost/callback.py:386: UserWarning: [19:09:22] WARNING: /workspace/src/learner.cc:738: Parameters: { "dsplit", "lambda_bias", "normalize_type", "one_drop", "predictor", "rate_drop", "sample_type", "sketch_eps", "skip_drop", "tweedie_variance_power" } are not used. self.starting_round = model.num_boosted_rounds() [2026-04-01:19:09:22:ERROR] Reporting training FAILURE [2026-04-01:19:09:22:ERROR] framework error: Traceback (most recent call last): File "/miniconda3/lib/python3.10/site-packages/sagemaker_xgboost_container/algorithm_mode/train.py", line 367, in train_job bst = xgb.train( File "/miniconda3/lib/python3.10/site-packages/xgboost/core.py", line 729, in inner_f return func(**kwargs) File "/miniconda3/lib/python3.10/site-packages/xgboost/training.py", line 183, in train bst.update(dtrain, iteration=i, fobj=obj) File "/miniconda3/lib/python3.10/site-packages/xgboost/core.py", line 2246, in update _check_call( File "/miniconda3/lib/python3.10/site-packages/xgboost/core.py", line 310, in _check_call raise XGBoostError(py_str(_LIB.XGBGetLastError())) xgboost.core.XGBoostError: [19:09:22] /workspace/src/objective/regression_obj.cu:88: label must be positive for gamma regression. Stack trace: [bt] (0) /miniconda3/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(+0x2a6e7c) [0x7fb583957e7c] [bt] (1) /miniconda3/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(+0xf02dcb) [0x7fb5845b3dcb] [bt] (2) /miniconda3/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(+0xf03333) [0x7fb5845b4333] [bt] (3) /miniconda3/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(+0x68d2a2) [0x7fb583d3e2a2] [bt] (4) /miniconda3/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(XGBoosterUpdateOneIter+0x77) [0x7fb583867f57] [bt] (5) /miniconda3/lib/python3.10/lib-dynload/../../libffi.so.8(+0x702a) [0x7fb5b767602a] [bt] (6) /miniconda3/lib/python3.10/lib-dynload/../../libffi.so.8(+0x64a9) [0x7fb5b76754a9] [bt] (7) /miniconda3/lib/python3.10/lib-dynload/../../libffi.so.8(ffi_call+0xdd) [0x7fb5b7675bbd] [bt] (8) /miniconda3/lib/python3.10/lib-dynload/_ctypes.cpython-310-x86_64-linux-gnu.so(+0x8c7b) [0x7fb5b7683c7b] During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/miniconda3/lib/python3.10/site-packages/sagemaker_containers/_trainer.py", line 84, in train entrypoint() File "/miniconda3/lib/python3.10/site-packages/sagemaker_xgboost_container/training.py", line 102, in main train(framework.training_env()) File "/miniconda3/lib/python3.10/site-packages/sagemaker_xgboost_container/training.py", line 98, in train run_algorithm_mode() File "/miniconda3/lib/python3.10/site-packages/sagemaker_xgboost_container/training.py", line 64, in run_algorithm_mode sagemaker_train( File "/miniconda3/lib/python3.10/site-packages/sagemaker_xgboost_container/algorithm_mode/train.py", line 278, in sagemaker_train train_job(**train_args) File "/miniconda3/lib/python3.10/site-packages/sagemaker_xgboost_container/algorithm_mode/train.py", line 467, in train_job raise exc.AlgorithmError(f"{exception_prefix}:\n {str(e)}") sagemaker_algorithm_toolkit.exceptions.AlgorithmError: XGB train call failed with exception: [19:09:22] /workspace/src/objective/regression_obj.cu:88: label must be positive for gamma regression. Stack trace: [bt] (0) /miniconda3/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(+0x2a6e7c) [0x7fb583957e7c] [bt] (1) /miniconda3/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(+0xf02dcb) [0x7fb5845b3dcb] [bt] (2) /miniconda3/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(+0xf03333) [0x7fb5845b4333] [bt] (3) /miniconda3/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(+0x68d2a2) [0x7fb583d3e2a2] [bt] (4) /miniconda3/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(XGBoosterUpdateOneIter+0x77) [0x7fb583867f57] [bt] (5) /miniconda3/lib/python3.10/lib-dynload/../../libffi.so.8(+0x702a) [0x7fb5b767602a] [bt] (6) /miniconda3/lib/python3.10/lib-dynload/../../libffi.so.8(+0x64a9) [0x7fb5b76754a9] [bt] (7) /miniconda3/lib/python3.10/lib-dynload/../../libffi.so.8(ffi_call+0xdd) [0x7fb5b7675bbd] [bt] (8) /miniconda3/lib/python3.10/lib-dynload/_ctypes.cpython-310-x86_64-linux-gnu.so(+0x8c7b) [0x7fb5b7683c7b] XGB train call failed with exception: [19:09:22] /workspace/src/objective/regression_obj.cu:88: label must be positive for gamma regression. Stack trace: [bt] (0) /miniconda3/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(+0x2a6e7c) [0x7fb583957e7c] [bt] (1) /miniconda3/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(+0xf02dcb) [0x7fb5845b3dcb] [bt] (2) /miniconda3/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(+0xf03333) [0x7fb5845b4333] [bt] (3) /miniconda3/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(+0x68d2a2) [0x7fb583d3e2a2] [bt] (4) /miniconda3/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(XGBoosterUpdateOneIter+0x77) [0x7fb583867f57] [bt] (5) /miniconda3/lib/python3.10/lib-dynload/../../libffi.so.8(+0x702a) [0x7fb5b767602a] [bt] (6) /miniconda3/lib/python3.10/lib-dynload/../../libffi.so.8(+0x64a9) [0x7fb5b76754a9] [bt] (7) /miniconda3/lib/python3.10/lib-dynload/../../libffi.so.8(ffi_call+0xdd) [0x7fb5b7675bbd] [bt] (8) /miniconda3/lib/python3.10/lib-dynload/_ctypes.cpython-310-x86_64-linux-gnu.so(+0x8c7b) [0x7fb5b7683c7b] assert 1 == 0 FAILED xgboost/container/test_training.py::TestValidTraining::test_checkpoint_and_reload - AssertionError: assert 1 == 5 + where 1 = len(['xgboost-checkpoint_0.ubj']) FAILED xgboost/container/test_training.py::TestInvalidTraining::test_invalid_hyperparameter[tree_method-values7] - AssertionError: Pattern 'UserError:' not found in logs assert None + where None = <function search at 0x7f11f9e60680>('UserError:', '/miniconda3/lib/python3.10/site-packages/sagemaker_containers/_server.py:22: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.\n import pkg_resources\n[2026-04-01:19:11:48:INFO] Imported framework sagemaker_xgboost_container.training\n[2026-04-01:19:11:48:INFO] Failed to parse hyperparameter eval_metric value error to Json.\nReturning the value itself\n[2026-04-01:19:11:48:INFO] Failed to parse hyperparameter predictor value cpu_predictor to Json.\nReturning the value itself\n[2026-04-01:19:11:48:INFO] Failed to parse hyperparameter tree_method value gpu_hist to Json.\nReturning the value itself\n[2026-04-01:19:11:48:INFO] Failed to parse hyperparameter normalize_type value tree to Json.\nReturning the value itself\n[2026-04-01:19:11:48:INFO] Failed to parse hyperparameter sample_type value uniform to Json.\nReturning the value itself\n[2026-04-01:19:11:48:INFO] Failed to parse hyperparameter booster value gbtree to Json.\nReturning the value itself\n[2026-04-01:19:11:48:INFO] Failed to pa...61\tvalidation-error:0.00000\n[4]\ttrain-error:0.00000\tvalidation-error:0.00000\n/miniconda3/lib/python3.10/site-packages/xgboost/callback.py:503: UserWarning: [19:11:48] WARNING: /workspace/src/gbm/gbtree.cc:359: \n Loading from a raw memory buffer (like pickle in Python, RDS in R) on a CPU-only\n machine. Consider using `save_model/load_model` instead. See:\n\n https://xgboost.readthedocs.io/en/latest/tutorials/saving_model.html\n\n for more details about differences between saving model and serializing. Changing `tree_method` to `hist`.\n model = model[: best_iteration + 1]\n[2026-04-01:19:11:48:INFO] FINAL_MODEL_DEBUG: is_master=True, model_dir=/opt/ml/model\nFINAL_MODEL_DEBUG: is_master=True, model_dir=/opt/ml/model\n[2026-04-01:19:11:48:INFO] FINAL_MODEL_SAVE: Saving final model as master\nFINAL_MODEL_SAVE: Saving final model as master\n/miniconda3/lib/python3.10/site-packages/sagemaker_xgboost_container/algorithm_mode/train.py:480: UserWarning: [19:11:48] WARNING: /workspace/src/c_api/c_api.cc:1427: Saving model in the UBJSON format as default. You can use file extension: `json`, `ubj` or `deprecated` to choose between formats.\n bst.save_model(model_location)\n') + where <function search at 0x7f11f9e60680> = re.search =================== 3 failed, 42 passed in 357.53s (0:05:57) ===================__________________ TestValidScoring.test_execution_parameters __________________ self = <container.test_scoring.TestValidScoring object at 0x7f92d4618500> docker_client = <docker.client.DockerClient object at 0x7f92d38fd0d0> image_uri = '404426647817.dkr.ecr.us-west-2.amazonaws.com/ci:xgboost-3.0.5-cpu-py310-cu126-ubuntu20.04-sagemaker-23865911659' inference_resources = '/tmp/xgb-container-test-n8qucxal/inference' def test_execution_parameters(self, docker_client, image_uri, inference_resources): model_dir = _model_path(inference_resources, "mnist-xgb-model") env = {"MAX_CONTENT_LENGTH": str(21 * 1024 ** 2)} with ServingContainer(docker_client, image_uri, model_dir, env) as ctx: resp = ctx.execution_parameters() params = json.loads(resp.text) assert params["BatchStrategy"] == "MULTI_RECORD" assert params["MaxConcurrentTransforms"] == multiprocessing.cpu_count() > assert params["MaxPayloadInMB"] == 20 E assert 21 == 20 xgboost/container/test_scoring.py:74: AssertionError _____________________ TestValidScoring.test_csv_inference ______________________ self = <container.test_scoring.TestValidScoring object at 0x7f92d3553e30> docker_client = <docker.client.DockerClient object at 0x7f92d38fd0d0> image_uri = '404426647817.dkr.ecr.us-west-2.amazonaws.com/ci:xgboost-3.0.5-cpu-py310-cu126-ubuntu20.04-sagemaker-23865911659' inference_resources = '/tmp/xgb-container-test-n8qucxal/inference' def test_csv_inference(self, docker_client, image_uri, inference_resources): # mnist xgb model responses = _send_requests( docker_client, image_uri, inference_resources, "mnist-xgb-model", "text/csv", ["mnist-1.csv", "mnist-empty-cell.csv", "mnist-equal-dim.csv", "mnist-700.csv"], ) _validate_response(responses[0], 1) _validate_response(responses[1], 1) _validate_response(responses[2], 1) > _validate_response(responses[3], 700) xgboost/container/test_scoring.py:85: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ resp = <Response [200]>, expected_length = 700 def _validate_response(resp, expected_length): assert resp.status_code == httplib.OK, resp.text predicted = resp.text.split(",") > assert len(predicted) == expected_length E AssertionError: assert 1 == 700 E + where 1 = len(['3.0\n8.0\n6.0\n9.0\n6.0\n4.0\n5.0\n3.0\n8.0\n4.0\n5.0\n2.0\n3.0\n8.0\n4.0\n8.0\n1.0\n5.0\n0.0\n5.0\n9.0\n7.0\n4.0\n1.0\n3.0\n3.0\n0.0\n6.0\n2.0\n9.0\n9.0\n4.0\n1.0\n3.0\n6.0\n8.0\n0.0\n7.0\n7.0\n6.0\n8.0\n9.0\n0.0\n3.0\n8.0\n3.0\n7.0\n7.0\n5.0\n1.0\n4.0\n2.0\n2.0\n9.0\n8.0\n1.0\n1.0\n0.0\n6.0\n6.0\n5.0\n0.0\n1.0\n1.0\n7.0\n2.0\n7.0\n3.0\n1.0\n4.0\n0.0\n5.0\n0.0\n6.0\n8.0\n7.0\n6.0\n8.0\n2.0\n9.0\n4.0\n0.0\n6.0\n1.0\n9.0\n2.0\n6.0\n3.0\n8.0\n4.0\n1.0\n5.0\n6.0\n6.0\n1.0\n7.0\n2.0\n8.0\n6.0\n9.0\n7.0\n0.0\n9.0\n8.0\n6.0\n2.0\n8.0\n3.0\n6.0\n4.0\n9.0\n2.0\n8.0\n6.0\n8.0\n7.0\n8.0\n8.0\n6.0\n9.0\n7.0\n7.0\n6.0\n0.0\n3.0\n6.0\n7.0\n0.0\n9.0\n7.0\n1.0\n3.0\n6.0\n8.0\n9.0\n6.0\n1.0\n7.0\n5.0\n1.0\n3.0\n3.0\n5.0\n7.0\n9.0\n9.0\n6.0\n7.0\n3.0\n6.0\n1.0\n0.0\n4.0\n2.0\n4.0\n5.0\n0.0\n0.0\n1.0\n6.0\n6.0\n4.0\n7.0\n9.0\n4.0\n6.0\n5.0\n2.0\n6.0\n9.0\n8.0\n8.0\n8.0\n5.0\n9.0\n3.0\n8.0\n9.0\n1.0\n8.0\n8.0\n3.0\n4.0\n4.0\n3.0\n0.0\n1.0\n5.0\n4.0\n4.0\n1.0\n8.0\n0.0\n6.0\n1.0\n3.0\n1.0\n0.0\n5.0\n6.0\n0.0\n3.0\n5.0\n4.0\n9.0\n0.0\n3.0\n1.0\n0.0\n9.0\n3.0\n2.0\n8.0\n3.0\n3.0\n7.0\n4.0\n9.0\n2.0\n1.0\n6.0\n2.0\n1.0\n8.0\n1.0\n1.0\n9.0\n7.0\n9.0\n2.0\n2.0\n8.0\n1.0\n7.0\n7.0\n0.0\n1.0\n1.0\n8.0\n2...\n2.0\n7.0\n0.0\n7.0\n1.0\n4.0\n9.0\n7.0\n6.0\n5.0\n4.0\n1.0\n9.0\n2.0\n2.0\n0.0\n1.0\n2.0\n2.0\n0.0\n3.0\n1.0\n7.0\n5.0\n0.0\n4.0\n2.0\n7.0\n1.0\n9.0\n3.0\n0.0\n1.0\n6.0\n2.0\n2.0\n5.0\n1.0\n8.0\n3.0\n1.0\n4.0\n6.0\n2.0\n4.0\n8.0\n5.0\n2.0\n6.0\n4.0\n0.0\n8.0\n5.0\n3.0\n9.0\n3.0\n4.0\n0.0\n9.0\n7.0\n2.0\n8.0\n0.0\n8.0\n5.0\n0.0\n2.0\n9.0\n3.0\n8.0\n4.0\n8.0\n5.0\n0.0\n8.0\n7.0\n9.0\n2.0\n0.0\n5.0\n1.0\n0.0\n2.0\n9.0\n3.0\n2.0\n4.0\n8.0\n5.0\n1.0\n6.0\n8.0\n7.0\n3.0\n8.0\n4.0\n7.0\n9.0\n0.0\n3.0\n1.0\n7.0\n2.0\n4.0\n3.0\n0.0\n4.0\n2.0\n5.0\n5.0\n8.0\n2.0\n5.0\n8.0\n2.0\n4.0\n1.0\n9.0\n7.0\n6.0\n2.0\n1.0\n4.0\n6.0\n1.0\n0.0\n4.0\n6.0\n1.0\n6.0\n4.0\n5.0\n9.0\n8.0\n6.0\n8.0\n8.0\n6.0\n4.0\n1.0\n5.0\n5.0\n3.0\n8.0\n7.0\n4.0\n8.0\n6.0\n4.0\n6.0\n3.0\n6.0\n3.0\n9.0\n5.0\n4.0\n0.0\n0.0\n6.0\n7.0\n1.0\n6.0\n6.0\n9.0\n8.0\n3.0\n7.0\n0.0\n3.0\n0.0\n1.0\n2.0\n5.0\n8.0\n6.0\n4.0\n0.0\n0.0\n8.0\n2.0\n5.0\n5.0\n0.0\n6.0\n6.0\n1.0\n1.0\n8.0\n5.0\n5.0\n8.0\n1.0\n4.0\n0.0\n7.0\n4.0\n6.0\n3.0\n9.0\n3.0\n1.0\n5.0\n9.0\n7.0\n7.0\n6.0\n1.0\n7.0\n2.0\n6.0\n3.0\n3.0\n4.0\n2.0\n5.0\n2.0\n5.0\n1.0\n3.0\n3.0\n7.0\n1.0\n3.0\n0.0\n1.0\n1.0\n8.0\n3.0\n2.0\n5.0\n2.0\n3.0\n3.0\n4.0\n2.0\n6.0\n7.0\n2.0\n4.0\n']) xgboost/container/test_scoring.py:57: AssertionError ____________________ TestValidScoring.test_libsvm_inference ____________________ self = <con * AI changes made during Kiro-cli session --- X-AI-Tool: Kiro-cli X-AI-Handle-Time-Seconds: 20 X-AI-Prompt: you can change the runner to use gpu fleet for container tests * AI changes made during Kiro-cli session --- X-AI-Tool: Kiro-cli X-AI-Handle-Time-Seconds: 19 X-AI-Prompt: _________________ TestValidScoring.test_binary_classification __________________ self = <container.test_scoring.TestValidScoring object at 0x7f92d3553380> docker_client = <docker.client.DockerClient object at 0x7f92d38fd0d0> image_uri = '404426647817.dkr.ecr.us-west-2.amazonaws.com/ci:xgboost-3.0.5-cpu-py310-cu126-ubuntu20.04-sagemaker-23865911659' inference_resources = '/tmp/xgb-container-test-n8qucxal/inference' def test_binary_classification(self, docker_client, image_uri, inference_resources): > responses = _send_requests( docker_client, image_uri, inference_resources, "diabetes-binary-xgb-model", "text/csv", ["diabetes_inference.csv"], ) xgboost/container/test_scoring.py:124: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ xgboost/container/test_scoring.py:43: in _send_requests with ServingContainer(docker_client, image_uri, model_dir, environment) as ctx: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ xgboost/container/container_helper.py:152: in __enter__ self._wait_healthy() _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ self = <container.container_helper.ServingContainer object at 0x7f92d297c0e0> def _wait_healthy(self): deadline = time.time() + SERVE_STARTUP_TIMEOUT while time.time() < deadline: self._container.reload() if self._container.status != "running": > raise RuntimeError( f"Container exited: {self._container.logs().decode()}" ) * AI changes made during Kiro-cli session --- X-AI-Tool: Kiro-cli X-AI-Handle-Time-Seconds: 79 X-AI-Prompt: show the output of tests 1-2 lines for validation. also run generate models script once per every test. * AI changes made during Kiro-cli session --- X-AI-Tool: Kiro-cli X-AI-Handle-Time-Seconds: 54 X-AI-Prompt: XGBoost version: 3.0.5 Downloading training data... Traceback (most recent call last): File "/work/test/xgboost/container/generate_models.py", line 85, in <module> main() File "/work/test/xgboost/container/generate_models.py", line 48, in…

Signed-off-by: Yadan Wei <yadanwei@amazon.com>

…peech, not /invocations)

…/audio/speech, not /invocations)" This reverts commit a80c193.

Switch all non-omni PR workflow triggers from pull_request to workflow_dispatch so only vllm-omni EC2 and SageMaker workflows run on PRs to the omni branch. Signed-off-by: Yadan Wei <yadanwei@amazon.com>

…-omni endpoint - omni_sagemaker_serve.py: FastAPI proxy on port 8080, routes to vllm-omni on 8081 - Supports explicit route via CustomAttributes header (route=/v1/audio/speech) - Falls back to payload inspection (TTS vs chat vs completion) - Entrypoint starts vllm-omni in background, proxy in foreground - Endpoint test uses explicit route for TTS

…model support, consolidate tests - Model config: CosyVoice3-0.5B, Qwen2.5-Omni-3B, BAGEL-7B-MoT, Wan2.1-T2V-1.3B - Covers all routes: /v1/audio/speech, /v1/chat/completions, /v1/images/generations, /v1/videos - Workflow handles both S3 and HF model sources (HF_TOKEN for downloads) - Removed separate unit-test job, runs in sagemaker-endpoint-test - Fixed async endpoint test (AWSSessionManager.sts for account ID) - Added starlette to sagemaker test requirements

…irst New models (CosyVoice3, Qwen2.5-Omni, BAGEL, Wan2.1) OOM during HF download. Need S3 tarballs and per-model validation before adding to CI.

- CosyVoice3: /v1/audio/speech (different TTS arch) - Qwen2.5-Omni-3B: /v1/chat/completions (tests fallthrough, no middleware) - BAGEL and Wan2.1 pending S3 upload

…ailure

Tested models that don't work in CI: - CosyVoice3: no model_type in config.json, unrecognized by transformers - Qwen2.5-Omni-3B: OOMs on g6e.xlarge (multi-stage needs >48GB) - BAGEL/Wan2.1: need --stage-configs-path, untested

- CosyVoice3-0.5B: /v1/audio/speech (g6e.4xl, config.json added to tarball) - Wan2.1-T2V-1.3B: /v1/videos (g6e.4xl, diffusers auto-detect) - BAGEL-7B-MoT: /v1/chat/completions (g6e.4xl, multimodal image gen) - Qwen2.5-Omni-3B: /v1/chat/completions (g6e.12xl, text+audio omni) - 6 models covering 4 routes: speech, images, videos, chat

Signed-off-by: Yadan Wei <yadanwei@amazon.com>

- CosyVoice3 on g6e12xl, Wan2.1 on g6e4xl, BAGEL on g6e4xl, Qwen2.5-Omni on g6e12xl - Wan2.1 uses /v1/videos/sync with multipart/form-data - Smoke tests support content_type param for form vs JSON - Orphaned endpoint cleanup step (if: always) - Container log dump increased to 500 lines

…del_type EngineCore subprocess fails at AutoTokenizer.from_pretrained because AutoConfig can't resolve cosyvoice3. The model uses ONNX tokenizers, not HuggingFace tokenizers. Only works with offline Omni() API.

Verified on L40S with SM image: - Model loads and serves on g6e.xlarge (L40S 48GB) - /v1/videos returns queued job with id - Middleware routes /invocations -> /v1/videos with form data

Signed-off-by: Yadan Wei <yadanwei@amazon.com>

junpuf · 2026-04-07T01:21:02Z

+  os_version: "amzn2023"
+  customer_type: "ec2"
+  arch_type: "x86"
+  prod_image: "vllm-omni:0.18-gpu-py312-ec2"


we'll use the save repo name "vllm" instead of creating new repo

Will update this section when we have a real prod image.

This reverts commit 8d55aa3.

….45)

- sglang: add aiohttp>=3.13.4 to CVE patch block - vllm: remove expired CVE-2026-33055 allowlist (fixed in uv tar 0.4.45) Fixes: CVE-2026-34520, CVE-2026-34516, CVE-2026-22815

aws-deep-learning-containers-ci Bot added authorized Size:XL Determines the size of the PR labels Apr 2, 2026

Yadan Wei and others added 27 commits April 2, 2026 08:24

fix: use --region instead of --aws-region for pytest

5e7b23e

Signed-off-by: Yadan Wei <yadanwei@amazon.com>

fix: increase stage init timeout for omni model tests

0de9f97

- Add --stage-init-timeout 600 to server start (TTS models need multi-stage init) - Add stage_init_timeout=600 to offline Omni() calls - Increase server wait loop from 120s to 300s Signed-off-by: Yadan Wei <yadanwei@amazon.com>

fix: use download-model action for model downloads

ce54d97

- Use existing download-model GitHub action with caching, locking, eviction - Downloads to /dlc-models/ (root fs) instead of /tmp - Proper cleanup of lock PIDs and docker images Signed-off-by: Yadan Wei <yadanwei@amazon.com>

fix: patch CVE-2026-28414 gradio path traversal in omni image

6075e81

- Pin gradio>=6.7.0 in omni-base CVE patch layer Signed-off-by: Yadan Wei <yadanwei@amazon.com>

fix: use .tar.gz model tarballs for download-model action compatibility

26db368

Signed-off-by: Yadan Wei <yadanwei@amazon.com>

fix: use /v1/audio/speech API for TTS smoke test

a85d641

- TTS models use OpenAI-compatible speech endpoint, not chat completions - Validate output WAV file size instead of JSON response Signed-off-by: Yadan Wei <yadanwei@amazon.com>

fix: validate diffusion response without printing full base64 image

325d917

- Parse response JSON, extract and decode base64 image - Print only image size instead of full base64 payload - Validate decoded image is non-trivial (>1000 bytes) Signed-off-by: Yadan Wei <yadanwei@amazon.com>

fix: use ml.g4dn.xlarge for TTS endpoint test (cheaper, 1.7B fits in …

aa40386

…16GB T4) Signed-off-by: Yadan Wei <yadanwei@amazon.com>

fix: remove redundant --enforce-eager (vllm-omni enforces it internally)

da26690

Signed-off-by: Yadan Wei <yadanwei@amazon.com>

fix: use customer-type from config to select smoke test script

9c18b3a

- Reusable workflow uses customer-type input (ec2 or sagemaker) - Maps to vllm_omni_{customer-type}_smoke_test.sh - No extra test-type parameter needed Signed-off-by: Yadan Wei <yadanwei@amazon.com>

fix lmiv22 yml and add lmiv23 (#5869)

7322dce

fix: use download-model action and /models/ path for omni smoke tests

b655007

Signed-off-by: Yadan Wei <yadanwei@amazon.com>

Merge branch 'main' into omni

99ceaac

ci: trigger pipeline

99628cb

Merge branch 'main' into omni

2723e31

ci: re-trigger after flux2 model tarball fix

02d7291

fix: SM endpoint test validates deployment only (TTS uses /v1/audio/s…

a80c193

…peech, not /invocations)

Revert "fix: SM endpoint test validates deployment only (TTS uses /v1…

fd63eba

…/audio/speech, not /invocations)" This reverts commit a80c193.

ci: Disable all non-omni PR workflows

8d55aa3

Switch all non-omni PR workflow triggers from pull_request to workflow_dispatch so only vllm-omni EC2 and SageMaker workflows run on PRs to the omni branch. Signed-off-by: Yadan Wei <yadanwei@amazon.com>

Yadan Wei and others added 19 commits April 5, 2026 22:47

fix: add starlette to unit test deps (not in test/requirements.txt)

9589e12

fix: revert to S3-cached models only, new HF models need validation f…

68e8c6e

…irst New models (CosyVoice3, Qwen2.5-Omni, BAGEL, Wan2.1) OOM during HF download. Need S3 tarballs and per-model validation before adding to CI.

Merge branch 'main' into omni

6bf7f8e

Merge branch 'main' into omni

86beb26

feat: add CosyVoice3-0.5B and Qwen2.5-Omni-3B smoke tests (S3 cached)

1162afd

- CosyVoice3: /v1/audio/speech (different TTS arch) - Qwen2.5-Omni-3B: /v1/chat/completions (tests fallthrough, no middleware) - BAGEL and Wan2.1 pending S3 upload

fix: bump new models to g6exl (more RAM), add container log dump on f…

ee5c415

…ailure

fix: revert to Qwen3-TTS and FLUX.2 only

862688d

Tested models that don't work in CI: - CosyVoice3: no model_type in config.json, unrecognized by transformers - Qwen2.5-Omni-3B: OOMs on g6e.xlarge (multi-stage needs >48GB) - BAGEL/Wan2.1: need --stage-configs-path, untested

change instance type

9605ed9

Signed-off-by: Yadan Wei <yadanwei@amazon.com>

fix: use absolute path for cosyvoice3 stage config in DLC container

e36fc3a

fix path

f3a716b

Signed-off-by: Yadan Wei <yadanwei@amazon.com>

fix: remove CosyVoice3 - transformers doesn't recognize cosyvoice3 mo…

1b03a34

…del_type EngineCore subprocess fails at AutoTokenizer.from_pretrained because AutoConfig can't resolve cosyvoice3. The model uses ONNX tokenizers, not HuggingFace tokenizers. Only works with offline Omni() API.

fix: use bash array for curl form data to preserve header quoting

5109c99

fix: Wan2.1 use /v1/videos (async), /v1/videos/sync not in v0.18.0

b154175

Verified on L40S with SM image: - Model loads and serves on g6e.xlarge (L40S 48GB) - /v1/videos returns queued job with id - Middleware routes /invocations -> /v1/videos with form data

fix: Wan2.1 validate json_field:id (async API returns JSON, not binary)

cd46502

enable all models

b1d1eac

Signed-off-by: Yadan Wei <yadanwei@amazon.com>

Merge branch 'main' into omni

0a4e745

junpuf reviewed Apr 7, 2026

View reviewed changes

Merge branch 'main' into omni

70f032f

junpuf previously approved these changes Apr 7, 2026

View reviewed changes

Revert "ci: Disable all non-omni PR workflows"

98f5f93

This reverts commit 8d55aa3.

Yadan-Wei dismissed junpuf’s stale review via 98f5f93 April 7, 2026 01:29

Yadan Wei added 2 commits April 6, 2026 21:01

fix: remove CVE-2026-33055 allowlist entry (fixed in uv tar crate 0.4…

e0e54da

….45)

fix: patch aiohttp CVEs in sglang and vllm Dockerfiles

00bb406

- sglang: add aiohttp>=3.13.4 to CVE patch block - vllm: remove expired CVE-2026-33055 allowlist (fixed in uv tar 0.4.45) Fixes: CVE-2026-34520, CVE-2026-34516, CVE-2026-22815

Yadan-Wei enabled auto-merge (squash) April 7, 2026 15:24

jinyan-li1 approved these changes Apr 7, 2026

View reviewed changes

Yadan-Wei merged commit e781a45 into main Apr 7, 2026
264 of 277 checks passed

Yadan-Wei deleted the omni branch April 7, 2026 16:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add vLLM-Omni EC2 and SageMaker DLC images#5868

feat: add vLLM-Omni EC2 and SageMaker DLC images#5868
Yadan-Wei merged 65 commits intomainfrom
omni

Yadan-Wei commented Apr 2, 2026 •

edited

Loading

Uh oh!

junpuf Apr 7, 2026

Uh oh!

Yadan-Wei Apr 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

Yadan-Wei commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

vLLM-Omni DLC: EC2 and SageMaker Deep Learning Containers for Omni-Modality

Summary

What's Included

Key Design Decisions

Testing

PR Checklist

Uh oh!

junpuf Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Yadan-Wei Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Yadan-Wei commented Apr 2, 2026 •

edited

Loading