Skip to content

feat: add vLLM-Omni EC2 and SageMaker DLC images#5868

Merged
Yadan-Wei merged 65 commits intomainfrom
omni
Apr 7, 2026
Merged

feat: add vLLM-Omni EC2 and SageMaker DLC images#5868
Yadan-Wei merged 65 commits intomainfrom
omni

Conversation

@Yadan-Wei
Copy link
Copy Markdown
Contributor

@Yadan-Wei Yadan-Wei commented Apr 2, 2026

vLLM-Omni DLC: EC2 and SageMaker Deep Learning Containers for Omni-Modality

Models

Summary

Adds vLLM-Omni DLC images for EC2 and SageMaker on Amazon Linux 2023, enabling
serving of omni-modality models (TTS, image generation, video generation,
multimodal chat) via vllm-omni==0.18.0.

What's Included

Dockerfile (docker/vllm/Dockerfile.amzn2023)

  • New stages: omni-deps, builder-oss-omni, omni-base, vllm-omni-ec2-amzn2023,
    vllm-omni-sagemaker-amzn2023
  • Installs vllm-omni==0.18.0 via pip on top of vLLM runtime
  • SPAL system deps: espeak-ng, sox, ffmpeg-free
  • Pre-built runtime base support (RUNTIME_BASE arg) to skip vLLM compile in PR
    builds (~120min → ~2min)

Entrypoints

  • omni_dockerd_entrypoint.sh — EC2: exec vllm serve --omni "$@"
  • omni_sagemaker_entrypoint.sh — SageMaker: parses SM_VLLM_* env vars, adds
    --middleware for routing

SageMaker Routing Middleware (omni_sagemaker_serve.py)

  • ASGI middleware injected via --middleware flag (single process, no proxy)
  • Routes /invocations to the correct vllm-omni endpoint based on
    CustomAttributes: route=
  • Falls through to vLLM's built-in /invocations handler for chat/completion/
    embed
  • Supports: /v1/audio/speech, /v1/images/generations, /v1/videos,
    /v1/chat/completions
  • 12 unit tests covering routing, fallthrough, adapter coexistence

CI Configs & Workflows

  • vllm-omni-ec2-amzn2023.yml, vllm-omni-sagemaker-amzn2023.yml — framework
    configs
  • pr-vllm-omni-ec2-amzn2023.yml, pr-vllm-omni-sagemaker-amzn2023.yml — PR
    workflows with build-runtime caching
  • reusable-vllm-omni-model-tests.yml — generic smoke test workflow supporting
    JSON and multipart/form-data
  • vllm-omni-model-tests.yml — per-model test config with route, request
    payload, and validation

Smoke Tests (5 models, 4 routes)

Model Route Runner
Qwen3-TTS-1.7B-CustomVoice /v1/audio/speech g6xl
FLUX.2-klein-4B /v1/images/generations g6xl
Wan2.1-T2V-1.3B /v1/videos g6e4xl
Qwen2.5-Omni-3B /v1/chat/completions g6e12xl
  • EC2 tests use real entrypoint, hit OpenAI-compatible API directly
  • SageMaker tests use real entrypoint, hit /invocations with middleware
    routing
  • Container log dump (500 lines) on failure
  • Orphaned endpoint cleanup step

SageMaker Endpoint Tests

  • Sync endpoint test with retry for torch.compile warmup (60s SageMaker
    timeout)
  • Async endpoint test using AsyncInferenceConfig (bypasses 60s limit)

S3 Model Cache

  • Models pre-cached in s3://dlc-cicd-models/omni-models/ as tar.gz
  • Uses download-model action with ETag-based caching and flock-based
    concurrency

Key Design Decisions

  • Middleware over proxy: Single-process ASGI middleware (--middleware flag
    ) instead of a separate proxy process. Reuses vLLM's existing /invocations,
    /ping, /health handlers.
  • Per-model test config: Route, request payload, content type, and
    validation defined in YAML. Adding a new model = adding a config entry, no
    code changes.
  • Pre-built runtime base: build-runtime job checks ECR for cached runtime
    image, builds only on first run per vLLM version. Subsequent PR builds skip
    the ~120min compile.

Testing

  • Middleware unit tests (12 tests)
  • EC2 smoke tests (4 models)
  • SageMaker smoke tests (4 models)
  • SageMaker endpoint test (sync + async)
  • Pre-commit checks
  • Middleware verified on upstream vllm/vllm-omni:v0.18.0 Docker Hub image

Toggle if you are merging into master Branch

By default, docker image builds and tests are disabled. Two ways to run builds and tests:

  1. Using dlc_developer_config.toml
  2. Using this PR description (currently only supported for PyTorch, TensorFlow, vllm, and base images)
How to use the helper utility for updating dlc_developer_config.toml

Assuming your remote is called origin (you can find out more with git remote -v)...

  • Run default builds and tests for a particular buildspec - also commits and pushes changes to remote; Example:

python src/prepare_dlc_dev_environment.py -b </path/to/buildspec.yml> -cp origin

  • Enable specific tests for a buildspec or set of buildspecs - also commits and pushes changes to remote; Example:

python src/prepare_dlc_dev_environment.py -b </path/to/buildspec.yml> -t sanity_tests -cp origin

  • Restore TOML file when ready to merge

python src/prepare_dlc_dev_environment.py -rcp origin

NOTE: If you are creating a PR for a new framework version, please ensure success of the local, standard, rc, and efa sagemaker tests by updating the dlc_developer_config.toml file:

  • sagemaker_remote_tests = true
  • sagemaker_efa_tests = true
  • sagemaker_rc_tests = true
  • sagemaker_local_tests = true
How to use PR description Use the code block below to uncomment commands and run the PR CodeBuild jobs. There are two commands available:
  • # /buildspec <buildspec_path>
    • e.g.: # /buildspec pytorch/training/buildspec.yml
    • If this line is commented out, dlc_developer_config.toml will be used.
  • # /tests <test_list>
    • e.g.: # /tests sanity security ec2
    • If this line is commented out, it will run the default set of tests (same as the defaults in dlc_developer_config.toml): sanity, security, ec2, ecs, eks, sagemaker, sagemaker-local.
# /buildspec <buildspec_path>
# /tests <test_list>
Toggle if you are merging into main Branch

PR Checklist

  • [] I ran pre-commit run --all-files locally before creating this PR. (Read DEVELOPMENT.md for details).

- Add omni-deps, builder-oss-omni, omni-base, ec2, sagemaker stages to Dockerfile.amzn2023
- Install vllm-omni as pure Python layer on top of vLLM runtime
- Add omni entrypoints (vllm serve --omni) for EC2 and SageMaker
- Add PR workflows for both EC2 and SageMaker omni images
- Add reusable model smoke tests (Qwen3-TTS, FLUX.2-klein-4B)
- Add SageMaker endpoint integration test with Qwen3-TTS
- System deps: espeak-ng, ffmpeg, sox, libsox-fmt-all for audio/TTS
- OSS compliance runs against omni venv separately

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
@aws-deep-learning-containers-ci aws-deep-learning-containers-ci Bot added authorized Size:XL Determines the size of the PR labels Apr 2, 2026
Yadan Wei and others added 27 commits April 2, 2026 08:24
- espeak (not espeak-ng) available in AL2023 repos
- sox available in AL2023 repos
- ffmpeg installed from static build (not in AL2023 repos)
- Removed libsox-fmt-all (not available on AL2023)

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
- espeak/sox not available in AL2023 minimal CUDA runtime image
- sox binary only needed for Qwen3-TTS 25Hz tokenizer (not 12Hz)
- ffmpeg needed by pydub/imageio-ffmpeg for audio/video I/O
- Removed dnf install for unavailable packages

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
- Upgrade system-release to latest to enable SPAL (requires 2023.9+)
- Install espeak-ng, sox, ffmpeg-free from SPAL (Supplementary Packages for Amazon Linux)
- Replaces static binary approach with official AL2023 package repo

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Signed-off-by: Yadan Wei <yadanwei@amazon.com>
- Add test/vllm-omni/sagemaker/requirements.txt with sagemaker>=2,<3
- Install test deps via uv pip matching reusable-vllm-sagemaker-tests pattern
- Run pytest from test/ directory with relative path

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
- Add --stage-init-timeout 600 to server start (TTS models need multi-stage init)
- Add stage_init_timeout=600 to offline Omni() calls
- Increase server wait loop from 120s to 300s

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
- Use existing download-model GitHub action with caching, locking, eviction
- Downloads to /dlc-models/ (root fs) instead of /tmp
- Proper cleanup of lock PIDs and docker images

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
- Pin gradio>=6.7.0 in omni-base CVE patch layer

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Signed-off-by: Yadan Wei <yadanwei@amazon.com>
- TTS models use OpenAI-compatible speech endpoint, not chat completions
- Validate output WAV file size instead of JSON response

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
- Both models are public (Apache 2.0, no gating)
- Eliminates S3 download/extract issues (corrupted tarballs, disk space)
- Models downloaded from HF at runtime inside container
- Removed s3_prefix and s3_model from config

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
- Parse response JSON, extract and decode base64 image
- Print only image size instead of full base64 payload
- Validate decoded image is non-trivial (>1000 bytes)

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
…16GB T4)

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Signed-off-by: Yadan Wei <yadanwei@amazon.com>
- Reusable workflow uses customer-type input (ec2 or sagemaker)
- Maps to vllm_omni_{customer-type}_smoke_test.sh
- No extra test-type parameter needed

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
* fix telemetry ingress rules

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* add test

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* temp test

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* revert workflow

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

---------

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
* AI changes made during Kiro-cli session
---
X-AI-Tool: Kiro-cli
X-AI-Handle-Time-Seconds: 254
X-AI-Prompt: I have uploaded test resources training/ and inference/ in s3://dlc-cicd-models/xgboost/container_test_resources/, I need you to create container_tests/ and add the following tests in xgboost test dir - The tests need a helper that replaces ai_algorithms_container_tests using docker-py directly:

test/xgboost/container/
├── conftest.py              # pytest fixtures: --image flag, S3 download, docker client
├── container_helper.py      # replaces ai_algorithms_container_tests
├── test_training.py         # rewritten training tests
├── test_scoring.py          # rewritten inference tests
└── test_batch_transform.py  # rewritten batch transform tests

The container_helper.py needs to:
- Download test resources from S3 to a temp dir (once per session)
- Create /opt/ml/ directory structure in temp dirs
- Write config JSON files (hyperparameters, inputdataconfig, resourceconfig)
- Mount volumes and run the container via docker-py
- For training: wait for exit, return exit code + logs + model files
- For inference: start container, wait for health check, send HTTP requests, you can refer to https://code.amazon.com/packages/SMFrameworksXGBoost3_0-5Tests/trees/mainline/--/src/container_tests

* AI changes made during Kiro-cli session
---
X-AI-Tool: Kiro-cli
X-AI-Handle-Time-Seconds: 135
X-AI-Prompt: Add this in release workflow, comment benchmark tests for now, add on push trigger, create parallel test execution for each test case in wf and prepare cr

* AI changes made during Kiro-cli session
---
X-AI-Tool: Kiro-cli
X-AI-Handle-Time-Seconds: 143
X-AI-Prompt: create a new workflow for xgboost benchmarking, container and integration tests and use that workflow in release wrkflow

* AI changes made during Kiro-cli session
---
X-AI-Tool: Kiro-cli
X-AI-Handle-Time-Seconds: 101
X-AI-Prompt: change the name to - sagemaker-xgboost-integ-tests.yml and remove the integ tests steps it is a todo, comment benchmark tests as i need to test container tests now.

* AI changes made during Kiro-cli session
---
X-AI-Tool: Kiro-cli
X-AI-Handle-Time-Seconds: 25
X-AI-Prompt: change on push current branch

* AI changes made during Kiro-cli session
---
X-AI-Tool: Kiro-cli
X-AI-Handle-Time-Seconds: 13
X-AI-Prompt: remove main this wf will never be pr triggered it is manually triggered

* AI changes made during Kiro-cli session
---
X-AI-Tool: Kiro-cli
X-AI-Handle-Time-Seconds: 41
X-AI-Prompt: yeah lets do with option b

* AI changes made during Kiro-cli session
---
X-AI-Tool: Kiro-cli
X-AI-Handle-Time-Seconds: 22
X-AI-Prompt: E
E         Invoking script with the following command:
E
E         /miniconda3/bin/python3 -m sagemaker_xgboost_container.training:main --alpha 0.0 --base_score 0.5 --booster gbtree --colsample_bylevel 1 --colsample_bytree 1.0 --csv_weights 1 --dsplit row --early_stopping_rounds 5 --eta 0.3 --eval_metric error --gamma 0.0 --grow_policy depthwise --lambda 1.0 --lambda_bias 0.0 --max_bin 256 --max_delta_step 0 --max_depth 6 --max_leaves 0 --min_child_weight 1.0 --normalize_type tree --nthread 8 --num_round 10 --objective binary:logistic --one_drop 0 --predictor cpu_predictor --process_type default --rate_drop 0.0 --refresh_leaf 1 --sample_type uniform --scale_pos_weight 1.0 --silent 0 --sketch_eps 0.03 --skip_drop 0.0 --subsample 1.0 --tree_method auto --tweedie_variance_power 1.5 --updater grow_colmaker,prune
E
E
E         /miniconda3/bin/python3: No module named sagemaker_xgboost_container.training:main
E         [2026-03-31:21:26:07:ERROR] ExecuteUserScriptError:
E         Command "/miniconda3/bin/python3 -m sagemaker_xgboost_container.training:main --alpha 0.0 --base_score 0.5 --booster gbtree --colsample_bylevel 1 --colsample_bytree 1.0 --csv_weights 1 --dsplit row --early_stopping_rounds 5 --eta 0.3 --eval_metric error --gamma 0.0 --grow_policy depthwise --lambda 1.0 --lambda_bias 0.0 --max_bin 256 --max_delta_step 0 --max_depth 6 --max_leaves 0 --min_child_weight 1.0 --normalize_type tree --nthread 8 --num_round 10 --objective binary:logistic --one_drop 0 --predictor cpu_predictor --process_type default --rate_drop 0.0 --refresh_leaf 1 --sample_type uniform --scale_pos_weight 1.0 --silent 0 --sketch_eps 0.03 --skip_drop 0.0 --subsample 1.0 --tree_method auto --tweedie_variance_power 1.5 --updater grow_colmaker,prune"
E
E       assert 1 == 0

* AI changes made during Kiro-cli session
---
X-AI-Tool: Kiro-cli
X-AI-Handle-Time-Seconds: 50
X-AI-Prompt: scan for red flags

* AI changes made during Kiro-cli session
---
X-AI-Tool: Kiro-cli
X-AI-Handle-Time-Seconds: 47
X-AI-Prompt: can we regrenate the model durng test time and upload back to s3?

* AI changes made during Kiro-cli session
---
X-AI-Tool: Kiro-cli
X-AI-Handle-Time-Seconds: 38
X-AI-Prompt: RuntimeError: Model /opt/ml/model/mnist-pkl-model cannot be loaded:
Pickle load error=[21:37:57] /workspace/src/learner.cc:1185: Check failed: header == serialisation_header_: If you are loading a serialized model (like pickle in Python, RDS in R) or
configuration generated by an older version of XGBoost, please export the model by calling
`Booster.save_model` from that version first, then load it back in current version. See:

* AI changes made during Kiro-cli session
---
X-AI-Tool: Kiro-cli
X-AI-Handle-Time-Seconds: 30
X-AI-Prompt: During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/miniconda3/lib/python3.10/site-packages/sagemaker_containers/_trainer.py", line 84, in train
entrypoint()
File "/miniconda3/lib/python3.10/site-packages/sagemaker_xgboost_container/training.py", line 102, in main
train(framework.training_env())
File "/miniconda3/lib/python3.10/site-packages/sagemaker_xgboost_container/training.py", line 98, in train
run_algorithm_mode()
File "/miniconda3/lib/python3.10/site-packages/sagemaker_xgboost_container/training.py", line 64, in run_algorithm_mode
sagemaker_train(
File "/miniconda3/lib/python3.10/site-packages/sagemaker_xgboost_container/algorithm_mode/train.py", line 144, in sagemaker_train
validated_train_config = hyperparameters.validate(train_config)
File "/miniconda3/lib/python3.10/site-packages/sagemaker_algorithm_toolkit/hyperparameter_validation.py", line 278, in validate
raise exc.UserError("Extraneous hyperparameter found: {}".format(hp))
sagemaker_algorithm_toolkit.exceptions.UserError: Extraneous hyperparameter found: silent

Extraneous hyperparameter found: silent

assert 1 == 0
FAILED xgboost/container/test_training.py::TestValidTraining::test_checkpoint_and_reload - assert 1 == 0

* AI changes made during Kiro-cli session
---
X-AI-Tool: Kiro-cli
X-AI-Handle-Time-Seconds: 21
X-AI-Prompt: The fix is a one-liner in ServingContainer.__enter__. The XGBoost serving entrypoint (sagemaker_xgboost_container.serving) reads
/opt/ml/input/config/resourceconfig.json on startup. Without it, the Python app fails to initialize, gunicorn workers exit with code 3, and you
get the HaltServer 'Worker failed to boot.' error.

* AI changes made during Kiro-cli session
---
X-AI-Tool: Kiro-cli
X-AI-Handle-Time-Seconds: 14
X-AI-Prompt: ### 2. container_helper.py — tmpdir not cleaned up in __exit__

Both run_training and ServingContainer create temp dirs but never clean them up. The training function at least returns paths so the caller
could clean up, but ServingContainer stores self._opt_ml and never removes it.

Fix: Add cleanup in __exit__:

* AI changes made during Kiro-cli session
---
X-AI-Tool: Kiro-cli
X-AI-Handle-Time-Seconds: 19
X-AI-Prompt: test_training.py — test_checkpoint_and_reload has inline import json

* AI changes made during Kiro-cli session
---
X-AI-Tool: Kiro-cli
X-AI-Handle-Time-Seconds: 31
X-AI-Prompt: test_training.py — test_checkpoint_and_reload phase 2 container not cleaned up on timeout

* AI changes made during Kiro-cli session
---
X-AI-Tool: Kiro-cli
X-AI-Handle-Time-Seconds: 33
X-AI-Prompt: container-test-training installs docker pytest boto3 but not requests. The training tests import run_training from container_helper, which
imports requests at module level. This will fail at import time.

* Human changes made during kiro-cli session after prompt completion.
---
X-AI-Tool: Human
X-AI-Prompt: tests are still failing with same reason

* AI changes made during Kiro-cli session
---
X-AI-Tool: Kiro-cli
X-AI-Handle-Time-Seconds: 35
X-AI-Prompt: scan for red flags

* AI changes made during Kiro-cli session
---
X-AI-Tool: Kiro-cli
X-AI-Handle-Time-Seconds: 84
X-AI-Prompt: RuntimeError: Model /opt/ml/model/mnist-pkl-model cannot be loaded:
Pickle load error=[23:48:50] /workspace/src/learner.cc:1185: Check failed: header == serialisation_header_: If you are loading a serialized model (like pickle in Python, RDS in R) or
configuration generated by an older version of XGBoost, please export the model by calling
`Booster.save_model` from that version first, then load it back in current version. See:

* Human changes made during kiro-cli session after prompt completion.
---
X-AI-Tool: Human
X-AI-Prompt: RuntimeError: Model /opt/ml/model/mnist-pkl-model cannot be loaded:
Pickle load error=[23:48:50] /workspace/src/learner.cc:1185: Check failed: header == serialisation_header_: If you are loading a serialized model (like pickle in Python, RDS in R) or
configuration generated by an older version of XGBoost, please export the model by calling
`Booster.save_model` from that version first, then load it back in current version. See:

* AI changes made during Kiro-cli session
---
X-AI-Tool: Kiro-cli
X-AI-Handle-Time-Seconds: 32
X-AI-Prompt:
============================= test session starts ==============================
platform linux -- Python 3.12.12, pytest-9.0.2, pluggy-1.6.0 -- /tmp/codebuild-b0ba6d93-4eb5-444e-b8c3-bebc7c5b99fa/output/src3763/src/eeeffba7_95a5_4ce7_9fdc_ed0e3f9ffdaa/actions-runner/_work/deep-learning-containers/deep-learning-containers/.venv/bin/python3
cachedir: .pytest_cache
rootdir: /tmp/codebuild-b0ba6d93-4eb5-444e-b8c3-bebc7c5b99fa/output/src3763/src/eeeffba7_95a5_4ce7_9fdc_ed0e3f9ffdaa/actions-runner/_work/deep-learning-containers/deep-learning-containers
configfile: pyproject.toml
collecting ... collected 3 items
xgboost/container/test_batch_transform.py::TestBatchTransform::test_libsvm_batch FAILED
xgboost/container/test_batch_transform.py::TestBatchTransform::test_recordio_protobuf_batch PASSED
xgboost/container/test_batch_transform.py::TestBatchTransform::test_csv_batch PASSED
=================================== FAILURES ===================================
_____________________ TestBatchTransform.test_libsvm_batch _____________________
self = <container.test_batch_transform.TestBatchTransform object at 0x7fd663720d40>
docker_client = <docker.client.DockerClient object at 0x7fd6638eec60>
image_uri = '404426647817.dkr.ecr.us-west-2.amazonaws.com/ci:xgboost-3.0.5-cpu-py310-cu126-ubuntu20.04-sagemaker-23864956268'
inference_resources = '/tmp/xgb-container-test-o7vvveha/inference'
def test_libsvm_batch(self, docker_client, image_uri, inference_resources):
responses = _send_batch_requests(
docker_client, image_uri, inference_resources, "mnist-xgb-model", "text/x-libsvm",
["mnist-1.libsvm", "mnist-less-dim-1.libsvm",
"mnist-plus-onedim-1.libsvm", "mnist-700.libsvm"],
)
_validate_batch_response(responses[0], 1)
_validate_batch_response(responses[1], 1)
>       _validate_batch_response(responses[2], 1)
xgboost/container/test_batch_transform.py:72:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
resp = <Response [400]>, expected_length = 1
def _validate_batch_response(resp, expected_length):
"""Batch responses are newline-delimited; trailing newline adds +1."""
>       assert resp.status_code == httplib.OK, resp.text
E       AssertionError: Unable to evaluate payload provided: [18:45:55] /workspace/src/learner.cc:1483: Check failed: learner_model_param_.num_feature >= p_fmat->Info().num_col_ (785 vs. 786) : Number of columns does not match number of features in booster.
E         Stack trace:
E           [bt] (0) /miniconda3/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(+0x2a6e7c) [0x7fc72964de7c]
E           [bt] (1) /miniconda3/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(+0x6777a9) [0x7fc729a1e7a9]
E           [bt] (2) /miniconda3/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(+0x68d962) [0x7fc729a34962]
E           [bt] (3) /miniconda3/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(XGBoosterPredictFromDMatrix+0x2de) [0x7fc72956196e]
E           [bt] (4) /miniconda3/lib/python3.10/lib-dynload/../../libffi.so.8(+0x702a) [0x7fc74a42302a]
E           [bt] (5) /miniconda3/lib/python3.10/lib-dynload/../../libffi.so.8(+0x64a9) [0x7fc74a4224a9]
E           [bt] (6) /miniconda3/lib/python3.10/lib-dynload/../../libffi.so.8(ffi_call+0xdd) [0x7fc74a422bbd]
E           [bt] (7) /miniconda3/lib/python3.10/lib-dynload/_ctypes.cpython-310-x86_64-linux-gnu.so(+0x8c7b) [0x7fc74a430c7b]
E           [bt] (8) /miniconda3/lib/python3.10/lib-dynload/_ctypes.cpython-310-x86_64-linux-gnu.so(+0x8565) [0x7fc74a430565]
E
E
E       assert 400 == <HTTPStatus.OK: 200>
E        +  where 400 = <Response [400]>.status_code
E        +  and   <HTTPStatus.OK: 200> = httplib.OK
xgboost/container/test_batch_transform.py:53: AssertionError
==================================== PASSES ====================================
=========================== short test summary info ============================
PASSED xgboost/container/test_batch_transform.py::TestBatchTransform::test_recordio_protobuf_batch
PASSED xgboost/container/test_batch_transform.py::TestBatchTransform::test_csv_batch
FAILED xgboost/container/test_batch_transform.py::TestBatchTransform::test_libsvm_batch - AssertionError: Unable to evaluate payload provided: [18:45:55] /workspace/src/learner.cc:1483: Check failed: learner_model_param_.num_feature >= p_fmat->Info().num_col_ (785 vs. 786) : Number of columns does not match number of features in booster.
Stack trace:
[bt] (0) /miniconda3/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(+0x2a6e7c) [0x7fc72964de7c]
[bt] (1) /miniconda3/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(+0x6777a9) [0x7fc729a1e7a9]
[bt] (2) /miniconda3/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(+0x68d962) [0x7fc729a34962]
[bt] (3) /miniconda3/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(XGBoosterPredictFromDMatrix+0x2de) [0x7fc72956196e]
[bt] (4) /miniconda3/lib/python3.10/lib-dynload/../../libffi.so.8(+0x702a) [0x7fc74a42302a]
[bt] (5) /miniconda3/lib/python3.10/lib-dynload/../../libffi.so.8(+0x64a9) [0x7fc74a4224a9]
[bt] (6) /miniconda3/lib/python3.10/lib-dynload/../../libffi.so.8(ffi_call+0xdd) [0x7fc74a422bbd]
[bt] (7) /miniconda3/lib/python3.10/lib-dynload/_ctypes.cpython-310-x86_64-linux-gnu.so(+0x8c7b) [0x7fc74a430c7b]
[bt] (8) /miniconda3/lib/python3.10/lib-dynload/_ctypes.cpython-310-x86_64-linux-gnu.so(+0x8565) [0x7fc74a430565]

assert 400 == <HTTPStatus.OK: 200>
+  where 400 = <Response [400]>.status_code
+  and   <HTTPStatus.OK: 200> = httplib.OK
========================= 1 failed, 2 passed in 37.90s =========================
Error: Process completed with exit code 1.
how is the test passing? we must need to know what the logs are?

* AI changes made during Kiro-cli session
---
X-AI-Tool: Kiro-cli
X-AI-Handle-Time-Seconds: 29
X-AI-Prompt: same here, xgboost/container/test_training.py::TestValidTraining::test_single_file_libsvm PASSED
xgboost/container/test_training.py::TestValidTraining::test_single_file_libsvm_weights PASSED
xgboost/container/test_training.py::TestValidTraining::test_single_file_libsvm_hpo_param PASSED
xgboost/container/test_training.py::TestValidTraining::test_single_file_libsvm_multiclass_hpo PASSED
xgboost/container/test_training.py::TestValidTraining::test_single_file_libsvm_iterate_objectives FAILED
xgboost/container/test_training.py::TestValidTraining::test_single_file_libsvm_threshold_eval_metric PASSED
xgboost/container/test_training.py::TestValidTraining::test_single_file_libsvm_verbosity PASSED
xgboost/container/test_training.py::TestValidTraining::test_multi_files_libsvm PASSED
xgboost/container/test_training.py::TestValidTraining::test_single_file_csv PASSED
xgboost/container/test_training.py::TestValidTraining::test_single_file_csv_weights PASSED
xgboost/container/test_training.py::TestValidTraining::test_multi_file_csv PASSED
xgboost/container/test_training.py::TestValidTraining::test_single_file_csv_space_separated PASSED
xgboost/container/test_training.py::TestValidTraining::test_single_file_csv_sci_notation PASSED
xgboost/container/test_training.py::TestValidTraining::test_single_file_csv_empty_cells PASSED
xgboost/container/test_training.py::TestValidTraining::test_checkpoint_and_reload FAILED
xgboost/container/test_training.py::TestInvalidTraining::test_no_training_data PASSED
xgboost/container/test_training.py::TestInvalidTraining::test_no_validation_data PASSED
xgboost/container/test_training.py::TestInvalidTraining::test_invalid_data_csv_content_type PASSED
xgboost/container/test_training.py::TestInvalidTraining::test_csv_alpha_with_csv_content_type PASSED
xgboost/container/test_training.py::TestInvalidTraining::test_csv_data_with_libsvm_content_type PASSED
xgboost/container/test_training.py::TestInvalidTraining::test_invalid_data_with_libsvm_content_type PASSED

* AI changes made during Kiro-cli session
---
X-AI-Tool: Kiro-cli
X-AI-Handle-Time-Seconds: 106
X-AI-Prompt: Run source .venv/bin/activate
============================= test session starts ==============================
platform linux -- Python 3.12.12, pytest-9.0.2, pluggy-1.6.0 -- /tmp/codebuild-8acc520a-64b1-45e6-8ddc-2078a24507b5/output/src787/src/b09928cc_a4a3_4b96_9bee_901575f815e0/actions-runner/_work/deep-learning-containers/deep-learning-containers/.venv/bin/python3
cachedir: .pytest_cache
rootdir: /tmp/codebuild-8acc520a-64b1-45e6-8ddc-2078a24507b5/output/src787/src/b09928cc_a4a3_4b96_9bee_901575f815e0/actions-runner/_work/deep-learning-containers/deep-learning-containers
configfile: pyproject.toml
collecting ... collected 45 items

xgboost/container/test_training.py::TestValidTraining::test_single_file_libsvm PASSED
xgboost/container/test_training.py::TestValidTraining::test_single_file_libsvm_weights PASSED
xgboost/container/test_training.py::TestValidTraining::test_single_file_libsvm_hpo_param PASSED
xgboost/container/test_training.py::TestValidTraining::test_single_file_libsvm_multiclass_hpo PASSED
xgboost/container/test_training.py::TestValidTraining::test_single_file_libsvm_iterate_objectives FAILED
xgboost/container/test_training.py::TestValidTraining::test_single_file_libsvm_threshold_eval_metric PASSED
xgboost/container/test_training.py::TestValidTraining::test_single_file_libsvm_verbosity PASSED
xgboost/container/test_training.py::TestValidTraining::test_multi_files_libsvm PASSED
xgboost/container/test_training.py::TestValidTraining::test_single_file_csv PASSED
xgboost/container/test_training.py::TestValidTraining::test_single_file_csv_weights PASSED
xgboost/container/test_training.py::TestValidTraining::test_multi_file_csv PASSED
xgboost/container/test_training.py::TestValidTraining::test_single_file_csv_space_separated PASSED
xgboost/container/test_training.py::TestValidTraining::test_single_file_csv_sci_notation PASSED
xgboost/container/test_training.py::TestValidTraining::test_single_file_csv_empty_cells PASSED
xgboost/container/test_training.py::TestValidTraining::test_checkpoint_and_reload FAILED
xgboost/container/test_training.py::TestInvalidTraining::test_no_training_data PASSED
xgboost/container/test_training.py::TestInvalidTraining::test_no_validation_data PASSED
xgboost/container/test_training.py::TestInvalidTraining::test_invalid_data_csv_content_type PASSED
xgboost/container/test_training.py::TestInvalidTraining::test_csv_alpha_with_csv_content_type PASSED
xgboost/container/test_training.py::TestInvalidTraining::test_csv_data_with_libsvm_content_type PASSED
xgboost/container/test_training.py::TestInvalidTraining::test_invalid_data_with_libsvm_content_type PASSED
xgboost/container/test_training.py::TestInvalidTraining::test_invalid_hyperparameter[eta-values0] PASSED
xgboost/container/test_training.py::TestInvalidTraining::test_invalid_hyperparameter[gamma-values1] PASSED
xgboost/container/test_training.py::TestInvalidTraining::test_invalid_hyperparameter[max_depth-values2] PASSED
xgboost/container/test_training.py::TestInvalidTraining::test_invalid_hyperparameter[min_child_weight-values3] PASSED
xgboost/container/test_training.py::TestInvalidTraining::test_invalid_hyperparameter[max_delta_step-values4] PASSED
xgboost/container/test_training.py::TestInvalidTraining::test_invalid_hyperparameter[colsample_bytree-values5] PASSED
xgboost/container/test_training.py::TestInvalidTraining::test_invalid_hyperparameter[colsample_bylevel-values6] PASSED
xgboost/container/test_training.py::TestInvalidTraining::test_invalid_hyperparameter[tree_method-values7] FAILED
xgboost/container/test_training.py::TestInvalidTraining::test_invalid_hyperparameter[sketch_eps-values8] PASSED
xgboost/container/test_training.py::TestInvalidTraining::test_invalid_hyperparameter[refresh_leaf-values9] PASSED
xgboost/container/test_training.py::TestInvalidTraining::test_invalid_hyperparameter[process_type-values10] PASSED
xgboost/container/test_training.py::TestInvalidTraining::test_invalid_hyperparameter[grow_policy-values11] PASSED
xgboost/container/test_training.py::TestInvalidTraining::test_invalid_hyperparameter[sample_type-values12] PASSED
xgboost/container/test_training.py::TestInvalidTraining::test_invalid_hyperparameter[normalize_type-values13] PASSED
xgboost/container/test_training.py::TestInvalidTraining::test_invalid_hyperparameter[rate_drop-values14] PASSED
xgboost/container/test_training.py::TestInvalidTraining::test_invalid_hyperparameter[one_drop-values15] PASSED
xgboost/container/test_training.py::TestInvalidTraining::test_invalid_hyperparameter[skip_drop-values16] PASSED
xgboost/container/test_training.py::TestInvalidTraining::test_invalid_hyperparameter[tweedie_variance_power-values17] PASSED
xgboost/container/test_training.py::TestInvalidTraining::test_invalid_hyperparameter[eval_metric-values18] PASSED
xgboost/container/test_training.py::TestInvalidTraining::test_invalid_hyperparameter[booster-values19] PASSED
xgboost/container/test_training.py::TestInvalidTraining::test_invalid_hyperparameter[verbosity-values20] PASSED
xgboost/container/test_training.py::TestInvalidTraining::test_missing_num_round PASSED
xgboost/container/test_training.py::TestInvalidTraining::test_multiclass_without_num_class PASSED
xgboost/container/test_training.py::TestInvalidTraining::test_pipe_mode_rejected PASSED

=================================== FAILURES ===================================
_________ TestValidTraining.test_single_file_libsvm_iterate_objectives _________

self = <container.test_training.TestValidTraining object at 0x7f11f6c34ce0>
docker_client = <docker.client.DockerClient object at 0x7f11f6f7d0d0>
image_uri = '404426647817.dkr.ecr.us-west-2.amazonaws.com/ci:xgboost-3.0.5-cpu-py310-cu126-ubuntu20.04-sagemaker-23865911659'
training_resources = '/tmp/xgb-container-test-ptswvydm/training'

def test_single_file_libsvm_iterate_objectives(self, docker_client, image_uri, training_resources):
hp = copy.deepcopy(STD_HP)
d = _libsvm_dir(training_resources)
for obj in ["reg:squarederror", "binary:logistic", "count:poisson",
"reg:gamma", "reg:tweedie"]:
hp["objective"] = obj
result = _run(docker_client, image_uri, training_resources, hp, STD_IDC, STD_RC,
[os.path.join(d, "agaricus.libsvm.train")],
[os.path.join(d, "agaricus.libsvm.test")])
>           _assert_success(result)

xgboost/container/test_training.py:170:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

result = (1, '/miniconda3/lib/python3.10/site-packages/sagemaker_containers/_server.py:22: UserWarning: pkg_resources is deprec... '/tmp/xgb-train-bkhw5xxo/input/data/train', 'input_validation': '/tmp/xgb-train-bkhw5xxo/input/data/validation', ...})
regex = None

def _assert_success(result, regex=None):
exit_code, logs, model_files, _ = result
>       assert exit_code == 0, f"Training failed:\n{logs}"
E       AssertionError: Training failed:
E         /miniconda3/lib/python3.10/site-packages/sagemaker_containers/_server.py:22: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
E           import pkg_resources
E         [2026-04-01:19:09:22:INFO] Imported framework sagemaker_xgboost_container.training
E         [2026-04-01:19:09:22:INFO] Failed to parse hyperparameter eval_metric value error to Json.
E         Returning the value itself
E         [2026-04-01:19:09:22:INFO] Failed to parse hyperparameter predictor value cpu_predictor to Json.
E         Returning the value itself
E         [2026-04-01:19:09:22:INFO] Failed to parse hyperparameter tree_method value auto to Json.
E         Returning the value itself
E         [2026-04-01:19:09:22:INFO] Failed to parse hyperparameter normalize_type value tree to Json.
E         Returning the value itself
E         [2026-04-01:19:09:22:INFO] Failed to parse hyperparameter sample_type value uniform to Json.
E         Returning the value itself
E         [2026-04-01:19:09:22:INFO] Failed to parse hyperparameter booster value gbtree to Json.
E         Returning the value itself
E         [2026-04-01:19:09:22:INFO] Failed to parse hyperparameter objective value reg:gamma to Json.
E         Returning the value itself
E         [2026-04-01:19:09:22:INFO] Failed to parse hyperparameter updater value grow_colmaker,prune to Json.
E         Returning the value itself
E         [2026-04-01:19:09:22:INFO] Failed to parse hyperparameter process_type value default to Json.
E         Returning the value itself
E         [2026-04-01:19:09:22:INFO] Failed to parse hyperparameter dsplit value row to Json.
E         Returning the value itself
E         [2026-04-01:19:09:22:INFO] Failed to parse hyperparameter grow_policy value depthwise to Json.
E         Returning the value itself
E         [2026-04-01:19:09:22:INFO] No GPUs detected (normal if no gpus installed)
E         [2026-04-01:19:09:22:INFO] Running XGBoost Sagemaker in algorithm mode
E         [2026-04-01:19:09:22:INFO] Determined 0 GPU(s) available on the instance.
E         [2026-04-01:19:09:22:INFO] File path /opt/ml/input/data/train of input files
E         [2026-04-01:19:09:22:INFO] Making smlinks from folder /opt/ml/input/data/train to folder /tmp/sagemaker_xgboost_input_data
E         [2026-04-01:19:09:22:INFO] creating symlink between Path /opt/ml/input/data/train/agaricus.libsvm.train and destination /tmp/sagemaker_xgboost_input_data/agaricus.libsvm.train1664359970552213804
E         [2026-04-01:19:09:22:INFO] files path: /tmp/sagemaker_xgboost_input_data
E         [2026-04-01:19:09:22:INFO] File path /opt/ml/input/data/validation of input files
E         [2026-04-01:19:09:22:INFO] Making smlinks from folder /opt/ml/input/data/validation to folder /tmp/sagemaker_xgboost_input_data
E         [2026-04-01:19:09:22:INFO] creating symlink between Path /opt/ml/input/data/validation/agaricus.libsvm.test and destination /tmp/sagemaker_xgboost_input_data/agaricus.libsvm.test1757920320072049626
E         [2026-04-01:19:09:22:INFO] files path: /tmp/sagemaker_xgboost_input_data
E         [2026-04-01:19:09:22:INFO] Single node training.
E         [2026-04-01:19:09:22:INFO] TRAIN_JOB_DEBUG: Received is_master=True
E         TRAIN_JOB_DEBUG: Received is_master=True
E         [2026-04-01:19:09:22:INFO] Train matrix has 6513 rows and 127 columns
E         [2026-04-01:19:09:22:INFO] Validation matrix has 1611 rows
E         [2026-04-01:19:09:22:INFO] CALLBACK_SETUP_DEBUG: save_model_on_termination=false, is_master=True
E         [2026-04-01:19:09:22:INFO] CALLBACK_SKIPPING save_model_on_termination=false, is_master=True)
E         /miniconda3/lib/python3.10/site-packages/xgboost/callback.py:386: UserWarning: [19:09:22] WARNING: /workspace/src/common/error_msg.cc:33: You have manually specified the `updater` parameter. The `tree_method` parameter will be ignored. Incorrect sequence of updaters will produce undefined behavior. For common uses, we recommend using `tree_method` parameter instead.
E           self.starting_round = model.num_boosted_rounds()
E         /miniconda3/lib/python3.10/site-packages/xgboost/callback.py:386: UserWarning: [19:09:22] WARNING: /workspace/src/learner.cc:738:
E         Parameters: { "dsplit", "lambda_bias", "normalize_type", "one_drop", "predictor", "rate_drop", "sample_type", "sketch_eps", "skip_drop", "tweedie_variance_power" } are not used.
E
E           self.starting_round = model.num_boosted_rounds()
E         [2026-04-01:19:09:22:ERROR] Reporting training FAILURE
E         [2026-04-01:19:09:22:ERROR] framework error:
E         Traceback (most recent call last):
E           File "/miniconda3/lib/python3.10/site-packages/sagemaker_xgboost_container/algorithm_mode/train.py", line 367, in train_job
E             bst = xgb.train(
E           File "/miniconda3/lib/python3.10/site-packages/xgboost/core.py", line 729, in inner_f
E             return func(**kwargs)
E           File "/miniconda3/lib/python3.10/site-packages/xgboost/training.py", line 183, in train
E             bst.update(dtrain, iteration=i, fobj=obj)
E           File "/miniconda3/lib/python3.10/site-packages/xgboost/core.py", line 2246, in update
E             _check_call(
E           File "/miniconda3/lib/python3.10/site-packages/xgboost/core.py", line 310, in _check_call
E             raise XGBoostError(py_str(_LIB.XGBGetLastError()))
E         xgboost.core.XGBoostError: [19:09:22] /workspace/src/objective/regression_obj.cu:88: label must be positive for gamma regression.
E         Stack trace:
E           [bt] (0) /miniconda3/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(+0x2a6e7c) [0x7fb583957e7c]
E           [bt] (1) /miniconda3/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(+0xf02dcb) [0x7fb5845b3dcb]
E           [bt] (2) /miniconda3/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(+0xf03333) [0x7fb5845b4333]
E           [bt] (3) /miniconda3/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(+0x68d2a2) [0x7fb583d3e2a2]
E           [bt] (4) /miniconda3/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(XGBoosterUpdateOneIter+0x77) [0x7fb583867f57]
E           [bt] (5) /miniconda3/lib/python3.10/lib-dynload/../../libffi.so.8(+0x702a) [0x7fb5b767602a]
E           [bt] (6) /miniconda3/lib/python3.10/lib-dynload/../../libffi.so.8(+0x64a9) [0x7fb5b76754a9]
E           [bt] (7) /miniconda3/lib/python3.10/lib-dynload/../../libffi.so.8(ffi_call+0xdd) [0x7fb5b7675bbd]
E           [bt] (8) /miniconda3/lib/python3.10/lib-dynload/_ctypes.cpython-310-x86_64-linux-gnu.so(+0x8c7b) [0x7fb5b7683c7b]
E
E
E
E         During handling of the above exception, another exception occurred:
E
E         Traceback (most recent call last):
E           File "/miniconda3/lib/python3.10/site-packages/sagemaker_containers/_trainer.py", line 84, in train
E             entrypoint()
E           File "/miniconda3/lib/python3.10/site-packages/sagemaker_xgboost_container/training.py", line 102, in main
E             train(framework.training_env())
E           File "/miniconda3/lib/python3.10/site-packages/sagemaker_xgboost_container/training.py", line 98, in train
E             run_algorithm_mode()
E           File "/miniconda3/lib/python3.10/site-packages/sagemaker_xgboost_container/training.py", line 64, in run_algorithm_mode
E             sagemaker_train(
E           File "/miniconda3/lib/python3.10/site-packages/sagemaker_xgboost_container/algorithm_mode/train.py", line 278, in sagemaker_train
E             train_job(**train_args)
E           File "/miniconda3/lib/python3.10/site-packages/sagemaker_xgboost_container/algorithm_mode/train.py", line 467, in train_job
E             raise exc.AlgorithmError(f"{exception_prefix}:\n {str(e)}")
E         sagemaker_algorithm_toolkit.exceptions.AlgorithmError: XGB train call failed with exception:
E          [19:09:22] /workspace/src/objective/regression_obj.cu:88: label must be positive for gamma regression.
E         Stack trace:
E           [bt] (0) /miniconda3/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(+0x2a6e7c) [0x7fb583957e7c]
E           [bt] (1) /miniconda3/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(+0xf02dcb) [0x7fb5845b3dcb]
E           [bt] (2) /miniconda3/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(+0xf03333) [0x7fb5845b4333]
E           [bt] (3) /miniconda3/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(+0x68d2a2) [0x7fb583d3e2a2]
E           [bt] (4) /miniconda3/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(XGBoosterUpdateOneIter+0x77) [0x7fb583867f57]
E           [bt] (5) /miniconda3/lib/python3.10/lib-dynload/../../libffi.so.8(+0x702a) [0x7fb5b767602a]
E           [bt] (6) /miniconda3/lib/python3.10/lib-dynload/../../libffi.so.8(+0x64a9) [0x7fb5b76754a9]
E           [bt] (7) /miniconda3/lib/python3.10/lib-dynload/../../libffi.so.8(ffi_call+0xdd) [0x7fb5b7675bbd]
E           [bt] (8) /miniconda3/lib/python3.10/lib-dynload/_ctypes.cpython-310-x86_64-linux-gnu.so(+0x8c7b) [0x7fb5b7683c7b]
E
E
E
E         XGB train call failed with exception:
E          [19:09:22] /workspace/src/objective/regression_obj.cu:88: label must be positive for gamma regression.
E         Stack trace:
E           [bt] (0) /miniconda3/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(+0x2a6e7c) [0x7fb583957e7c]
E           [bt] (1) /miniconda3/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(+0xf02dcb) [0x7fb5845b3dcb]
E           [bt] (2) /miniconda3/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(+0xf03333) [0x7fb5845b4333]
E           [bt] (3) /miniconda3/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(+0x68d2a2) [0x7fb583d3e2a2]
E           [bt] (4) /miniconda3/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(XGBoosterUpdateOneIter+0x77) [0x7fb583867f57]
E           [bt] (5) /miniconda3/lib/python3.10/lib-dynload/../../libffi.so.8(+0x702a) [0x7fb5b767602a]
E           [bt] (6) /miniconda3/lib/python3.10/lib-dynload/../../libffi.so.8(+0x64a9) [0x7fb5b76754a9]
E           [bt] (7) /miniconda3/lib/python3.10/lib-dynload/../../libffi.so.8(ffi_call+0xdd) [0x7fb5b7675bbd]
E           [bt] (8) /miniconda3/lib/python3.10/lib-dynload/_ctypes.cpython-310-x86_64-linux-gnu.so(+0x8c7b) [0x7fb5b7683c7b]
E
E
E
E       assert 1 == 0

xgboost/container/test_training.py:104: AssertionError
_________________ TestValidTraining.test_checkpoint_and_reload _________________

self = <container.test_training.TestValidTraining object at 0x7f11f6c37380>
docker_client = <docker.client.DockerClient object at 0x7f11f6f7d0d0>
image_uri = '404426647817.dkr.ecr.us-west-2.amazonaws.com/ci:xgboost-3.0.5-cpu-py310-cu126-ubuntu20.04-sagemaker-23865911659'
training_resources = '/tmp/xgb-container-test-ptswvydm/training'

def test_checkpoint_and_reload(self, docker_client, image_uri, training_resources):
"""Train 10 rounds, verify checkpoints, then resume to 20 rounds."""
hp1 = copy.deepcopy(STD_HP)
hp1["num_round"] = 10
hp1["eval_metric"] = "error"
hp1.pop("early_stopping_rounds", None)

idc = copy.deepcopy(STD_IDC)
idc["train"]["ContentType"] = "text/libsvm"
idc.pop("validation", None)

d = _libsvm_dir(training_resources)
train_files = [os.path.join(d, "agaricus.libsvm.train")]

# Phase 1: train 10 rounds
exit_code, logs, model_files, paths = run_training(
docker_client, image_uri, hp1, idc, STD_RC,
training_files=train_files, checkpointconfig=STD_CPC,
)
assert exit_code == 0
assert len(model_files) == 1

ckpt_files = os.listdir(paths["checkpoints"])
assert all(f.startswith("xgboost-checkpoint") for f in ckpt_files)
regex = r"\[\d+\].*(?=.*train-error:.*)"
assert len(re.findall(regex, logs)) == 10
>       assert len(ckpt_files) == 5
E       AssertionError: assert 1 == 5
E        +  where 1 = len(['xgboost-checkpoint_0.ubj'])

xgboost/container/test_training.py:283: AssertionError
_____ TestInvalidTraining.test_invalid_hyperparameter[tree_method-values7] _____

self = <container.test_training.TestInvalidTraining object at 0x7f11f6c37f20>
docker_client = <docker.client.DockerClient object at 0x7f11f6f7d0d0>
image_uri = '404426647817.dkr.ecr.us-west-2.amazonaws.com/ci:xgboost-3.0.5-cpu-py310-cu126-ubuntu20.04-sagemaker-23865911659'
training_resources = '/tmp/xgb-container-test-ptswvydm/training'
param = 'tree_method', values = ['invalid_method', 'gpu_exact', 'gpu_hist']

@pytest.mark.parametrize("param,values", [
("eta", ["-0.1", "1.01", "invalid_string"]),
("gamma", ["-0.1", "invalid_string"]),
("max_depth", ["-0.1", "invalid_string"]),
("min_child_weight", ["-0.1", "invalid_string"]),
("max_delta_step", ["-0.1", "invalid_string"]),
("colsample_bytree", ["-0.1", "0", "invalid_string"]),
("colsample_bylevel", ["-0.1", "0", "invalid_string"]),
("tree_method", ["invalid_method", "gpu_exact", "gpu_hist"]),
("sketch_eps", ["0", "1", "invalid_string"]),
("refresh_leaf", ["invalid", "2"]),
("process_type", ["invalid", "0.01"]),
("grow_policy", ["invalid", "0.01"]),
("sample_type", ["invalid", "0.01"]),
("normalize_type", ["invalid", "0.01"]),
("rate_drop", ["invalid", "-0.01", "1.01"]),
("one_drop", ["invalid", "-0.01", "1.01"]),
("skip_drop", ["invalid", "-0.01", "1.01"]),
("tweedie_variance_power", ["invalid", "1", "2"]),
("eval_metric", ["invalid", "1", "rmse,invalid", "error@nonfloat"]),
("booster", ["invalid", "1"]),
("verbosity", ["invalid", "-1", "4", "0.5"]),
])
def test_invalid_hyperparameter(self, docker_client, image_uri, training_resources,
param, values):
train, val = self._get_libsvm_data(training_resources)
hp = copy.deepcopy(STD_HP)
for v in values:
hp[param] = v
result = _run(docker_client, image_uri, training_resources, hp, STD_IDC, STD_RC,
train, val)
>           _assert_failed(result)

xgboost/container/test_training.py:405:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

result = (0, '/miniconda3/lib/python3.10/site-packages/sagemaker_containers/_server.py:22: UserWarning: pkg_resources is deprec... '/tmp/xgb-train-4tccj7i0/input/data/train', 'input_validation': '/tmp/xgb-train-4tccj7i0/input/data/validation', ...})
regex = 'UserError:'

def _assert_failed(result, regex="UserError:"):
exit_code, logs, _, _ = result
>       assert re.search(regex, logs), f"Pattern {regex!r} not found in logs"
E       AssertionError: Pattern 'UserError:' not found in logs
E       assert None
E        +  where None = <function search at 0x7f11f9e60680>('UserError:', '/miniconda3/lib/python3.10/site-packages/sagemaker_containers/_server.py:22: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.\n  import pkg_resources\n[2026-04-01:19:11:48:INFO] Imported framework sagemaker_xgboost_container.training\n[2026-04-01:19:11:48:INFO] Failed to parse hyperparameter eval_metric value error to Json.\nReturning the value itself\n[2026-04-01:19:11:48:INFO] Failed to parse hyperparameter predictor value cpu_predictor to Json.\nReturning the value itself\n[2026-04-01:19:11:48:INFO] Failed to parse hyperparameter tree_method value gpu_hist to Json.\nReturning the value itself\n[2026-04-01:19:11:48:INFO] Failed to parse hyperparameter normalize_type value tree to Json.\nReturning the value itself\n[2026-04-01:19:11:48:INFO] Failed to parse hyperparameter sample_type value uniform to Json.\nReturning the value itself\n[2026-04-01:19:11:48:INFO] Failed to parse hyperparameter booster value gbtree to Json.\nReturning the value itself\n[2026-04-01:19:11:48:INFO] Failed to pa...61\tvalidation-error:0.00000\n[4]\ttrain-error:0.00000\tvalidation-error:0.00000\n/miniconda3/lib/python3.10/site-packages/xgboost/callback.py:503: UserWarning: [19:11:48] WARNING: /workspace/src/gbm/gbtree.cc:359: \n  Loading from a raw memory buffer (like pickle in Python, RDS in R) on a CPU-only\n  machine. Consider using `save_model/load_model` instead. See:\n\n    https://xgboost.readthedocs.io/en/latest/tutorials/saving_model.html\n\n  for more details about differences between saving model and serializing.  Changing `tree_method` to `hist`.\n  model = model[: best_iteration + 1]\n[2026-04-01:19:11:48:INFO] FINAL_MODEL_DEBUG: is_master=True, model_dir=/opt/ml/model\nFINAL_MODEL_DEBUG: is_master=True, model_dir=/opt/ml/model\n[2026-04-01:19:11:48:INFO] FINAL_MODEL_SAVE: Saving final model as master\nFINAL_MODEL_SAVE: Saving final model as master\n/miniconda3/lib/python3.10/site-packages/sagemaker_xgboost_container/algorithm_mode/train.py:480: UserWarning: [19:11:48] WARNING: /workspace/src/c_api/c_api.cc:1427: Saving model in the UBJSON format as default.  You can use file extension: `json`, `ubj` or `deprecated` to choose between formats.\n  bst.save_model(model_location)\n')
E        +    where <function search at 0x7f11f9e60680> = re.search

xgboost/container/test_training.py:112: AssertionError
==================================== PASSES ====================================
=========================== short test summary info ============================
PASSED xgboost/container/test_training.py::TestValidTraining::test_single_file_libsvm
PASSED xgboost/container/test_training.py::TestValidTraining::test_single_file_libsvm_weights
PASSED xgboost/container/test_training.py::TestValidTraining::test_single_file_libsvm_hpo_param
PASSED xgboost/container/test_training.py::TestValidTraining::test_single_file_libsvm_multiclass_hpo
PASSED xgboost/container/test_training.py::TestValidTraining::test_single_file_libsvm_threshold_eval_metric
PASSED xgboost/container/test_training.py::TestValidTraining::test_single_file_libsvm_verbosity
PASSED xgboost/container/test_training.py::TestValidTraining::test_multi_files_libsvm
PASSED xgboost/container/test_training.py::TestValidTraining::test_single_file_csv
PASSED xgboost/container/test_training.py::TestValidTraining::test_single_file_csv_weights
PASSED xgboost/container/test_training.py::TestValidTraining::test_multi_file_csv
PASSED xgboost/container/test_training.py::TestValidTraining::test_single_file_csv_space_separated
PASSED xgboost/container/test_training.py::TestValidTraining::test_single_file_csv_sci_notation
PASSED xgboost/container/test_training.py::TestValidTraining::test_single_file_csv_empty_cells
PASSED xgboost/container/test_training.py::TestInvalidTraining::test_no_training_data
PASSED xgboost/container/test_training.py::TestInvalidTraining::test_no_validation_data
PASSED xgboost/container/test_training.py::TestInvalidTraining::test_invalid_data_csv_content_type
PASSED xgboost/container/test_training.py::TestInvalidTraining::test_csv_alpha_with_csv_content_type
PASSED xgboost/container/test_training.py::TestInvalidTraining::test_csv_data_with_libsvm_content_type
PASSED xgboost/container/test_training.py::TestInvalidTraining::test_invalid_data_with_libsvm_content_type
PASSED xgboost/container/test_training.py::TestInvalidTraining::test_invalid_hyperparameter[eta-values0]
PASSED xgboost/container/test_training.py::TestInvalidTraining::test_invalid_hyperparameter[gamma-values1]
PASSED xgboost/container/test_training.py::TestInvalidTraining::test_invalid_hyperparameter[max_depth-values2]
PASSED xgboost/container/test_training.py::TestInvalidTraining::test_invalid_hyperparameter[min_child_weight-values3]
PASSED xgboost/container/test_training.py::TestInvalidTraining::test_invalid_hyperparameter[max_delta_step-values4]
PASSED xgboost/container/test_training.py::TestInvalidTraining::test_invalid_hyperparameter[colsample_bytree-values5]
PASSED xgboost/container/test_training.py::TestInvalidTraining::test_invalid_hyperparameter[colsample_bylevel-values6]
PASSED xgboost/container/test_training.py::TestInvalidTraining::test_invalid_hyperparameter[sketch_eps-values8]
PASSED xgboost/container/test_training.py::TestInvalidTraining::test_invalid_hyperparameter[refresh_leaf-values9]
PASSED xgboost/container/test_training.py::TestInvalidTraining::test_invalid_hyperparameter[process_type-values10]
PASSED xgboost/container/test_training.py::TestInvalidTraining::test_invalid_hyperparameter[grow_policy-values11]
PASSED xgboost/container/test_training.py::TestInvalidTraining::test_invalid_hyperparameter[sample_type-values12]
PASSED xgboost/container/test_training.py::TestInvalidTraining::test_invalid_hyperparameter[normalize_type-values13]
PASSED xgboost/container/test_training.py::TestInvalidTraining::test_invalid_hyperparameter[rate_drop-values14]
PASSED xgboost/container/test_training.py::TestInvalidTraining::test_invalid_hyperparameter[one_drop-values15]
PASSED xgboost/container/test_training.py::TestInvalidTraining::test_invalid_hyperparameter[skip_drop-values16]
PASSED xgboost/container/test_training.py::TestInvalidTraining::test_invalid_hyperparameter[tweedie_variance_power-values17]
PASSED xgboost/container/test_training.py::TestInvalidTraining::test_invalid_hyperparameter[eval_metric-values18]
PASSED xgboost/container/test_training.py::TestInvalidTraining::test_invalid_hyperparameter[booster-values19]
PASSED xgboost/container/test_training.py::TestInvalidTraining::test_invalid_hyperparameter[verbosity-values20]
PASSED xgboost/container/test_training.py::TestInvalidTraining::test_missing_num_round
PASSED xgboost/container/test_training.py::TestInvalidTraining::test_multiclass_without_num_class
PASSED xgboost/container/test_training.py::TestInvalidTraining::test_pipe_mode_rejected
FAILED xgboost/container/test_training.py::TestValidTraining::test_single_file_libsvm_iterate_objectives - AssertionError: Training failed:
/miniconda3/lib/python3.10/site-packages/sagemaker_containers/_server.py:22: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
import pkg_resources
[2026-04-01:19:09:22:INFO] Imported framework sagemaker_xgboost_container.training
[2026-04-01:19:09:22:INFO] Failed to parse hyperparameter eval_metric value error to Json.
Returning the value itself
[2026-04-01:19:09:22:INFO] Failed to parse hyperparameter predictor value cpu_predictor to Json.
Returning the value itself
[2026-04-01:19:09:22:INFO] Failed to parse hyperparameter tree_method value auto to Json.
Returning the value itself
[2026-04-01:19:09:22:INFO] Failed to parse hyperparameter normalize_type value tree to Json.
Returning the value itself
[2026-04-01:19:09:22:INFO] Failed to parse hyperparameter sample_type value uniform to Json.
Returning the value itself
[2026-04-01:19:09:22:INFO] Failed to parse hyperparameter booster value gbtree to Json.
Returning the value itself
[2026-04-01:19:09:22:INFO] Failed to parse hyperparameter objective value reg:gamma to Json.
Returning the value itself
[2026-04-01:19:09:22:INFO] Failed to parse hyperparameter updater value grow_colmaker,prune to Json.
Returning the value itself
[2026-04-01:19:09:22:INFO] Failed to parse hyperparameter process_type value default to Json.
Returning the value itself
[2026-04-01:19:09:22:INFO] Failed to parse hyperparameter dsplit value row to Json.
Returning the value itself
[2026-04-01:19:09:22:INFO] Failed to parse hyperparameter grow_policy value depthwise to Json.
Returning the value itself
[2026-04-01:19:09:22:INFO] No GPUs detected (normal if no gpus installed)
[2026-04-01:19:09:22:INFO] Running XGBoost Sagemaker in algorithm mode
[2026-04-01:19:09:22:INFO] Determined 0 GPU(s) available on the instance.
[2026-04-01:19:09:22:INFO] File path /opt/ml/input/data/train of input files
[2026-04-01:19:09:22:INFO] Making smlinks from folder /opt/ml/input/data/train to folder /tmp/sagemaker_xgboost_input_data
[2026-04-01:19:09:22:INFO] creating symlink between Path /opt/ml/input/data/train/agaricus.libsvm.train and destination /tmp/sagemaker_xgboost_input_data/agaricus.libsvm.train1664359970552213804
[2026-04-01:19:09:22:INFO] files path: /tmp/sagemaker_xgboost_input_data
[2026-04-01:19:09:22:INFO] File path /opt/ml/input/data/validation of input files
[2026-04-01:19:09:22:INFO] Making smlinks from folder /opt/ml/input/data/validation to folder /tmp/sagemaker_xgboost_input_data
[2026-04-01:19:09:22:INFO] creating symlink between Path /opt/ml/input/data/validation/agaricus.libsvm.test and destination /tmp/sagemaker_xgboost_input_data/agaricus.libsvm.test1757920320072049626
[2026-04-01:19:09:22:INFO] files path: /tmp/sagemaker_xgboost_input_data
[2026-04-01:19:09:22:INFO] Single node training.
[2026-04-01:19:09:22:INFO] TRAIN_JOB_DEBUG: Received is_master=True
TRAIN_JOB_DEBUG: Received is_master=True
[2026-04-01:19:09:22:INFO] Train matrix has 6513 rows and 127 columns
[2026-04-01:19:09:22:INFO] Validation matrix has 1611 rows
[2026-04-01:19:09:22:INFO] CALLBACK_SETUP_DEBUG: save_model_on_termination=false, is_master=True
[2026-04-01:19:09:22:INFO] CALLBACK_SKIPPING save_model_on_termination=false, is_master=True)
/miniconda3/lib/python3.10/site-packages/xgboost/callback.py:386: UserWarning: [19:09:22] WARNING: /workspace/src/common/error_msg.cc:33: You have manually specified the `updater` parameter. The `tree_method` parameter will be ignored. Incorrect sequence of updaters will produce undefined behavior. For common uses, we recommend using `tree_method` parameter instead.
self.starting_round = model.num_boosted_rounds()
/miniconda3/lib/python3.10/site-packages/xgboost/callback.py:386: UserWarning: [19:09:22] WARNING: /workspace/src/learner.cc:738:
Parameters: { "dsplit", "lambda_bias", "normalize_type", "one_drop", "predictor", "rate_drop", "sample_type", "sketch_eps", "skip_drop", "tweedie_variance_power" } are not used.

self.starting_round = model.num_boosted_rounds()
[2026-04-01:19:09:22:ERROR] Reporting training FAILURE
[2026-04-01:19:09:22:ERROR] framework error:
Traceback (most recent call last):
File "/miniconda3/lib/python3.10/site-packages/sagemaker_xgboost_container/algorithm_mode/train.py", line 367, in train_job
bst = xgb.train(
File "/miniconda3/lib/python3.10/site-packages/xgboost/core.py", line 729, in inner_f
return func(**kwargs)
File "/miniconda3/lib/python3.10/site-packages/xgboost/training.py", line 183, in train
bst.update(dtrain, iteration=i, fobj=obj)
File "/miniconda3/lib/python3.10/site-packages/xgboost/core.py", line 2246, in update
_check_call(
File "/miniconda3/lib/python3.10/site-packages/xgboost/core.py", line 310, in _check_call
raise XGBoostError(py_str(_LIB.XGBGetLastError()))
xgboost.core.XGBoostError: [19:09:22] /workspace/src/objective/regression_obj.cu:88: label must be positive for gamma regression.
Stack trace:
[bt] (0) /miniconda3/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(+0x2a6e7c) [0x7fb583957e7c]
[bt] (1) /miniconda3/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(+0xf02dcb) [0x7fb5845b3dcb]
[bt] (2) /miniconda3/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(+0xf03333) [0x7fb5845b4333]
[bt] (3) /miniconda3/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(+0x68d2a2) [0x7fb583d3e2a2]
[bt] (4) /miniconda3/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(XGBoosterUpdateOneIter+0x77) [0x7fb583867f57]
[bt] (5) /miniconda3/lib/python3.10/lib-dynload/../../libffi.so.8(+0x702a) [0x7fb5b767602a]
[bt] (6) /miniconda3/lib/python3.10/lib-dynload/../../libffi.so.8(+0x64a9) [0x7fb5b76754a9]
[bt] (7) /miniconda3/lib/python3.10/lib-dynload/../../libffi.so.8(ffi_call+0xdd) [0x7fb5b7675bbd]
[bt] (8) /miniconda3/lib/python3.10/lib-dynload/_ctypes.cpython-310-x86_64-linux-gnu.so(+0x8c7b) [0x7fb5b7683c7b]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/miniconda3/lib/python3.10/site-packages/sagemaker_containers/_trainer.py", line 84, in train
entrypoint()
File "/miniconda3/lib/python3.10/site-packages/sagemaker_xgboost_container/training.py", line 102, in main
train(framework.training_env())
File "/miniconda3/lib/python3.10/site-packages/sagemaker_xgboost_container/training.py", line 98, in train
run_algorithm_mode()
File "/miniconda3/lib/python3.10/site-packages/sagemaker_xgboost_container/training.py", line 64, in run_algorithm_mode
sagemaker_train(
File "/miniconda3/lib/python3.10/site-packages/sagemaker_xgboost_container/algorithm_mode/train.py", line 278, in sagemaker_train
train_job(**train_args)
File "/miniconda3/lib/python3.10/site-packages/sagemaker_xgboost_container/algorithm_mode/train.py", line 467, in train_job
raise exc.AlgorithmError(f"{exception_prefix}:\n {str(e)}")
sagemaker_algorithm_toolkit.exceptions.AlgorithmError: XGB train call failed with exception:
[19:09:22] /workspace/src/objective/regression_obj.cu:88: label must be positive for gamma regression.
Stack trace:
[bt] (0) /miniconda3/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(+0x2a6e7c) [0x7fb583957e7c]
[bt] (1) /miniconda3/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(+0xf02dcb) [0x7fb5845b3dcb]
[bt] (2) /miniconda3/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(+0xf03333) [0x7fb5845b4333]
[bt] (3) /miniconda3/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(+0x68d2a2) [0x7fb583d3e2a2]
[bt] (4) /miniconda3/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(XGBoosterUpdateOneIter+0x77) [0x7fb583867f57]
[bt] (5) /miniconda3/lib/python3.10/lib-dynload/../../libffi.so.8(+0x702a) [0x7fb5b767602a]
[bt] (6) /miniconda3/lib/python3.10/lib-dynload/../../libffi.so.8(+0x64a9) [0x7fb5b76754a9]
[bt] (7) /miniconda3/lib/python3.10/lib-dynload/../../libffi.so.8(ffi_call+0xdd) [0x7fb5b7675bbd]
[bt] (8) /miniconda3/lib/python3.10/lib-dynload/_ctypes.cpython-310-x86_64-linux-gnu.so(+0x8c7b) [0x7fb5b7683c7b]

XGB train call failed with exception:
[19:09:22] /workspace/src/objective/regression_obj.cu:88: label must be positive for gamma regression.
Stack trace:
[bt] (0) /miniconda3/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(+0x2a6e7c) [0x7fb583957e7c]
[bt] (1) /miniconda3/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(+0xf02dcb) [0x7fb5845b3dcb]
[bt] (2) /miniconda3/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(+0xf03333) [0x7fb5845b4333]
[bt] (3) /miniconda3/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(+0x68d2a2) [0x7fb583d3e2a2]
[bt] (4) /miniconda3/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(XGBoosterUpdateOneIter+0x77) [0x7fb583867f57]
[bt] (5) /miniconda3/lib/python3.10/lib-dynload/../../libffi.so.8(+0x702a) [0x7fb5b767602a]
[bt] (6) /miniconda3/lib/python3.10/lib-dynload/../../libffi.so.8(+0x64a9) [0x7fb5b76754a9]
[bt] (7) /miniconda3/lib/python3.10/lib-dynload/../../libffi.so.8(ffi_call+0xdd) [0x7fb5b7675bbd]
[bt] (8) /miniconda3/lib/python3.10/lib-dynload/_ctypes.cpython-310-x86_64-linux-gnu.so(+0x8c7b) [0x7fb5b7683c7b]

assert 1 == 0
FAILED xgboost/container/test_training.py::TestValidTraining::test_checkpoint_and_reload - AssertionError: assert 1 == 5
+  where 1 = len(['xgboost-checkpoint_0.ubj'])
FAILED xgboost/container/test_training.py::TestInvalidTraining::test_invalid_hyperparameter[tree_method-values7] - AssertionError: Pattern 'UserError:' not found in logs
assert None
+  where None = <function search at 0x7f11f9e60680>('UserError:', '/miniconda3/lib/python3.10/site-packages/sagemaker_containers/_server.py:22: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.\n  import pkg_resources\n[2026-04-01:19:11:48:INFO] Imported framework sagemaker_xgboost_container.training\n[2026-04-01:19:11:48:INFO] Failed to parse hyperparameter eval_metric value error to Json.\nReturning the value itself\n[2026-04-01:19:11:48:INFO] Failed to parse hyperparameter predictor value cpu_predictor to Json.\nReturning the value itself\n[2026-04-01:19:11:48:INFO] Failed to parse hyperparameter tree_method value gpu_hist to Json.\nReturning the value itself\n[2026-04-01:19:11:48:INFO] Failed to parse hyperparameter normalize_type value tree to Json.\nReturning the value itself\n[2026-04-01:19:11:48:INFO] Failed to parse hyperparameter sample_type value uniform to Json.\nReturning the value itself\n[2026-04-01:19:11:48:INFO] Failed to parse hyperparameter booster value gbtree to Json.\nReturning the value itself\n[2026-04-01:19:11:48:INFO] Failed to pa...61\tvalidation-error:0.00000\n[4]\ttrain-error:0.00000\tvalidation-error:0.00000\n/miniconda3/lib/python3.10/site-packages/xgboost/callback.py:503: UserWarning: [19:11:48] WARNING: /workspace/src/gbm/gbtree.cc:359: \n  Loading from a raw memory buffer (like pickle in Python, RDS in R) on a CPU-only\n  machine. Consider using `save_model/load_model` instead. See:\n\n    https://xgboost.readthedocs.io/en/latest/tutorials/saving_model.html\n\n  for more details about differences between saving model and serializing.  Changing `tree_method` to `hist`.\n  model = model[: best_iteration + 1]\n[2026-04-01:19:11:48:INFO] FINAL_MODEL_DEBUG: is_master=True, model_dir=/opt/ml/model\nFINAL_MODEL_DEBUG: is_master=True, model_dir=/opt/ml/model\n[2026-04-01:19:11:48:INFO] FINAL_MODEL_SAVE: Saving final model as master\nFINAL_MODEL_SAVE: Saving final model as master\n/miniconda3/lib/python3.10/site-packages/sagemaker_xgboost_container/algorithm_mode/train.py:480: UserWarning: [19:11:48] WARNING: /workspace/src/c_api/c_api.cc:1427: Saving model in the UBJSON format as default.  You can use file extension: `json`, `ubj` or `deprecated` to choose between formats.\n  bst.save_model(model_location)\n')
+    where <function search at 0x7f11f9e60680> = re.search
=================== 3 failed, 42 passed in 357.53s (0:05:57) ===================__________________ TestValidScoring.test_execution_parameters __________________

self = <container.test_scoring.TestValidScoring object at 0x7f92d4618500>
docker_client = <docker.client.DockerClient object at 0x7f92d38fd0d0>
image_uri = '404426647817.dkr.ecr.us-west-2.amazonaws.com/ci:xgboost-3.0.5-cpu-py310-cu126-ubuntu20.04-sagemaker-23865911659'
inference_resources = '/tmp/xgb-container-test-n8qucxal/inference'

def test_execution_parameters(self, docker_client, image_uri, inference_resources):
model_dir = _model_path(inference_resources, "mnist-xgb-model")
env = {"MAX_CONTENT_LENGTH": str(21 * 1024 ** 2)}
with ServingContainer(docker_client, image_uri, model_dir, env) as ctx:
resp = ctx.execution_parameters()
params = json.loads(resp.text)
assert params["BatchStrategy"] == "MULTI_RECORD"
assert params["MaxConcurrentTransforms"] == multiprocessing.cpu_count()
>       assert params["MaxPayloadInMB"] == 20
E       assert 21 == 20

xgboost/container/test_scoring.py:74: AssertionError
_____________________ TestValidScoring.test_csv_inference ______________________

self = <container.test_scoring.TestValidScoring object at 0x7f92d3553e30>
docker_client = <docker.client.DockerClient object at 0x7f92d38fd0d0>
image_uri = '404426647817.dkr.ecr.us-west-2.amazonaws.com/ci:xgboost-3.0.5-cpu-py310-cu126-ubuntu20.04-sagemaker-23865911659'
inference_resources = '/tmp/xgb-container-test-n8qucxal/inference'

def test_csv_inference(self, docker_client, image_uri, inference_resources):
# mnist xgb model
responses = _send_requests(
docker_client, image_uri, inference_resources, "mnist-xgb-model", "text/csv",
["mnist-1.csv", "mnist-empty-cell.csv", "mnist-equal-dim.csv", "mnist-700.csv"],
)
_validate_response(responses[0], 1)
_validate_response(responses[1], 1)
_validate_response(responses[2], 1)
>       _validate_response(responses[3], 700)

xgboost/container/test_scoring.py:85:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

resp = <Response [200]>, expected_length = 700

def _validate_response(resp, expected_length):
assert resp.status_code == httplib.OK, resp.text
predicted = resp.text.split(",")
>       assert len(predicted) == expected_length
E       AssertionError: assert 1 == 700
E        +  where 1 = len(['3.0\n8.0\n6.0\n9.0\n6.0\n4.0\n5.0\n3.0\n8.0\n4.0\n5.0\n2.0\n3.0\n8.0\n4.0\n8.0\n1.0\n5.0\n0.0\n5.0\n9.0\n7.0\n4.0\n1.0\n3.0\n3.0\n0.0\n6.0\n2.0\n9.0\n9.0\n4.0\n1.0\n3.0\n6.0\n8.0\n0.0\n7.0\n7.0\n6.0\n8.0\n9.0\n0.0\n3.0\n8.0\n3.0\n7.0\n7.0\n5.0\n1.0\n4.0\n2.0\n2.0\n9.0\n8.0\n1.0\n1.0\n0.0\n6.0\n6.0\n5.0\n0.0\n1.0\n1.0\n7.0\n2.0\n7.0\n3.0\n1.0\n4.0\n0.0\n5.0\n0.0\n6.0\n8.0\n7.0\n6.0\n8.0\n2.0\n9.0\n4.0\n0.0\n6.0\n1.0\n9.0\n2.0\n6.0\n3.0\n8.0\n4.0\n1.0\n5.0\n6.0\n6.0\n1.0\n7.0\n2.0\n8.0\n6.0\n9.0\n7.0\n0.0\n9.0\n8.0\n6.0\n2.0\n8.0\n3.0\n6.0\n4.0\n9.0\n2.0\n8.0\n6.0\n8.0\n7.0\n8.0\n8.0\n6.0\n9.0\n7.0\n7.0\n6.0\n0.0\n3.0\n6.0\n7.0\n0.0\n9.0\n7.0\n1.0\n3.0\n6.0\n8.0\n9.0\n6.0\n1.0\n7.0\n5.0\n1.0\n3.0\n3.0\n5.0\n7.0\n9.0\n9.0\n6.0\n7.0\n3.0\n6.0\n1.0\n0.0\n4.0\n2.0\n4.0\n5.0\n0.0\n0.0\n1.0\n6.0\n6.0\n4.0\n7.0\n9.0\n4.0\n6.0\n5.0\n2.0\n6.0\n9.0\n8.0\n8.0\n8.0\n5.0\n9.0\n3.0\n8.0\n9.0\n1.0\n8.0\n8.0\n3.0\n4.0\n4.0\n3.0\n0.0\n1.0\n5.0\n4.0\n4.0\n1.0\n8.0\n0.0\n6.0\n1.0\n3.0\n1.0\n0.0\n5.0\n6.0\n0.0\n3.0\n5.0\n4.0\n9.0\n0.0\n3.0\n1.0\n0.0\n9.0\n3.0\n2.0\n8.0\n3.0\n3.0\n7.0\n4.0\n9.0\n2.0\n1.0\n6.0\n2.0\n1.0\n8.0\n1.0\n1.0\n9.0\n7.0\n9.0\n2.0\n2.0\n8.0\n1.0\n7.0\n7.0\n0.0\n1.0\n1.0\n8.0\n2...\n2.0\n7.0\n0.0\n7.0\n1.0\n4.0\n9.0\n7.0\n6.0\n5.0\n4.0\n1.0\n9.0\n2.0\n2.0\n0.0\n1.0\n2.0\n2.0\n0.0\n3.0\n1.0\n7.0\n5.0\n0.0\n4.0\n2.0\n7.0\n1.0\n9.0\n3.0\n0.0\n1.0\n6.0\n2.0\n2.0\n5.0\n1.0\n8.0\n3.0\n1.0\n4.0\n6.0\n2.0\n4.0\n8.0\n5.0\n2.0\n6.0\n4.0\n0.0\n8.0\n5.0\n3.0\n9.0\n3.0\n4.0\n0.0\n9.0\n7.0\n2.0\n8.0\n0.0\n8.0\n5.0\n0.0\n2.0\n9.0\n3.0\n8.0\n4.0\n8.0\n5.0\n0.0\n8.0\n7.0\n9.0\n2.0\n0.0\n5.0\n1.0\n0.0\n2.0\n9.0\n3.0\n2.0\n4.0\n8.0\n5.0\n1.0\n6.0\n8.0\n7.0\n3.0\n8.0\n4.0\n7.0\n9.0\n0.0\n3.0\n1.0\n7.0\n2.0\n4.0\n3.0\n0.0\n4.0\n2.0\n5.0\n5.0\n8.0\n2.0\n5.0\n8.0\n2.0\n4.0\n1.0\n9.0\n7.0\n6.0\n2.0\n1.0\n4.0\n6.0\n1.0\n0.0\n4.0\n6.0\n1.0\n6.0\n4.0\n5.0\n9.0\n8.0\n6.0\n8.0\n8.0\n6.0\n4.0\n1.0\n5.0\n5.0\n3.0\n8.0\n7.0\n4.0\n8.0\n6.0\n4.0\n6.0\n3.0\n6.0\n3.0\n9.0\n5.0\n4.0\n0.0\n0.0\n6.0\n7.0\n1.0\n6.0\n6.0\n9.0\n8.0\n3.0\n7.0\n0.0\n3.0\n0.0\n1.0\n2.0\n5.0\n8.0\n6.0\n4.0\n0.0\n0.0\n8.0\n2.0\n5.0\n5.0\n0.0\n6.0\n6.0\n1.0\n1.0\n8.0\n5.0\n5.0\n8.0\n1.0\n4.0\n0.0\n7.0\n4.0\n6.0\n3.0\n9.0\n3.0\n1.0\n5.0\n9.0\n7.0\n7.0\n6.0\n1.0\n7.0\n2.0\n6.0\n3.0\n3.0\n4.0\n2.0\n5.0\n2.0\n5.0\n1.0\n3.0\n3.0\n7.0\n1.0\n3.0\n0.0\n1.0\n1.0\n8.0\n3.0\n2.0\n5.0\n2.0\n3.0\n3.0\n4.0\n2.0\n6.0\n7.0\n2.0\n4.0\n'])

xgboost/container/test_scoring.py:57: AssertionError
____________________ TestValidScoring.test_libsvm_inference ____________________

self = <con

* AI changes made during Kiro-cli session
---
X-AI-Tool: Kiro-cli
X-AI-Handle-Time-Seconds: 20
X-AI-Prompt: you can change the runner to use gpu fleet for container tests

* AI changes made during Kiro-cli session
---
X-AI-Tool: Kiro-cli
X-AI-Handle-Time-Seconds: 19
X-AI-Prompt: _________________ TestValidScoring.test_binary_classification __________________

self = <container.test_scoring.TestValidScoring object at 0x7f92d3553380>
docker_client = <docker.client.DockerClient object at 0x7f92d38fd0d0>
image_uri = '404426647817.dkr.ecr.us-west-2.amazonaws.com/ci:xgboost-3.0.5-cpu-py310-cu126-ubuntu20.04-sagemaker-23865911659'
inference_resources = '/tmp/xgb-container-test-n8qucxal/inference'

def test_binary_classification(self, docker_client, image_uri, inference_resources):
>       responses = _send_requests(
docker_client, image_uri, inference_resources,
"diabetes-binary-xgb-model", "text/csv",
["diabetes_inference.csv"],
)

xgboost/container/test_scoring.py:124:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
xgboost/container/test_scoring.py:43: in _send_requests
with ServingContainer(docker_client, image_uri, model_dir, environment) as ctx:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
xgboost/container/container_helper.py:152: in __enter__
self._wait_healthy()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <container.container_helper.ServingContainer object at 0x7f92d297c0e0>

def _wait_healthy(self):
deadline = time.time() + SERVE_STARTUP_TIMEOUT
while time.time() < deadline:
self._container.reload()
if self._container.status != "running":
>               raise RuntimeError(
f"Container exited: {self._container.logs().decode()}"
)

* AI changes made during Kiro-cli session
---
X-AI-Tool: Kiro-cli
X-AI-Handle-Time-Seconds: 79
X-AI-Prompt: show the output of tests 1-2 lines for validation. also run generate models script once per every test.

* AI changes made during Kiro-cli session
---
X-AI-Tool: Kiro-cli
X-AI-Handle-Time-Seconds: 54
X-AI-Prompt: XGBoost version: 3.0.5
Downloading training data...
Traceback (most recent call last):
File "/work/test/xgboost/container/generate_models.py", line 85, in <module>
main()
File "/work/test/xgboost/container/generate_models.py", line 48, in…
Signed-off-by: Yadan Wei <yadanwei@amazon.com>
…/audio/speech, not /invocations)"

This reverts commit a80c193.
Switch all non-omni PR workflow triggers from pull_request to
workflow_dispatch so only vllm-omni EC2 and SageMaker workflows
run on PRs to the omni branch.

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
…-omni endpoint

- omni_sagemaker_serve.py: FastAPI proxy on port 8080, routes to vllm-omni on 8081
- Supports explicit route via CustomAttributes header (route=/v1/audio/speech)
- Falls back to payload inspection (TTS vs chat vs completion)
- Entrypoint starts vllm-omni in background, proxy in foreground
- Endpoint test uses explicit route for TTS
Yadan Wei and others added 19 commits April 5, 2026 22:47
…model support, consolidate tests

- Model config: CosyVoice3-0.5B, Qwen2.5-Omni-3B, BAGEL-7B-MoT, Wan2.1-T2V-1.3B
- Covers all routes: /v1/audio/speech, /v1/chat/completions, /v1/images/generations, /v1/videos
- Workflow handles both S3 and HF model sources (HF_TOKEN for downloads)
- Removed separate unit-test job, runs in sagemaker-endpoint-test
- Fixed async endpoint test (AWSSessionManager.sts for account ID)
- Added starlette to sagemaker test requirements
…irst

New models (CosyVoice3, Qwen2.5-Omni, BAGEL, Wan2.1) OOM during HF download.
Need S3 tarballs and per-model validation before adding to CI.
- CosyVoice3: /v1/audio/speech (different TTS arch)
- Qwen2.5-Omni-3B: /v1/chat/completions (tests fallthrough, no middleware)
- BAGEL and Wan2.1 pending S3 upload
Tested models that don't work in CI:
- CosyVoice3: no model_type in config.json, unrecognized by transformers
- Qwen2.5-Omni-3B: OOMs on g6e.xlarge (multi-stage needs >48GB)
- BAGEL/Wan2.1: need --stage-configs-path, untested
- CosyVoice3-0.5B: /v1/audio/speech (g6e.4xl, config.json added to tarball)
- Wan2.1-T2V-1.3B: /v1/videos (g6e.4xl, diffusers auto-detect)
- BAGEL-7B-MoT: /v1/chat/completions (g6e.4xl, multimodal image gen)
- Qwen2.5-Omni-3B: /v1/chat/completions (g6e.12xl, text+audio omni)
- 6 models covering 4 routes: speech, images, videos, chat
Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Signed-off-by: Yadan Wei <yadanwei@amazon.com>
- CosyVoice3 on g6e12xl, Wan2.1 on g6e4xl, BAGEL on g6e4xl, Qwen2.5-Omni on g6e12xl
- Wan2.1 uses /v1/videos/sync with multipart/form-data
- Smoke tests support content_type param for form vs JSON
- Orphaned endpoint cleanup step (if: always)
- Container log dump increased to 500 lines
…del_type

EngineCore subprocess fails at AutoTokenizer.from_pretrained because
AutoConfig can't resolve cosyvoice3. The model uses ONNX tokenizers,
not HuggingFace tokenizers. Only works with offline Omni() API.
Verified on L40S with SM image:
- Model loads and serves on g6e.xlarge (L40S 48GB)
- /v1/videos returns queued job with id
- Middleware routes /invocations -> /v1/videos with form data
Signed-off-by: Yadan Wei <yadanwei@amazon.com>
os_version: "amzn2023"
customer_type: "ec2"
arch_type: "x86"
prod_image: "vllm-omni:0.18-gpu-py312-ec2"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we'll use the save repo name "vllm" instead of creating new repo

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will update this section when we have a real prod image.

junpuf
junpuf previously approved these changes Apr 7, 2026
Yadan Wei added 2 commits April 6, 2026 21:01
- sglang: add aiohttp>=3.13.4 to CVE patch block
- vllm: remove expired CVE-2026-33055 allowlist (fixed in uv tar 0.4.45)

Fixes: CVE-2026-34520, CVE-2026-34516, CVE-2026-22815
@Yadan-Wei Yadan-Wei enabled auto-merge (squash) April 7, 2026 15:24
@Yadan-Wei Yadan-Wei merged commit e781a45 into main Apr 7, 2026
264 of 277 checks passed
@Yadan-Wei Yadan-Wei deleted the omni branch April 7, 2026 16:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

authorized Size:XL Determines the size of the PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants