PySDK Version
Describe the bug
DJL Serving / vLLM — LLaVA tokenizer.json incompatibility causes endpoint startup failure
When deploying a LLaVA multimodal model to a SageMaker endpoint using the djl-inference:0.29.0-lmi11.0.0-cu124 container with OPTION_ROLLING_BATCH=vllm, the Python engine process crashes immediately on startup with the following error:
Exception: data did not match any variant of untagged enum ModelWrapper at line 277156 column 3
The crash occurs inside the Rust tokenizers library when attempting to deserialize tokenizer.json. Two incompatibilities are present between tokenizer files produced by recent versions of the HuggingFace tokenizers library and the version bundled inside the DJL 0.29.0 container:
• The tokenizer.json BPE model block contains an ignore_merges field introduced in tokenizers >= 0.14.0 that the container's older Rust deserializer does not recognise, causing the entire ModelWrapper enum deserialization to fail.
• The merges array uses the newer list-of-lists format (e.g. ["▁", "t"]) instead of the legacy space-joined string format expected by older tokenizers (e.g. "▁ t").
Either incompatibility alone is sufficient to trigger the crash. The endpoint never becomes healthy and SageMaker eventually times out the deployment.
To reproduce
- Package a LLaVA model (e.g. llava-hf/llava-1.5-7b-hf) whose tokenizer.json was saved with tokenizers >= 0.14.0 into a model.tar.gz and upload to S3.
- Run the deployment script below:
import boto3
from sagemaker.core.helper.session_helper import Session
from sagemaker.serve.model_builder import ModelBuilder
from sagemaker.serve.mode.function_pointers import Mode
from datetime import datetime
region = "us-east-1"
sm_s3_model_path = "s3:///path/model.tar.gz"
sm_role = "arn:aws:iam:::role/"
boto_session = boto3.Session(region_name=region)
sagemaker_session = Session(boto_session=boto_session)
image_uri = (
"763104351884.dkr.ecr.us-east-1.amazonaws.com/"
"djl-inference:0.29.0-lmi11.0.0-cu124"
)
builder = ModelBuilder(
image_uri=image_uri,
s3_model_data_url=sm_s3_model_path,
role_arn=sm_role,
sagemaker_session=sagemaker_session,
env_vars={
"HF_MODEL_ID": "/opt/ml/model",
"OPTION_ROLLING_BATCH": "vllm",
"TENSOR_PARALLEL_DEGREE": "1",
"OPTION_DTYPE": "fp16",
"OPTION_MAX_MODEL_LEN": "4096",
"OPTION_TRUST_REMOTE_CODE": "true",
"OPTION_TASK": "text-generation",
},
instance_type="ml.g5.2xlarge",
mode=Mode.SAGEMAKER_ENDPOINT,
)
builder.build(role_arn=sm_role, sagemaker_session=sagemaker_session)
predictor = builder.deploy(
endpoint_name=f"llava-{int(datetime.now().timestamp())}",
initial_instance_count=1,
instance_type="ml.g5.2xlarge",
container_timeout_in_seconds=600,
)
3. Observe that the container exits with the ModelWrapper deserialization exception before the endpoint becomes InService.
Expected behavior
The DJL container should either:
• Accept tokenizer.json files produced by modern versions of the HuggingFace tokenizers library (including the ignore_merges field and list-format merges), OR
• Emit a clear, actionable error message indicating that the tokenizer.json format is incompatible and specifying the maximum supported tokenizers library version.
The endpoint should reach InService status and be able to serve LLaVA inference requests.
Screenshots or Logs
Key error extracted from CloudWatch / container stdout (repeated across all three retry attempts):
INFO llm_engine.py:176] Initializing an LLM engine (v0.5.3.post1) with config:
model='/opt/ml/model', dtype=torch.float16, max_seq_len=4096 ...
File ".../tokenization_utils_fast.py", line 115, in init
fast_tokenizer = TokenizerFast.from_file(fast_tokenizer_file)
Exception: data did not match any variant of untagged enum ModelWrapper
at line 277156 column 3
Python engine process died
[ERROR] ModelServer - Failed register workflow
Caused by: ai.djl.engine.EngineException: Failed to initialize model: prediction failure
[ERROR] ModelServer - Unexpected error
ai.djl.serving.http.ServerStartupException:
Failed to initialize startup models and workflows
The same sequence is logged on each of the three startup retries before SageMaker marks the deployment as failed.
System information
| SageMaker Python SDK version |
PySDK V3 (3.4.0) — sagemaker.serve.model_builder.ModelBuilder |
| Framework / Algorithm |
DJL Serving 0.29.0 with vLLM 0.5.3.post1 rolling batch |
| Framework version |
djl-inference:0.29.0-lmi11.0.0-cu124 |
| Python version |
3.9 (container default) |
| CPU or GPU |
GPU — ml.g5.2xlarge (1× NVIDIA A10G, 24 GB VRAM) |
| Custom Docker image |
N — official AWS DJL LMI image |
Additional context
• The DJL log correctly identifies modelType: llava but the OPTION_TASK=text-generation override may also conflict with LLaVA's multimodal requirements. Removing this env var is recommended.
• A secondary (non-fatal) CUDA compatibility warning is also logged: the container's CUDA 12.4 compat package expects a driver <= 550.127.08 but the host driver is 535.288.01. This does not cause the crash but may affect future GPU operations.
• The container retries startup three times before giving up, logging the identical tokenizer error each time, resulting in a confusingly long CloudWatch log.
PySDK Version
Describe the bug
DJL Serving / vLLM — LLaVA tokenizer.json incompatibility causes endpoint startup failure
When deploying a LLaVA multimodal model to a SageMaker endpoint using the djl-inference:0.29.0-lmi11.0.0-cu124 container with OPTION_ROLLING_BATCH=vllm, the Python engine process crashes immediately on startup with the following error:
Exception: data did not match any variant of untagged enum ModelWrapper at line 277156 column 3
The crash occurs inside the Rust tokenizers library when attempting to deserialize tokenizer.json. Two incompatibilities are present between tokenizer files produced by recent versions of the HuggingFace tokenizers library and the version bundled inside the DJL 0.29.0 container:
• The tokenizer.json BPE model block contains an ignore_merges field introduced in tokenizers >= 0.14.0 that the container's older Rust deserializer does not recognise, causing the entire ModelWrapper enum deserialization to fail.
• The merges array uses the newer list-of-lists format (e.g. ["▁", "t"]) instead of the legacy space-joined string format expected by older tokenizers (e.g. "▁ t").
Either incompatibility alone is sufficient to trigger the crash. The endpoint never becomes healthy and SageMaker eventually times out the deployment.
To reproduce
import boto3
from sagemaker.core.helper.session_helper import Session
from sagemaker.serve.model_builder import ModelBuilder
from sagemaker.serve.mode.function_pointers import Mode
from datetime import datetime
region = "us-east-1"
sm_s3_model_path = "s3:///path/model.tar.gz"
sm_role = "arn:aws:iam:::role/"
boto_session = boto3.Session(region_name=region)
sagemaker_session = Session(boto_session=boto_session)
image_uri = (
"763104351884.dkr.ecr.us-east-1.amazonaws.com/"
"djl-inference:0.29.0-lmi11.0.0-cu124"
)
builder = ModelBuilder(
image_uri=image_uri,
s3_model_data_url=sm_s3_model_path,
role_arn=sm_role,
sagemaker_session=sagemaker_session,
env_vars={
"HF_MODEL_ID": "/opt/ml/model",
"OPTION_ROLLING_BATCH": "vllm",
"TENSOR_PARALLEL_DEGREE": "1",
"OPTION_DTYPE": "fp16",
"OPTION_MAX_MODEL_LEN": "4096",
"OPTION_TRUST_REMOTE_CODE": "true",
"OPTION_TASK": "text-generation",
},
instance_type="ml.g5.2xlarge",
mode=Mode.SAGEMAKER_ENDPOINT,
)
builder.build(role_arn=sm_role, sagemaker_session=sagemaker_session)
predictor = builder.deploy(
endpoint_name=f"llava-{int(datetime.now().timestamp())}",
initial_instance_count=1,
instance_type="ml.g5.2xlarge",
container_timeout_in_seconds=600,
)
3. Observe that the container exits with the ModelWrapper deserialization exception before the endpoint becomes InService.
Expected behavior
The DJL container should either:
• Accept tokenizer.json files produced by modern versions of the HuggingFace tokenizers library (including the ignore_merges field and list-format merges), OR
• Emit a clear, actionable error message indicating that the tokenizer.json format is incompatible and specifying the maximum supported tokenizers library version.
The endpoint should reach InService status and be able to serve LLaVA inference requests.
Screenshots or Logs
Key error extracted from CloudWatch / container stdout (repeated across all three retry attempts):
INFO llm_engine.py:176] Initializing an LLM engine (v0.5.3.post1) with config:
model='/opt/ml/model', dtype=torch.float16, max_seq_len=4096 ...
File ".../tokenization_utils_fast.py", line 115, in init
fast_tokenizer = TokenizerFast.from_file(fast_tokenizer_file)
Exception: data did not match any variant of untagged enum ModelWrapper
at line 277156 column 3
Python engine process died
[ERROR] ModelServer - Failed register workflow
Caused by: ai.djl.engine.EngineException: Failed to initialize model: prediction failure
[ERROR] ModelServer - Unexpected error
ai.djl.serving.http.ServerStartupException:
Failed to initialize startup models and workflows
The same sequence is logged on each of the three startup retries before SageMaker marks the deployment as failed.
System information
Additional context
• The DJL log correctly identifies modelType: llava but the OPTION_TASK=text-generation override may also conflict with LLaVA's multimodal requirements. Removing this env var is recommended.
• A secondary (non-fatal) CUDA compatibility warning is also logged: the container's CUDA 12.4 compat package expects a driver <= 550.127.08 but the host driver is 535.288.01. This does not cause the crash but may affect future GPU operations.
• The container retries startup three times before giving up, logging the identical tokenizer error each time, resulting in a confusingly long CloudWatch log.