Skip to content

DJL Serving / vLLM — LLaVA tokenizer.json incompatibility causes endpoint startup failure #5752

@jitu1987

Description

@jitu1987

PySDK Version

  • PySDK V3 (3.4.0)

Describe the bug
DJL Serving / vLLM — LLaVA tokenizer.json incompatibility causes endpoint startup failure

When deploying a LLaVA multimodal model to a SageMaker endpoint using the djl-inference:0.29.0-lmi11.0.0-cu124 container with OPTION_ROLLING_BATCH=vllm, the Python engine process crashes immediately on startup with the following error:
Exception: data did not match any variant of untagged enum ModelWrapper at line 277156 column 3
The crash occurs inside the Rust tokenizers library when attempting to deserialize tokenizer.json. Two incompatibilities are present between tokenizer files produced by recent versions of the HuggingFace tokenizers library and the version bundled inside the DJL 0.29.0 container:
• The tokenizer.json BPE model block contains an ignore_merges field introduced in tokenizers >= 0.14.0 that the container's older Rust deserializer does not recognise, causing the entire ModelWrapper enum deserialization to fail.
• The merges array uses the newer list-of-lists format (e.g. ["▁", "t"]) instead of the legacy space-joined string format expected by older tokenizers (e.g. "▁ t").
Either incompatibility alone is sufficient to trigger the crash. The endpoint never becomes healthy and SageMaker eventually times out the deployment.

To reproduce

  1. Package a LLaVA model (e.g. llava-hf/llava-1.5-7b-hf) whose tokenizer.json was saved with tokenizers >= 0.14.0 into a model.tar.gz and upload to S3.
  2. Run the deployment script below:
    import boto3
    from sagemaker.core.helper.session_helper import Session
    from sagemaker.serve.model_builder import ModelBuilder
    from sagemaker.serve.mode.function_pointers import Mode
    from datetime import datetime

region = "us-east-1"
sm_s3_model_path = "s3:///path/model.tar.gz"
sm_role = "arn:aws:iam:::role/"

boto_session = boto3.Session(region_name=region)
sagemaker_session = Session(boto_session=boto_session)
image_uri = (
"763104351884.dkr.ecr.us-east-1.amazonaws.com/"
"djl-inference:0.29.0-lmi11.0.0-cu124"
)

builder = ModelBuilder(
image_uri=image_uri,
s3_model_data_url=sm_s3_model_path,
role_arn=sm_role,
sagemaker_session=sagemaker_session,
env_vars={
"HF_MODEL_ID": "/opt/ml/model",
"OPTION_ROLLING_BATCH": "vllm",
"TENSOR_PARALLEL_DEGREE": "1",
"OPTION_DTYPE": "fp16",
"OPTION_MAX_MODEL_LEN": "4096",
"OPTION_TRUST_REMOTE_CODE": "true",
"OPTION_TASK": "text-generation",
},
instance_type="ml.g5.2xlarge",
mode=Mode.SAGEMAKER_ENDPOINT,
)
builder.build(role_arn=sm_role, sagemaker_session=sagemaker_session)
predictor = builder.deploy(
endpoint_name=f"llava-{int(datetime.now().timestamp())}",
initial_instance_count=1,
instance_type="ml.g5.2xlarge",
container_timeout_in_seconds=600,
)
3. Observe that the container exits with the ModelWrapper deserialization exception before the endpoint becomes InService.

Expected behavior
The DJL container should either:
• Accept tokenizer.json files produced by modern versions of the HuggingFace tokenizers library (including the ignore_merges field and list-format merges), OR
• Emit a clear, actionable error message indicating that the tokenizer.json format is incompatible and specifying the maximum supported tokenizers library version.
The endpoint should reach InService status and be able to serve LLaVA inference requests.

Screenshots or Logs
Key error extracted from CloudWatch / container stdout (repeated across all three retry attempts):
INFO llm_engine.py:176] Initializing an LLM engine (v0.5.3.post1) with config:
model='/opt/ml/model', dtype=torch.float16, max_seq_len=4096 ...

File ".../tokenization_utils_fast.py", line 115, in init
fast_tokenizer = TokenizerFast.from_file(fast_tokenizer_file)
Exception: data did not match any variant of untagged enum ModelWrapper
at line 277156 column 3

Python engine process died
[ERROR] ModelServer - Failed register workflow
Caused by: ai.djl.engine.EngineException: Failed to initialize model: prediction failure
[ERROR] ModelServer - Unexpected error
ai.djl.serving.http.ServerStartupException:
Failed to initialize startup models and workflows
The same sequence is logged on each of the three startup retries before SageMaker marks the deployment as failed.

System information

SageMaker Python SDK version PySDK V3 (3.4.0) — sagemaker.serve.model_builder.ModelBuilder
Framework / Algorithm DJL Serving 0.29.0 with vLLM 0.5.3.post1 rolling batch
Framework version djl-inference:0.29.0-lmi11.0.0-cu124
Python version 3.9 (container default)
CPU or GPU GPU — ml.g5.2xlarge (1× NVIDIA A10G, 24 GB VRAM)
Custom Docker image N — official AWS DJL LMI image

Additional context
• The DJL log correctly identifies modelType: llava but the OPTION_TASK=text-generation override may also conflict with LLaVA's multimodal requirements. Removing this env var is recommended.
• A secondary (non-fatal) CUDA compatibility warning is also logged: the container's CUDA 12.4 compat package expects a driver <= 550.127.08 but the host driver is 535.288.01. This does not cause the crash but may affect future GPU operations.
• The container retries startup three times before giving up, logging the identical tokenizer error each time, resulting in a confusingly long CloudWatch log.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions