Skip to content

Update dependency vllm to v0.19.0 [SECURITY]#54

Open
renovate[bot] wants to merge 1 commit into
mainfrom
renovate/pypi-vllm-vulnerability
Open

Update dependency vllm to v0.19.0 [SECURITY]#54
renovate[bot] wants to merge 1 commit into
mainfrom
renovate/pypi-vllm-vulnerability

Conversation

@renovate
Copy link
Copy Markdown

@renovate renovate Bot commented May 5, 2026

ℹ️ Note

This PR body was truncated due to platform limits.

This PR contains the following updates:

Package Change Age Confidence
vllm 0.10.10.19.0 age confidence

vllm API endpoints vulnerable to Denial of Service Attacks

CVE-2025-48956 / GHSA-rxc4-3w6r-4v47

More information

Details

Summary

A Denial of Service (DoS) vulnerability can be triggered by sending a single HTTP GET request with an extremely large header to an HTTP endpoint. This results in server memory exhaustion, potentially leading to a crash or unresponsiveness. The attack does not require authentication, making it exploitable by any remote user.

Details

The vulnerability leverages the abuse of HTTP headers. By setting a header such as X-Forwarded-For to a very large value like ("A" * 5_800_000_000), the server's HTTP parser or application logic may attempt to load the entire request into memory, overwhelming system resources.

Impact

What kind of vulnerability is it? Who is impacted?
Type of vulnerability: Denial of Service (DoS)

Resolution

Upgrade to a version of vLLM that includes appropriate HTTP limits by deafult, or use a proxy in front of vLLM which provides protection against this issue.

Severity

  • CVSS Score: 7.5 / 10 (High)
  • Vector String: CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:N/I:N/A:H

References

This data is provided by the GitHub Advisory Database (CC-BY 4.0).


vLLM has remote code execution vulnerability in the tool call parser for Qwen3-Coder

CVE-2025-9141 / GHSA-79j6-g2m3-jgfw

More information

Details

Summary

An unsafe deserialization vulnerability allows any authenticated user to execute arbitrary code on the server if they are able to get the model to pass the code as an argument to a tool call.

Details

vLLM's Qwen3 Coder tool parser contains a code execution path that uses Python's eval() function to parse tool call parameters. This occurs during the parameter conversion process when the parser attempts to handle unknown data types.

This code path is reached when:

  1. Tool calling is enabled (--enable-auto-tool-choice)
  2. The qwen3_coder parser is specified (--tool-call-parser qwen3_coder)
  3. The parameter type is not explicitly defined or recognized
Impact

Remote Code Execution via Python's eval() function.

Severity

  • CVSS Score: 8.8 / 10 (High)
  • Vector String: CVSS:3.1/AV:N/AC:L/PR:L/UI:N/S:U/C:H/I:H/A:H

References

This data is provided by the GitHub Advisory Database (CC-BY 4.0).


vLLM is vulnerable to timing attack at bearer auth

CVE-2025-59425 / GHSA-wr9h-g72x-mwhm

More information

Details

Summary

The API key support in vLLM performed validation using a method that was vulnerable to a timing attack. This could potentially allow an attacker to discover a valid API key using an approach more efficient than brute force.

Details

https://github.com/vllm-project/vllm/blob/4b946d693e0af15740e9ca9c0e059d5f333b1083/vllm/entrypoints/openai/api_server.py#L1270-L1274

API key validation used a string comparison that will take longer the more characters the provided API key gets correct. Data analysis across many attempts can allow an attacker to determine when it finds the next correct character in the key sequence.

Impact

Deployments relying on vLLM's built-in API key validation are vulnerable to authentication bypass using this technique.

Severity

  • CVSS Score: 7.5 / 10 (High)
  • Vector String: CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:H/I:N/A:N

References

This data is provided by the GitHub Advisory Database (CC-BY 4.0).


vLLM: Resource-Exhaustion (DoS) through Malicious Jinja Template in OpenAI-Compatible Server

CVE-2025-61620 / GHSA-6fvq-23cw-5628

More information

Details

Summary

A resource-exhaustion (denial-of-service) vulnerability exists in multiple endpoints of the OpenAI-Compatible Server due to the ability to specify Jinja templates via the chat_template and chat_template_kwargs parameters. If an attacker can supply these parameters to the API, they can cause a service outage by exhausting CPU and/or memory resources.

Details

When using an LLM as a chat model, the conversation history must be rendered into a text input for the model. In hf/transformer, this rendering is performed using a Jinja template. The OpenAI-Compatible Server launched by vllm serve exposes a chat_template parameter that lets users specify that template. In addition, the server accepts a chat_template_kwargs parameter to pass extra keyword arguments to the rendering function.

Because Jinja templates support programming-language-like constructs (loops, nested iterations, etc.), a crafted template can consume extremely large amounts of CPU and memory and thereby trigger a denial-of-service condition.

Importantly, simply forbidding the chat_template parameter does not fully mitigate the issue. The implementation constructs a dictionary of keyword arguments for apply_hf_chat_template and then updates that dictionary with the user-supplied chat_template_kwargs via dict.update. Since dict.update can overwrite existing keys, an attacker can place a chat_template key inside chat_template_kwargs to replace the template that will be used by apply_hf_chat_template.

##### vllm/entrypoints/openai/serving_engine.py#L794-L816
_chat_template_kwargs: dict[str, Any] = dict(
    chat_template=chat_template,
    add_generation_prompt=add_generation_prompt,
    continue_final_message=continue_final_message,
    tools=tool_dicts,
    documents=documents,
)
_chat_template_kwargs.update(chat_template_kwargs or {})

request_prompt: Union[str, list[int]]
if isinstance(tokenizer, MistralTokenizer):
    ...
else:
    request_prompt = apply_hf_chat_template(
        tokenizer=tokenizer,
        conversation=conversation,
        model_config=model_config,
        **_chat_template_kwargs,
    )
Impact

If an OpenAI-Compatible Server exposes endpoints that accept chat_template or chat_template_kwargs from untrusted clients, an attacker can submit a malicious Jinja template (directly or by overriding chat_template inside chat_template_kwargs) that consumes excessive CPU and/or memory. This can result in a resource-exhaustion denial-of-service that renders the server unresponsive to legitimate requests.

Fixes

Severity

  • CVSS Score: 6.5 / 10 (Medium)
  • Vector String: CVSS:3.1/AV:N/AC:L/PR:L/UI:N/S:U/C:N/I:N/A:H

References

This data is provided by the GitHub Advisory Database (CC-BY 4.0).


vLLM is vulnerable to Server-Side Request Forgery (SSRF) through MediaConnector class

CVE-2025-6242 / GHSA-3f6c-7fw2-ppm4

More information

Details

Summary

A Server-Side Request Forgery (SSRF) vulnerability exists in the MediaConnector class within the vLLM project's multimodal feature set. The load_from_url and load_from_url_async methods fetch and process media from user-provided URLs without adequate restrictions on the target hosts. This allows an attacker to coerce the vLLM server into making arbitrary requests to internal network resources.

This vulnerability is particularly critical in containerized environments like llm-d, where a compromised vLLM pod could be used to scan the internal network, interact with other pods, and potentially cause denial of service or access sensitive data. For example, an attacker could make the vLLM pod send malicious requests to an internal llm-d management endpoint, leading to system instability by falsely reporting metrics like the KV cache state.

Vulnerability Details

The core of the vulnerability lies in the MediaConnector.load_from_url method and its asynchronous counterpart. These methods accept a URL string to fetch media content (images, audio, video).

https://github.com/vllm-project/vllm/blob/119f683949dfed10df769fe63b2676d7f1eb644e/vllm/multimodal/utils.py#L97-L113

The function directly processes URLs with http, https, and file schemes. An attacker can supply a URL pointing to an internal IP address or a localhost endpoint. The vLLM server will then initiate a connection to this internal resource.

  • HTTP/HTTPS Scheme: An attacker can craft a request like {"image_url": "http://127.0.0.1:8080/internal_api"}. The vLLM server will send a GET request to this internal endpoint.
  • File Scheme: The _load_file_url method attempts to restrict file access to a subdirectory defined by --allowed-local-media-path. While this is a good security measure for local file access, it does not prevent network-based SSRF attacks.
Impact in llm-d Environments

The risk is significantly amplified in orchestrated environments such as llm-d, where multiple pods communicate over an internal network.

  1. Denial of Service (DoS): An attacker could target internal management endpoints of other services within the llm-d cluster. For instance, if a monitoring or metrics service is exposed internally, an attacker could send malformed requests to it. A specific example is an attacker causing the vLLM pod to call an internal API that reports a false KV cache utilization, potentially triggering incorrect scaling decisions or even a system shutdown.

  2. Internal Network Reconnaissance: Attackers can use the vulnerability to scan the internal network for open ports and services by providing URLs like http://10.0.0.X:PORT and observing the server's response time or error messages.

  3. Interaction with Internal Services: Any unsecured internal service becomes a potential target. This could include databases, internal APIs, or other model pods that might not have robust authentication, as they are not expected to be directly exposed.

Delegating this security responsibility to an upper-level orchestrator like llm-d is problematic. The orchestrator cannot easily distinguish between legitimate requests initiated by the vLLM engine for its own purposes and malicious requests originating from user input, thus complicating traffic filtering rules and increasing management overhead.

Fix

See the --allowed-media-domains option discussed here: https://docs.vllm.ai/en/latest/usage/security.html#4-restrict-domains-access-for-media-urls

Severity

  • CVSS Score: 7.1 / 10 (High)
  • Vector String: CVSS:3.1/AV:N/AC:H/PR:L/UI:N/S:U/C:H/I:L/A:H

References

This data is provided by the GitHub Advisory Database (CC-BY 4.0).


vLLM vulnerable to DoS with incorrect shape of multimodal embedding inputs

CVE-2025-62372 / GHSA-pmqf-x6x8-p7qw

More information

Details

Summary

Users can crash the vLLM engine serving multimodal models by passing multimodal embedding inputs with correct ndim but incorrect shape (e.g. hidden dimension is wrong), regardless of whether the model is intended to support such inputs (as defined in the Supported Models page).

The issue has existed ever since we added support for image embedding inputs, i.e. #​6613 (released in v0.5.5)

Details

Using image embeddings as an example:

  • For models that support image embedding inputs, the engine crashes when scattering the embeddings to inputs_embeds (mismatched shape)
  • For models that don't support image embedding inputs, the engine crashes when validating the inputs inside get_input_embeddings (validation fails).

This happens because we only validate ndim of the tensor, but not the full shape, in input processor (via MultiModalDataParser).

Impact
  • Denial of service by crashing the engine
Mitigation
  • Use API key to limit access to trusted users.
  • Set --limit-mm-per-prompt to 0 for all non-text modalities to ban multimodal inputs, which includes multimodal embedding inputs. However, the model would then only accept text, defeating the purpose of using a multi-modal model.
Resolution

Severity

  • CVSS Score: 8.3 / 10 (High)
  • Vector String: CVSS:4.0/AV:N/AC:L/AT:N/PR:L/UI:N/VC:N/VI:N/VA:H/SC:N/SI:N/SA:H

References

This data is provided by the GitHub Advisory Database (CC-BY 4.0).


vLLM vulnerable to DoS via large Chat Completion or Tokenization requests with specially crafted chat_template_kwargs

CVE-2025-62426 / GHSA-69j4-grxj-j64p

More information

Details

Summary

The /v1/chat/completions and /tokenize endpoints allow a chat_template_kwargs request parameter that is used in the code before it is properly validated against the chat template. With the right chat_template_kwargs parameters, it is possible to block processing of the API server for long periods of time, delaying all other requests

Details

In serving_engine.py, the chat_template_kwargs are unpacked into kwargs passed to chat_utils.py apply_hf_chat_template with no validation on the keys or values in that chat_template_kwargs dict. This means they can be used to override optional parameters in the apply_hf_chat_template method, such as tokenize, changing its default from False to True.

https://github.com/vllm-project/vllm/blob/2a6dc67eb520ddb9c4138d8b35ed6fe6226997fb/vllm/entrypoints/openai/serving_engine.py#L809-L814

https://github.com/vllm-project/vllm/blob/2a6dc67eb520ddb9c4138d8b35ed6fe6226997fb/vllm/entrypoints/chat_utils.py#L1602-L1610

Both serving_chat.py and serving_tokenization.py call into this _preprocess_chat method of serving_engine.py and they both pass in chat_template_kwargs.

So, a chat_template_kwargs like {"tokenize": True} makes tokenization happen as part of applying the chat template, even though that is not expected. Tokenization is a blocking operation, and with sufficiently large input can block the API server's event loop, which blocks handling of all other requests until this tokenization is complete.

This optional tokenize parameter to apply_hf_chat_template does not appear to be used, so one option would be to just hard-code that to always be False instead of allowing it to be optionally overridden by callers. A better option may be to not pass chat_template_kwargs as unpacked kwargs but instead as a dict, and only unpack them after the logic in apply_hf_chat_template that resolves the kwargs against the chat template.

Impact

Any authenticated user can cause a denial of service to a vLLM server with Chat Completion or Tokenize requests.

Fix

https://github.com/vllm-project/vllm/pull/27205

Severity

  • CVSS Score: 6.5 / 10 (Medium)
  • Vector String: CVSS:3.1/AV:N/AC:L/PR:L/UI:N/S:U/C:N/I:N/A:H

References

This data is provided by the GitHub Advisory Database (CC-BY 4.0).


vLLM vulnerable to remote code execution via transformers_utils/get_config

CVE-2025-66448 / GHSA-8fr4-5q9j-m8gm

More information

Details

Summary

vllm has a critical remote code execution vector in a config class named Nemotron_Nano_VL_Config. When vllm loads a model config that contains an auto_map entry, the config class resolves that mapping with get_class_from_dynamic_module(...) and immediately instantiates the returned class. This fetches and executes Python from the remote repository referenced in the auto_map string. Crucially, this happens even when the caller explicitly sets trust_remote_code=False in vllm.transformers_utils.config.get_config. In practice, an attacker can publish a benign-looking frontend repo whose config.json points via auto_map to a separate malicious backend repo; loading the frontend will silently run the backend’s code on the victim host.

Details

The vulnerable code resolves and instantiates classes from auto_map entries without checking whether those entries point to a different repo or whether remote code execution is allowed.

class Nemotron_Nano_VL_Config(PretrainedConfig):
    model_type = 'Llama_Nemotron_Nano_VL'

    def __init__(self, **kwargs):
        super().__init__(**kwargs)

        if vision_config is not None:
            assert "auto_map" in vision_config and "AutoConfig" in vision_config["auto_map"]
            # <-- vulnerable dynamic resolution + instantiation happens here
            vision_auto_config = get_class_from_dynamic_module(*vision_config["auto_map"]["AutoConfig"].split("--")[::-1])
            self.vision_config = vision_auto_config(**vision_config)
        else:
            self.vision_config = PretrainedConfig()

get_class_from_dynamic_module(...) is capable of fetching and importing code from the Hugging Face repo specified in the mapping. trust_remote_code is not enforced for this code path. As a result, a frontend repo can redirect the loader to any backend repo and cause code execution, bypassing the trust_remote_code guard.

Impact

This is a critical vulnerability because it breaks the documented trust_remote_code safety boundary in a core model-loading utility. The vulnerable code lives in a common loading path, so any application, service, CI job, or developer machine that uses vllm’s transformer utilities to load configs can be affected. The attack requires only two repos and no user interaction beyond loading the frontend model. A successful exploit can execute arbitrary commands on the host.

Fixes

Severity

  • CVSS Score: 7.1 / 10 (High)
  • Vector String: CVSS:3.1/AV:N/AC:H/PR:L/UI:R/S:U/C:H/I:H/A:H

References

This data is provided by the GitHub Advisory Database (CC-BY 4.0).


vLLM is vulnerable to DoS in Idefics3 vision models via image payload with ambiguous dimensions

CVE-2026-22773 / GHSA-grg2-63fw-f2qr

More information

Details

Summary

Users can crash the vLLM engine serving multimodal models that use the Idefics3 vision model implementation by sending a specially crafted 1x1 pixel image. This causes a tensor dimension mismatch that results in an unhandled runtime error, leading to complete server termination.

Details

The vulnerability is triggered when the image processor encounters a 1x1 pixel image with shape (1, 1, 3) in HWC (Height, Width, Channel) format. Due to the ambiguous dimensions, the processor incorrectly assumes the image is in CHW (Channel, Height, Width) format with shape (3, H, W). This misinterpretation causes an incorrect calculation of the number of image patches, resulting in a fatal tensor split operation failure.

Crash location: vllm/model_executor/models/idefics3.py line 672:

def _process_image_input(self, image_input: ImageInputs) -> torch.Tensor | list[torch.Tensor]:
    # ...
    num_patches = image_input["num_patches"]
    return [e.flatten(0, 1) for e in image_features.split(num_patches.tolist())]

The split() call fails because the computed num_patches value (17) does not match the actual tensor dimension (9):

RuntimeError: split_with_sizes expects split_sizes to sum exactly to 9 
(input tensor's size at dimension 0), but got split_sizes=[17]

This unhandled exception terminates the EngineCore process, crashing the server.

Affected Models

Any model using the Idefics3 architecture. The vulnerability was tested with HuggingFaceTB/SmolVLM-Instruct.

Impact

Denial of service by crashing the engine

Mitigation

Validating the input:

def _validate_image_dimensions(self, image_shape):
    h, w = image_shape[:2] if len(image_shape) == 3 else image_shape
    if h < MIN_IMAGE_SIZE or w < MIN_IMAGE_SIZE:
        raise ValueError(f"Image dimensions too small: {h}x{w}")

Managing the exception:

try:
    return [e.flatten(0, 1) for e in image_features.split(num_patches.tolist())]
except RuntimeError as e:
    logger.error(f"Image processing failed: {e}")
    raise InvalidImageError("Failed to process image features") from e
Fixes

Severity

  • CVSS Score: 6.5 / 10 (Medium)
  • Vector String: CVSS:3.1/AV:N/AC:L/PR:L/UI:N/S:U/C:N/I:N/A:H

References

This data is provided by the GitHub Advisory Database (CC-BY 4.0).


vLLM affected by RCE via auto_map dynamic module loading during model initialization

CVE-2026-22807 / GHSA-2pc9-4j83-qjmr

More information

Details

Summary

vLLM loads Hugging Face auto_map dynamic modules during model resolution without gating on trust_remote_code, allowing attacker-controlled Python code in a model repo/path to execute at server startup.


Impact

An attacker who can influence the model repo/path (local directory or remote Hugging Face repo) can achieve arbitrary code execution on the vLLM host during model load.
This happens before any request handling and does not require API access.


Affected Versions

All versions where vllm/model_executor/models/registry.py resolves auto_map entries with try_get_class_from_dynamic_module without checking trust_remote_code (at least current main).


Details

During model resolution, vLLM unconditionally iterates auto_map entries from the model config and calls try_get_class_from_dynamic_module, which delegates to Transformers’ get_class_from_dynamic_module and executes the module code.

This occurs even when trust_remote_code is false, allowing a malicious model repo to embed code in a referenced module and have it executed during initialization.

Relevant code
  • vllm/model_executor/models/registry.py:856 — auto_map resolution
  • vllm/transformers_utils/dynamic_module.py:13 — delegates to get_class_from_dynamic_module, which executes code

Fixes
Credits

Reported by bugbunny.ai

Severity

  • CVSS Score: 8.8 / 10 (High)
  • Vector String: CVSS:3.1/AV:N/AC:L/PR:N/UI:R/S:U/C:H/I:H/A:H

References

This data is provided by the GitHub Advisory Database (CC-BY 4.0).


vLLM vulnerable to Server-Side Request Forgery (SSRF) through MediaConnector

CVE-2026-24779 / GHSA-qh4c-xf7m-gxfc

More information

Details

Summary

A Server-Side Request Forgery (SSRF) vulnerability exists in the MediaConnector class within the vLLM project's multimodal feature set. The load_from_url and load_from_url_async methods obtain and process media from URLs provided by users, using different Python parsing libraries when restricting the target host. These two parsing libraries have different interpretations of backslashes, which allows the host name restriction to be bypassed. This allows an attacker to coerce the vLLM server into making arbitrary requests to internal network resources.

This vulnerability is particularly critical in containerized environments like llm-d, where a compromised vLLM pod could be used to scan the internal network, interact with other pods, and potentially cause Denial of Service or access sensitive data. For example, an attacker could make the vLLM pod send malicious requests to an internal llm-d management endpoint, leading to system instability by falsely reporting metrics like the KV cache state.

Details

The core of the vulnerability lies in the MediaConnector.load_from_url method and its asynchronous counterpart. These methods accept a URL string to fetch media content (images, audio, video).

def load_from_url(
    self,
    url: str,
    media_io: MediaIO[_M],
    *,
    fetch_timeout: int | None = None,
) -> _M:  # type: ignore[type-var]
    url_spec = urlparse(url)

    if url_spec.scheme.startswith("http"):
        self._assert_url_in_allowed_media_domains(url_spec)

        connection = self.connection
        data = connection.get_bytes(
            url,
            timeout=fetch_timeout,
            allow_redirects=envs.VLLM_MEDIA_URL_ALLOW_REDIRECTS,
        )

        return media_io.load_bytes(data)

The URL validation uses the urlparse function from Python's urllib module, while the request is made using the request function from Python's requests module. The requests module's underlying URL parsing is implemented using the parse_url function from Python's urllib3. These two parsing functions follow different URL specifications; one is implemented according to the RFC 3986 specification, and the other is implemented according to the WHATWG Living Standard. There is a difference in how the two functions handle backslashes (\) in URLs, which allows the hostname restriction to be bypassed.

Fix

Severity

  • CVSS Score: 7.1 / 10 (High)
  • Vector String: CVSS:3.1/AV:N/AC:L/PR:L/UI:N/S:U/C:H/I:N/A:L

References

This data is provided by the GitHub Advisory Database (CC-BY 4.0).


vLLM has RCE In Video Processing

CVE-2026-22778 / GHSA-4r2x-xpjr-7cvv

More information

Details

Summary

A chain of vulnerabilities in vLLM allow Remote Code Execution (RCE):

  1. Info Leak - PIL error messages expose memory addresses, bypassing ASLR
  2. Heap Overflow - JPEG2000 decoder in OpenCV/FFmpeg has a heap overflow that lets us hijack code execution

Result: Send a malicious video URL to vLLM Completions or Invocations for a video model -> Execute arbitrary commands on the server

Completely default vLLM instance directly from pip, or docker, does not have authentication so "None" privileges are required, but even with non-default api-key enabled configuration this exploit is feasible through invocations route that allows payload to execute pre-auth.

Example heap target is provided, other heap targets can be exploited as well to achieve rce. Leak allows for simple ASLR bypass. Leak + heap overflow achieves RCE on versions prior to 0.14.1.

Deployments not serving a video model are not affected.


1. Vulnerability Overview
1.1 The Bug: JPEG2000 cdef Box Heap Overflow

The JPEG2000 decoder used by OpenCV (cv2) honors a cdef box that can remap color channels. When Y (luma) is mapped into the U (chroma) plane buffer, the decoder writes a large Y plane into the smaller U buffer, causing a heap overflow.

Root Cause

  • cdef allows channel remapping (e.g., Y→U, U→Y).
  • Y plane size: W×H; U plane size: (W/2)×(H/2).
  • Overflow size = W×H - (W/2×H/2) = 0.75 × W × H bytes.

Example (150×64)

  • Y plane: 150×64 = 9,600 bytes
  • U plane: 75×32 = 2,400 bytes
  • Overflow: 7,200 bytes past the U buffer
1.2 Malicious cdef Box
Offset  Size  Field           Value
0       4     Box Length      0x00000016 (22 bytes)
4       4     Box Type        'cdef'
8       2     N (channels)    0x0003
10      2     Channel 0 Cn    0x0000 (Y channel)
12      2     Channel 0 Typ   0x0000 (color)
14      2     Channel 0 Asoc  0x0002 (→ maps Y into U plane)
16      2     Channel 1 Cn    0x0001 (U channel)
18      2     Channel 1 Typ   0x0000 (color)
20      2     Channel 1 Asoc  0x0001 (→ maps U into Y plane)
22      2     Channel 2 Cn    0x0002 (V channel)
24      2     Channel 2 Typ   0x0000 (color)
26      2     Channel 2 Asoc  0x0003 (→ maps V plane)

Key control: Asoc=2 for channel 0 forces Y data into the U buffer, triggering the overflow.


Vulnerable Code Chain
1) Entry: vLLM accepts a remote video_url and downloads raw bytes

vLLM’s OpenAI-compatible API supports a video_url content part:

class VideoURL(TypedDict, total=False):
    url: Required[str]

class ChatCompletionContentPartVideoParam(TypedDict, total=False):
    video_url: Required[VideoURL]
    type: Required[Literal["video_url"]]

Source: src/vllm/entrypoints/chat_utils.py.

When the URL is HTTP(S), vLLM downloads it as raw bytes and passes the bytes into the modality loader:

if url_spec.scheme.startswith("http"):
    data = connection.get_bytes(url, timeout=fetch_timeout, allow_redirects=...)
    return media_io.load_bytes(data)

Source: src/vllm/multimodal/utils.py (MediaConnector.load_from_url).


2) Decode: vLLM uses OpenCV (cv2) VideoCapture on an in-memory byte stream

The default video backend is OpenCV, and it constructs cv2.VideoCapture over a BytesIO buffer containing the downloaded bytes:

backend = cls().get_cv2_video_api()
cap = cv2.VideoCapture(BytesIO(data), backend, [])
if not cap.isOpened():
    raise ValueError("Could not open video stream")

Source: src/vllm/multimodal/video.py (OpenCVVideoBackend.load_bytes).

The backend is selected from OpenCV’s stream-buffered backends registry:

import cv2.videoio_registry as vr
for backend in vr.getStreamBufferedBackends():
    if vr.hasBackend(backend) and ...:
        api_pref = backend
        break
return api_pref

Source: src/vllm/multimodal/video.py (OpenCVVideoBackend.get_cv2_video_api).

Implication: vLLM is delegating container parsing + codec decode to OpenCV’s Video I/O stack (which, in typical builds, is backed by FFmpeg for MOV/MP4 and codecs like JPEG2000).


3) The actual overflow: Y (full-res) written into U (quarter-res)

When the decoder honors the remap and writes Y into the U-plane buffer, it writes too many bytes:

  • Y plane bytes: (W \times H)
  • U plane bytes: ((W/2) \times (H/2))
  • Overflow bytes: (W \times H - (W/2 \times H/2) = 0.75 \times W \times H)

Concrete example tried (150×64):

  • Y: (150 \times 64 = 9600) bytes
  • U: (75 \times 32 = 2400) bytes
  • Overflow: (9600 - 2400 = 7200) bytes past the end of the U allocation

This is a heap buffer overflow into whatever allocations follow the U-plane buffer in the decoder’s heap layout (structures, metadata, other buffers, etc.). The exact victims depend on build + runtime allocator layout.


The Exploit Chain
Vuln 1: PIL BytesIO Address Leak (ASLR Bypass)

When you send an invalid image to vLLM's multimodal endpoint, PIL throws an error like:

cannot identify image file <_io.BytesIO object at 0x7a95e299e750>
                                                   ^^^^^^^^^^^^^^^^
                                                   LEAKED ADDRESS!

vLLM returns this error to the client, leaking a heap address. This address is ~10.33 GB before libc in memory. With this leak, we reduce ASLR from 4 billion guesses to ~8 guesses.

Vuln 2: JPEG2000 cdef Heap Overflow (RCE)

vLLM uses OpenCV (cv2) to decode videos. OpenCV bundles FFmpeg 5.1.x which has a heap overflow in the JPEG2000 decoder. The OpenCV is used for video decoding so if we build a video from JPEG2000 frames it will reach the vuln:

vLLM API Request to Completions/Invocation
     ↓
OpenCV cv2.VideoCapture()
     ↓
FFmpeg 5.1 (bundled in OpenCV)
     ↓
JPEG2000 decoder (libopenjp2)
     ↓
HEAP OVERFLOW via malicious "cdef" box
     ↓
Overwrite function pointer → RCE!

How the overflow works:

  • JPEG2000 has a cdef box that remaps color channels
  • We remap Y (luma) into the U (chroma) buffer
  • Y plane = 9,600 bytes, U plane = 2,400 bytes
  • On small geometry like 150x64 pixel image we get 7,200 bytes overflow past the U buffer. We can grow that exponentially by making bigger images.
  • This overwrites an AVBuffer structure containing a free() function pointer. This could be any function pointer or other targets.
  • We set free = system() and opaque = "command string"
  • When the buffer is freed → system("our command") executes

vLLM Attack Surface
Affected Endpoints

Both multimodal endpoints are vulnerable:

POST /v1/chat/completions     (with video_url in content)
POST /v1/invocations          (with video_url in content)
Request Flow
1. Attacker sends request with video_url pointing to malicious .mov file
2. vLLM fetches the video from the URL
3. vLLM passes video bytes to cv2.VideoCapture()
4. OpenCV's bundled FFmpeg decodes JPEG2000 frames
5. Malicious cdef box triggers heap overflow
6. AVBuffer.free pointer overwritten with system()
7. When buffer is released → system("attacker command") executes

Versions Affected
Component Version Notes
vLLM >= 0.8.3, < 0.14.1 Default config vulnerable when serving a video model
OpenCV (cv2) 4.x with FFmpeg bundle Bundled FFmpeg is vulnerable
FFmpeg 5.1.x (bundled) JPEG2000 cdef overflow
libopenjp2 2.x Honors malicious cdef box

Fixes

Severity

  • CVSS Score: 9.8 / 10 (Critical)
  • Vector String: CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:H/I:H/A:H

References

This data is provided by the GitHub Advisory Database (CC-BY 4.0).


vLLM has Hardcoded Trust Override in Model Files Enables RCE Despite Explicit User Opt-Out

CVE-2026-27893 / GHSA-7972-pg2x-xr59

More information

Details

Summary

Two model implementation files hardcode trust_remote_code=True when loading sub-components, bypassing the user's explicit --trust-remote-code=False security opt-out. This enables remote code execution via malicious model
repositories even when the user has explicitly disabled remote code trust.

Details

Affected files (latest main branch):

  1. vllm/model_executor/models/nemotron_vl.py:430
vision_model = AutoModel.from_config(config.vision_config, trust_remote_code=True)
  1. vllm/model_executor/models/kimi_k25.py:177
  cached_get_image_processor(self.ctx.model_config.model, trust_remote_code=True)

Both pass a hardcoded trust_remote_code=True to HuggingFace API calls, overriding the user's global --trust-remote-code=False setting.

Relation to prior CVEs:

  • CVE-2025-66448 fixed auto_map resolution in vllm/transformers_utils/config.py (config loading path)
  • CVE-2026-22807 fixed broader auto_map at startup
  • Both fixes are present in the current code. These hardcoded instances in model files survived both patches — different code paths.
Impact

Remote code execution. An attacker can craft a malicious model repository that executes arbitrary Python code when loaded by vLLM, even when the user has explicitly set --trust-remote-code=False. This undermines the security guarantee
that trust_remote_code=False is intended to provide.

Remediation: Replace hardcoded trust_remote_code=True with self.config.model_config.trust_remote_code in both files. Raise a clear error if the model component requires remote code but the user hasn't opted in.

Severity

  • CVSS Score: 8.8 / 10 (High)
  • Vector String: CVSS:3.1/AV:N/AC:L/PR:N/UI:R/S:U/C:H/I:H/A:H

References

This data is provided by the GitHub Advisory Database (CC-BY 4.0).


vLLM: Unauthenticated OOM Denial of Service via Unbounded n Parameter in OpenAI API Server

CVE-2026-34756 / GHSA-3mwp-wvh9-7528

More information

Details

Summary

A Denial of Service vulnerability exists in the vLLM OpenAI-compatible API server. Due to the lack of an upper bound validation on the n parameter in the ChatCompletionRequest and CompletionRequest Pydantic models, an unauthenticated attacker can send a single HTTP request with an astronomically large n value. This completely blocks the Python asyncio event loop and causes immediate Out-Of-Memory crashes by allocating millions of request object copies in the heap before the request even reaches the scheduling queue.

Details

The root cause of this vulnerability lies in the missing upper bound checks across the request parsing and asynchronous scheduling layers:

  1. Protocol Layer:
    In vllm/entrypoints/openai/chat_completion/protocol.py, the n parameter is defined simply as an integer without any pydantic.Field constraints for an upper bound.
class ChatCompletionRequest(OpenAIBaseModel):
    # Ordered by official OpenAI API documentation
    # https://platform.openai.com/docs/api/reference/chat/create
    messages: list[ChatCompletionMessageParam]
    model: str | None = None
    frequency_penalty: float | None = 0.0
    logit_bias: dict[str, float] | None = None
    logprobs: bool | None = False
    top_logprobs: int | None = 0
    max_tokens: int | None = Field(
        default=None,
        deprecated="max_tokens is deprecated in favor of "
        "the max_completion_tokens field",
    )
    max_completion_tokens: int | None = None
    n: int | None = 1
    presence_penalty: float | None = 0.0
  1. SamplingParams Layer (Incomplete Validation):
    When the API request is converted to internal SamplingParams in vllm/sampling_params.py, the _verify_args method only checks the lower bound (self.n < 1), entirely omitting an upper bounds check.
    def _verify_args(self) -> None:
        if not isinstance(self.n, int):
            raise ValueError(f"n must be an int, but is of type {type(self.n)}")
        if self.n < 1:
            raise ValueError(f"n must be at least 1, got {self.n}.")
  1. Engine Layer (The OOM Trigger):
    When the malicious request reaches the core engine (vllm/v1/engine/async_llm.py), the engine attempts to fan out the request n times to generate identical independent sequences within a synchronous loop.
        # Fan out child requests (for n>1).
        parent_request = ParentRequest(request)
        for idx in range(parent_params.n):
            request_id, child_params = parent_request.get_child_info(idx)
            child_request = request if idx == parent_params.n - 1 else copy(request)
            child_request.request_id = request_id
            child_request.sampling_params = child_params
            await self._add_request(
                child_request, prompt_text, parent_request, idx, queue
            )
        return queue

Because Python's asyncio runs on a single thread and event loop, this monolithic for-loop monopolizes the CPU thread. The server stops responding to all other connections (including liveness probes). Simultaneously, the memory allocator is overwhelmed by cloning millions of request object instances via copy(request), driving the host's Resident Set Size (RSS) up by gigabytes per second until the OS OOM-killer terminates the vLLM process.

Impact

Vulnerability Type: Resource Exhaustion / Denial of Service

Impacted Parties:

  • Any individual or organization hosting a public-facing vLLM API server (vllm.entrypoints.openai.api_server), which happens to be the primary entrypoint for OpenAI-compatible setups.
  • SaaS / AI-as-a-Service platforms acting as reverse proxies sitting in front of vLLM without strict HTTP body payload validation or rate limitations.

Because this vulnerability exploits the control plane rather than the data plane, an unauthenticated remote attacker can achieve a high success rate in taking down production inference hosts with a single HTTP request. This effectively circumvents any hardware-level capacity planning and conventional bandwidth stress limitations.

Severity

  • CVSS Score: 6.5 / 10 (Medium)
  • Vector String: CVSS:3.1/AV:N/AC:L/PR:L/UI:N/S:U/C:N/I:N/A:H

References

This data is provided by the GitHub Advisory Database (CC-BY 4.0).


vLLM: Denial of Service via Unbounded Frame Count in video/jpeg Base64 Processing

CVE-2026-34755 / GHSA-pq5c-rjhq-qp7p

More information

Details

Summary

The VideoMediaIO.load_base64() method at vllm/multimodal/media/video.py:51-62 splits video/jpeg data URLs by comma to extract individual JPEG frames, but does not enforce a frame count limit. The num_frames parameter (default: 32), which is enforced by the load_bytes() code path at line 47-48, is completely bypassed in the video/jpeg base64 path. An attacker can send a single API request containing thousands of comma-separated base64-encoded JPEG frames, causing the server to decode all frames into memory and crash with OOM.

Details
Vulnerable code
##### video.py:51-62
def load_base64(self, media_type: str, data: str) -> tuple[npt.NDArray, dict[str, Any]]:
    if media_type.lower() == "video/jpeg":
        load_frame = partial(self.image_io.load_base64, "image/jpeg")
        return np.stack(
            [np.asarray(load_frame(frame_data)) for frame_data in data.split(",")]
            #                                                       ^^^^^^^^^^
            # Unbounded split — no frame count limit
        ), {}
    return self.load_bytes(base64.b64decode(data))

The load_bytes() path (line 47-48) properly delegates to a video loader that respects self.num_frames (default 32). The load_base64("video/jpeg", ...) path bypasses this limit entirely — data.split(",") produces an unbounded list and every frame is decoded into a numpy array.

video/jpeg is part of vLLM's public API

video/jpeg is a vLLM-specific MIME type, not IANA-registered. However it is part of the public API surface:

  • encode_video_url() at vllm/multimodal/utils.py:96-108 generates data:video/jpeg;base64,... URLs
  • Official test suites at tests/entrypoints/openai/test_video.py:62 and tests/entrypoints/test_chat_utils.py:153 both use this format
Memory amplification

Each JPEG frame decodes to a full numpy array. For 640x480 RGB images, each frame is ~921 KB decoded. 5000 frames = ~4.6 GB. np.stack() then creates an additional copy. The compressed JPEG payload is small (~100 KB for 5000 frames) but decompresses to gigabytes.

Data flow
POST /v1/chat/completions
  → chat_utils.py:1434   video_url type → mm_parser.parse_video()
  → chat_utils.py:872    parse_video() → self._connector.fetch_video()
  → connector.py:295     fetch_video() → load_from_url(url, self.video_io)
  → connector.py:91 

> ✂ **Note**
> 
> PR body was truncated to here.

@renovate renovate Bot changed the title Update dependency vllm to v0.11.1 [SECURITY] Update dependency vllm to v0.19.0 [SECURITY] May 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants