|
| 1 | +--- |
| 2 | +title: "VLLMChatGenerator" |
| 3 | +id: vllmchatgenerator |
| 4 | +slug: "/vllmchatgenerator" |
| 5 | +description: "This component enables chat completion using models served with vLLM." |
| 6 | +--- |
| 7 | + |
| 8 | +# VLLMChatGenerator |
| 9 | + |
| 10 | +This component enables chat completion using models served with [vLLM](https://docs.vllm.ai/). |
| 11 | + |
| 12 | +<div className="key-value-table"> |
| 13 | + |
| 14 | +| | | |
| 15 | +| --- | --- | |
| 16 | +| **Most common position in a pipeline** | After a [`ChatPromptBuilder`](../builders/chatpromptbuilder.mdx) | |
| 17 | +| **Mandatory init variables** | `model`: The name of the model served by vLLM | |
| 18 | +| **Mandatory run variables** | `messages`: A list of [`ChatMessage`](../../concepts/data-classes/chatmessage.mdx) objects | |
| 19 | +| **Output variables** | `replies`: A list of [`ChatMessage`](../../concepts/data-classes/chatmessage.mdx) objects | |
| 20 | +| **API reference** | [vLLM](/reference/integrations-vllm) | |
| 21 | +| **GitHub link** | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/vllm | |
| 22 | + |
| 23 | +</div> |
| 24 | + |
| 25 | +## Overview |
| 26 | + |
| 27 | +[vLLM](https://docs.vllm.ai/) is a high-throughput and memory-efficient inference and serving engine for LLMs. It exposes an OpenAI-compatible HTTP server, which `VLLMChatGenerator` uses to run chat completions. |
| 28 | + |
| 29 | +`VLLMChatGenerator` expects a vLLM server to be running and accessible at the `api_base_url` parameter (by default, `http://localhost:8000/v1`). The component needs a list of [`ChatMessage`](../../concepts/data-classes/chatmessage.mdx) objects to operate. `ChatMessage` is a data class that contains a message, a role (who generated the message, such as `user`, `assistant`, `system`, `function`), and optional metadata. |
| 30 | + |
| 31 | +You can pass any text generation parameters valid for the vLLM OpenAI-compatible Chat Completion API directly to this component using the `generation_kwargs` parameter in `__init__` or in the `run` method. vLLM-specific parameters not part of the standard OpenAI API (such as `top_k`, `min_tokens`, `repetition_penalty`) can be passed through `generation_kwargs["extra_body"]`. For more details, see the [vLLM documentation](https://docs.vllm.ai/en/stable/serving/openai_compatible_server/). |
| 32 | + |
| 33 | +If the vLLM server was started with `--api-key`, provide the API key through the `VLLM_API_KEY` environment variable or the `api_key` init parameter using Haystack's [Secret](../../concepts/secret-management.mdx) API. |
| 34 | + |
| 35 | +### Tool Support |
| 36 | + |
| 37 | +`VLLMChatGenerator` supports function calling through the `tools` parameter, which accepts flexible tool configurations: |
| 38 | + |
| 39 | +- **A list of Tool objects**: Pass individual tools as a list |
| 40 | +- **A single Toolset**: Pass an entire Toolset directly |
| 41 | +- **Mixed Tools and Toolsets**: Combine multiple Toolsets with standalone tools in a single list |
| 42 | + |
| 43 | +This allows you to organize related tools into logical groups while also including standalone tools as needed. |
| 44 | + |
| 45 | +For tool calling to work, the vLLM server must be started with `--enable-auto-tool-choice` and `--tool-call-parser`. The available tool call parsers depend on the model. See the [vLLM tool calling docs](https://docs.vllm.ai/en/stable/features/tool_calling/) for the full list. |
| 46 | + |
| 47 | +For more details on working with tools, see the [Tool](../../tools/tool.mdx) and [Toolset](../../tools/toolset.mdx) documentation. |
| 48 | + |
| 49 | +### Streaming |
| 50 | + |
| 51 | +`VLLMChatGenerator` supports [streaming](guides-to-generators/choosing-the-right-generator.mdx#streaming-support) responses from the LLM, allowing tokens to be emitted as they are generated. To enable streaming, pass a callable to the `streaming_callback` parameter during initialization. |
| 52 | + |
| 53 | +### Reasoning models |
| 54 | + |
| 55 | +`VLLMChatGenerator` supports reasoning models. To use them, start the vLLM server with the appropriate `--reasoning-parser`. The reasoning content produced by the model is exposed in the `reasoning` field of the returned `ChatMessage`. |
| 56 | + |
| 57 | +## Usage |
| 58 | + |
| 59 | +Install the `vllm-haystack` package to use the `VLLMChatGenerator`: |
| 60 | + |
| 61 | +```shell |
| 62 | +pip install vllm-haystack |
| 63 | +``` |
| 64 | + |
| 65 | +### Starting the vLLM server |
| 66 | + |
| 67 | +Before using this component, start a vLLM server: |
| 68 | + |
| 69 | +```bash |
| 70 | +vllm serve Qwen/Qwen3-4B-Instruct-2507 |
| 71 | +``` |
| 72 | + |
| 73 | +For reasoning models, start the server with the appropriate reasoning parser: |
| 74 | + |
| 75 | +```bash |
| 76 | +vllm serve Qwen/Qwen3-0.6B --reasoning-parser qwen3 |
| 77 | +``` |
| 78 | + |
| 79 | +For tool calling, start the server with `--enable-auto-tool-choice` and `--tool-call-parser`: |
| 80 | + |
| 81 | +```bash |
| 82 | +vllm serve Qwen/Qwen3-0.6B --enable-auto-tool-choice --tool-call-parser hermes |
| 83 | +``` |
| 84 | + |
| 85 | +For details on server options, see the [vLLM CLI docs](https://docs.vllm.ai/en/stable/cli/serve/). |
| 86 | + |
| 87 | +### On its own |
| 88 | + |
| 89 | +Basic usage: |
| 90 | + |
| 91 | +```python |
| 92 | +from haystack.dataclasses import ChatMessage |
| 93 | +from haystack_integrations.components.generators.vllm import VLLMChatGenerator |
| 94 | + |
| 95 | +generator = VLLMChatGenerator( |
| 96 | + model="Qwen/Qwen3-4B-Instruct-2507", |
| 97 | + generation_kwargs={"max_tokens": 512, "temperature": 0.7}, |
| 98 | +) |
| 99 | + |
| 100 | +messages = [ChatMessage.from_user("What's Natural Language Processing?")] |
| 101 | +response = generator.run(messages=messages) |
| 102 | +print(response["replies"][0].text) |
| 103 | +``` |
| 104 | + |
| 105 | +### With vLLM-specific parameters |
| 106 | + |
| 107 | +Pass vLLM-specific parameters through the `generation_kwargs["extra_body"]` dictionary: |
| 108 | + |
| 109 | +```python |
| 110 | +from haystack_integrations.components.generators.vllm import VLLMChatGenerator |
| 111 | + |
| 112 | +generator = VLLMChatGenerator( |
| 113 | + model="Qwen/Qwen3-4B-Instruct-2507", |
| 114 | + generation_kwargs={ |
| 115 | + "max_tokens": 512, |
| 116 | + "extra_body": { |
| 117 | + "top_k": 50, |
| 118 | + "min_tokens": 10, |
| 119 | + "repetition_penalty": 1.1, |
| 120 | + }, |
| 121 | + }, |
| 122 | +) |
| 123 | +``` |
| 124 | + |
| 125 | +### With tool calling |
| 126 | + |
| 127 | +Start the vLLM server with `--enable-auto-tool-choice` and `--tool-call-parser`, then: |
| 128 | + |
| 129 | +```python |
| 130 | +from haystack.dataclasses import ChatMessage |
| 131 | +from haystack.tools import tool |
| 132 | +from haystack_integrations.components.generators.vllm import VLLMChatGenerator |
| 133 | + |
| 134 | + |
| 135 | +@tool |
| 136 | +def weather(city: str) -> str: |
| 137 | + """Get the weather in a given city.""" |
| 138 | + return f"The weather in {city} is sunny" |
| 139 | + |
| 140 | + |
| 141 | +generator = VLLMChatGenerator(model="Qwen/Qwen3-0.6B", tools=[weather]) |
| 142 | + |
| 143 | +messages = [ChatMessage.from_user("What is the weather in Paris?")] |
| 144 | +response = generator.run(messages=messages) |
| 145 | +print(response["replies"][0].tool_calls) |
| 146 | +``` |
| 147 | + |
| 148 | +### With reasoning models |
| 149 | + |
| 150 | +Start the vLLM server with `--reasoning-parser`, then: |
| 151 | + |
| 152 | +```python |
| 153 | +from haystack.dataclasses import ChatMessage |
| 154 | +from haystack_integrations.components.generators.vllm import VLLMChatGenerator |
| 155 | + |
| 156 | +generator = VLLMChatGenerator(model="Qwen/Qwen3-0.6B") |
| 157 | + |
| 158 | +messages = [ChatMessage.from_user("Solve step by step: what is 15 * 37?")] |
| 159 | +response = generator.run(messages=messages) |
| 160 | +reply = response["replies"][0] |
| 161 | +if reply.reasoning: |
| 162 | + print("Reasoning:", reply.reasoning.reasoning_text) |
| 163 | +print("Answer:", reply.text) |
| 164 | +``` |
| 165 | + |
| 166 | +### In a pipeline |
| 167 | + |
| 168 | +```python |
| 169 | +from haystack import Pipeline |
| 170 | +from haystack.components.builders import ChatPromptBuilder |
| 171 | +from haystack.dataclasses import ChatMessage |
| 172 | +from haystack_integrations.components.generators.vllm import VLLMChatGenerator |
| 173 | + |
| 174 | +prompt_builder = ChatPromptBuilder() |
| 175 | +llm = VLLMChatGenerator(model="Qwen/Qwen3-4B-Instruct-2507") |
| 176 | + |
| 177 | +pipe = Pipeline() |
| 178 | +pipe.add_component("prompt_builder", prompt_builder) |
| 179 | +pipe.add_component("llm", llm) |
| 180 | +pipe.connect("prompt_builder.prompt", "llm.messages") |
| 181 | + |
| 182 | +messages = [ |
| 183 | + ChatMessage.from_system("Give brief answers."), |
| 184 | + ChatMessage.from_user("Tell me about {{city}}"), |
| 185 | +] |
| 186 | + |
| 187 | +response = pipe.run( |
| 188 | + data={ |
| 189 | + "prompt_builder": { |
| 190 | + "template": messages, |
| 191 | + "template_variables": {"city": "Berlin"}, |
| 192 | + }, |
| 193 | + }, |
| 194 | +) |
| 195 | +print(response) |
| 196 | +``` |
0 commit comments