Skip to content

Commit a3b46e2

Browse files
authored
Merge branch 'main' into vllm-ranker-docs
2 parents f88b641 + 1896020 commit a3b46e2

24 files changed

Lines changed: 3481 additions & 1 deletion

File tree

.github/workflows/ci_metrics.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ jobs:
1616
send:
1717
runs-on: ubuntu-slim
1818
steps:
19-
- uses: int128/datadog-actions-metrics@d2f2fefbd0145c2d401da6f00f01d41ce6ab6230 # v1.159.0
19+
- uses: int128/datadog-actions-metrics@7b7475c28ed4decbaa92cd401bf46c4b32a8bb79 # v1.161.0
2020
with:
2121
datadog-api-key: ${{ secrets.DATADOG_API_KEY }}
2222
datadog-site: "datadoghq.eu"

docs-website/docs/pipeline-components/generators.mdx

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -56,5 +56,6 @@ Generators are responsible for generating text after you give them a prompt. The
5656
| [VertexAIImageGenerator](generators/vertexaiimagegenerator.mdx) | Enables image generation using Google Vertex AI generative model. ||
5757
| [VertexAIImageQA](generators/vertexaiimageqa.mdx) | Enables text generation (image captioning) using Google Vertex AI generative models. ||
5858
| [VertexAITextGenerator](generators/vertexaitextgenerator.mdx) | Enables text generation using Google Vertex AI generative models. ||
59+
| [VLLMChatGenerator](generators/vllmchatgenerator.mdx) | Enables chat completion using models served with vLLM. ||
5960
| [WatsonxGenerator](generators/watsonxgenerator.mdx) | Enables text generation with IBM Watsonx models. ||
6061
| [WatsonxChatGenerator](generators/watsonxchatgenerator.mdx) | Enables chat completions with IBM Watsonx models. ||
Lines changed: 196 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,196 @@
1+
---
2+
title: "VLLMChatGenerator"
3+
id: vllmchatgenerator
4+
slug: "/vllmchatgenerator"
5+
description: "This component enables chat completion using models served with vLLM."
6+
---
7+
8+
# VLLMChatGenerator
9+
10+
This component enables chat completion using models served with [vLLM](https://docs.vllm.ai/).
11+
12+
<div className="key-value-table">
13+
14+
| | |
15+
| --- | --- |
16+
| **Most common position in a pipeline** | After a [`ChatPromptBuilder`](../builders/chatpromptbuilder.mdx) |
17+
| **Mandatory init variables** | `model`: The name of the model served by vLLM |
18+
| **Mandatory run variables** | `messages`: A list of [`ChatMessage`](../../concepts/data-classes/chatmessage.mdx) objects |
19+
| **Output variables** | `replies`: A list of [`ChatMessage`](../../concepts/data-classes/chatmessage.mdx) objects |
20+
| **API reference** | [vLLM](/reference/integrations-vllm) |
21+
| **GitHub link** | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/vllm |
22+
23+
</div>
24+
25+
## Overview
26+
27+
[vLLM](https://docs.vllm.ai/) is a high-throughput and memory-efficient inference and serving engine for LLMs. It exposes an OpenAI-compatible HTTP server, which `VLLMChatGenerator` uses to run chat completions.
28+
29+
`VLLMChatGenerator` expects a vLLM server to be running and accessible at the `api_base_url` parameter (by default, `http://localhost:8000/v1`). The component needs a list of [`ChatMessage`](../../concepts/data-classes/chatmessage.mdx) objects to operate. `ChatMessage` is a data class that contains a message, a role (who generated the message, such as `user`, `assistant`, `system`, `function`), and optional metadata.
30+
31+
You can pass any text generation parameters valid for the vLLM OpenAI-compatible Chat Completion API directly to this component using the `generation_kwargs` parameter in `__init__` or in the `run` method. vLLM-specific parameters not part of the standard OpenAI API (such as `top_k`, `min_tokens`, `repetition_penalty`) can be passed through `generation_kwargs["extra_body"]`. For more details, see the [vLLM documentation](https://docs.vllm.ai/en/stable/serving/openai_compatible_server/).
32+
33+
If the vLLM server was started with `--api-key`, provide the API key through the `VLLM_API_KEY` environment variable or the `api_key` init parameter using Haystack's [Secret](../../concepts/secret-management.mdx) API.
34+
35+
### Tool Support
36+
37+
`VLLMChatGenerator` supports function calling through the `tools` parameter, which accepts flexible tool configurations:
38+
39+
- **A list of Tool objects**: Pass individual tools as a list
40+
- **A single Toolset**: Pass an entire Toolset directly
41+
- **Mixed Tools and Toolsets**: Combine multiple Toolsets with standalone tools in a single list
42+
43+
This allows you to organize related tools into logical groups while also including standalone tools as needed.
44+
45+
For tool calling to work, the vLLM server must be started with `--enable-auto-tool-choice` and `--tool-call-parser`. The available tool call parsers depend on the model. See the [vLLM tool calling docs](https://docs.vllm.ai/en/stable/features/tool_calling/) for the full list.
46+
47+
For more details on working with tools, see the [Tool](../../tools/tool.mdx) and [Toolset](../../tools/toolset.mdx) documentation.
48+
49+
### Streaming
50+
51+
`VLLMChatGenerator` supports [streaming](guides-to-generators/choosing-the-right-generator.mdx#streaming-support) responses from the LLM, allowing tokens to be emitted as they are generated. To enable streaming, pass a callable to the `streaming_callback` parameter during initialization.
52+
53+
### Reasoning models
54+
55+
`VLLMChatGenerator` supports reasoning models. To use them, start the vLLM server with the appropriate `--reasoning-parser`. The reasoning content produced by the model is exposed in the `reasoning` field of the returned `ChatMessage`.
56+
57+
## Usage
58+
59+
Install the `vllm-haystack` package to use the `VLLMChatGenerator`:
60+
61+
```shell
62+
pip install vllm-haystack
63+
```
64+
65+
### Starting the vLLM server
66+
67+
Before using this component, start a vLLM server:
68+
69+
```bash
70+
vllm serve Qwen/Qwen3-4B-Instruct-2507
71+
```
72+
73+
For reasoning models, start the server with the appropriate reasoning parser:
74+
75+
```bash
76+
vllm serve Qwen/Qwen3-0.6B --reasoning-parser qwen3
77+
```
78+
79+
For tool calling, start the server with `--enable-auto-tool-choice` and `--tool-call-parser`:
80+
81+
```bash
82+
vllm serve Qwen/Qwen3-0.6B --enable-auto-tool-choice --tool-call-parser hermes
83+
```
84+
85+
For details on server options, see the [vLLM CLI docs](https://docs.vllm.ai/en/stable/cli/serve/).
86+
87+
### On its own
88+
89+
Basic usage:
90+
91+
```python
92+
from haystack.dataclasses import ChatMessage
93+
from haystack_integrations.components.generators.vllm import VLLMChatGenerator
94+
95+
generator = VLLMChatGenerator(
96+
model="Qwen/Qwen3-4B-Instruct-2507",
97+
generation_kwargs={"max_tokens": 512, "temperature": 0.7},
98+
)
99+
100+
messages = [ChatMessage.from_user("What's Natural Language Processing?")]
101+
response = generator.run(messages=messages)
102+
print(response["replies"][0].text)
103+
```
104+
105+
### With vLLM-specific parameters
106+
107+
Pass vLLM-specific parameters through the `generation_kwargs["extra_body"]` dictionary:
108+
109+
```python
110+
from haystack_integrations.components.generators.vllm import VLLMChatGenerator
111+
112+
generator = VLLMChatGenerator(
113+
model="Qwen/Qwen3-4B-Instruct-2507",
114+
generation_kwargs={
115+
"max_tokens": 512,
116+
"extra_body": {
117+
"top_k": 50,
118+
"min_tokens": 10,
119+
"repetition_penalty": 1.1,
120+
},
121+
},
122+
)
123+
```
124+
125+
### With tool calling
126+
127+
Start the vLLM server with `--enable-auto-tool-choice` and `--tool-call-parser`, then:
128+
129+
```python
130+
from haystack.dataclasses import ChatMessage
131+
from haystack.tools import tool
132+
from haystack_integrations.components.generators.vllm import VLLMChatGenerator
133+
134+
135+
@tool
136+
def weather(city: str) -> str:
137+
"""Get the weather in a given city."""
138+
return f"The weather in {city} is sunny"
139+
140+
141+
generator = VLLMChatGenerator(model="Qwen/Qwen3-0.6B", tools=[weather])
142+
143+
messages = [ChatMessage.from_user("What is the weather in Paris?")]
144+
response = generator.run(messages=messages)
145+
print(response["replies"][0].tool_calls)
146+
```
147+
148+
### With reasoning models
149+
150+
Start the vLLM server with `--reasoning-parser`, then:
151+
152+
```python
153+
from haystack.dataclasses import ChatMessage
154+
from haystack_integrations.components.generators.vllm import VLLMChatGenerator
155+
156+
generator = VLLMChatGenerator(model="Qwen/Qwen3-0.6B")
157+
158+
messages = [ChatMessage.from_user("Solve step by step: what is 15 * 37?")]
159+
response = generator.run(messages=messages)
160+
reply = response["replies"][0]
161+
if reply.reasoning:
162+
print("Reasoning:", reply.reasoning.reasoning_text)
163+
print("Answer:", reply.text)
164+
```
165+
166+
### In a pipeline
167+
168+
```python
169+
from haystack import Pipeline
170+
from haystack.components.builders import ChatPromptBuilder
171+
from haystack.dataclasses import ChatMessage
172+
from haystack_integrations.components.generators.vllm import VLLMChatGenerator
173+
174+
prompt_builder = ChatPromptBuilder()
175+
llm = VLLMChatGenerator(model="Qwen/Qwen3-4B-Instruct-2507")
176+
177+
pipe = Pipeline()
178+
pipe.add_component("prompt_builder", prompt_builder)
179+
pipe.add_component("llm", llm)
180+
pipe.connect("prompt_builder.prompt", "llm.messages")
181+
182+
messages = [
183+
ChatMessage.from_system("Give brief answers."),
184+
ChatMessage.from_user("Tell me about {{city}}"),
185+
]
186+
187+
response = pipe.run(
188+
data={
189+
"prompt_builder": {
190+
"template": messages,
191+
"template_variables": {"city": "Berlin"},
192+
},
193+
},
194+
)
195+
print(response)
196+
```

0 commit comments

Comments
 (0)