Skip to content

Commit 5d62dd6

Browse files
committed
fixes
1 parent 5207261 commit 5d62dd6

4 files changed

Lines changed: 81 additions & 13 deletions

File tree

.github/workflows/vllm.yml

Lines changed: 32 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,7 @@ env:
3030
PYTHONUNBUFFERED: "1"
3131
FORCE_COLOR: "1"
3232
VLLM_MODEL: "Qwen/Qwen3-0.6B"
33+
VLLM_EMBEDDING_MODEL: "sentence-transformers/all-MiniLM-L6-v2"
3334
# we only test on Ubuntu to keep vLLM server running simple
3435
TEST_MATRIX_OS: '["ubuntu-latest"]'
3536
# vLLM is not compatible with Python 3.14. https://github.com/vllm-project/vllm/issues/34096
@@ -88,12 +89,13 @@ jobs:
8889
"https://github.com/vllm-project/vllm/releases/download/v${VLLM_VERSION}/vllm-${VLLM_VERSION}+cpu-cp38-abi3-manylinux_2_35_x86_64.whl" \
8990
--torch-backend cpu
9091
91-
- name: Start vLLM server
92+
- name: Start vLLM chat server
9293
env:
9394
VLLM_TARGET_DEVICE: "cpu"
9495
VLLM_CPU_KVCACHE_SPACE: "4"
9596
run: |
9697
nohup hatch run -- vllm serve ${{ env.VLLM_MODEL }} \
98+
--port 8000 \
9799
--reasoning-parser qwen3 \
98100
--max-model-len 1024 \
99101
--enforce-eager \
@@ -102,20 +104,45 @@ jobs:
102104
--tool-call-parser hermes \
103105
--max-num-seqs 1 &
104106
105-
# Wait for the vLLM server to be ready with a timeout of 300 seconds
107+
# Wait for the vLLM chat server to be ready with a timeout of 300 seconds
106108
timeout=300
107109
while [ $timeout -gt 0 ] && ! curl -sSf http://localhost:8000/health > /dev/null 2>&1; do
108-
echo "Waiting for vLLM server to start..."
110+
echo "Waiting for vLLM chat server to start..."
109111
sleep 10
110112
((timeout-=10))
111113
done
112114
113115
if [ $timeout -eq 0 ]; then
114-
echo "Timed out waiting for vLLM server to start."
116+
echo "Timed out waiting for vLLM chat server to start."
115117
exit 1
116118
fi
117119
118-
echo "vLLM server started successfully."
120+
echo "vLLM chat server started successfully."
121+
122+
- name: Start vLLM embedding server
123+
env:
124+
VLLM_TARGET_DEVICE: "cpu"
125+
VLLM_CPU_KVCACHE_SPACE: "4"
126+
run: |
127+
nohup hatch run -- vllm serve ${{ env.VLLM_EMBEDDING_MODEL }} \
128+
--port 8001 \
129+
--enforce-eager \
130+
--max-num-seqs 1 &
131+
132+
# Wait for the vLLM embedding server to be ready with a timeout of 300 seconds
133+
timeout=300
134+
while [ $timeout -gt 0 ] && ! curl -sSf http://localhost:8001/health > /dev/null 2>&1; do
135+
echo "Waiting for vLLM embedding server to start..."
136+
sleep 10
137+
((timeout-=10))
138+
done
139+
140+
if [ $timeout -eq 0 ]; then
141+
echo "Timed out waiting for vLLM embedding server to start."
142+
exit 1
143+
fi
144+
145+
echo "vLLM embedding server started successfully."
119146
120147
- name: Lint
121148
if: matrix.python-version == '3.10' && runner.os == 'Linux'

integrations/vllm/README.md

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -13,12 +13,17 @@ Refer to the general [Contribution Guidelines](https://github.com/deepset-ai/hay
1313

1414
To run integration tests locally, you need two vLLM servers running in parallel: one for the chat generator on port `8000` and one for the embedders on port `8001`. Refer to the [workflow file](https://github.com/deepset-ai/haystack-core-integrations/blob/main/.github/workflows/vllm.yml) for more details.
1515

16-
For example, on macOs, you can install [vLLM-metal](https://github.com/vllm-project/vllm-metal) and start both servers with:
16+
For example, on macOs, you can install [vLLM-metal](https://github.com/vllm-project/vllm-metal) and start the chat generator server with:
1717

1818
```bash
1919
# chat generator server (port 8000)
2020
source ~/.venv-vllm-metal/bin/activate && vllm serve Qwen/Qwen3-0.6B --reasoning-parser qwen3 --max-model-len 1024 --enforce-eager --enable-auto-tool-choice --tool-call-parser hermes
21+
```
2122

23+
vLLM-metal does not support embedding models. On macOS, you can run the embedding server via CPU Docker image:
24+
25+
```bash
2226
# embedders server (port 8001)
23-
source ~/.venv-vllm-metal/bin/activate && vllm serve sergeyzh/rubert-tiny-turbo --port 8001 --enforce-eager --max-num-seqs 1
27+
docker run --rm -p 8001:8000 -e VLLM_CPU_OMP_THREADS_BIND=0-3 vllm/vllm-openai-cpu:latest \
28+
--model sentence-transformers/all-MiniLM-L6-v2 --enforce-eager
2429
```

integrations/vllm/tests/test_document_embedder.py

Lines changed: 32 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@
33
# SPDX-License-Identifier: Apache-2.0
44
from unittest.mock import AsyncMock, MagicMock
55

6+
import numpy as np
67
import pytest
78
from haystack import Document
89
from haystack.utils import Secret
@@ -12,7 +13,7 @@
1213

1314
from haystack_integrations.components.embedders.vllm import VLLMDocumentEmbedder
1415

15-
MODEL = "sergeyzh/rubert-tiny-turbo"
16+
MODEL = "sentence-transformers/all-MiniLM-L6-v2"
1617
API_BASE_URL = "http://localhost:8001/v1"
1718

1819

@@ -235,12 +236,13 @@ async def test_run_async(self):
235236
assert [d.embedding for d in result["documents"]] == [[0.5], [0.6]]
236237

237238
@pytest.mark.integration
238-
def test_run(self):
239+
def test_live_run(self):
239240
embedder = VLLMDocumentEmbedder(model=MODEL, api_base_url=API_BASE_URL)
240241

241242
docs = [
242-
Document(content="I love cheese", meta={"topic": "Cuisine"}),
243-
Document(content="A transformer is a deep learning architecture", meta={"topic": "ML"}),
243+
Document(content="I love cheese"),
244+
Document(content="Cheddar is my favorite food"),
245+
Document(content="A transformer is a deep learning architecture"),
244246
]
245247

246248
result = embedder.run(docs)
@@ -250,3 +252,29 @@ def test_run(self):
250252
for doc in docs_with_embeddings:
251253
assert isinstance(doc.embedding, list)
252254
assert isinstance(doc.embedding[0], float)
255+
256+
embeddings = [np.array(d.embedding) for d in docs_with_embeddings]
257+
258+
def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
259+
return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))
260+
261+
assert cosine_similarity(embeddings[0], embeddings[1]) > cosine_similarity(embeddings[0], embeddings[2])
262+
263+
@pytest.mark.integration
264+
@pytest.mark.asyncio
265+
async def test_live_run_async(self):
266+
embedder = VLLMDocumentEmbedder(model=MODEL, api_base_url=API_BASE_URL)
267+
268+
docs = [
269+
Document(content="I love cheese"),
270+
Document(content="Cheddar is my favorite food"),
271+
Document(content="A transformer is a deep learning architecture"),
272+
]
273+
274+
result = await embedder.run_async(docs)
275+
docs_with_embeddings = result["documents"]
276+
277+
assert len(docs_with_embeddings) == len(docs)
278+
for doc in docs_with_embeddings:
279+
assert isinstance(doc.embedding, list)
280+
assert isinstance(doc.embedding[0], float)

integrations/vllm/tests/test_text_embedder.py

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@
1010

1111
from haystack_integrations.components.embedders.vllm import VLLMTextEmbedder
1212

13-
MODEL = "sergeyzh/rubert-tiny-turbo"
13+
MODEL = "sentence-transformers/all-MiniLM-L6-v2"
1414
API_BASE_URL = "http://localhost:8001/v1"
1515

1616

@@ -175,8 +175,16 @@ async def test_run_async(self):
175175
assert result["embedding"] == [0.3, 0.4]
176176

177177
@pytest.mark.integration
178-
def test_run(self):
178+
def test_live_run(self):
179179
embedder = VLLMTextEmbedder(model=MODEL, api_base_url=API_BASE_URL)
180180
result = embedder.run("The food was delicious")
181181
assert isinstance(result["embedding"], list)
182182
assert all(isinstance(x, float) for x in result["embedding"])
183+
184+
@pytest.mark.asyncio
185+
@pytest.mark.integration
186+
async def test_live_run_async(self):
187+
embedder = VLLMTextEmbedder(model=MODEL, api_base_url=API_BASE_URL)
188+
result = await embedder.run_async("The food was delicious")
189+
assert isinstance(result["embedding"], list)
190+
assert all(isinstance(x, float) for x in result["embedding"])

0 commit comments

Comments
 (0)