Name	Name	Last commit message	Last commit date
parent directory ..
pydoc	pydoc
src/haystack_integrations	src/haystack_integrations
tests	tests
CHANGELOG.md	CHANGELOG.md
LICENSE.txt	LICENSE.txt
README.md	README.md
pyproject.toml	pyproject.toml

Name

Last commit message

Last commit date

src/haystack_integrations

vllm-haystack

Changelog

Contributing

Refer to the general Contribution Guidelines.

To run integration tests locally, you need two vLLM servers running in parallel: one for the chat generator on port 8000 and one for the embedders on port 8001. Refer to the workflow file for more details.

For example, on macOs, you can install vLLM-metal and start the chat generator server with:

# chat generator server (port 8000)
source ~/.venv-vllm-metal/bin/activate && vllm serve Qwen/Qwen3-0.6B --reasoning-parser qwen3 --max-model-len 1024 --enforce-eager --enable-auto-tool-choice --tool-call-parser hermes

vLLM-metal does not support embedding models. On macOS, you can run the embedding server via CPU Docker image:

# embedders server (port 8001)
docker run --rm -p 8001:8000 -e VLLM_CPU_OMP_THREADS_BIND=0-3 vllm/vllm-openai-cpu:latest \
    --model sentence-transformers/all-MiniLM-L6-v2 --enforce-eager

To run the ranker server, use CPU Docker image:

# ranker server (port 8002)
docker run --rm -p 8002:8000 -e VLLM_CPU_OMP_THREADS_BIND=0-3 vllm/vllm-openai-cpu:latest \
    --model BAAI/bge-reranker-base --enforce-eager

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

README.md

vllm-haystack

Contributing

Uh oh!

FilesExpand file tree

vllm

Directory actions

More options

Directory actions

More options

Latest commit

History

vllm

Folders and files

parent directory

README.md

vllm-haystack

Contributing