-
Notifications
You must be signed in to change notification settings - Fork 229
feat: add vLLM Chat Generator #3099
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
26 commits
Select commit
Hold shift + click to select a range
d07592a
draft vllm integration
anakin87 64afd71
workflow + more tests
anakin87 ee7606f
simplify installation
anakin87 49ee751
improve
anakin87 837a5d1
increase timeout and pydoc fix
anakin87 cd60d84
show logs
anakin87 63124fa
cpu
anakin87 b551b0d
install from github
anakin87 e2d6fe2
retry
anakin87 4dc6794
better wf and explanation
anakin87 23bd80e
Merge branch 'main' into vllm-chat-generator
anakin87 c541848
better readme + 3.14
anakin87 3172864
back to 3.13
anakin87 db0d550
drop inheritance
anakin87 ff18d66
Merge branch 'main' into vllm-chat-generator
anakin87 c17d42b
align coverage
anakin87 ec626e6
clarify 3.14 incompatibility
anakin87 8a062d8
Merge branch 'main' into vllm-chat-generator
anakin87 286f3bb
better tests
anakin87 4cf1da3
test with Qwen3.5-0.8, image tests, better API key handling
anakin87 aab1f53
use Qwen 3 0.6B; better tests
anakin87 ee91faf
better api key handling
anakin87 0a39be7
Merge branch 'main' into vllm-chat-generator
anakin87 662baf5
missing inits
anakin87 ec76876
Merge branch 'vllm-chat-generator' of https://github.com/deepset-ai/h…
anakin87 8ca2088
better streaming handling + tests
anakin87 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,4 +1,4 @@ | ||
| name: Github workflows linter | ||
| name: Core / Github workflows linter | ||
|
|
||
| on: | ||
| pull_request: | ||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,180 @@ | ||
| # This workflow comes from https://github.com/ofek/hatch-mypyc | ||
| # https://github.com/ofek/hatch-mypyc/blob/5a198c0ba8660494d02716cfc9d79ce4adfb1442/.github/workflows/test.yml | ||
| name: Test / vllm | ||
|
|
||
| on: | ||
| schedule: | ||
| - cron: "0 0 * * *" | ||
| pull_request: | ||
| paths: | ||
| - "integrations/vllm/**" | ||
| - "!integrations/vllm/*.md" | ||
| - ".github/workflows/vllm.yml" | ||
| push: | ||
| branches: | ||
| - main | ||
| paths: | ||
| - "integrations/vllm/**" | ||
| - "!integrations/vllm/*.md" | ||
| - ".github/workflows/vllm.yml" | ||
|
|
||
| defaults: | ||
| run: | ||
| working-directory: integrations/vllm | ||
|
|
||
| concurrency: | ||
| group: vllm-${{ github.head_ref || github.sha }} | ||
| cancel-in-progress: true | ||
|
|
||
| env: | ||
| PYTHONUNBUFFERED: "1" | ||
| FORCE_COLOR: "1" | ||
| VLLM_MODEL: "Qwen/Qwen3-0.6B" | ||
| # we only test on Ubuntu to keep vLLM server running simple | ||
| TEST_MATRIX_OS: '["ubuntu-latest"]' | ||
| # vLLM is not compatible with Python 3.14. https://github.com/vllm-project/vllm/issues/34096 | ||
| TEST_MATRIX_PYTHON: '["3.10", "3.13"]' | ||
|
|
||
| jobs: | ||
| compute-test-matrix: | ||
| runs-on: ubuntu-slim | ||
| defaults: | ||
| run: | ||
| working-directory: . | ||
| outputs: | ||
| os: ${{ steps.set.outputs.os }} | ||
| python-version: ${{ steps.set.outputs.python-version }} | ||
| steps: | ||
| - id: set | ||
| run: | | ||
| echo 'os=${{ github.event_name == 'push' && '["ubuntu-latest"]' || env.TEST_MATRIX_OS }}' >> "$GITHUB_OUTPUT" | ||
| echo 'python-version=${{ github.event_name == 'push' && '["3.10"]' || env.TEST_MATRIX_PYTHON }}' >> "$GITHUB_OUTPUT" | ||
|
|
||
| run: | ||
| name: Python ${{ matrix.python-version }} on ${{ startsWith(matrix.os, 'macos-') && 'macOS' || startsWith(matrix.os, 'windows-') && 'Windows' || 'Linux' }} | ||
| needs: compute-test-matrix | ||
| permissions: | ||
| contents: write | ||
| pull-requests: write | ||
| runs-on: ${{ matrix.os }} | ||
| strategy: | ||
| fail-fast: false | ||
| matrix: | ||
| os: ${{ fromJSON(needs.compute-test-matrix.outputs.os) }} | ||
| python-version: ${{ fromJSON(needs.compute-test-matrix.outputs.python-version) }} | ||
|
|
||
| steps: | ||
| - uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2 | ||
|
|
||
| - name: Set up Python ${{ matrix.python-version }} | ||
| uses: actions/setup-python@a309ff8b426b58ec0e2a45f0f869d46889d02405 # v6.2.0 | ||
| with: | ||
| python-version: ${{ matrix.python-version }} | ||
|
|
||
| - name: Install Hatch | ||
| run: pip install hatch | ||
|
|
||
| - name: Install vLLM (CPU) | ||
| run: | | ||
| # vLLM on PyPI is GPU-only and requires CUDA, so it won't run on CPU-only systems. | ||
| # CPU wheels are not published to PyPI; they are only available as direct downloads from GitHub releases. | ||
| # We fetch the latest release and install the appropriate x86 CPU wheel. | ||
| # The --torch-backend cpu flag ensures uv installs PyTorch from the official CPU-only index, | ||
| # since the required torch+cpu builds are also not available on PyPI. | ||
| VLLM_VERSION="$(curl -s https://api.github.com/repos/vllm-project/vllm/releases/latest | jq -r .tag_name | sed 's/^v//')" | ||
| export VLLM_VERSION | ||
| echo "Installing vLLM ${VLLM_VERSION} (CPU)" | ||
| hatch run -- uv pip install \ | ||
| "https://github.com/vllm-project/vllm/releases/download/v${VLLM_VERSION}/vllm-${VLLM_VERSION}+cpu-cp38-abi3-manylinux_2_35_x86_64.whl" \ | ||
| --torch-backend cpu | ||
|
|
||
| - name: Start vLLM server | ||
| env: | ||
| VLLM_TARGET_DEVICE: "cpu" | ||
| VLLM_CPU_KVCACHE_SPACE: "4" | ||
| run: | | ||
| nohup hatch run -- vllm serve ${{ env.VLLM_MODEL }} \ | ||
| --reasoning-parser qwen3 \ | ||
| --max-model-len 1024 \ | ||
| --enforce-eager \ | ||
| --dtype bfloat16 \ | ||
| --enable-auto-tool-choice \ | ||
| --tool-call-parser hermes \ | ||
| --max-num-seqs 1 & | ||
|
|
||
| # Wait for the vLLM server to be ready with a timeout of 300 seconds | ||
| timeout=300 | ||
| while [ $timeout -gt 0 ] && ! curl -sSf http://localhost:8000/health > /dev/null 2>&1; do | ||
| echo "Waiting for vLLM server to start..." | ||
| sleep 10 | ||
| ((timeout-=10)) | ||
| done | ||
|
|
||
| if [ $timeout -eq 0 ]; then | ||
| echo "Timed out waiting for vLLM server to start." | ||
| exit 1 | ||
| fi | ||
|
|
||
| echo "vLLM server started successfully." | ||
|
|
||
| - name: Lint | ||
| if: matrix.python-version == '3.10' && runner.os == 'Linux' | ||
| run: hatch run fmt-check && hatch run test:types | ||
|
|
||
| - name: Run unit tests | ||
| run: hatch run test:unit-cov-retry | ||
|
|
||
| # On PR: posts coverage comment (directly on same-repo PRs; via artifact for fork PRs). On push to main: stores coverage baseline on data branch. | ||
| - name: Store unit tests coverage | ||
| id: coverage_comment | ||
| if: matrix.python-version == '3.10' && runner.os == 'Linux' && github.event_name != 'schedule' | ||
| uses: py-cov-action/python-coverage-comment-action@7188638f871f721a365d644f505d1ff3df20d683 # v3.40 | ||
| with: | ||
| GITHUB_TOKEN: ${{ github.token }} | ||
| COVERAGE_PATH: integrations/vllm | ||
| SUBPROJECT_ID: vllm | ||
| MINIMUM_GREEN: 90 | ||
| MINIMUM_ORANGE: 60 | ||
|
|
||
| - name: Upload coverage comment to be posted | ||
| if: matrix.python-version == '3.10' && runner.os == 'Linux' && github.event_name == 'pull_request' && steps.coverage_comment.outputs.COMMENT_FILE_WRITTEN == 'true' | ||
| uses: actions/upload-artifact@bbbca2ddaa5d8feaa63e36b76fdaad77386f024f # v7.0.0 | ||
| with: | ||
| name: coverage-comment-vllm | ||
| path: python-coverage-comment-action-vllm.txt | ||
|
|
||
| - name: Run integration tests | ||
| run: hatch run test:integration-cov-append-retry | ||
|
|
||
| - name: Store combined coverage | ||
| if: github.event_name == 'push' | ||
| uses: py-cov-action/python-coverage-comment-action@7188638f871f721a365d644f505d1ff3df20d683 # v3.40 | ||
| with: | ||
| GITHUB_TOKEN: ${{ github.token }} | ||
| COVERAGE_PATH: integrations/vllm | ||
| SUBPROJECT_ID: vllm-combined | ||
| MINIMUM_GREEN: 90 | ||
| MINIMUM_ORANGE: 60 | ||
|
|
||
| - name: Run unit tests with lowest direct dependencies | ||
| if: github.event_name != 'push' | ||
| run: | | ||
| hatch run uv pip compile pyproject.toml --resolution lowest-direct --output-file requirements_lowest_direct.txt | ||
| hatch -e test env run -- uv pip install -r requirements_lowest_direct.txt | ||
| hatch run test:unit | ||
|
|
||
| - name: Nightly - run unit tests with Haystack main branch | ||
| if: github.event_name == 'schedule' | ||
| run: | | ||
| hatch env prune | ||
| hatch -e test env run -- uv pip install git+https://github.com/deepset-ai/haystack.git@main | ||
| hatch run test:unit | ||
|
|
||
| notify-slack-on-failure: | ||
| needs: run | ||
| if: failure() && github.event_name == 'schedule' | ||
| runs-on: ubuntu-slim | ||
| steps: | ||
| - uses: deepset-ai/notify-slack-action@3cda73b77a148f16f703274198e7771340cf862b # v1 | ||
| with: | ||
| slack-webhook-url: ${{ secrets.SLACK_WEBHOOK_URL_NOTIFICATIONS }} |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.