Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 11 additions & 1 deletion docs/source/getting-started/quickstart_vllm.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
This document describes how to install unified-cache-management with vllm on cuda platform.

## Prerequisites
- vllm >=0.9.1, device=cuda (vllm == 0.9.2 to use the Sparse Feature)
- vllm >=0.9.1, device=cuda (Sparse Feature is supported in vllm 0.9.2 and v0.11.0)

## Step 1: UCM Installation

Expand Down Expand Up @@ -44,6 +44,7 @@ Download the pre-built `vllm/vllm-openai:v0.9.2` docker image and build unified-
1. Prepare vLLM Environment

For the sake of environment isolation and simplicity, we recommend preparing the vLLM environment by pulling the official, pre-built vLLM Docker image.
> Note: v0.11.0 is newly supported (replace the tag with v0.11.0 if needed).

```bash
docker pull vllm/vllm-openai:v0.9.2
Expand Down Expand Up @@ -87,6 +88,15 @@ Download the pre-built `vllm/vllm-openai:v0.9.2` docker image and build unified-
```
Apply the patch that matches your development needs:

#### vLLM 0.11.0

Note: v0.11.0 only requires the sparse attention patch.

```bash
git apply unified-cache-management/ucm/integration/vllm/patch/0.11.0/vllm-adapt-sparse.patch
```

#### vLLM 0.9.2
- Full UCM integration (recommended):
```bash
git apply unified-cache-management/ucm/integration/vllm/patch/0.9.2/vllm-adapt.patch
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,7 @@ def build_llm_with_uc(module_path: str, name: str, model: str):
},
}
],
"ucm_sparse_config": {"KvCompOnDevice": {}},
"ucm_sparse_config": {"GSAOnDevice": {}},
},
)

Expand Down
Loading