AI-amd-ROCm-notes/vllm-rocm.md at main · rrunner77/AI-amd-ROCm-notes

How to install VLLM

install docker:

docker run -it \
   --network=host \
   --group-add=video \
   --ipc=host \
   --cap-add=SYS_PTRACE \
   --security-opt seccomp=unconfined \
   --device /dev/kfd \
   --device /dev/dri \
   -v /ai/vllm:/app/model \
   vllm-rocm \
   bash

Parameters which work with 7900XTX (most problematic is the caches which increase the VRAM usage):

max_num_seqs => It specifies the maximum number of sequences (e.g., prompts, input texts, or batches of tokens) that the model can process simultaneously. - personal use 16-32 on a consumer GPU
max_seq_len_to_capture => This parameter ensures that the model only processes sequences up to a certain length, balancing speed - on consumer GPU: 20–50
max_num_batched_tokens => It specifies the maximum total number of tokens (across all sequences in a batch) that the model can process at once. - personal use -> 1024–2048 on a consumer GPU
max_model_len => should be set to model len -> does not have impact on VRAM

model=Qwen/Qwen3-8B
tp=1
dtype=float16
kv_cache_dtype=auto
max_num_seqs=32  
max_seq_len_to_capture=128
max_num_batched_tokens=2048
max_model_len=8192
swap_space=16

Run vllm inside the docker

vllm serve $model   \
-tp $tp     \
--dtype $dtype     \
--kv-cache-dtype $kv_cache_dtype     \
--max-num-seqs $max_num_seqs     \
--max-seq-len-to-capture $max_seq_len_to_capture     \
--max-num-batched-tokens $max_num_batched_tokens     \
--max-model-len $max_model_len   \
--swap-space $swap_space

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FilesExpand file tree

vllm-rocm.md

Latest commit

History

vllm-rocm.md

File metadata and controls