Skip to content

Latest commit

 

History

History
44 lines (43 loc) · 1.48 KB

File metadata and controls

44 lines (43 loc) · 1.48 KB

How to install VLLM

install docker:

docker run -it \
   --network=host \
   --group-add=video \
   --ipc=host \
   --cap-add=SYS_PTRACE \
   --security-opt seccomp=unconfined \
   --device /dev/kfd \
   --device /dev/dri \
   -v /ai/vllm:/app/model \
   vllm-rocm \
   bash

Parameters which work with 7900XTX (most problematic is the caches which increase the VRAM usage):

  • max_num_seqs => It specifies the maximum number of sequences (e.g., prompts, input texts, or batches of tokens) that the model can process simultaneously. - personal use 16-32 on a consumer GPU
  • max_seq_len_to_capture => This parameter ensures that the model only processes sequences up to a certain length, balancing speed - on consumer GPU: 20–50
  • max_num_batched_tokens => It specifies the maximum total number of tokens (across all sequences in a batch) that the model can process at once. - personal use -> 1024–2048 on a consumer GPU
  • max_model_len => should be set to model len -> does not have impact on VRAM
model=Qwen/Qwen3-8B
tp=1
dtype=float16
kv_cache_dtype=auto
max_num_seqs=32  
max_seq_len_to_capture=128
max_num_batched_tokens=2048
max_model_len=8192
swap_space=16

Run vllm inside the docker

vllm serve $model   \
-tp $tp     \
--dtype $dtype     \
--kv-cache-dtype $kv_cache_dtype     \
--max-num-seqs $max_num_seqs     \
--max-seq-len-to-capture $max_seq_len_to_capture     \
--max-num-batched-tokens $max_num_batched_tokens     \
--max-model-len $max_model_len   \
--swap-space $swap_space