Run Open Source LLMs on remote GPUs with vLLM's high-performance inference engine. Includes a lightweight Web UI and OpenAI-compatible API endpoint.
cd templates/vllm-models
# (Optional) Edit models.json to configure which model to run
# Start the pod
gpu runAfter startup, two endpoints are available on your local machine:
| Endpoint | Port | Description |
|---|---|---|
| Web UI | 8080 | Chat interface at http://localhost:8080 |
| vLLM API | 8000 | OpenAI-compatible API at http://localhost:8000 |
- High Performance: vLLM provides 2-4x higher throughput than vanilla HuggingFace
- OpenAI-Compatible API: Drop-in replacement for OpenAI SDK
- Streaming: Real-time response streaming in both UI and API
- KV Cache Optimization: Efficient memory management for long contexts
- Continuous Batching: Handles multiple concurrent requests efficiently
| Aspect | vLLM | Ollama |
|---|---|---|
| Performance | Higher throughput (PagedAttention) | Good, optimized for single-user |
| Model Loading | Single model at startup | Dynamic pull/switch |
| Memory | More VRAM (KV cache) | Less VRAM overhead |
| API | OpenAI-compatible only | Native + OpenAI + Anthropic |
| Best For | Production workloads, high throughput | Development, model experimentation |
Edit models.json to specify which model to run:
{
"model": "Qwen/Qwen2.5-14B-Instruct",
"vllm_args": {
"gpu_memory_utilization": 0.9,
"max_model_len": 32768,
"tensor_parallel_size": 1
}
}| Field | Description | Default |
|---|---|---|
model |
HuggingFace model ID | Required |
gpu_memory_utilization |
Fraction of GPU memory to use | 0.9 |
max_model_len |
Maximum context length | 32768 |
tensor_parallel_size |
Number of GPUs for tensor parallelism | 1 |
| Model | Size | VRAM | Auth | Best For |
|---|---|---|---|---|
Qwen/Qwen2.5-7B-Instruct |
14GB | 24GB | No | Default - good quality, no auth |
Qwen/Qwen2.5-1.5B-Instruct |
3GB | 8GB | No | Fast, small GPU |
Qwen/Qwen2.5-Coder-7B-Instruct |
14GB | 24GB | No | Code-focused |
mistralai/Mistral-7B-Instruct-v0.3 |
14GB | 24GB | No | High quality |
meta-llama/Llama-3.2-3B-Instruct |
6GB | 12GB | Yes | Small, fast (requires HF token) |
meta-llama/Llama-3.1-8B-Instruct |
16GB | 24GB | Yes | High quality (requires HF token) |
zai-org/GLM-4.7-Flash |
19GB | 24GB | No | Excellent quality, 200K context |
Some models require a HuggingFace token:
- Create account at https://huggingface.co
- Accept model license (e.g., https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct)
- Create token at https://huggingface.co/settings/tokens
- Set token in GPU CLI:
gpu auth add hfcurl http://localhost:8000/v1/modelscurl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-14B-Instruct",
"messages": [{"role": "user", "content": "Hello!"}]
}'curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-14B-Instruct",
"messages": [{"role": "user", "content": "Write a haiku about GPUs"}],
"stream": true
}'from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="vllm" # Required but ignored
)
response = client.chat.completions.create(
model="Qwen/Qwen2.5-14B-Instruct",
messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)from langchain_openai import ChatOpenAI
llm = ChatOpenAI(
base_url="http://localhost:8000/v1",
api_key="vllm",
model="Qwen/Qwen2.5-14B-Instruct"
)
response = llm.invoke("Hello!")
print(response.content)Deploy as a serverless API endpoint that auto-scales and scales to zero:
gpu serverless deploy --gpu "NVIDIA GeForce RTX 4090"This creates a RunPod Serverless endpoint using the official vLLM worker template. The endpoint exposes the same OpenAI-compatible API you've been using locally.
Configuration is read from the serverless section in gpu.jsonc (scaling, volume, env vars).
CLI flags override config values when provided.
curl -X POST "https://api.runpod.ai/v2/<endpoint-id>/runsync" \
-H "Authorization: Bearer $RUNPOD_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"input": {
"openai_route": "/v1/chat/completions",
"openai_input": {
"model": "Qwen/Qwen2.5-14B-Instruct",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 100
}
}
}'gpu serverless status <endpoint-id> # Check workers, queue, scaling
gpu serverless list # List all endpoints
gpu serverless warm <endpoint-id> # Pre-warm a worker
gpu serverless delete <id> --force # Delete endpointUnlike Ollama, vLLM loads a single model at startup. To change models:
- Edit
models.jsonwith the new model ID - Restart the pod:
gpu restartThe template is configured to use GPUs in this priority order:
- RTX 4090 (24GB) - Great for 7B-14B models
- A40 (48GB) - Good for 30B models
- L40S (48GB) - Newer, good availability
- A100 80GB - For 70B+ models
Edit gpu.jsonc to change GPU preferences:
| Field | Description |
|---|---|
ports |
Ports to forward (default: [8000, 8080]) |
keep_alive_minutes |
Idle time before pod stops (default: 20) |
gpu_types |
Preferred GPUs in priority order |
min_vram |
Minimum VRAM in GB (default: 24) |
inputs |
Optional secrets like hf_token |
| Field | Description |
|---|---|
model |
HuggingFace model ID to load |
vllm_args |
vLLM server configuration |
Check if the model fits in GPU memory:
- 7B models: ~14GB VRAM
- 13B models: ~26GB VRAM
- 70B models: ~140GB VRAM (needs multi-GPU)
Try reducing max_model_len or gpu_memory_utilization in models.json.
Lower the GPU memory utilization:
{
"vllm_args": {
"gpu_memory_utilization": 0.8
}
}Large models can take 30-120 seconds to load. This is normal - vLLM loads the full model into GPU memory at startup.
- Verify you've accepted the model license on HuggingFace
- Check your token is set:
gpu secret list - Ensure token has read access
If port 8000 is already in use locally, modify gpu.jsonc:
"ports": [8001, 8080]Then update ui/app.js to use the new port.
For large models that don't fit on a single GPU, increase tensor parallelism:
{
"model": "meta-llama/Llama-3.1-70B-Instruct",
"vllm_args": {
"tensor_parallel_size": 2,
"gpu_memory_utilization": 0.9
}
}And update gpu.jsonc to request multiple GPUs:
"gpu_types": [
{ "type": "A100 PCIe 80GB", "count": 2 }
]