Name	Name	Last commit message	Last commit date
parent directory ..
data	data
ui	ui
.gitignore	.gitignore
README.md	README.md
gpu.jsonc	gpu.jsonc
models.json	models.json
pyproject.toml	pyproject.toml
server.py	server.py
startup.sh	startup.sh
test_serverless.sh	test_serverless.sh

vLLM Models - GPU CLI Template

Run Open Source LLMs on remote GPUs with vLLM's high-performance inference engine. Includes a lightweight Web UI and OpenAI-compatible API endpoint.

Quick Start

cd templates/vllm-models

# (Optional) Edit models.json to configure which model to run

# Start the pod
gpu run

After startup, two endpoints are available on your local machine:

Endpoint	Port	Description
Web UI	8080	Chat interface at http://localhost:8080
vLLM API	8000	OpenAI-compatible API at http://localhost:8000

Features

High Performance: vLLM provides 2-4x higher throughput than vanilla HuggingFace
OpenAI-Compatible API: Drop-in replacement for OpenAI SDK
Streaming: Real-time response streaming in both UI and API
KV Cache Optimization: Efficient memory management for long contexts
Continuous Batching: Handles multiple concurrent requests efficiently

vLLM vs Ollama

Aspect	vLLM	Ollama
Performance	Higher throughput (PagedAttention)	Good, optimized for single-user
Model Loading	Single model at startup	Dynamic pull/switch
Memory	More VRAM (KV cache)	Less VRAM overhead
API	OpenAI-compatible only	Native + OpenAI + Anthropic
Best For	Production workloads, high throughput	Development, model experimentation

Configuring Models

Edit models.json to specify which model to run:

{
  "model": "Qwen/Qwen2.5-14B-Instruct",
  "vllm_args": {
    "gpu_memory_utilization": 0.9,
    "max_model_len": 32768,
    "tensor_parallel_size": 1
  }
}

Configuration Options

Field	Description	Default
`model`	HuggingFace model ID	Required
`gpu_memory_utilization`	Fraction of GPU memory to use	0.9
`max_model_len`	Maximum context length	32768
`tensor_parallel_size`	Number of GPUs for tensor parallelism	1

Recommended Models

Model	Size	VRAM	Auth	Best For
`Qwen/Qwen2.5-7B-Instruct`	14GB	24GB	No	Default - good quality, no auth
`Qwen/Qwen2.5-1.5B-Instruct`	3GB	8GB	No	Fast, small GPU
`Qwen/Qwen2.5-Coder-7B-Instruct`	14GB	24GB	No	Code-focused
`mistralai/Mistral-7B-Instruct-v0.3`	14GB	24GB	No	High quality
`meta-llama/Llama-3.2-3B-Instruct`	6GB	12GB	Yes	Small, fast (requires HF token)
`meta-llama/Llama-3.1-8B-Instruct`	16GB	24GB	Yes	High quality (requires HF token)
`zai-org/GLM-4.7-Flash`	19GB	24GB	No	Excellent quality, 200K context

Gated Models (Llama, etc.)

Some models require a HuggingFace token:

Create account at https://huggingface.co
Accept model license (e.g., https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct)
Create token at https://huggingface.co/settings/tokens
Set token in GPU CLI:

gpu auth add hf

API Usage

List Models

curl http://localhost:8000/v1/models

Chat Completion

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-14B-Instruct",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Streaming

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-14B-Instruct",
    "messages": [{"role": "user", "content": "Write a haiku about GPUs"}],
    "stream": true
  }'

Using with OpenAI SDK (Python)

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="vllm"  # Required but ignored
)

response = client.chat.completions.create(
    model="Qwen/Qwen2.5-14B-Instruct",
    messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)

Using with LangChain

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    base_url="http://localhost:8000/v1",
    api_key="vllm",
    model="Qwen/Qwen2.5-14B-Instruct"
)

response = llm.invoke("Hello!")
print(response.content)

Serverless Deployment

Deploy as a serverless API endpoint that auto-scales and scales to zero:

gpu serverless deploy --gpu "NVIDIA GeForce RTX 4090"

This creates a RunPod Serverless endpoint using the official vLLM worker template. The endpoint exposes the same OpenAI-compatible API you've been using locally.

Configuration is read from the serverless section in gpu.jsonc (scaling, volume, env vars). CLI flags override config values when provided.

Invoke the endpoint

curl -X POST "https://api.runpod.ai/v2/<endpoint-id>/runsync" \
  -H "Authorization: Bearer $RUNPOD_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "input": {
      "openai_route": "/v1/chat/completions",
      "openai_input": {
        "model": "Qwen/Qwen2.5-14B-Instruct",
        "messages": [{"role": "user", "content": "Hello!"}],
        "max_tokens": 100
      }
    }
  }'

Manage the endpoint

gpu serverless status <endpoint-id>   # Check workers, queue, scaling
gpu serverless list                    # List all endpoints
gpu serverless warm <endpoint-id>      # Pre-warm a worker
gpu serverless delete <id> --force     # Delete endpoint

Changing Models

Unlike Ollama, vLLM loads a single model at startup. To change models:

Edit models.json with the new model ID
Restart the pod:

gpu restart

GPU Selection

The template is configured to use GPUs in this priority order:

RTX 4090 (24GB) - Great for 7B-14B models
A40 (48GB) - Good for 30B models
L40S (48GB) - Newer, good availability
A100 80GB - For 70B+ models

Edit gpu.jsonc to change GPU preferences:

"gpu_types": [
  { "type": "A100 PCIe 80GB" },
  { "type": "A40" }
],
"min_vram": 48

Configuration Reference

gpu.jsonc

Field	Description
`ports`	Ports to forward (default: `[8000, 8080]`)
`keep_alive_minutes`	Idle time before pod stops (default: `20`)
`gpu_types`	Preferred GPUs in priority order
`min_vram`	Minimum VRAM in GB (default: `24`)
`inputs`	Optional secrets like `hf_token`

models.json

Field	Description
`model`	HuggingFace model ID to load
`vllm_args`	vLLM server configuration

Troubleshooting

Model fails to load

Check if the model fits in GPU memory:

7B models: ~14GB VRAM
13B models: ~26GB VRAM
70B models: ~140GB VRAM (needs multi-GPU)

Try reducing max_model_len or gpu_memory_utilization in models.json.

"CUDA out of memory"

Lower the GPU memory utilization:

{
  "vllm_args": {
    "gpu_memory_utilization": 0.8
  }
}

Slow startup

Large models can take 30-120 seconds to load. This is normal - vLLM loads the full model into GPU memory at startup.

Gated model access denied

Verify you've accepted the model license on HuggingFace
Check your token is set: gpu secret list
Ensure token has read access

Port already in use

If port 8000 is already in use locally, modify gpu.jsonc:

"ports": [8001, 8080]

Then update ui/app.js to use the new port.

Multi-GPU Support

For large models that don't fit on a single GPU, increase tensor parallelism:

{
  "model": "meta-llama/Llama-3.1-70B-Instruct",
  "vllm_args": {
    "tensor_parallel_size": 2,
    "gpu_memory_utilization": 0.9
  }
}

And update gpu.jsonc to request multiple GPUs:

"gpu_types": [
  { "type": "A100 PCIe 80GB", "count": 2 }
]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

README.md

vLLM Models - GPU CLI Template

Quick Start

Features

vLLM vs Ollama

Configuring Models

Configuration Options

Recommended Models

Gated Models (Llama, etc.)

API Usage

List Models

Chat Completion

Streaming

Using with OpenAI SDK (Python)

Using with LangChain

Serverless Deployment

Invoke the endpoint

Manage the endpoint

Changing Models

GPU Selection

Configuration Reference

gpu.jsonc

models.json

Troubleshooting

Model fails to load

"CUDA out of memory"

Slow startup

Gated model access denied

Port already in use

Multi-GPU Support

Resources

Uh oh!

FilesExpand file tree

vllm-models

Directory actions

More options

Directory actions

More options

Latest commit

History

vllm-models

Folders and files

parent directory

README.md

vLLM Models - GPU CLI Template

Quick Start

Features

vLLM vs Ollama

Configuring Models

Configuration Options

Recommended Models

Gated Models (Llama, etc.)

API Usage

List Models

Chat Completion

Streaming

Using with OpenAI SDK (Python)

Using with LangChain

Serverless Deployment

Invoke the endpoint

Manage the endpoint

Changing Models

GPU Selection

Configuration Reference

gpu.jsonc

models.json

Troubleshooting

Model fails to load

"CUDA out of memory"

Slow startup

Gated model access denied

Port already in use

Multi-GPU Support

Resources