Skip to content

Commit b1c6fb6

Browse files
committed
adding vllm and trt llm inference examples
1 parent 1b0b898 commit b1c6fb6

5 files changed

Lines changed: 591 additions & 1 deletion

File tree

README.md

Lines changed: 43 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -107,6 +107,24 @@ cd marimo && bash setup.sh
107107
**Time:** ~2-3 minutes
108108
**Port:** 8080/tcp for web access
109109

110+
### ⚡ vLLM
111+
```bash
112+
cd vllm && bash setup.sh
113+
```
114+
**Installs:** High-performance LLM inference server (OpenAI-compatible API)
115+
**Time:** ~8-10 minutes
116+
**Port:** 8000/tcp for API access
117+
**Note:** Set `VLLM_MODEL` to change model
118+
119+
### 🏎️ TensorRT-LLM
120+
```bash
121+
cd tensorrt-llm && bash setup.sh
122+
```
123+
**Installs:** NVIDIA's optimized LLM inference engine (OpenAI-compatible API)
124+
**Time:** ~8-10 minutes (engine building on first run)
125+
**Port:** 8000/tcp for API access
126+
**Note:** Set `TRTLLM_MODEL` to change model
127+
110128
### 🛡️ earlyoom
111129
```bash
112130
cd earlyoom && bash setup.sh
@@ -229,6 +247,24 @@ docker exec -it postgres psql -U postgres
229247
docker exec -it redis redis-cli
230248
```
231249

250+
**High-performance LLM serving with vLLM:**
251+
```bash
252+
cd vllm && bash setup.sh
253+
# Then:
254+
curl http://localhost:8000/v1/models
255+
python3 ~/vllm-examples/chat.py
256+
bash ~/vllm-examples/test_api.sh
257+
```
258+
259+
**Optimized LLM inference with TensorRT-LLM:**
260+
```bash
261+
cd tensorrt-llm && bash setup.sh
262+
# Then:
263+
curl http://localhost:8000/v1/models
264+
python3 ~/trtllm-examples/chat.py
265+
bash ~/trtllm-examples/test_api.sh
266+
```
267+
232268
**OOM protection with earlyoom:**
233269
```bash
234270
cd earlyoom && bash setup.sh
@@ -281,6 +317,12 @@ brev-setup-scripts/
281317
├── earlyoom/
282318
│ ├── setup.sh # Early OOM daemon
283319
│ └── README.md
320+
├── vllm/
321+
│ ├── setup.sh # vLLM inference server
322+
│ └── README.md
323+
├── tensorrt-llm/
324+
│ ├── setup.sh # TensorRT-LLM inference server
325+
│ └── README.md
284326
└── rapids/
285327
├── setup.sh # RAPIDS GPU-accelerated data science
286328
└── README.md
@@ -298,4 +340,4 @@ Want to add a script? Keep it simple:
298340

299341
## License
300342

301-
Apache 2.0
343+
Apache 2.0

tensorrt-llm/README.md

Lines changed: 131 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,131 @@
1+
# TensorRT-LLM
2+
3+
NVIDIA's optimized LLM inference engine with OpenAI-compatible API.
4+
5+
TensorRT-LLM compiles models into highly optimized TensorRT engines that squeeze maximum throughput out of NVIDIA GPUs. It builds a GPU-specific execution plan on first run (which takes a while), but once built, inference is extremely fast — especially with NVIDIA's pre-quantized FP8 checkpoints on Ada/Hopper GPUs. Think of it as the "compiled binary" approach to LLM serving vs. vLLM's "interpreter" approach.
6+
7+
## What it installs
8+
9+
- **TensorRT-LLM** - GPU-optimized inference with automatic engine building (Docker container)
10+
- **Starter model** - TinyLlama-1.1B-Chat (pre-downloaded)
11+
- **Example scripts** - `~/trtllm-examples/chat.py` and `~/trtllm-examples/test_api.sh`
12+
13+
## ⚠️ Required Port
14+
15+
To access from outside Brev, open:
16+
- **8000/tcp** (TensorRT-LLM API endpoint)
17+
18+
## Usage
19+
20+
```bash
21+
bash setup.sh
22+
```
23+
24+
Takes ~8-10 minutes on first run (builds TensorRT engine). Subsequent starts are much faster.
25+
26+
**Options (environment variables):**
27+
```bash
28+
TRTLLM_MODEL=nvidia/Llama-3.1-8B-Instruct-FP8 bash setup.sh # Different model
29+
TRTLLM_PORT=9000 bash setup.sh # Different port
30+
export HF_TOKEN={YOUR_HF_TOKEN} # Gated models (e.g., Llama)
31+
```
32+
33+
## Quick Start
34+
35+
**1. Wait for engine to build:**
36+
```bash
37+
docker logs -f trtllm
38+
# Wait until you see "Started server process"
39+
```
40+
41+
**2. Test with curl:**
42+
```bash
43+
curl http://localhost:8000/v1/chat/completions \
44+
-H "Content-Type: application/json" \
45+
-d '{
46+
"model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
47+
"messages": [{"role": "user", "content": "Hello!"}],
48+
"max_tokens": 100
49+
}' | jq
50+
```
51+
52+
**3. Use with Python (OpenAI SDK):**
53+
```python
54+
from openai import OpenAI
55+
56+
client = OpenAI(base_url="http://localhost:8000/v1", api_key="tensorrt_llm")
57+
58+
response = client.chat.completions.create(
59+
model="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
60+
messages=[{"role": "user", "content": "Explain quantum computing simply."}]
61+
)
62+
print(response.choices[0].message.content)
63+
```
64+
65+
## Health Check
66+
67+
```bash
68+
curl http://localhost:8000/health | jq
69+
curl http://localhost:8000/v1/models | jq
70+
curl http://localhost:8000/metrics | jq # GPU memory + batching stats
71+
```
72+
73+
## Popular Models
74+
75+
| Model | VRAM | Notes |
76+
|-------|------|-------|
77+
| `TinyLlama/TinyLlama-1.1B-Chat-v1.0` (default) | ~4GB | Fast to build, good for testing |
78+
| `nvidia/Llama-3.1-8B-Instruct-FP8` | ~10GB | Pre-quantized FP8, great perf |
79+
| `nvidia/Qwen3-8B-FP8` | ~10GB | Pre-quantized FP8 |
80+
| `meta-llama/Llama-3.1-8B-Instruct` | ~16GB | Requires HF_TOKEN |
81+
| `mistralai/Mistral-7B-Instruct-v0.3` | ~16GB | |
82+
| `meta-llama/Llama-3.1-70B-Instruct` | ~140GB | Multi-GPU with `--tp_size` |
83+
84+
**Tip:** NVIDIA's [pre-quantized FP8 models](https://huggingface.co/collections/nvidia/inference-optimized-checkpoints-with-model-optimizer) offer the best performance on FP8-capable GPUs (Ada/Hopper).
85+
86+
**Note:** Gated models (e.g., Llama) require `export HF_TOKEN={YOUR_HF_TOKEN}` before running.
87+
88+
## Advanced Options
89+
90+
**Tensor parallelism (multi-GPU):**
91+
```bash
92+
# Serve on the container, then exec in to set tp_size
93+
docker run -d --name trtllm --gpus all --ipc host \
94+
-p 8000:8000 nvcr.io/nvidia/tensorrt-llm/release:latest \
95+
trtllm-serve serve meta-llama/Llama-3.1-70B-Instruct \
96+
--host 0.0.0.0 --port 8000 --tp_size 4
97+
```
98+
99+
**Custom image tag:**
100+
```bash
101+
TRTLLM_IMAGE=nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc2 bash setup.sh
102+
```
103+
104+
## Manage Service
105+
106+
```bash
107+
docker logs -f trtllm # Watch startup / logs
108+
docker restart trtllm # Restart server
109+
docker stop trtllm # Stop server
110+
docker start trtllm # Start server
111+
```
112+
113+
## Troubleshooting
114+
115+
**Container exits immediately:** `docker logs trtllm` (usually out of GPU memory)
116+
117+
**Out of memory:** Try a smaller model or NVIDIA's FP8 quantized variants
118+
119+
**Engine build fails:** Ensure sufficient GPU memory — engine building needs more VRAM than inference
120+
121+
**Connection refused:** Engine may still be building — check `docker logs -f trtllm`
122+
123+
**Slow first start:** Normal — TensorRT compiles an optimized engine on first run. Subsequent starts reuse the cached engine
124+
125+
## Resources
126+
127+
- **Docs:** https://nvidia.github.io/TensorRT-LLM/
128+
- **GitHub:** https://github.com/NVIDIA/TensorRT-LLM
129+
- **Quick Start:** https://nvidia.github.io/TensorRT-LLM/quick-start-guide.html
130+
- **Supported Models:** https://nvidia.github.io/TensorRT-LLM/models/supported-models.html
131+
- **Pre-quantized Models:** https://huggingface.co/collections/nvidia/inference-optimized-checkpoints-with-model-optimizer

tensorrt-llm/setup.sh

Lines changed: 160 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,160 @@
1+
#!/bin/bash
2+
set -e
3+
4+
# Detect Brev user (handles ubuntu, nvidia, shadeform, etc.)
5+
detect_brev_user() {
6+
if [ -n "${SUDO_USER:-}" ] && [ "$SUDO_USER" != "root" ]; then
7+
echo "$SUDO_USER"
8+
return
9+
fi
10+
# Check for Brev-specific markers
11+
for user_home in /home/*; do
12+
username=$(basename "$user_home")
13+
[ "$username" = "launchpad" ] && continue
14+
if ls "$user_home"/.lifecycle-script-ls-*.log 2>/dev/null | grep -q . || \
15+
[ -f "$user_home/.verb-setup.log" ] || \
16+
{ [ -L "$user_home/.cache" ] && [ "$(readlink "$user_home/.cache")" = "/ephemeral/cache" ]; }; then
17+
echo "$username"
18+
return
19+
fi
20+
done
21+
# Fallback to common users
22+
[ -d "/home/nvidia" ] && echo "nvidia" && return
23+
[ -d "/home/ubuntu" ] && echo "ubuntu" && return
24+
echo "ubuntu"
25+
}
26+
27+
# Set USER and HOME if running as root
28+
if [ "$(id -u)" -eq 0 ] || [ "${USER:-}" = "root" ]; then
29+
DETECTED_USER=$(detect_brev_user)
30+
export USER="$DETECTED_USER"
31+
export HOME="/home/$DETECTED_USER"
32+
fi
33+
34+
# Configuration (override with environment variables)
35+
MODEL="${TRTLLM_MODEL:-TinyLlama/TinyLlama-1.1B-Chat-v1.0}"
36+
PORT="${TRTLLM_PORT:-8000}"
37+
IMAGE="${TRTLLM_IMAGE:-nvcr.io/nvidia/tensorrt-llm/release:latest}"
38+
39+
echo "⚡ Setting up TensorRT-LLM inference server..."
40+
echo "User: $USER | Home: $HOME"
41+
echo "Model: $MODEL | Port: $PORT"
42+
echo "Image: $IMAGE"
43+
44+
# Note: Brev already has Docker and NVIDIA Container Toolkit installed
45+
echo "Using existing Docker installation..."
46+
47+
# Verify GPU is available
48+
if command -v nvidia-smi &> /dev/null; then
49+
echo "GPU detected: $(nvidia-smi --query-gpu=name --format=csv,noheader | head -1)"
50+
else
51+
echo "❌ No GPU detected - This is a script meant to be run on a NVIDIA Brev GPU instance!"
52+
exit 1
53+
fi
54+
55+
# Create cache directory for HuggingFace models
56+
mkdir -p "$HOME/.cache/huggingface"
57+
58+
# Stop existing container if running
59+
if docker ps -a --format '{{.Names}}' | grep -q '^trtllm$'; then
60+
echo "Removing existing TensorRT-LLM container..."
61+
docker stop trtllm 2>/dev/null || true
62+
docker rm trtllm 2>/dev/null || true
63+
fi
64+
65+
# Run TensorRT-LLM container
66+
echo "Starting TensorRT-LLM server with $MODEL..."
67+
echo "This may take 10-20+ minutes on first run (engine building + model download)..."
68+
docker run -d \
69+
--name trtllm \
70+
--restart unless-stopped \
71+
--gpus all \
72+
--ipc host \
73+
--ulimit memlock=-1 \
74+
--ulimit stack=67108864 \
75+
-v "$HOME/.cache/huggingface:/root/.cache/huggingface" \
76+
-p "$PORT:8000" \
77+
-e "HF_TOKEN=${HF_TOKEN:-}" \
78+
-e "HUGGING_FACE_HUB_TOKEN=${HF_TOKEN:-}" \
79+
"$IMAGE" \
80+
trtllm-serve serve "$MODEL" --host 0.0.0.0 --port 8000
81+
82+
# Create examples directory
83+
mkdir -p "$HOME/trtllm-examples"
84+
85+
# Create example Python script
86+
cat > "$HOME/trtllm-examples/chat.py" << EOF
87+
#!/usr/bin/env python3
88+
"""Example: Chat with TensorRT-LLM using OpenAI SDK"""
89+
from openai import OpenAI
90+
91+
client = OpenAI(base_url="http://localhost:${PORT}/v1", api_key="tensorrt_llm")
92+
93+
response = client.chat.completions.create(
94+
model="${MODEL}",
95+
messages=[{"role": "user", "content": "Explain what TensorRT-LLM is in two sentences."}]
96+
)
97+
98+
print(response.choices[0].message.content)
99+
EOF
100+
chmod +x "$HOME/trtllm-examples/chat.py"
101+
102+
# Create curl example script
103+
cat > "$HOME/trtllm-examples/test_api.sh" << EOF
104+
#!/bin/bash
105+
# Test TensorRT-LLM API with curl
106+
curl -s http://localhost:${PORT}/v1/chat/completions \\
107+
-H "Content-Type: application/json" \\
108+
-d '{
109+
"model": "${MODEL}",
110+
"messages": [{"role": "user", "content": "Hello!"}],
111+
"max_tokens": 100
112+
}' | python3 -m json.tool
113+
EOF
114+
chmod +x "$HOME/trtllm-examples/test_api.sh"
115+
116+
# Fix permissions if running as root
117+
if [ "$(id -u)" -eq 0 ]; then
118+
chown -R $USER:$USER "$HOME/.cache/huggingface"
119+
chown -R $USER:$USER "$HOME/trtllm-examples"
120+
fi
121+
122+
# Wait for container to start
123+
echo "Waiting for TensorRT-LLM to initialize..."
124+
echo "(Engine building and model loading may take 10-20+ minutes on first run)"
125+
sleep 5
126+
127+
# Verify
128+
echo ""
129+
echo "Verifying installation..."
130+
docker ps --filter "name=trtllm" --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}"
131+
132+
echo ""
133+
echo "✅ TensorRT-LLM container running!"
134+
echo ""
135+
echo "⏳ The engine is still building — first run takes 10-20+ minutes."
136+
echo " Subsequent starts are much faster (engine is cached)."
137+
echo " Run this to watch progress:"
138+
echo " docker logs -f trtllm"
139+
echo ""
140+
echo " The API is ready when you see: \"Started server process\""
141+
echo ""
142+
echo "Model: $MODEL"
143+
echo "API Endpoint: http://localhost:$PORT"
144+
echo "OpenAI-compatible: http://localhost:$PORT/v1"
145+
echo ""
146+
echo "⚠️ To access from outside Brev, open port: ${PORT}/tcp"
147+
echo ""
148+
echo "Quick start (after engine finishes building):"
149+
echo " pip install openai"
150+
echo " python3 $HOME/trtllm-examples/chat.py"
151+
echo " bash $HOME/trtllm-examples/test_api.sh"
152+
echo ""
153+
echo "Manage:"
154+
echo " docker logs -f trtllm # Watch startup progress"
155+
echo " docker restart trtllm # Restart server"
156+
echo " docker stop trtllm # Stop server"
157+
echo ""
158+
echo "Run with a different model:"
159+
echo " export HF_TOKEN={YOUR_HF_TOKEN}"
160+
echo " TRTLLM_MODEL=nvidia/Llama-3.1-8B-Instruct-FP8 bash setup.sh"

0 commit comments

Comments
 (0)