brevdev
diff --git a/‎README.md‎
Lines changed: 43 additions & 1 deletion b/‎README.md‎
Lines changed: 43 additions & 1 deletion
diff --git a/‎tensorrt-llm/README.md‎
Lines changed: 131 additions & 0 deletions b/‎tensorrt-llm/README.md‎
Lines changed: 131 additions & 0 deletions
diff --git a/‎tensorrt-llm/setup.sh‎
Lines changed: 160 additions & 0 deletions b/‎tensorrt-llm/setup.sh‎
Lines changed: 160 additions & 0 deletions
@@ -107,6 +107,24 @@ cd marimo && bash setup.sh
 **Time:** ~2-3 minutes  
 **Port:** 8080/tcp for web access
 
+### ⚡ vLLM
+```bash
+cd vllm && bash setup.sh
+```
+**Installs:** High-performance LLM inference server (OpenAI-compatible API)  
+**Time:** ~8-10 minutes  
+**Port:** 8000/tcp for API access  
+**Note:** Set `VLLM_MODEL` to change model
+
+### 🏎️ TensorRT-LLM
+```bash
+cd tensorrt-llm && bash setup.sh
+```
+**Installs:** NVIDIA's optimized LLM inference engine (OpenAI-compatible API)  
+**Time:** ~8-10 minutes (engine building on first run)  
+**Port:** 8000/tcp for API access  
+**Note:** Set `TRTLLM_MODEL` to change model
+
 ### 🛡️ earlyoom
 ```bash
 cd earlyoom && bash setup.sh
@@ -229,6 +247,24 @@ docker exec -it postgres psql -U postgres
 docker exec -it redis redis-cli
 ```
 
+**High-performance LLM serving with vLLM:**
+```bash
+cd vllm && bash setup.sh
+# Then:
+curl http://localhost:8000/v1/models
+python3 ~/vllm-examples/chat.py
+bash ~/vllm-examples/test_api.sh
+```
+
+**Optimized LLM inference with TensorRT-LLM:**
+```bash
+cd tensorrt-llm && bash setup.sh
+# Then:
+curl http://localhost:8000/v1/models
+python3 ~/trtllm-examples/chat.py
+bash ~/trtllm-examples/test_api.sh
+```
+
 **OOM protection with earlyoom:**
 ```bash
 cd earlyoom && bash setup.sh
@@ -281,6 +317,12 @@ brev-setup-scripts/
 ├── earlyoom/
 │   ├── setup.sh                 # Early OOM daemon
 │   └── README.md
+├── vllm/
+│   ├── setup.sh                 # vLLM inference server
+│   └── README.md
+├── tensorrt-llm/
+│   ├── setup.sh                 # TensorRT-LLM inference server
+│   └── README.md
 └── rapids/
     ├── setup.sh                 # RAPIDS GPU-accelerated data science
     └── README.md
@@ -298,4 +340,4 @@ Want to add a script? Keep it simple:
 
 ## License
 
-Apache 2.0
+Apache 2.0
@@ -0,0 +1,131 @@
+# TensorRT-LLM
+
+NVIDIA's optimized LLM inference engine with OpenAI-compatible API.
+
+TensorRT-LLM compiles models into highly optimized TensorRT engines that squeeze maximum throughput out of NVIDIA GPUs. It builds a GPU-specific execution plan on first run (which takes a while), but once built, inference is extremely fast — especially with NVIDIA's pre-quantized FP8 checkpoints on Ada/Hopper GPUs. Think of it as the "compiled binary" approach to LLM serving vs. vLLM's "interpreter" approach.
+
+## What it installs
+
+- **TensorRT-LLM** - GPU-optimized inference with automatic engine building (Docker container)
+- **Starter model** - TinyLlama-1.1B-Chat (pre-downloaded)
+- **Example scripts** - `~/trtllm-examples/chat.py` and `~/trtllm-examples/test_api.sh`
+
+## ⚠️ Required Port
+
+To access from outside Brev, open:
+- **8000/tcp** (TensorRT-LLM API endpoint)
+
+## Usage
+
+```bash
+bash setup.sh
+```
+
+Takes ~8-10 minutes on first run (builds TensorRT engine). Subsequent starts are much faster.
+
+**Options (environment variables):**
+```bash
+TRTLLM_MODEL=nvidia/Llama-3.1-8B-Instruct-FP8 bash setup.sh  # Different model
+TRTLLM_PORT=9000 bash setup.sh                                # Different port
+export HF_TOKEN={YOUR_HF_TOKEN}                               # Gated models (e.g., Llama)
+```
+
+## Quick Start
+
+**1. Wait for engine to build:**
+```bash
+docker logs -f trtllm
+# Wait until you see "Started server process"
+```
+
+**2. Test with curl:**
+```bash
+curl http://localhost:8000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
+    "messages": [{"role": "user", "content": "Hello!"}],
+    "max_tokens": 100
+  }' | jq
+```
+
+**3. Use with Python (OpenAI SDK):**
+```python
+from openai import OpenAI
+
+client = OpenAI(base_url="http://localhost:8000/v1", api_key="tensorrt_llm")
+
+response = client.chat.completions.create(
+    model="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
+    messages=[{"role": "user", "content": "Explain quantum computing simply."}]
+)
+print(response.choices[0].message.content)
+```
+
+## Health Check
+
+```bash
+curl http://localhost:8000/health | jq
+curl http://localhost:8000/v1/models | jq
+curl http://localhost:8000/metrics | jq    # GPU memory + batching stats
+```
+
+## Popular Models
+
+| Model | VRAM | Notes |
+|-------|------|-------|
+| `TinyLlama/TinyLlama-1.1B-Chat-v1.0` (default) | ~4GB | Fast to build, good for testing |
+| `nvidia/Llama-3.1-8B-Instruct-FP8` | ~10GB | Pre-quantized FP8, great perf |
+| `nvidia/Qwen3-8B-FP8` | ~10GB | Pre-quantized FP8 |
+| `meta-llama/Llama-3.1-8B-Instruct` | ~16GB | Requires HF_TOKEN |
+| `mistralai/Mistral-7B-Instruct-v0.3` | ~16GB | |
+| `meta-llama/Llama-3.1-70B-Instruct` | ~140GB | Multi-GPU with `--tp_size` |
+
+**Tip:** NVIDIA's [pre-quantized FP8 models](https://huggingface.co/collections/nvidia/inference-optimized-checkpoints-with-model-optimizer) offer the best performance on FP8-capable GPUs (Ada/Hopper).
+
+**Note:** Gated models (e.g., Llama) require `export HF_TOKEN={YOUR_HF_TOKEN}` before running.
+
+## Advanced Options
+
+**Tensor parallelism (multi-GPU):**
+```bash
+# Serve on the container, then exec in to set tp_size
+docker run -d --name trtllm --gpus all --ipc host \
+  -p 8000:8000 nvcr.io/nvidia/tensorrt-llm/release:latest \
+  trtllm-serve serve meta-llama/Llama-3.1-70B-Instruct \
+  --host 0.0.0.0 --port 8000 --tp_size 4
+```
+
+**Custom image tag:**
+```bash
+TRTLLM_IMAGE=nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc2 bash setup.sh
+```
+
+## Manage Service
+
+```bash
+docker logs -f trtllm          # Watch startup / logs
+docker restart trtllm          # Restart server
+docker stop trtllm             # Stop server
+docker start trtllm            # Start server
+```
+
+## Troubleshooting
+
+**Container exits immediately:** `docker logs trtllm` (usually out of GPU memory)
+
+**Out of memory:** Try a smaller model or NVIDIA's FP8 quantized variants
+
+**Engine build fails:** Ensure sufficient GPU memory — engine building needs more VRAM than inference
+
+**Connection refused:** Engine may still be building — check `docker logs -f trtllm`
+
+**Slow first start:** Normal — TensorRT compiles an optimized engine on first run. Subsequent starts reuse the cached engine
+
+## Resources
+
+- **Docs:** https://nvidia.github.io/TensorRT-LLM/
+- **GitHub:** https://github.com/NVIDIA/TensorRT-LLM
+- **Quick Start:** https://nvidia.github.io/TensorRT-LLM/quick-start-guide.html
+- **Supported Models:** https://nvidia.github.io/TensorRT-LLM/models/supported-models.html
+- **Pre-quantized Models:** https://huggingface.co/collections/nvidia/inference-optimized-checkpoints-with-model-optimizer
@@ -0,0 +1,160 @@
+#!/bin/bash
+set -e
+
+# Detect Brev user (handles ubuntu, nvidia, shadeform, etc.)
+detect_brev_user() {
+    if [ -n "${SUDO_USER:-}" ] && [ "$SUDO_USER" != "root" ]; then
+        echo "$SUDO_USER"
+        return
+    fi
+    # Check for Brev-specific markers
+    for user_home in /home/*; do
+        username=$(basename "$user_home")
+        [ "$username" = "launchpad" ] && continue
+        if ls "$user_home"/.lifecycle-script-ls-*.log 2>/dev/null | grep -q . || \
+           [ -f "$user_home/.verb-setup.log" ] || \
+           { [ -L "$user_home/.cache" ] && [ "$(readlink "$user_home/.cache")" = "/ephemeral/cache" ]; }; then
+            echo "$username"
+            return
+        fi
+    done
+    # Fallback to common users
+    [ -d "/home/nvidia" ] && echo "nvidia" && return
+    [ -d "/home/ubuntu" ] && echo "ubuntu" && return
+    echo "ubuntu"
+}
+
+# Set USER and HOME if running as root
+if [ "$(id -u)" -eq 0 ] || [ "${USER:-}" = "root" ]; then
+    DETECTED_USER=$(detect_brev_user)
+    export USER="$DETECTED_USER"
+    export HOME="/home/$DETECTED_USER"
+fi
+
+# Configuration (override with environment variables)
+MODEL="${TRTLLM_MODEL:-TinyLlama/TinyLlama-1.1B-Chat-v1.0}"
+PORT="${TRTLLM_PORT:-8000}"
+IMAGE="${TRTLLM_IMAGE:-nvcr.io/nvidia/tensorrt-llm/release:latest}"
+
+echo "⚡ Setting up TensorRT-LLM inference server..."
+echo "User: $USER | Home: $HOME"
+echo "Model: $MODEL | Port: $PORT"
+echo "Image: $IMAGE"
+
+# Note: Brev already has Docker and NVIDIA Container Toolkit installed
+echo "Using existing Docker installation..."
+
+# Verify GPU is available
+if command -v nvidia-smi &> /dev/null; then
+    echo "GPU detected: $(nvidia-smi --query-gpu=name --format=csv,noheader | head -1)"
+else
+    echo "❌ No GPU detected - This is a script meant to be run on a NVIDIA Brev GPU instance!"
+    exit 1
+fi
+
+# Create cache directory for HuggingFace models
+mkdir -p "$HOME/.cache/huggingface"
+
+# Stop existing container if running
+if docker ps -a --format '{{.Names}}' | grep -q '^trtllm$'; then
+    echo "Removing existing TensorRT-LLM container..."
+    docker stop trtllm 2>/dev/null || true
+    docker rm trtllm 2>/dev/null || true
+fi
+
+# Run TensorRT-LLM container
+echo "Starting TensorRT-LLM server with $MODEL..."
+echo "This may take 10-20+ minutes on first run (engine building + model download)..."
+docker run -d \
+    --name trtllm \
+    --restart unless-stopped \
+    --gpus all \
+    --ipc host \
+    --ulimit memlock=-1 \
+    --ulimit stack=67108864 \
+    -v "$HOME/.cache/huggingface:/root/.cache/huggingface" \
+    -p "$PORT:8000" \
+    -e "HF_TOKEN=${HF_TOKEN:-}" \
+    -e "HUGGING_FACE_HUB_TOKEN=${HF_TOKEN:-}" \
+    "$IMAGE" \
+    trtllm-serve serve "$MODEL" --host 0.0.0.0 --port 8000
+
+# Create examples directory
+mkdir -p "$HOME/trtllm-examples"
+
+# Create example Python script
+cat > "$HOME/trtllm-examples/chat.py" << EOF
+#!/usr/bin/env python3
+"""Example: Chat with TensorRT-LLM using OpenAI SDK"""
+from openai import OpenAI
+
+client = OpenAI(base_url="http://localhost:${PORT}/v1", api_key="tensorrt_llm")
+
+response = client.chat.completions.create(
+    model="${MODEL}",
+    messages=[{"role": "user", "content": "Explain what TensorRT-LLM is in two sentences."}]
+)
+
+print(response.choices[0].message.content)
+EOF
+chmod +x "$HOME/trtllm-examples/chat.py"
+
+# Create curl example script
+cat > "$HOME/trtllm-examples/test_api.sh" << EOF
+#!/bin/bash
+# Test TensorRT-LLM API with curl
+curl -s http://localhost:${PORT}/v1/chat/completions \\
+  -H "Content-Type: application/json" \\
+  -d '{
+    "model": "${MODEL}",
+    "messages": [{"role": "user", "content": "Hello!"}],
+    "max_tokens": 100
+  }' | python3 -m json.tool
+EOF
+chmod +x "$HOME/trtllm-examples/test_api.sh"
+
+# Fix permissions if running as root
+if [ "$(id -u)" -eq 0 ]; then
+    chown -R $USER:$USER "$HOME/.cache/huggingface"
+    chown -R $USER:$USER "$HOME/trtllm-examples"
+fi
+
+# Wait for container to start
+echo "Waiting for TensorRT-LLM to initialize..."
+echo "(Engine building and model loading may take 10-20+ minutes on first run)"
+sleep 5
+
+# Verify
+echo ""
+echo "Verifying installation..."
+docker ps --filter "name=trtllm" --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}"
+
+echo ""
+echo "✅ TensorRT-LLM container running!"
+echo ""
+echo "⏳ The engine is still building — first run takes 10-20+ minutes."
+echo "   Subsequent starts are much faster (engine is cached)."
+echo "   Run this to watch progress:"
+echo "   docker logs -f trtllm"
+echo ""
+echo "   The API is ready when you see: \"Started server process\""
+echo ""
+echo "Model: $MODEL"
+echo "API Endpoint: http://localhost:$PORT"
+echo "OpenAI-compatible: http://localhost:$PORT/v1"
+echo ""
+echo "⚠️  To access from outside Brev, open port: ${PORT}/tcp"
+echo ""
+echo "Quick start (after engine finishes building):"
+echo "  pip install openai"
+echo "  python3 $HOME/trtllm-examples/chat.py"
+echo "  bash $HOME/trtllm-examples/test_api.sh"
+echo ""
+echo "Manage:"
+echo "  docker logs -f trtllm          # Watch startup progress"
+echo "  docker restart trtllm          # Restart server"
+echo "  docker stop trtllm             # Stop server"
+echo ""
+echo "Run with a different model:"
+echo "  export HF_TOKEN={YOUR_HF_TOKEN}"
+echo "  TRTLLM_MODEL=nvidia/Llama-3.1-8B-Instruct-FP8 bash setup.sh"