A modern web interface for managing and interacting with vLLM (Very Large Language Model) servers. Supports both GPU and CPU modes, with special optimizations for macOS Apple Silicon.
Built-in LLM-Compressor integration for quantizing and compressing models directly from the UI!
Integrated GuideLLM for comprehensive performance benchmarking and analysis. Run load tests and get detailed metrics on throughput, latency, and token generation performance!
vllm-playground/
βββ app.py # Main FastAPI backend application
βββ run.py # Backend server launcher
βββ index.html # Main HTML interface
βββ requirements.txt # Python dependencies
βββ env.example # Example environment variables
βββ LICENSE # MIT License
βββ README.md # This file
β
βββ containers/ # Container definitions π³
β βββ Containerfile.cuda # CUDA/GPU container (RHEL UBI9)
β βββ Containerfile.vllm # vLLM official base image
β βββ Containerfile.mac # macOS/CPU container
β βββ Containerfile.rhel9 # RHEL 9 container
β
βββ deployments/ # Kubernetes/OpenShift deployments βΈοΈ
β βββ kubernetes-deployment.yaml # Kubernetes manifests
β βββ openshift-deployment.yaml # OpenShift manifests
β βββ deploy-to-openshift.sh # OpenShift deployment script
β
βββ static/ # Frontend assets
β βββ css/
β β βββ style.css # Main stylesheet
β βββ js/
β βββ app.js # Frontend JavaScript
β
βββ scripts/ # Utility scripts
β βββ run_cpu.sh # Start vLLM in CPU mode (macOS compatible)
β βββ start.sh # General start script
β βββ install.sh # Installation script
β βββ verify_setup.py # Setup verification
β
βββ config/ # Configuration files
β βββ vllm_cpu.env # CPU mode environment variables
β βββ example_configs.json # Example configurations
β
βββ assets/ # Images and assets
β βββ vllm-playground.png # WebUI screenshot
β βββ llmcompressor.png # Model compression UI screenshot
β βββ guidellm.png # GuideLLM benchmark results screenshot
β βββ vllm.png # vLLM logo
β βββ vllm.jpeg # vLLM logo (alternate)
β
βββ docs/ # Documentation
βββ QUICKSTART.md # Quick start guide
βββ MACOS_CPU_GUIDE.md # macOS CPU setup guide
βββ CPU_MODELS_QUICKSTART.md # CPU-optimized models guide
βββ GATED_MODELS_GUIDE.md # Guide for accessing Llama, Gemma, etc.
βββ CHAT_TEMPLATES.md # Model-specific chat templates
βββ TROUBLESHOOTING.md # Common issues and solutions
βββ FEATURES.md # Feature documentation
βββ PERFORMANCE_METRICS.md # Performance metrics
βββ QUICK_REFERENCE.md # Command reference
For macOS users, the container provides the easiest setup with everything pre-configured:
# 1. Build the container (one-time, ~15-30 min)
./scripts/build_container.sh
# 2. Run the container
./scripts/run_container.sh
# 3. Open http://localhost:7860β¨ Benefits:
- β No complex installation
- β Pre-built vLLM optimized for CPU
- β Isolated environment
- β Works out of the box
π See CONTAINER-QUICKSTART.md for detailed instructions.
For local development or if you prefer not to use containers:
# For macOS/CPU mode
pip install vllmpip install -r requirements.txtpython run.pyThen open http://localhost:7860 in your browser.
Option A: Using the WebUI
- Select CPU or GPU mode
- Click "Start Server"
Option B: Using the script (macOS/CPU)
./scripts/run_cpu.shFor macOS users, vLLM runs in CPU mode. See docs/MACOS_CPU_GUIDE.md for detailed setup.
Quick CPU Mode Setup:
# Edit CPU configuration
nano config/vllm_cpu.env
# Run vLLM
./scripts/run_cpu.sh- Model Compression: LLM-Compressor integration for quantizing and compressing models π
- Performance Benchmarking: GuideLLM integration for comprehensive load testing with detailed metrics π
- Request statistics (success rate, duration, avg times)
- Token throughput analysis (mean/median tokens per second)
- Latency percentiles (P50, P75, P90, P95, P99)
- Configurable load patterns and request rates
- Server Management: Start/stop vLLM servers from the UI
- Chat Interface: Interactive chat with streaming responses
- Smart Chat Templates: Automatic model-specific template detection (Nov 2025) π
- Performance Metrics: Real-time token counts and generation speed
- Model Support: Pre-configured popular models + custom model support
- Gated Model Access: Built-in HuggingFace token support for Llama, Gemma, etc.
- CPU & GPU Modes: Automatic detection and configuration
- macOS Optimized: Special support for Apple Silicon
- Resizable Panels: Customizable layout
- Command Preview: See exact commands before execution
- Container Quick Start π³ - Easiest way for macOS users (RECOMMENDED)
- Container Full Guide - Complete container documentation
- Container Workflow - Step-by-step container workflow
- Quick Start Guide - Get up and running in minutes
- Command-Line Demo Guide π - Full workflow demo with vLLM, LLMCompressor & GuideLLM
- macOS CPU Setup - Apple Silicon optimization guide
- CPU Models Quickstart - Best models for CPU
- Gated Models Guide (Llama, Gemma) β - Access restricted models
- Chat Templates Explained π - Model-specific templates
- Feature Overview - Complete feature list
- Performance Metrics - Benchmarking and metrics
- Command Reference - Command cheat sheet
- CLI Quick Reference - Command-line demo quick reference π
- Troubleshooting - Common issues and solutions
Edit config/vllm_cpu.env:
export VLLM_CPU_KVCACHE_SPACE=40
export VLLM_CPU_OMP_THREADS_BIND=autoCPU-Optimized Models (Recommended for macOS):
- TinyLlama/TinyLlama-1.1B-Chat-v1.0 (default) - Fast, no token required
- meta-llama/Llama-3.2-1B - Latest Llama, requires HF token (gated)
- google/gemma-2-2b - High quality, requires HF token (gated)
- facebook/opt-125m - Tiny test model
Larger Models (Slow on CPU, better on GPU):
- meta-llama/Llama-2-7b-chat-hf (requires HF token)
- mistralai/Mistral-7B-Instruct-v0.2
- Custom models via text input
π Note: Gated models (Llama, Gemma) require a HuggingFace token. See Gated Models Guide for setup.
- Backend: FastAPI (
app.py) - Frontend: Vanilla JavaScript (
static/js/app.js) - Styling: Custom CSS (
static/css/style.css) - Scripts: Bash scripts in
scripts/ - Config: Environment files in
config/
# Start backend with auto-reload
uvicorn app:app --reload --port 7860
# Or use the run script
python run.pyMIT License - See LICENSE file for details
Contributions welcome! Please feel free to submit issues and pull requests.
Use CPU mode with proper environment variables. See docs/MACOS_CPU_GUIDE.md.
- Check if vLLM is installed:
python -c "import vllm; print(vllm.__version__)" - Check port availability:
lsof -i :8000 - Review server logs in the WebUI
Check browser console (F12) for errors and ensure the server is running.
Made with β€οΈ for the vLLM community


