Skip to content

bryonbaker/vllm-webui

Β 
Β 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

39 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

vLLM Playground

A modern web interface for managing and interacting with vLLM (Very Large Language Model) servers. Supports both GPU and CPU modes, with special optimizations for macOS Apple Silicon.

vLLM Playground Interface

πŸ†• New: Model Compression Support

Built-in LLM-Compressor integration for quantizing and compressing models directly from the UI!

Model Compression Interface

πŸ“Š New: GuideLLM Benchmarking

Integrated GuideLLM for comprehensive performance benchmarking and analysis. Run load tests and get detailed metrics on throughput, latency, and token generation performance!

GuideLLM Benchmark Results

πŸ“ Project Structure

vllm-playground/
β”œβ”€β”€ app.py                       # Main FastAPI backend application
β”œβ”€β”€ run.py                       # Backend server launcher
β”œβ”€β”€ index.html                   # Main HTML interface
β”œβ”€β”€ requirements.txt             # Python dependencies
β”œβ”€β”€ env.example                  # Example environment variables
β”œβ”€β”€ LICENSE                      # MIT License
β”œβ”€β”€ README.md                    # This file
β”‚
β”œβ”€β”€ containers/                  # Container definitions 🐳
β”‚   β”œβ”€β”€ Containerfile.cuda      # CUDA/GPU container (RHEL UBI9)
β”‚   β”œβ”€β”€ Containerfile.vllm      # vLLM official base image
β”‚   β”œβ”€β”€ Containerfile.mac       # macOS/CPU container
β”‚   └── Containerfile.rhel9     # RHEL 9 container
β”‚
β”œβ”€β”€ deployments/                 # Kubernetes/OpenShift deployments ☸️
β”‚   β”œβ”€β”€ kubernetes-deployment.yaml    # Kubernetes manifests
β”‚   β”œβ”€β”€ openshift-deployment.yaml     # OpenShift manifests
β”‚   └── deploy-to-openshift.sh       # OpenShift deployment script
β”‚
β”œβ”€β”€ static/                      # Frontend assets
β”‚   β”œβ”€β”€ css/
β”‚   β”‚   └── style.css           # Main stylesheet
β”‚   └── js/
β”‚       └── app.js              # Frontend JavaScript
β”‚
β”œβ”€β”€ scripts/                     # Utility scripts
β”‚   β”œβ”€β”€ run_cpu.sh              # Start vLLM in CPU mode (macOS compatible)
β”‚   β”œβ”€β”€ start.sh                # General start script
β”‚   β”œβ”€β”€ install.sh              # Installation script
β”‚   └── verify_setup.py         # Setup verification
β”‚
β”œβ”€β”€ config/                      # Configuration files
β”‚   β”œβ”€β”€ vllm_cpu.env            # CPU mode environment variables
β”‚   └── example_configs.json    # Example configurations
β”‚
β”œβ”€β”€ assets/                      # Images and assets
β”‚   β”œβ”€β”€ vllm-playground.png          # WebUI screenshot
β”‚   β”œβ”€β”€ llmcompressor.png       # Model compression UI screenshot
β”‚   β”œβ”€β”€ guidellm.png            # GuideLLM benchmark results screenshot
β”‚   β”œβ”€β”€ vllm.png                # vLLM logo
β”‚   └── vllm.jpeg               # vLLM logo (alternate)
β”‚
└── docs/                        # Documentation
    β”œβ”€β”€ QUICKSTART.md            # Quick start guide
    β”œβ”€β”€ MACOS_CPU_GUIDE.md       # macOS CPU setup guide
    β”œβ”€β”€ CPU_MODELS_QUICKSTART.md # CPU-optimized models guide
    β”œβ”€β”€ GATED_MODELS_GUIDE.md    # Guide for accessing Llama, Gemma, etc.
    β”œβ”€β”€ CHAT_TEMPLATES.md        # Model-specific chat templates
    β”œβ”€β”€ TROUBLESHOOTING.md       # Common issues and solutions
    β”œβ”€β”€ FEATURES.md              # Feature documentation
    β”œβ”€β”€ PERFORMANCE_METRICS.md   # Performance metrics
    └── QUICK_REFERENCE.md       # Command reference

πŸš€ Quick Start

🐳 Option 1: Container (Easiest for macOS) RECOMMENDED

For macOS users, the container provides the easiest setup with everything pre-configured:

# 1. Build the container (one-time, ~15-30 min)
./scripts/build_container.sh

# 2. Run the container
./scripts/run_container.sh

# 3. Open http://localhost:7860

✨ Benefits:

  • βœ… No complex installation
  • βœ… Pre-built vLLM optimized for CPU
  • βœ… Isolated environment
  • βœ… Works out of the box

πŸ“– See CONTAINER-QUICKSTART.md for detailed instructions.


πŸ’» Option 2: Local Installation

For local development or if you prefer not to use containers:

1. Install vLLM

# For macOS/CPU mode
pip install vllm

2. Install Dependencies

pip install -r requirements.txt

3. Start the WebUI

python run.py

Then open http://localhost:7860 in your browser.

4. Start vLLM Server

Option A: Using the WebUI

  • Select CPU or GPU mode
  • Click "Start Server"

Option B: Using the script (macOS/CPU)

./scripts/run_cpu.sh

πŸ’» macOS Apple Silicon Support

For macOS users, vLLM runs in CPU mode. See docs/MACOS_CPU_GUIDE.md for detailed setup.

Quick CPU Mode Setup:

# Edit CPU configuration
nano config/vllm_cpu.env

# Run vLLM
./scripts/run_cpu.sh

✨ Features

  • Model Compression: LLM-Compressor integration for quantizing and compressing models πŸ†•
  • Performance Benchmarking: GuideLLM integration for comprehensive load testing with detailed metrics πŸ†•
    • Request statistics (success rate, duration, avg times)
    • Token throughput analysis (mean/median tokens per second)
    • Latency percentiles (P50, P75, P90, P95, P99)
    • Configurable load patterns and request rates
  • Server Management: Start/stop vLLM servers from the UI
  • Chat Interface: Interactive chat with streaming responses
  • Smart Chat Templates: Automatic model-specific template detection (Nov 2025) πŸ†•
  • Performance Metrics: Real-time token counts and generation speed
  • Model Support: Pre-configured popular models + custom model support
  • Gated Model Access: Built-in HuggingFace token support for Llama, Gemma, etc.
  • CPU & GPU Modes: Automatic detection and configuration
  • macOS Optimized: Special support for Apple Silicon
  • Resizable Panels: Customizable layout
  • Command Preview: See exact commands before execution

πŸ“– Documentation

Getting Started

Model Configuration

Reference

πŸ”§ Configuration

CPU Mode (macOS)

Edit config/vllm_cpu.env:

export VLLM_CPU_KVCACHE_SPACE=40
export VLLM_CPU_OMP_THREADS_BIND=auto

Supported Models

CPU-Optimized Models (Recommended for macOS):

  • TinyLlama/TinyLlama-1.1B-Chat-v1.0 (default) - Fast, no token required
  • meta-llama/Llama-3.2-1B - Latest Llama, requires HF token (gated)
  • google/gemma-2-2b - High quality, requires HF token (gated)
  • facebook/opt-125m - Tiny test model

Larger Models (Slow on CPU, better on GPU):

  • meta-llama/Llama-2-7b-chat-hf (requires HF token)
  • mistralai/Mistral-7B-Instruct-v0.2
  • Custom models via text input

πŸ“Œ Note: Gated models (Llama, Gemma) require a HuggingFace token. See Gated Models Guide for setup.

πŸ› οΈ Development

Project Structure

  • Backend: FastAPI (app.py)
  • Frontend: Vanilla JavaScript (static/js/app.js)
  • Styling: Custom CSS (static/css/style.css)
  • Scripts: Bash scripts in scripts/
  • Config: Environment files in config/

Running in Development

# Start backend with auto-reload
uvicorn app:app --reload --port 7860

# Or use the run script
python run.py

πŸ“ License

MIT License - See LICENSE file for details

🀝 Contributing

Contributions welcome! Please feel free to submit issues and pull requests.

πŸ”— Links

πŸ†˜ Troubleshooting

macOS Segmentation Fault

Use CPU mode with proper environment variables. See docs/MACOS_CPU_GUIDE.md.

Server Won't Start

  1. Check if vLLM is installed: python -c "import vllm; print(vllm.__version__)"
  2. Check port availability: lsof -i :8000
  3. Review server logs in the WebUI

Chat Not Streaming

Check browser console (F12) for errors and ensure the server is running.


Made with ❀️ for the vLLM community

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • JavaScript 35.8%
  • Python 26.2%
  • Shell 13.4%
  • HTML 12.0%
  • CSS 10.6%
  • Dockerfile 2.0%