Troubleshooting Guide for vLLM Playground

Common Issues and Solutions

1. Engine Core Initialization Failed with "Torch not compiled with CUDA enabled"

Error:

AssertionError: Torch not compiled with CUDA enabled
RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {'EngineCore_DP0': 1}

Root Cause: vLLM is trying to use CUDA/GPU mode on macOS where CUDA is not available. This happens when the --device cpu flag is not explicitly set.

Solution: This is now fixed automatically in the WebUI - it will detect macOS and add the --device cpu flag. However, if you're running vLLM manually or seeing this error:

Ensure you're using CPU mode - The WebUI auto-detects macOS
Verify the command includes --device cpu
Check your vLLM version - Make sure you have vLLM with CPU backend support

Environment variables are set:

export VLLM_CPU_KVCACHE_SPACE=4
export VLLM_CPU_OMP_THREADS_BIND=auto
export VLLM_CPU_MOE_PREPACK=0
export VLLM_CPU_SGL_KERNEL=0

2. Engine Core Initialization Failed (Memory Issues)

Error:

RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {'EngineCore_DP0': 1}

(Without the CUDA error)

Solution:

Use a smaller model (e.g., facebook/opt-125m or facebook/opt-350m for testing)
Reduce max_model_len to 512, 1024, or 2048
Reduce KV cache space in settings (try 2-4 GB instead of 40 GB)

B. Model Not Compatible with CPU Backend

Some models may not work well with vLLM's CPU backend.

Solution:

Try a different model from the OPT family first: facebook/opt-125m, facebook/opt-350m
Check if the model supports CPU inference

C. CPU Optimization Issues on Apple Silicon

Some CPU optimizations may cause issues on M1/M2/M3 Macs (now automatically disabled).

Solution: The WebUI now automatically disables problematic optimizations. If running manually, add these to your environment or config/vllm_cpu.env:

export VLLM_CPU_MOE_PREPACK=0
export VLLM_CPU_SGL_KERNEL=0

3. max_num_batched_tokens Error

Error:

Value error, max_num_batched_tokens (2048) is smaller than max_model_len (131072)

Solution: This is now fixed automatically, but if you still see it:

Explicitly set max_model_len to a reasonable value (2048, 4096, or 8192)
The WebUI will automatically set max_num_batched_tokens to match

3. Memory Issues - OOM (Out of Memory)

Symptoms:

Process crashes
System becomes unresponsive
"Cannot allocate memory" errors

Solution:

Reduce model size: Use smaller models
Reduce max_model_len: Try 512, 1024, or 2048
Reduce KV cache: Set to 2-4 GB
Close other applications: Free up RAM

Conservative Settings for CPU:

{
  "model": "facebook/opt-125m",
  "max_model_len": 1024,
  "cpu_kvcache_space": 2,
  "dtype": "bfloat16"
}

4. Model Download Issues

Symptoms:

Timeout errors
Connection refused
Model not found

Solution:

Check your internet connection
For gated models (Llama 2, etc.), ensure you have:
- Hugging Face account
- Accepted model terms
- Set HF_TOKEN environment variable

Pre-download models:

python -c "from transformers import AutoModelForCausalLM; AutoModelForCausalLM.from_pretrained('facebook/opt-125m')"

5. Server Won't Start

Solution:

Check if port is already in use:
```
lsof -i :8000
```
Try a different port in settings
Check logs in WebUI for specific errors

6. Slow Performance on CPU

Expected Behavior: CPU inference is inherently slower than GPU. Typical speeds:

Small models (125M-350M): 10-50 tokens/second
Medium models (1B-3B): 1-10 tokens/second
Large models (7B+): 0.1-2 tokens/second

Optimization:

Increase KV cache (if you have RAM): 10-40 GB
Reduce max_tokens in generation
Use smaller models
Ensure no other heavy processes are running

Recommended Starting Configuration

For Testing (Minimal Resources)

{
  "model": "facebook/opt-125m",
  "max_model_len": 1024,
  "cpu_kvcache_space": 2,
  "dtype": "bfloat16"
}

For Development (Moderate Resources)

{
  "model": "facebook/opt-350m",
  "max_model_len": 2048,
  "cpu_kvcache_space": 4,
  "dtype": "bfloat16"
}

For Production (High Resources)

{
  "model": "facebook/opt-1.3b",
  "max_model_len": 4096,
  "cpu_kvcache_space": 10,
  "dtype": "bfloat16"
}

Getting More Debug Information

To see detailed error messages:

In WebUI: Check the Server Logs panel
From command line:
```
python app.py
```
Then check terminal output

Enable verbose logging:

export VLLM_LOGGING_LEVEL=DEBUG
python app.py

macOS-Specific Issues

Apple Silicon (M1/M2/M3) Compatibility

vLLM CPU backend works but may be slower
Some optimizations may need to be disabled
Intel-based Macs may have different behavior

Environment Setup

Make sure you have:

# Check Python version (3.8+)
python --version

# Check if vLLM is installed
python -c "import vllm; print(vllm.__version__)"

# Check available memory
sysctl hw.memsize

Still Having Issues?

Check the full error log - The root cause is usually shown above the final error
Try the smallest model first - facebook/opt-125m with minimal settings
Monitor system resources - Use Activity Monitor to check RAM usage
Check vLLM compatibility - Some features may not work on CPU backend

Reporting Issues

When reporting issues, please include:

Full error log from WebUI
Your system specs (macOS version, RAM, CPU)
Model and configuration you're trying to use
Steps to reproduce the error

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Troubleshooting Guide for vLLM Playground

Common Issues and Solutions

1. Engine Core Initialization Failed with "Torch not compiled with CUDA enabled"

2. Engine Core Initialization Failed (Memory Issues)

B. Model Not Compatible with CPU Backend

C. CPU Optimization Issues on Apple Silicon

3. max_num_batched_tokens Error

3. Memory Issues - OOM (Out of Memory)

4. Model Download Issues

5. Server Won't Start

6. Slow Performance on CPU

Recommended Starting Configuration

For Testing (Minimal Resources)

For Development (Moderate Resources)

For Production (High Resources)

Getting More Debug Information

macOS-Specific Issues

Apple Silicon (M1/M2/M3) Compatibility

Environment Setup

Still Having Issues?

Reporting Issues

FilesExpand file tree

TROUBLESHOOTING.md

Latest commit

History

TROUBLESHOOTING.md

File metadata and controls

Troubleshooting Guide for vLLM Playground

Common Issues and Solutions

1. Engine Core Initialization Failed with "Torch not compiled with CUDA enabled"

2. Engine Core Initialization Failed (Memory Issues)

B. Model Not Compatible with CPU Backend

C. CPU Optimization Issues on Apple Silicon

3. max_num_batched_tokens Error

3. Memory Issues - OOM (Out of Memory)

4. Model Download Issues

5. Server Won't Start

6. Slow Performance on CPU

Recommended Starting Configuration

For Testing (Minimal Resources)

For Development (Moderate Resources)

For Production (High Resources)

Getting More Debug Information

macOS-Specific Issues

Apple Silicon (M1/M2/M3) Compatibility

Environment Setup

Still Having Issues?

Reporting Issues