Error:
AssertionError: Torch not compiled with CUDA enabled
RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {'EngineCore_DP0': 1}
Root Cause:
vLLM is trying to use CUDA/GPU mode on macOS where CUDA is not available. This happens when the --device cpu flag is not explicitly set.
Solution:
This is now fixed automatically in the WebUI - it will detect macOS and add the --device cpu flag. However, if you're running vLLM manually or seeing this error:
- Ensure you're using CPU mode - The WebUI auto-detects macOS
- Verify the command includes
--device cpu - Check your vLLM version - Make sure you have vLLM with CPU backend support
- Environment variables are set:
export VLLM_CPU_KVCACHE_SPACE=4 export VLLM_CPU_OMP_THREADS_BIND=auto export VLLM_CPU_MOE_PREPACK=0 export VLLM_CPU_SGL_KERNEL=0
Error:
RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {'EngineCore_DP0': 1}
(Without the CUDA error)
Solution:
- Use a smaller model (e.g.,
facebook/opt-125morfacebook/opt-350mfor testing) - Reduce
max_model_lento 512, 1024, or 2048 - Reduce KV cache space in settings (try 2-4 GB instead of 40 GB)
Some models may not work well with vLLM's CPU backend.
Solution:
- Try a different model from the OPT family first:
facebook/opt-125m,facebook/opt-350m - Check if the model supports CPU inference
Some CPU optimizations may cause issues on M1/M2/M3 Macs (now automatically disabled).
Solution:
The WebUI now automatically disables problematic optimizations. If running manually, add these to your environment or config/vllm_cpu.env:
export VLLM_CPU_MOE_PREPACK=0
export VLLM_CPU_SGL_KERNEL=0Error:
Value error, max_num_batched_tokens (2048) is smaller than max_model_len (131072)
Solution: This is now fixed automatically, but if you still see it:
- Explicitly set
max_model_lento a reasonable value (2048, 4096, or 8192) - The WebUI will automatically set
max_num_batched_tokensto match
Symptoms:
- Process crashes
- System becomes unresponsive
- "Cannot allocate memory" errors
Solution:
- Reduce model size: Use smaller models
- Reduce max_model_len: Try 512, 1024, or 2048
- Reduce KV cache: Set to 2-4 GB
- Close other applications: Free up RAM
Conservative Settings for CPU:
{
"model": "facebook/opt-125m",
"max_model_len": 1024,
"cpu_kvcache_space": 2,
"dtype": "bfloat16"
}Symptoms:
- Timeout errors
- Connection refused
- Model not found
Solution:
- Check your internet connection
- For gated models (Llama 2, etc.), ensure you have:
- Hugging Face account
- Accepted model terms
- Set HF_TOKEN environment variable
- Pre-download models:
python -c "from transformers import AutoModelForCausalLM; AutoModelForCausalLM.from_pretrained('facebook/opt-125m')"
Solution:
- Check if port is already in use:
lsof -i :8000
- Try a different port in settings
- Check logs in WebUI for specific errors
Expected Behavior: CPU inference is inherently slower than GPU. Typical speeds:
- Small models (125M-350M): 10-50 tokens/second
- Medium models (1B-3B): 1-10 tokens/second
- Large models (7B+): 0.1-2 tokens/second
Optimization:
- Increase KV cache (if you have RAM): 10-40 GB
- Reduce max_tokens in generation
- Use smaller models
- Ensure no other heavy processes are running
{
"model": "facebook/opt-125m",
"max_model_len": 1024,
"cpu_kvcache_space": 2,
"dtype": "bfloat16"
}{
"model": "facebook/opt-350m",
"max_model_len": 2048,
"cpu_kvcache_space": 4,
"dtype": "bfloat16"
}{
"model": "facebook/opt-1.3b",
"max_model_len": 4096,
"cpu_kvcache_space": 10,
"dtype": "bfloat16"
}To see detailed error messages:
-
In WebUI: Check the Server Logs panel
-
From command line:
python app.py
Then check terminal output
-
Enable verbose logging:
export VLLM_LOGGING_LEVEL=DEBUG python app.py
- vLLM CPU backend works but may be slower
- Some optimizations may need to be disabled
- Intel-based Macs may have different behavior
Make sure you have:
# Check Python version (3.8+)
python --version
# Check if vLLM is installed
python -c "import vllm; print(vllm.__version__)"
# Check available memory
sysctl hw.memsize- Check the full error log - The root cause is usually shown above the final error
- Try the smallest model first -
facebook/opt-125mwith minimal settings - Monitor system resources - Use Activity Monitor to check RAM usage
- Check vLLM compatibility - Some features may not work on CPU backend
When reporting issues, please include:
- Full error log from WebUI
- Your system specs (macOS version, RAM, CPU)
- Model and configuration you're trying to use
- Steps to reproduce the error