problem

Environment
- GPU: NVIDIA GeForce RTX 5090 Laptop GPU (24GB VRAM)
- Compute capability: 12.0 (Blackwell)
- OS: WSL2
- llama.cpp branch: turboquant (feature/turboquant-kv-cache)
- Model: Qwen3.6-35B-A3B-Q4_K_M.gguf
Steps to reproduce
1. Build llama-server with CUDA support
2. Run: ./llama-server -m Qwen3.6-35B-A3B-Q4_K_M.gguf --port 1234 --host 0.0.0.0 --ctx-size 2048
Expected behavior
Server starts and responds to API requests
Actual behavior
- Model loads successfully
- Server hangs during warmup phase
- All API requests return HTTP 503 "Loading model"
- Server process eventually crashes/exits
Key warnings from logs
- "llama_kv_cache: upstream attention rotation disabled (TurboQuant uses kernel-level WHT)"
- "sched_reserve: fused Gated Delta Net (autoregressive) enabled"
Notes
- Same model works fine in LM Studio on Windows 11
- Issue occurs even with -ngl 0 (CPU-only mode)
- Issue occurs even with --ctx-size 2048 (small context)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

problem #95

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

problem #95

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions