Skip to content

problem #95

@3879171-cn

Description

@3879171-cn

Environment

  • GPU: NVIDIA GeForce RTX 5090 Laptop GPU (24GB VRAM)
  • Compute capability: 12.0 (Blackwell)
  • OS: WSL2
  • llama.cpp branch: turboquant (feature/turboquant-kv-cache)
  • Model: Qwen3.6-35B-A3B-Q4_K_M.gguf
    Steps to reproduce
  1. Build llama-server with CUDA support
  2. Run: ./llama-server -m Qwen3.6-35B-A3B-Q4_K_M.gguf --port 1234 --host 0.0.0.0 --ctx-size 2048
    Expected behavior
    Server starts and responds to API requests
    Actual behavior
  • Model loads successfully
  • Server hangs during warmup phase
  • All API requests return HTTP 503 "Loading model"
  • Server process eventually crashes/exits
    Key warnings from logs
  • "llama_kv_cache: upstream attention rotation disabled (TurboQuant uses kernel-level WHT)"
  • "sched_reserve: fused Gated Delta Net (autoregressive) enabled"
    Notes
  • Same model works fine in LM Studio on Windows 11
  • Issue occurs even with -ngl 0 (CPU-only mode)
  • Issue occurs even with --ctx-size 2048 (small context)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions