Skip to content

DFlash Server.py Out of memory on a single 3090 #114

@Cinderella-Man

Description

@Cinderella-Man

Hello,

I'm trying to run the DFlash today and it's running out of memory after a few messages:

$ python scripts/server.py   --tokenizer Qwen/Qwen3.6-27B   --port 8000 --max-ctx 16000 --fa-window 2048 --daemon
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
config.json: 4.31kB [00:00, 15.2MB/s]
tokenizer_config.json: 16.7kB [00:00, 61.5MB/s]
vocab.json: 6.72MB [00:00, 35.7MB/s]
merges.txt: 3.35MB [00:00, 111MB/s]
tokenizer.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12.8M/12.8M [00:01<00:00, 12.6MB/s]
chat_template.jinja: 7.76kB [00:00, 26.8MB/s]
/home/wolf/lucebox-hub/dflash/scripts/server.py:233: DeprecationWarning: 
        on_event is deprecated, use lifespan event handlers instead.

        Read more about it in the
        [FastAPI docs for Lifespan Events](https://fastapi.tiangolo.com/advanced/events/).
        
  @app.on_event("startup")
Luce DFlash OpenAI server on http://0.0.0.0:8000
  target    = /home/wolf/lucebox-hub/dflash/models/Qwen3.6-27B-Q4_K_M.gguf
  draft     = /home/wolf/lucebox-hub/dflash/models/draft/model.safetensors
  bin       = /home/wolf/lucebox-hub/dflash/build/test_dflash
  budget    = 22
  max_ctx   = 16000
  tokenizer = Qwen/Qwen3.6-27B
  pflash    = off
INFO:     Started server process [53012]
INFO:     Waiting for application startup.
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 24117 MiB):
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24117 MiB
  [daemon] [cfg] seq_verify=0 fast_rollback=1 ddtree=1 budget=22 temp=1.00 chain_seed=1 fa_window=2048 draft_feature_mirror=0 target_gpu=0 draft_gpu=0
  [daemon] [loader] eos_id=248046 eos_chat_id=-1
  [daemon] [target] target loaded: 851 tensors on GPU 14.99 GiB, tok_embd 682 MiB CPU-only (q4_K)
  [daemon] [draft]  loaded
  [daemon] [daemon] ready
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
  [daemon] [snap] inline slot=0 cur_pos=383
[pc] inline-snap committed slot=0 prefix_len=383
INFO:     127.0.0.1:59856 - "POST /v1/chat/completions HTTP/1.1" 200 OK
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 570.13 MiB on device 0: cudaMalloc failed: out of memory
alloc_tensor_range: failed to allocate CUDA0 buffer of size 597823488
[snap] inline snap failed slot=1: ggml_backend_alloc_ctx_tensors failed for PrefixSnapshot
[pc] inline-snap committed slot=1 prefix_len=86
INFO:     127.0.0.1:51444 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[pc] lookup hit slot=0 prefix_len=383 (of 556 total)
  [daemon] [snap] restored slot=0 cur_pos=383
INFO:     127.0.0.1:51444 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[pc] lookup hit slot=0 prefix_len=383 (of 1300 total)
  [daemon] [snap] restored slot=0 cur_pos=383
INFO:     127.0.0.1:51444 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[pc] lookup hit slot=1 prefix_len=86 (of 674 total)
[snap] RESTORE bad args or empty slot 1
INFO:     127.0.0.1:43622 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[pc] lookup hit slot=0 prefix_len=383 (of 615 total)
  [daemon] [snap] restored slot=0 cur_pos=383
INFO:     127.0.0.1:43622 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[pc] lookup hit slot=0 prefix_len=383 (of 1213 total)
  [daemon] [snap] restored slot=0 cur_pos=383
INFO:     127.0.0.1:43622 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[pc] lookup hit slot=0 prefix_len=383 (of 1360 total)
  [daemon] [snap] restored slot=0 cur_pos=383
INFO:     127.0.0.1:43622 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[pc] lookup hit slot=0 prefix_len=383 (of 1213 total)
  [daemon] [snap] restored slot=0 cur_pos=383
INFO:     127.0.0.1:43622 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[pc] lookup hit slot=0 prefix_len=383 (of 1360 total)
  [daemon] [snap] restored slot=0 cur_pos=383
INFO:     127.0.0.1:43622 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[pc] lookup hit slot=0 prefix_len=383 (of 586 total)
  [daemon] [snap] restored slot=0 cur_pos=383
INFO:     127.0.0.1:43622 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[pc] lookup hit slot=0 prefix_len=383 (of 1417 total)
  [daemon] [snap] restored slot=0 cur_pos=383
INFO:     127.0.0.1:41362 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[pc] lookup hit slot=0 prefix_len=383 (of 2153 total)
  [daemon] [snap] restored slot=0 cur_pos=383
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 1215.57 MiB on device 0: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n_impl: failed to allocate CUDA0 buffer of size 1274615808

I'm not sure how to give you more info - please let me know - I'll happily provide you with more details if needed

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingrtx 3090

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions