Skip to content

PFlash compression falls back to extract at 180K-200K source on 24GB despite explicit park target #147

@YanissAmz

Description

@YanissAmz

Summary

On an RTX 3090 24GB, Qwen3.6-27B Q4_K_XL + PFlash/BSA works up to 150K source tokens, but 180K-200K source prompts fail PFlash compression and fall back to the server excerpt fallback. The final NIAH answer can still be exact because the fallback extracts the relevant text, but the logs show real PFlash compression returned 0 tokens.

Environment

  • Repo: Luce-Org/lucebox-hub
  • Commit tested: 7f96d4c
  • GPU: NVIDIA RTX 3090 24GB
  • Target: Qwen3.6-27B-UD-Q4_K_XL.gguf
  • Draft: z-lab/Qwen3.6-27B-DFlash safetensors
  • PFlash drafter: Qwen3-0.6B-BF16.gguf
  • Mode: chain
  • max_ctx: 8192 (compressed prompt budget)
  • fa_window: 0
  • cache_type_k/cache_type_v: tq3_0/tq3_0
  • DFLASH_FP_USE_BSA=1
  • DFLASH_FP_ALPHA tested: 0.85 and 0.90
  • DFLASH27B_PREFILL_UBATCH=256
  • keep_ratio=0.035

Park/unpark sequence

I patched the Python server hook locally to issue explicit park target before compress, then free drafter, then unpark target / unpark draft. This avoids restoring target weights before freeing the drafter. The logs confirm the sequence:

[park] target released
[park] draft released
... drafter scoring ...
[drafter] freed
[unpark] target restored
[unpark] draft restored

Results

130K and 150K pass with real PFlash compression:

[qwen3-0.6b-fp] forward 7.68s (S=130197, A=1.24s FP=4.40s B=2.03s) tail-score 2.46s total 10.15s
[compress] 130197 -> 4533 tokens (keep_ratio=0.035)
NIAH exact: true

[qwen3-0.6b-fp] forward 9.66s (S=150038, A=1.41s FP=5.89s B=2.35s) tail-score 1.97s total 11.63s
[compress] 150038 -> 5238 tokens (keep_ratio=0.035)
NIAH exact: true

180K and 200K fail PFlash compression and fall back to excerpts:

ggml_backend_cuda_buffer_type_alloc_buffer: allocating 350.12 MiB on device 0: cudaMalloc failed: out of memory
[compress] 179262 -> 0 tokens (keep_ratio=0.035)
# HTTP answer exact only because fallback prompt was ~1801 tokens

ggml_backend_cuda_buffer_type_alloc_buffer: allocating 389.71 MiB on device 0: cudaMalloc failed: out of memory
[compress] 199530 -> 0 tokens (keep_ratio=0.035)
# HTTP answer exact only because fallback prompt was ~1803 tokens

Raising DFLASH_FP_ALPHA to 0.90 did not change the failure:

alpha=0.90 180K: allocating 350.12 MiB ... OOM; [compress] 179262 -> 0 tokens
alpha=0.90 200K: allocating 389.71 MiB ... OOM; [compress] 199530 -> 0 tokens

Expected

After explicit park target and park draft, 180K-200K source prompts should either compress successfully or expose a tunable/chunking path that keeps the drafter/BSA scratch below the 24GB limit.

Actual

150K compresses successfully, but 180K+ fails during drafter/BSA allocation and returns an empty compressed stream. The OpenAI server then uses its fallback excerpt path, so the user-visible answer may still be correct while PFlash did not actually run.

Ask

Is there a recommended chunking/scoring-window setting for Qwen3-0.6B PFlash at 180K-200K on 24GB cards, or should the BSA scratch allocation be freed/chunked differently between long compressions?

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingquestionFurther information is requested

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions