Summary
On an RTX 3090 24GB, Qwen3.6-27B Q4_K_XL + PFlash/BSA works up to 150K source tokens, but 180K-200K source prompts fail PFlash compression and fall back to the server excerpt fallback. The final NIAH answer can still be exact because the fallback extracts the relevant text, but the logs show real PFlash compression returned 0 tokens.
Environment
- Repo: Luce-Org/lucebox-hub
- Commit tested: 7f96d4c
- GPU: NVIDIA RTX 3090 24GB
- Target: Qwen3.6-27B-UD-Q4_K_XL.gguf
- Draft: z-lab/Qwen3.6-27B-DFlash safetensors
- PFlash drafter: Qwen3-0.6B-BF16.gguf
- Mode: chain
- max_ctx: 8192 (compressed prompt budget)
- fa_window: 0
- cache_type_k/cache_type_v: tq3_0/tq3_0
- DFLASH_FP_USE_BSA=1
- DFLASH_FP_ALPHA tested: 0.85 and 0.90
- DFLASH27B_PREFILL_UBATCH=256
- keep_ratio=0.035
Park/unpark sequence
I patched the Python server hook locally to issue explicit park target before compress, then free drafter, then unpark target / unpark draft. This avoids restoring target weights before freeing the drafter. The logs confirm the sequence:
[park] target released
[park] draft released
... drafter scoring ...
[drafter] freed
[unpark] target restored
[unpark] draft restored
Results
130K and 150K pass with real PFlash compression:
[qwen3-0.6b-fp] forward 7.68s (S=130197, A=1.24s FP=4.40s B=2.03s) tail-score 2.46s total 10.15s
[compress] 130197 -> 4533 tokens (keep_ratio=0.035)
NIAH exact: true
[qwen3-0.6b-fp] forward 9.66s (S=150038, A=1.41s FP=5.89s B=2.35s) tail-score 1.97s total 11.63s
[compress] 150038 -> 5238 tokens (keep_ratio=0.035)
NIAH exact: true
180K and 200K fail PFlash compression and fall back to excerpts:
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 350.12 MiB on device 0: cudaMalloc failed: out of memory
[compress] 179262 -> 0 tokens (keep_ratio=0.035)
# HTTP answer exact only because fallback prompt was ~1801 tokens
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 389.71 MiB on device 0: cudaMalloc failed: out of memory
[compress] 199530 -> 0 tokens (keep_ratio=0.035)
# HTTP answer exact only because fallback prompt was ~1803 tokens
Raising DFLASH_FP_ALPHA to 0.90 did not change the failure:
alpha=0.90 180K: allocating 350.12 MiB ... OOM; [compress] 179262 -> 0 tokens
alpha=0.90 200K: allocating 389.71 MiB ... OOM; [compress] 199530 -> 0 tokens
Expected
After explicit park target and park draft, 180K-200K source prompts should either compress successfully or expose a tunable/chunking path that keeps the drafter/BSA scratch below the 24GB limit.
Actual
150K compresses successfully, but 180K+ fails during drafter/BSA allocation and returns an empty compressed stream. The OpenAI server then uses its fallback excerpt path, so the user-visible answer may still be correct while PFlash did not actually run.
Ask
Is there a recommended chunking/scoring-window setting for Qwen3-0.6B PFlash at 180K-200K on 24GB cards, or should the BSA scratch allocation be freed/chunked differently between long compressions?
Summary
On an RTX 3090 24GB, Qwen3.6-27B Q4_K_XL + PFlash/BSA works up to 150K source tokens, but 180K-200K source prompts fail PFlash compression and fall back to the server excerpt fallback. The final NIAH answer can still be exact because the fallback extracts the relevant text, but the logs show real PFlash compression returned 0 tokens.
Environment
Park/unpark sequence
I patched the Python server hook locally to issue explicit
park targetbeforecompress, thenfree drafter, thenunpark target/unpark draft. This avoids restoring target weights before freeing the drafter. The logs confirm the sequence:Results
130K and 150K pass with real PFlash compression:
180K and 200K fail PFlash compression and fall back to excerpts:
Raising
DFLASH_FP_ALPHAto 0.90 did not change the failure:Expected
After explicit
park targetandpark draft, 180K-200K source prompts should either compress successfully or expose a tunable/chunking path that keeps the drafter/BSA scratch below the 24GB limit.Actual
150K compresses successfully, but 180K+ fails during drafter/BSA allocation and returns an empty compressed stream. The OpenAI server then uses its fallback excerpt path, so the user-visible answer may still be correct while PFlash did not actually run.
Ask
Is there a recommended chunking/scoring-window setting for Qwen3-0.6B PFlash at 180K-200K on 24GB cards, or should the BSA scratch allocation be freed/chunked differently between long compressions?