PFlash compression falls back to extract at 180K-200K source on 24GB despite explicit park target

### Summary

On an RTX 3090 24GB, Qwen3.6-27B Q4_K_XL + PFlash/BSA works up to 150K source tokens, but 180K-200K source prompts fail PFlash compression and fall back to the server excerpt fallback. The final NIAH answer can still be exact because the fallback extracts the relevant text, but the logs show real PFlash compression returned 0 tokens.

### Environment

- Repo: Luce-Org/lucebox-hub
- Commit tested: 7f96d4c
- GPU: NVIDIA RTX 3090 24GB
- Target: Qwen3.6-27B-UD-Q4_K_XL.gguf
- Draft: z-lab/Qwen3.6-27B-DFlash safetensors
- PFlash drafter: Qwen3-0.6B-BF16.gguf
- Mode: chain
- max_ctx: 8192 (compressed prompt budget)
- fa_window: 0
- cache_type_k/cache_type_v: tq3_0/tq3_0
- DFLASH_FP_USE_BSA=1
- DFLASH_FP_ALPHA tested: 0.85 and 0.90
- DFLASH27B_PREFILL_UBATCH=256
- keep_ratio=0.035

### Park/unpark sequence

I patched the Python server hook locally to issue explicit `park target` before `compress`, then `free drafter`, then `unpark target` / `unpark draft`. This avoids restoring target weights before freeing the drafter. The logs confirm the sequence:

```text
[park] target released
[park] draft released
... drafter scoring ...
[drafter] freed
[unpark] target restored
[unpark] draft restored
```

### Results

130K and 150K pass with real PFlash compression:

```text
[qwen3-0.6b-fp] forward 7.68s (S=130197, A=1.24s FP=4.40s B=2.03s) tail-score 2.46s total 10.15s
[compress] 130197 -> 4533 tokens (keep_ratio=0.035)
NIAH exact: true

[qwen3-0.6b-fp] forward 9.66s (S=150038, A=1.41s FP=5.89s B=2.35s) tail-score 1.97s total 11.63s
[compress] 150038 -> 5238 tokens (keep_ratio=0.035)
NIAH exact: true
```

180K and 200K fail PFlash compression and fall back to excerpts:

```text
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 350.12 MiB on device 0: cudaMalloc failed: out of memory
[compress] 179262 -> 0 tokens (keep_ratio=0.035)
# HTTP answer exact only because fallback prompt was ~1801 tokens

ggml_backend_cuda_buffer_type_alloc_buffer: allocating 389.71 MiB on device 0: cudaMalloc failed: out of memory
[compress] 199530 -> 0 tokens (keep_ratio=0.035)
# HTTP answer exact only because fallback prompt was ~1803 tokens
```

Raising `DFLASH_FP_ALPHA` to 0.90 did not change the failure:

```text
alpha=0.90 180K: allocating 350.12 MiB ... OOM; [compress] 179262 -> 0 tokens
alpha=0.90 200K: allocating 389.71 MiB ... OOM; [compress] 199530 -> 0 tokens
```

### Expected

After explicit `park target` and `park draft`, 180K-200K source prompts should either compress successfully or expose a tunable/chunking path that keeps the drafter/BSA scratch below the 24GB limit.

### Actual

150K compresses successfully, but 180K+ fails during drafter/BSA allocation and returns an empty compressed stream. The OpenAI server then uses its fallback excerpt path, so the user-visible answer may still be correct while PFlash did not actually run.

### Ask

Is there a recommended chunking/scoring-window setting for Qwen3-0.6B PFlash at 180K-200K on 24GB cards, or should the BSA scratch allocation be freed/chunked differently between long compressions?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PFlash compression falls back to extract at 180K-200K source on 24GB despite explicit park target #147

Summary

Environment

Park/unpark sequence

Results

Expected

Actual

Ask

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

PFlash compression falls back to extract at 180K-200K source on 24GB despite explicit park target #147

Description

Summary

Environment

Park/unpark sequence

Results

Expected

Actual

Ask

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions