perf(pflash): add SM75 target-resident TTFT path by weicj · Pull Request #72 · Luce-Org/lucebox-hub

weicj · 2026-05-01T13:43:11Z

perf(pflash): add SM75 target-resident TTFT path

Summary

This PR adds an opt-in SM75 / RTX 2080 Ti path for PFlash TTFT-oriented use cases.

Adds FP16 drafter compute support for SM75, including BF16->F16 GGUF load conversion when native BF16 tensor cores are unavailable.
Tunes the WMMA FlashPrefill fallback for SM75 with padded shared-memory layout and DFLASH_FP_K_TILE=32.
Adds PFlash chunk selection with rare-query lexical rescue, plus test_pflash_chunk_select.
Adds opt-in target-resident PFlash daemon flow:
- DFLASH_PFLASH_KEEP_TARGET=1
- DFLASH_PFLASH_SKIP_DRAFT_RELOAD=1
- default-off fallback when the DFlash draft remains parked
- skip migrate_prefill_cache in that fallback because rollback tensors are not used
Documents the SM75 benchmark as an experimental TTFT path, not as full PFlash + DFlash speculative decode.

Benchmark

Hardware and environment:

GPU: RTX 2080 Ti 22 GB / SM75
Driver: 595.58.03
CUDA toolkit: 12.0
CMake: 3.28.3
Power limit: 280 W
Persistence mode: enabled
Target: Qwen3.6-27B Q4_K_M via test_dflash
PFlash drafter: Qwen3-0.6B FP16 GGUF
Prompt: same 16K NIAH qtail prompt
Methodology: warm daemon request timing after model load; tokenizers treated as preloaded; same hardware, same power limit, same prompt

Case	Request time	Speedup	Notes
no PFlash	50.35 s	1.00x	original 16K prompt
current PFlash hook	26.11 s	1.93x	parks and reloads target + draft
`DFLASH_PFLASH_KEEP_TARGET=1`	14.19 s	3.55x	target stays resident, draft reloads
`KEEP_TARGET=1` + `SKIP_DRAFT_RELOAD=1`	4.13 s	12.21x	TTFT-only fallback; no speculative decode

Correctness / Validation

cmake --build build-sm75-f16 --target test_dflash test_pflash_chunk_select test_flashprefill_kernels -j$(nproc) passes.
test_pflash_chunk_select passes.
test_flashprefill_kernels passes on SM75:
- mean vector max diff 0.00000
- sparse forward max diff 0.00043
- S=8192 e2e FlashPrefill 13.8 ms / iter
_prefill_hook.py passes python -m py_compile.
Clean PR-worktree SM75 build passes:
cmake --build dflash/build-sm75-pr --target test_dflash test_pflash_chunk_select test_flashprefill_kernels -j24.
16K NIAH quality smoke retained the key and answer on the original prompt and a 5-position synthetic NIAH sweep.

Caveats

This is a PFlash TTFT path. It intentionally does not claim new DFlash/DDTree decode results on RTX 2080 Ti.
With DFLASH_PFLASH_SKIP_DRAFT_RELOAD=1, the draft remains parked after compressed prefill. This is a default-off fallback for TTFT / very short output, not a decode-speed path.
A keep-target + DFlash draft reload countercheck at max_ctx=17000 hit a 106 MiB CUDA allocation OOM in the draft graph on RTX 2080 Ti.
The quality validation here is retrieval-style NIAH smoke, not broad Math/GSM/code/chat validation.
The new residency flags remain default-off.

howard0su · 2026-05-01T13:51:10Z

I tried this path as well to convert draft model from BF16 to FP16 in order to leverage 2080Ti's Tensor Core. Based on some experiment, I prefer using Q8_0. check PR #71

2080 only has 22GB VRAM. Q8_0 will save about 1.35GB which technically match the main experiment platform 3090 24GB.
No need to use BF16 on draft model as our main model is already quantized pretty aggressively already.
A small perf gain compare to FP16 (8.0% on my machine.)

cubic-dev-ai

3 issues found across 18 files

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="dflash/src/qwen3_0p6b_loader.cpp">

<violation number="1" location="dflash/src/qwen3_0p6b_loader.cpp:92">
P2: BF16 capability check uses compile-time minimum SM instead of runtime GPU capability, causing false negatives on SM80+ devices in mixed-arch builds.</violation>
</file>

<file name="dflash/test/test_dflash.cpp">

<violation number="1" location="dflash/test/test_dflash.cpp:1452">
P2: Target-only/decode gating does not validate that target weights are resident, allowing generation paths to run after `park target` freed them.</violation>
</file>

<file name="dflash/test/test_flashprefill_kernels.cpp">

<violation number="1" location="dflash/test/test_flashprefill_kernels.cpp:233">
P1: Numerical validation can pass even when GPU outputs are NaN, because the max-diff accumulation ignores non-finite values.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

weicj · 2026-05-01T13:58:08Z

Thanks. I pushed an update in 4645fa8.

Fixes for the three issues identified by cubic:

test_flashprefill_kernels: numerical validation now fails immediately on non-finite reference/output/diff values, so NaN/Inf cannot be hidden by max-diff accumulation.
qwen3_0p6b_loader: BF16 capability is now checked from the active CUDA device at runtime via cudaGetDeviceProperties, while DFLASH27B_DRAFT_FP16=1 still forces the FP16 path. This avoids mixed-arch false negatives on SM80+ devices.
test_dflash: generate now rejects requests while target weights are parked, before entering target-only or speculative decode.

I also agree Q8_0 is a good direction for the 2080 Ti path, especially for the VRAM and perf reasons in #71. I kept this PR scoped to the FP16/BF16->F16 SM75 enablement plus PFlash TTFT/residency path so it does not collide with #71; happy to either rebase onto the Q8_0 draft path after #71 lands or split a follow-up patch for that.

cubic-dev-ai · 2026-05-02T07:56:21Z

@cubic-dev-ai please re-run the review/check on the current head commit.

@weicj I have started the AI code review. It will take a few minutes to complete.

cubic-dev-ai

No issues found across 18 files

cubic-dev-ai Bot reviewed May 1, 2026

View reviewed changes

Comment thread dflash/test/test_flashprefill_kernels.cpp

Comment thread dflash/src/qwen3_0p6b_loader.cpp Outdated

Comment thread dflash/test/test_dflash.cpp

perf(pflash): add sm75 ttft path

4645fa8

weicj force-pushed the pflash-sm75-ttft-residency branch from 33b7869 to 4645fa8 Compare May 1, 2026 13:57

cubic-dev-ai Bot reviewed May 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(pflash): add SM75 target-resident TTFT path#72

perf(pflash): add SM75 target-resident TTFT path#72
weicj wants to merge 1 commit into
Luce-Org:mainfrom
weicj:pflash-sm75-ttft-residency

weicj commented May 1, 2026

Uh oh!

howard0su commented May 1, 2026

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

weicj commented May 1, 2026 •

edited

Loading

Uh oh!

cubic-dev-ai Bot commented May 2, 2026

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

weicj commented May 1, 2026

perf(pflash): add SM75 target-resident TTFT path

Summary

Benchmark

Correctness / Validation

Caveats

Uh oh!

howard0su commented May 1, 2026

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

weicj commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cubic-dev-ai Bot commented May 2, 2026

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

weicj commented May 1, 2026 •

edited

Loading