perf(pflash): add SM75 target-resident TTFT path#72
Conversation
|
I tried this path as well to convert draft model from BF16 to FP16 in order to leverage 2080Ti's Tensor Core. Based on some experiment, I prefer using Q8_0. check PR #71
|
There was a problem hiding this comment.
3 issues found across 18 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="dflash/src/qwen3_0p6b_loader.cpp">
<violation number="1" location="dflash/src/qwen3_0p6b_loader.cpp:92">
P2: BF16 capability check uses compile-time minimum SM instead of runtime GPU capability, causing false negatives on SM80+ devices in mixed-arch builds.</violation>
</file>
<file name="dflash/test/test_dflash.cpp">
<violation number="1" location="dflash/test/test_dflash.cpp:1452">
P2: Target-only/decode gating does not validate that target weights are resident, allowing generation paths to run after `park target` freed them.</violation>
</file>
<file name="dflash/test/test_flashprefill_kernels.cpp">
<violation number="1" location="dflash/test/test_flashprefill_kernels.cpp:233">
P1: Numerical validation can pass even when GPU outputs are NaN, because the max-diff accumulation ignores non-finite values.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
33b7869 to
4645fa8
Compare
|
Thanks. I pushed an update in 4645fa8. Fixes for the three issues identified by cubic:
I also agree Q8_0 is a good direction for the 2080 Ti path, especially for the VRAM and perf reasons in #71. I kept this PR scoped to the FP16/BF16->F16 SM75 enablement plus PFlash TTFT/residency path so it does not collide with #71; happy to either rebase onto the Q8_0 draft path after #71 lands or split a follow-up patch for that. |
@weicj I have started the AI code review. It will take a few minutes to complete. |
perf(pflash): add SM75 target-resident TTFT path
Summary
This PR adds an opt-in SM75 / RTX 2080 Ti path for PFlash TTFT-oriented use cases.
DFLASH_FP_K_TILE=32.test_pflash_chunk_select.DFLASH_PFLASH_KEEP_TARGET=1DFLASH_PFLASH_SKIP_DRAFT_RELOAD=1migrate_prefill_cachein that fallback because rollback tensors are not usedBenchmark
Hardware and environment:
test_dflashDFLASH_PFLASH_KEEP_TARGET=1KEEP_TARGET=1+SKIP_DRAFT_RELOAD=1Correctness / Validation
cmake --build build-sm75-f16 --target test_dflash test_pflash_chunk_select test_flashprefill_kernels -j$(nproc)passes.test_pflash_chunk_selectpasses.test_flashprefill_kernelspasses on SM75:0.000000.0004313.8 ms / iter_prefill_hook.pypassespython -m py_compile.cmake --build dflash/build-sm75-pr --target test_dflash test_pflash_chunk_select test_flashprefill_kernels -j24.Caveats
DFLASH_PFLASH_SKIP_DRAFT_RELOAD=1, the draft remains parked after compressed prefill. This is a default-off fallback for TTFT / very short output, not a decode-speed path.max_ctx=17000hit a 106 MiB CUDA allocation OOM in the draft graph on RTX 2080 Ti.