Skip to content

[https://nvbugs/6276981][fix] Force the q-split + allgather code path whenever q_split_eligible=True (drop…#15474

Open
tensorrt-cicd wants to merge 1 commit into
NVIDIA:mainfrom
tensorrt-cicd:repair-bot-bug6276981
Open

[https://nvbugs/6276981][fix] Force the q-split + allgather code path whenever q_split_eligible=True (drop…#15474
tensorrt-cicd wants to merge 1 commit into
NVIDIA:mainfrom
tensorrt-cicd:repair-bot-bug6276981

Conversation

@tensorrt-cicd

@tensorrt-cicd tensorrt-cicd commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Root cause: When q_split_eligible=True but apply_q_split=False (chunk smaller than threshold), each TP rank ran fp8_mqa_logits independently on the full chunk; the DeepGEMM kernel is not bit-exact across launches, so per-rank topk indices diverged and downstream MLA attention attended to different KV positions on different ranks, corrupting KV-cache writes.
  • Fix: Force the q-split + allgather code path whenever q_split_eligible=True (drop the chunk_num_token >= q_split_threshold gate). The per-token canonical owner from slice + allgather erases per-rank nondeterminism.
  • Automated fix generated by repair-bot

Test plan

  • Verify fix on the same GPU type as the original failure
  • Check for regressions in related tests

Links

Summary by CodeRabbit

Release Notes

  • Performance

    • Refined sparse attention path selection logic during prefill operations with distributed tensor configurations. Chunk processing now consistently applies distributed synchronization when specific eligibility criteria are met.
  • Tests

    • Re-enabled a previously skipped test case covering multi-GPU inference with chunked prefill, strengthening quality assurance for distributed configurations.

@coderabbitai

coderabbitai Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Caution

Review failed

An error occurred during the review process. Please try again later.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@tensorrt-cicd tensorrt-cicd force-pushed the repair-bot-bug6276981 branch 3 times, most recently from 290fdfd to 9b45f57 Compare June 25, 2026 11:00
…n eligible

When the indexer chunked-prefill is gated by q_split_eligible (TP > 1, no
attention DP) but apply_q_split is False (chunk smaller than
q_split_threshold), every TP rank computes the full chunk's topk indices
independently via fp8_mqa_logits / fp8_fp4_mqa_logits. Those DeepGEMM
kernels are not bit-exact across launches, so per-rank topk indices
diverge for the same tokens. The downstream MLA attention then attends
to different KV positions on different ranks, corrupting KV-cache
writes. Short generations (MMLU's 2-token answers) hide it; long ones
(GSM8K's 256 tokens) compound it into garbage and 0% accuracy.

Force the q-split + allgather path whenever eligible: small chunks pay
a microscopic allgather instead of redundant per-rank logits compute,
and the per-token canonical owner from the slice + allgather erases
any rank-local nondeterminism before downstream layers read the
indices. q_split_threshold < 0 still fully disables eligibility.

Signed-off-by: tensorrt-cicd <90828364+tensorrt-cicd@users.noreply.github.com>
@tensorrt-cicd tensorrt-cicd force-pushed the repair-bot-bug6276981 branch from 9b45f57 to 69844b1 Compare June 28, 2026 23:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants