Fa4 support by blueswhen · Pull Request #1327 · ModelTC/LightLLM

blueswhen · 2026-06-02T02:47:49Z

No description provided.

gemini-code-assist

Code Review

This pull request introduces FlashAttention-4 (FA4) support and implements paged attention backends for FlashAttention-3 (FA3) and FlashInfer (FP and MLA variants) to support page sizes greater than 1. It also updates memory management and request queue components to handle paged memory allocations. The code review identified several critical issues and improvement opportunities: in-place modifications of kv_starts in FlashInfer states can corrupt original sequence length tensors and should be avoided by cloning; slicing non-contiguous tensors followed by .view() calls will raise runtime errors and should be replaced with .reshape(); hardcoded dimensions like qk_rope_head_dim should be dynamically retrieved from configurations; and allocating tensors on the CPU before moving them to the GPU is inefficient and should be replaced with direct device allocation.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-02T02:50:04Z

+        batch_size = self.infer_state.batch_size
+        device = self.infer_state.input_ids.device
+        q_starts = self.infer_state.b1_cu_q_seq_len.int()
+        kv_starts = self.infer_state.b1_cu_kv_seq_len.int()


Modifying kv_starts in-place with kv_starts[1:] = ... will corrupt the original self.infer_state.b1_cu_kv_seq_len tensor because .int() returns a view or the same tensor if it is already int32. Use .clone() to avoid side effects.

Suggested change

kv_starts = self.infer_state.b1_cu_kv_seq_len.int()

kv_starts = self.infer_state.b1_cu_kv_seq_len.int().clone()

gemini-code-assist · 2026-06-02T02:50:04Z

+        else:
+            self.kv_indices = torch.empty(buffer_len, dtype=torch.int32, device=device)
+
+        self.kv_starts = self.infer_state.b1_cu_kv_seq_len.int()


Modifying self.kv_starts in-place will corrupt the original self.infer_state.b1_cu_kv_seq_len tensor. Use .clone() to avoid side effects.

Suggested change

self.kv_starts = self.infer_state.b1_cu_kv_seq_len.int()

self.kv_starts = self.infer_state.b1_cu_kv_seq_len.int().clone()

gemini-code-assist · 2026-06-02T02:50:04Z

+        model = self.backend.model
+        device = self.infer_state.input_ids.device
+        batch_size = self.infer_state.batch_size
+        self.kv_starts = self.infer_state.b1_cu_kv_seq_len


Modifying self.kv_starts in-place will corrupt the original self.infer_state.b1_cu_kv_seq_len tensor. Use .clone() to avoid side effects.

Suggested change

self.kv_starts = self.infer_state.b1_cu_kv_seq_len

self.kv_starts = self.infer_state.b1_cu_kv_seq_len.clone()

gemini-code-assist · 2026-06-02T02:50:04Z

+            k_cache=k[:, :, -qk_rope_head_dim:].view(-1, self.backend.page_size, 1, qk_rope_head_dim),
+            v_cache=k[:, :, :-qk_rope_head_dim].view(-1, self.backend.page_size, 1, kv_lora_rank),


Slicing k along the last dimension makes it non-contiguous. Calling .view() on a non-contiguous tensor will raise a RuntimeError at runtime. Use .reshape() instead to safely handle non-contiguous layouts.

Suggested change

k_cache=k[:, :, -qk_rope_head_dim:].view(-1, self.backend.page_size, 1, qk_rope_head_dim),

v_cache=k[:, :, :-qk_rope_head_dim].view(-1, self.backend.page_size, 1, kv_lora_rank),

k_cache=k[:, :, -qk_rope_head_dim:].reshape(-1, self.backend.page_size, 1, qk_rope_head_dim),

v_cache=k[:, :, :-qk_rope_head_dim].reshape(-1, self.backend.page_size, 1, kv_lora_rank),

gemini-code-assist · 2026-06-02T02:50:04Z

+            k[:, :, :-qk_rope_head_dim].view(-1, self.backend.page_size, 1, k.shape[-1] - qk_rope_head_dim),
+            k[:, :, -qk_rope_head_dim:].view(-1, self.backend.page_size, 1, qk_rope_head_dim),


Slicing k along the last dimension makes it non-contiguous. Calling .view() on a non-contiguous tensor will raise a RuntimeError at runtime. Use .reshape() instead to safely handle non-contiguous layouts.

Suggested change

k[:, :, :-qk_rope_head_dim].view(-1, self.backend.page_size, 1, k.shape[-1] - qk_rope_head_dim),

k[:, :, -qk_rope_head_dim:].view(-1, self.backend.page_size, 1, qk_rope_head_dim),

k[:, :, :-qk_rope_head_dim].reshape(-1, self.backend.page_size, 1, k.shape[-1] - qk_rope_head_dim),

k[:, :, -qk_rope_head_dim:].reshape(-1, self.backend.page_size, 1, qk_rope_head_dim),

gemini-code-assist · 2026-06-02T02:50:04Z

+        self, q: Tuple[torch.Tensor, torch.Tensor], k, v, att_control: AttControl, alloc_func=torch.empty
+    ):
+        q_nope, q_rope = q
+        qk_rope_head_dim = 64


Avoid hardcoding qk_rope_head_dim = 64. Use self.backend.model.qk_rope_head_dim to dynamically retrieve the dimension from the model configuration.

Suggested change

qk_rope_head_dim = 64

qk_rope_head_dim = self.backend.model.qk_rope_head_dim

gemini-code-assist · 2026-06-02T02:50:04Z

+        )
+        assert v is None
+        q_nope, q_rope = q
+        qk_rope_head_dim = 64


Avoid hardcoding qk_rope_head_dim = 64. Use self.backend.qk_rope_head_dim to dynamically retrieve the dimension from the backend configuration.

Suggested change

qk_rope_head_dim = 64

qk_rope_head_dim = self.backend.qk_rope_head_dim

gemini-code-assist · 2026-06-02T02:50:04Z

+                torch.empty(shared_len, dtype=torch.int32).to(get_current_device_id()),
+                torch.empty(shared_len, dtype=torch.int32).to(get_current_device_id()),


Allocating a tensor on CPU and then moving it to GPU using .to() is inefficient. Allocate the tensor directly on the target device using the device argument in torch.empty.

Suggested change

torch.empty(shared_len, dtype=torch.int32).to(get_current_device_id()),

torch.empty(shared_len, dtype=torch.int32).to(get_current_device_id()),

torch.empty(shared_len, dtype=torch.int32, device=get_current_device_id()),

torch.empty(shared_len, dtype=torch.int32, device=get_current_device_id()),

gemini-code-assist · 2026-06-02T02:50:04Z

+                torch.empty(shared_len, dtype=torch.int32).to(get_current_device_id()),
+                torch.empty(shared_len, dtype=torch.int32).to(get_current_device_id()),


Allocating a tensor on CPU and then moving it to GPU using .to() is inefficient. Allocate the tensor directly on the target device using the device argument in torch.empty.

Suggested change

torch.empty(shared_len, dtype=torch.int32).to(get_current_device_id()),

torch.empty(shared_len, dtype=torch.int32).to(get_current_device_id()),

torch.empty(shared_len, dtype=torch.int32, device=get_current_device_id()),

torch.empty(shared_len, dtype=torch.int32, device=get_current_device_id()),

feat: add flash attention 4

fd99194

gemini-code-assist Bot reviewed Jun 2, 2026

View reviewed changes

blueswhen force-pushed the fa4 branch from 1f5db06 to 8273755 Compare June 2, 2026 02:51

feat: page size > 1 support

db91dd0

blueswhen force-pushed the fa4 branch from 8273755 to db91dd0 Compare June 2, 2026 03:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fa4 support#1327

Fa4 support#1327
blueswhen wants to merge 2 commits into
mainfrom
fa4

blueswhen commented Jun 2, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 2, 2026

Uh oh!

gemini-code-assist Bot Jun 2, 2026

Uh oh!

gemini-code-assist Bot Jun 2, 2026

Uh oh!

gemini-code-assist Bot Jun 2, 2026

Uh oh!

gemini-code-assist Bot Jun 2, 2026

Uh oh!

gemini-code-assist Bot Jun 2, 2026

Uh oh!

gemini-code-assist Bot Jun 2, 2026

Uh oh!

gemini-code-assist Bot Jun 2, 2026

Uh oh!

gemini-code-assist Bot Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	kv_starts = self.infer_state.b1_cu_kv_seq_len.int()
	kv_starts = self.infer_state.b1_cu_kv_seq_len.int().clone()

	self.kv_starts = self.infer_state.b1_cu_kv_seq_len.int()
	self.kv_starts = self.infer_state.b1_cu_kv_seq_len.int().clone()

		k_cache=k[:, :, -qk_rope_head_dim:].view(-1, self.backend.page_size, 1, qk_rope_head_dim),
		v_cache=k[:, :, :-qk_rope_head_dim].view(-1, self.backend.page_size, 1, kv_lora_rank),

		k[:, :, :-qk_rope_head_dim].view(-1, self.backend.page_size, 1, k.shape[-1] - qk_rope_head_dim),
		k[:, :, -qk_rope_head_dim:].view(-1, self.backend.page_size, 1, qk_rope_head_dim),

	qk_rope_head_dim = 64
	qk_rope_head_dim = self.backend.model.qk_rope_head_dim

	qk_rope_head_dim = 64
	qk_rope_head_dim = self.backend.qk_rope_head_dim

		torch.empty(shared_len, dtype=torch.int32).to(get_current_device_id()),
		torch.empty(shared_len, dtype=torch.int32).to(get_current_device_id()),

Conversation

blueswhen commented Jun 2, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant