Replace int with size_t for size/count/stride/index variables across csrc by jikunshang · Pull Request #291 · vllm-project/vllm-xpu-kernels

jikunshang · 2026-04-20T05:46:48Z

Replace int with size_t for size/count/stride/index variables in cache.cpp

Change local variables assigned from .size() calls, kernel class constructor parameters, kernel class member variables, and loop variables to use size_t instead of int for variables representing sizes, dimensions, counts, strides, indices, block sizes, and head sizes. This avoids signed/unsigned comparison bugs.

Replace int with size_t for size/dimension/count variables in layernorm kernels

Change variables representing sizes, dimensions, counts, and loop indices from int to size_t in csrc/layernorm.cpp and csrc/layernorm_quant.cpp.

This affects:

Kernel class constructor parameters and member variables
Local variables (hidden_size, num_tokens, num_dims, etc.)
Loop variables iterating over sizes/dimensions
Fix std::min calls with static_cast<size_t>(1024)
Fix std::gcd calls with static_cast<size_t>(...)

Preserved: template non-type params (int NUM_DIMS, int VEC_SIZE), int64_t variables, and inner j-loops bounded by VEC_SIZE.

Replace int with size_t for size/dimension/count variables in pos_encoding and fused_qknorm_rope kernels

pos_encoding_kernels.cpp: Changed positions_ndim, query_hidden_size, key_hidden_size, num_heads, num_kv_heads, rot_dim, seq_dim_idx, query_ndim, and kernel class member vars (head_size, num_heads, num_kv_heads, rot_dim) from int to size_t. Updated loop/index vars (embed_dim, nq, nk, head_idx, rot_offset, x_index, y_index) and fixed std::min<int64_t> to std::min<size_t>.
fused_qknorm_rope.cpp: Changed kernel class constructor params and member vars (num_heads_q, num_heads_k, num_heads_v, num_tokens, rotary_dim) from int to size_t. Updated all loop vars, index vars (total_qk_heads, token_idx, local_head_idx, head_idx, num_heads, offset_warp, offset_thread, embed_dim, rotary_lanes, pair_offset, dim_idx, half_dim), and launch function params. Changed static_cast to static_cast<size_t> at call site.

Preserved: int token_idx in rotary_embedding (used with int64_t strides), template non-type params (int head_dim), int64_t variables, and constexpr int constants (block_size, sgs_per_wg, SUB_GROUP_SIZE, NUM_ELEMS_PER_THREAD).

Replace int with size_t for size/dimension/count variables in moe kernels (partial)

Changed int → size_t for variables representing sizes, dimensions, counts, strides, and indices in 4 files:

fused_moe_prologue.cpp: experts_per_token
init_expert_map.cpp: num_experts, ep_rank, ep_size (params, members, launcher)
moe_gather.cpp: num_experts, num_tokens, hidden_size, topk, loop vars
remap_hidden_states.cpp: RowsPerExpertCount get_nd_range params (partial)

Note: This is a partial change. The following files still need updates:

topk.cpp
moe_align_sum_kernels.cpp
remap_hidden_states.cpp (remaining changes)
fused_grouped_topk.cpp
grouped_topk.cpp
fused_moe_prologue.hpp

Replace int with size_t for size/count/stride/index variables across csrc/ to fix signed/unsigned comparison risks

Agent-Logs-Url: https://github.com/jikunshang/vllm-xpu-kernels/sessions/77ac7b1b-2a91-4f87-9031-10fbf67b346f

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS ABOVE HAVE BEEN CONSIDERED.

Purpose

Test Plan

Test Result

(Optional) Documentation Update

BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing (anything written below this line will be removed by GitHub Actions)

…csrc/ (#6) * Replace int with size_t for size/count/stride/index variables in cache.cpp Change local variables assigned from .size() calls, kernel class constructor parameters, kernel class member variables, and loop variables to use size_t instead of int for variables representing sizes, dimensions, counts, strides, indices, block sizes, and head sizes. This avoids signed/unsigned comparison bugs. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: jikunshang <50897707+jikunshang@users.noreply.github.com> * Replace int with size_t for size/dimension/count variables in layernorm kernels Change variables representing sizes, dimensions, counts, and loop indices from int to size_t in csrc/layernorm.cpp and csrc/layernorm_quant.cpp. This affects: - Kernel class constructor parameters and member variables - Local variables (hidden_size, num_tokens, num_dims, etc.) - Loop variables iterating over sizes/dimensions - Fix std::min calls with static_cast<size_t>(1024) - Fix std::gcd calls with static_cast<size_t>(...) Preserved: template non-type params (int NUM_DIMS, int VEC_SIZE), int64_t variables, and inner j-loops bounded by VEC_SIZE. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: jikunshang <50897707+jikunshang@users.noreply.github.com> * Replace int with size_t for size/dimension/count variables in pos_encoding and fused_qknorm_rope kernels - pos_encoding_kernels.cpp: Changed positions_ndim, query_hidden_size, key_hidden_size, num_heads, num_kv_heads, rot_dim, seq_dim_idx, query_ndim, and kernel class member vars (head_size, num_heads, num_kv_heads, rot_dim) from int to size_t. Updated loop/index vars (embed_dim, nq, nk, head_idx, rot_offset, x_index, y_index) and fixed std::min<int64_t> to std::min<size_t>. - fused_qknorm_rope.cpp: Changed kernel class constructor params and member vars (num_heads_q, num_heads_k, num_heads_v, num_tokens, rotary_dim) from int to size_t. Updated all loop vars, index vars (total_qk_heads, token_idx, local_head_idx, head_idx, num_heads, offset_warp, offset_thread, embed_dim, rotary_lanes, pair_offset, dim_idx, half_dim), and launch function params. Changed static_cast<int> to static_cast<size_t> at call site. Preserved: int token_idx in rotary_embedding (used with int64_t strides), template non-type params (int head_dim), int64_t variables, and constexpr int constants (block_size, sgs_per_wg, SUB_GROUP_SIZE, NUM_ELEMS_PER_THREAD). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: jikunshang <50897707+jikunshang@users.noreply.github.com> * Replace int with size_t for size/dimension/count variables in moe kernels (partial) Changed int → size_t for variables representing sizes, dimensions, counts, strides, and indices in 4 files: - fused_moe_prologue.cpp: experts_per_token - init_expert_map.cpp: num_experts, ep_rank, ep_size (params, members, launcher) - moe_gather.cpp: num_experts, num_tokens, hidden_size, topk, loop vars - remap_hidden_states.cpp: RowsPerExpertCount get_nd_range params (partial) Note: This is a partial change. The following files still need updates: - topk.cpp - moe_align_sum_kernels.cpp - remap_hidden_states.cpp (remaining changes) - fused_grouped_topk.cpp - grouped_topk.cpp - fused_moe_prologue.hpp Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: jikunshang <50897707+jikunshang@users.noreply.github.com> * Replace int with size_t for size/count/stride/index variables across csrc/ to fix signed/unsigned comparison risks Agent-Logs-Url: https://github.com/jikunshang/vllm-xpu-kernels/sessions/77ac7b1b-2a91-4f87-9031-10fbf67b346f Co-authored-by: jikunshang <50897707+jikunshang@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: jikunshang <50897707+jikunshang@users.noreply.github.com>

Copilot

Pull request overview

This PR updates XPU/SYCL kernel code in csrc/ to use size_t (instead of int) for variables representing sizes, counts, strides, and indices, aiming to reduce signed/unsigned comparison issues.

Changes:

Converted many kernel parameters/member variables and loop indices from int to size_t across cache, layernorm, rotary/pos-encoding, activation, and quantization code.
Updated various launch-time calculations (e.g., std::min(...) and std::gcd(...)) to use size_t-consistent types.
Partially migrated MoE-related kernels to size_t for size/count/index variables.

Reviewed changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
csrc/quantization/fp8/quant_utils.h	Updated vector conversion helper signatures/loops to `size_t`.
csrc/quantization/fp8/fp8_quant.h	Switched some per-row scan/reduction loop counters to `size_t`.
csrc/quantization/fp8/fp8_quant.cpp	Updated hidden/group sizing and launch computations to `size_t`.
csrc/quantization/fp4/mxfp4_quant.h	Updated MXFP4 kernel size/count fields to `size_t`.
csrc/quantization/fp4/mxfp4_quant.cpp	Updated group/thread/block sizing and stride handling to `size_t`.
csrc/pos_encoding_kernels.cpp	Converted rotary embedding sizes/indices to `size_t` and adjusted launch bounds.
csrc/moe/topk.cpp	Converted expert/topk sizing/indexing to `size_t` in softmax/sigmoid/topk paths.
csrc/moe/remap_hidden_states.cpp	Partially converted remap sizing parameters to `size_t`.
csrc/moe/moe_gather.cpp	Converted token/expert/hidden/topk sizing and indexing to `size_t`.
csrc/moe/init_expert_map.cpp	Converted expert-parallel sizing and nd_range calculation to `size_t`.
csrc/moe/fused_moe_prologue.cpp	Converted `experts_per_token` to `size_t`.
csrc/layernorm_quant.cpp	Converted hidden/group/token sizing and launch math to `size_t`.
csrc/layernorm.cpp	Converted hidden/token/dim sizing and vectorization-related indices to `size_t`.
csrc/fused_qknorm_rope.cpp	Converted head/token/rotary sizing and indexing to `size_t`.
csrc/cache.cpp	Converted cache reshape/concat sizing and some stride/index variables to `size_t`.
csrc/activation.cpp	Converted activation dimension `d` and vector loop indices to `size_t`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-20T05:54:24Z

+      const size_t num_rows,
+      const size_t TopK)
      : expert_map(expert_map),
        expert_first_token_offset(expert_first_token_offset),
        topk_ids(topk_ids),


RowsPerExpertCount now takes size_t num_rows/size_t TopK, but the class still stores these as const int (see member declarations below) and get_nd_range still uses int parameters. This creates narrowing/truncation risk and inconsistent types. Make the stored members and get_nd_range parameters size_t as well (or revert the ctor params to int consistently).

Copilot · 2026-04-20T05:54:24Z

+  const size_t num_groups = input.numel() / group_size;

  TORCH_CHECK(input.numel() % group_size == 0);


num_groups is computed as size_t from input.numel() / group_size but group_size is an int64_t parameter and isn’t validated to be > 0. A negative or zero group_size would lead to division-by-zero or a negative result being converted to a huge size_t. Add a TORCH_CHECK(group_size > 0) (and ideally cast group_size to size_t only after validation).

Copilot · 2026-04-20T05:54:24Z

    const bool renormalize,
    std::optional<torch::Tensor> bias) {
-  const int num_experts = gating_output.size(-1);
+  const size_t num_experts = gating_output.size(-1);


num_tokens is computed as gating_output.numel() / num_experts before checking num_experts != 0, so an empty num_experts dimension would trigger division-by-zero. Guard num_experts == 0 (and/or gating_output.numel() == 0) with an early return or TORCH_CHECK before this division.

Suggested change

const size_t num_experts = gating_output.size(-1);

const size_t num_experts = gating_output.size(-1);

TORCH_CHECK(

num_experts != 0,

"topk_softmax requires gating_output.size(-1) > 0, but got 0 experts");

Copilot · 2026-04-20T05:54:25Z

    const bool renormalize,
    std::optional<torch::Tensor> bias) {
-  const int num_experts = gating_output.size(-1);
+  const size_t num_experts = gating_output.size(-1);


num_tokens is computed as gating_output.numel() / num_experts before checking num_experts != 0, so an empty num_experts dimension would trigger division-by-zero. Guard num_experts == 0 (and/or gating_output.numel() == 0) with an early return or TORCH_CHECK before this division.

Suggested change

const size_t num_experts = gating_output.size(-1);

const size_t num_experts = gating_output.size(-1);

TORCH_CHECK(

num_experts != 0,

"topk_sigmoid expects gating_output.size(-1) (num_experts) to be non-zero");

Copilot · 2026-04-20T05:54:25Z

+    const size_t head_idx = i / embed_dim;
    const int64_t token_head =
        token_idx * query_stride + head_idx * head_stride;
-    const int rot_offset = i % embed_dim;
+    const size_t rot_offset = i % embed_dim;


token_head is computed from head_idx (size_t) and head_stride (int64_t), which forces unsigned arithmetic and can yield incorrect offsets (and compiler warnings) due to signed/unsigned conversion. Cast head_idx (and/or embed_dim) to int64_t before multiplying by head_stride/strides so the offset math stays in the signed stride domain.

Copilot · 2026-04-20T05:54:25Z

+  size_t query_hidden_size = query.numel() / num_tokens;
+  size_t key_hidden_size = key.has_value() ? key->numel() / num_tokens : 0;


query_hidden_size/key_hidden_size divide by num_tokens, but num_tokens = positions.numel() can be 0 for empty inputs, leading to division-by-zero before any kernel launch. Add an early return (or TORCH_CHECK) when num_tokens == 0 before these computations.

Copilot · 2026-04-20T05:54:25Z

+  size_t key_stride = key.stride(0);
+  size_t value_stride = value.stride(0);


key.stride(0)/value.stride(0) are int64_t in PyTorch; storing them in size_t can silently wrap negative strides (or other unexpected stride values) into huge positives and produce out-of-bounds indexing. Keep strides as int64_t here (and in the kernel) or explicitly validate stride >= 0 before converting.

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>

Copilot AI review requested due to automatic review settings April 20, 2026 05:46

Copilot started reviewing on behalf of jikunshang April 20, 2026 05:47 View session

Copilot AI reviewed Apr 20, 2026

View reviewed changes

jikunshang changed the title ~~Replace int with size_t for size/count/stride/index variables across…csrc/~~ Replace int with size_t for size/count/stride/index variables across csrc Apr 20, 2026

jikunshang added 2 commits April 21, 2026 19:39

fix

d2a704b

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>

pre commit

8f2e1db

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace int with size_t for size/count/stride/index variables across csrc #291

Replace int with size_t for size/count/stride/index variables across csrc #291
jikunshang wants to merge 3 commits into
vllm-project:mainfrom
jikunshang:main

jikunshang commented Apr 20, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 20, 2026

Uh oh!

Copilot AI Apr 20, 2026

Uh oh!

Copilot AI Apr 20, 2026

Uh oh!

Copilot AI Apr 20, 2026

Uh oh!

Copilot AI Apr 20, 2026

Uh oh!

Copilot AI Apr 20, 2026

Uh oh!

Copilot AI Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		const size_t num_groups = input.numel() / group_size;

		TORCH_CHECK(input.numel() % group_size == 0);

		size_t query_hidden_size = query.numel() / num_tokens;
		size_t key_hidden_size = key.has_value() ? key->numel() / num_tokens : 0;

		size_t key_stride = key.stride(0);
		size_t value_stride = value.stride(0);

Conversation

jikunshang commented Apr 20, 2026

Purpose

Test Plan

Test Result

(Optional) Documentation Update

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants