Skip to content

Replace int with size_t for size/count/stride/index variables across csrc #291

Open
jikunshang wants to merge 3 commits into
vllm-project:mainfrom
jikunshang:main
Open

Replace int with size_t for size/count/stride/index variables across csrc #291
jikunshang wants to merge 3 commits into
vllm-project:mainfrom
jikunshang:main

Conversation

@jikunshang
Copy link
Copy Markdown
Member

  • Replace int with size_t for size/count/stride/index variables in cache.cpp

Change local variables assigned from .size() calls, kernel class constructor parameters, kernel class member variables, and loop variables to use size_t instead of int for variables representing sizes, dimensions, counts, strides, indices, block sizes, and head sizes. This avoids signed/unsigned comparison bugs.

  • Replace int with size_t for size/dimension/count variables in layernorm kernels

Change variables representing sizes, dimensions, counts, and loop indices from int to size_t in csrc/layernorm.cpp and csrc/layernorm_quant.cpp.

This affects:

  • Kernel class constructor parameters and member variables
  • Local variables (hidden_size, num_tokens, num_dims, etc.)
  • Loop variables iterating over sizes/dimensions
  • Fix std::min calls with static_cast<size_t>(1024)
  • Fix std::gcd calls with static_cast<size_t>(...)

Preserved: template non-type params (int NUM_DIMS, int VEC_SIZE), int64_t variables, and inner j-loops bounded by VEC_SIZE.

  • Replace int with size_t for size/dimension/count variables in pos_encoding and fused_qknorm_rope kernels
  • pos_encoding_kernels.cpp: Changed positions_ndim, query_hidden_size, key_hidden_size, num_heads, num_kv_heads, rot_dim, seq_dim_idx, query_ndim, and kernel class member vars (head_size, num_heads, num_kv_heads, rot_dim) from int to size_t. Updated loop/index vars (embed_dim, nq, nk, head_idx, rot_offset, x_index, y_index) and fixed std::min<int64_t> to std::min<size_t>.

  • fused_qknorm_rope.cpp: Changed kernel class constructor params and member vars (num_heads_q, num_heads_k, num_heads_v, num_tokens, rotary_dim) from int to size_t. Updated all loop vars, index vars (total_qk_heads, token_idx, local_head_idx, head_idx, num_heads, offset_warp, offset_thread, embed_dim, rotary_lanes, pair_offset, dim_idx, half_dim), and launch function params. Changed static_cast to static_cast<size_t> at call site.

Preserved: int token_idx in rotary_embedding (used with int64_t strides), template non-type params (int head_dim), int64_t variables, and constexpr int constants (block_size, sgs_per_wg, SUB_GROUP_SIZE, NUM_ELEMS_PER_THREAD).

  • Replace int with size_t for size/dimension/count variables in moe kernels (partial)

Changed int → size_t for variables representing sizes, dimensions, counts, strides, and indices in 4 files:

  • fused_moe_prologue.cpp: experts_per_token
  • init_expert_map.cpp: num_experts, ep_rank, ep_size (params, members, launcher)
  • moe_gather.cpp: num_experts, num_tokens, hidden_size, topk, loop vars
  • remap_hidden_states.cpp: RowsPerExpertCount get_nd_range params (partial)

Note: This is a partial change. The following files still need updates:

  • topk.cpp
  • moe_align_sum_kernels.cpp
  • remap_hidden_states.cpp (remaining changes)
  • fused_grouped_topk.cpp
  • grouped_topk.cpp
  • fused_moe_prologue.hpp
  • Replace int with size_t for size/count/stride/index variables across csrc/ to fix signed/unsigned comparison risks

Agent-Logs-Url: https://github.com/jikunshang/vllm-xpu-kernels/sessions/77ac7b1b-2a91-4f87-9031-10fbf67b346f


PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS ABOVE HAVE BEEN CONSIDERED.

Purpose

Test Plan

Test Result

(Optional) Documentation Update

BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing (anything written below this line will be removed by GitHub Actions)

…csrc/ (#6)

* Replace int with size_t for size/count/stride/index variables in cache.cpp

Change local variables assigned from .size() calls, kernel class constructor
parameters, kernel class member variables, and loop variables to use size_t
instead of int for variables representing sizes, dimensions, counts, strides,
indices, block sizes, and head sizes. This avoids signed/unsigned comparison
bugs.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: jikunshang <50897707+jikunshang@users.noreply.github.com>

* Replace int with size_t for size/dimension/count variables in layernorm kernels

Change variables representing sizes, dimensions, counts, and loop indices
from int to size_t in csrc/layernorm.cpp and csrc/layernorm_quant.cpp.

This affects:
- Kernel class constructor parameters and member variables
- Local variables (hidden_size, num_tokens, num_dims, etc.)
- Loop variables iterating over sizes/dimensions
- Fix std::min calls with static_cast<size_t>(1024)
- Fix std::gcd calls with static_cast<size_t>(...)

Preserved: template non-type params (int NUM_DIMS, int VEC_SIZE),
int64_t variables, and inner j-loops bounded by VEC_SIZE.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: jikunshang <50897707+jikunshang@users.noreply.github.com>

* Replace int with size_t for size/dimension/count variables in pos_encoding and fused_qknorm_rope kernels

- pos_encoding_kernels.cpp: Changed positions_ndim, query_hidden_size,
  key_hidden_size, num_heads, num_kv_heads, rot_dim, seq_dim_idx,
  query_ndim, and kernel class member vars (head_size, num_heads,
  num_kv_heads, rot_dim) from int to size_t. Updated loop/index vars
  (embed_dim, nq, nk, head_idx, rot_offset, x_index, y_index) and
  fixed std::min<int64_t> to std::min<size_t>.

- fused_qknorm_rope.cpp: Changed kernel class constructor params and
  member vars (num_heads_q, num_heads_k, num_heads_v, num_tokens,
  rotary_dim) from int to size_t. Updated all loop vars, index vars
  (total_qk_heads, token_idx, local_head_idx, head_idx, num_heads,
  offset_warp, offset_thread, embed_dim, rotary_lanes, pair_offset,
  dim_idx, half_dim), and launch function params. Changed static_cast<int>
  to static_cast<size_t> at call site.

Preserved: int token_idx in rotary_embedding (used with int64_t strides),
template non-type params (int head_dim), int64_t variables, and
constexpr int constants (block_size, sgs_per_wg, SUB_GROUP_SIZE,
NUM_ELEMS_PER_THREAD).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: jikunshang <50897707+jikunshang@users.noreply.github.com>

* Replace int with size_t for size/dimension/count variables in moe kernels (partial)

Changed int → size_t for variables representing sizes, dimensions, counts,
strides, and indices in 4 files:

- fused_moe_prologue.cpp: experts_per_token
- init_expert_map.cpp: num_experts, ep_rank, ep_size (params, members, launcher)
- moe_gather.cpp: num_experts, num_tokens, hidden_size, topk, loop vars
- remap_hidden_states.cpp: RowsPerExpertCount get_nd_range params (partial)

Note: This is a partial change. The following files still need updates:
- topk.cpp
- moe_align_sum_kernels.cpp
- remap_hidden_states.cpp (remaining changes)
- fused_grouped_topk.cpp
- grouped_topk.cpp
- fused_moe_prologue.hpp

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: jikunshang <50897707+jikunshang@users.noreply.github.com>

* Replace int with size_t for size/count/stride/index variables across csrc/ to fix signed/unsigned comparison risks

Agent-Logs-Url: https://github.com/jikunshang/vllm-xpu-kernels/sessions/77ac7b1b-2a91-4f87-9031-10fbf67b346f

Co-authored-by: jikunshang <50897707+jikunshang@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: jikunshang <50897707+jikunshang@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 20, 2026 05:46
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates XPU/SYCL kernel code in csrc/ to use size_t (instead of int) for variables representing sizes, counts, strides, and indices, aiming to reduce signed/unsigned comparison issues.

Changes:

  • Converted many kernel parameters/member variables and loop indices from int to size_t across cache, layernorm, rotary/pos-encoding, activation, and quantization code.
  • Updated various launch-time calculations (e.g., std::min(...) and std::gcd(...)) to use size_t-consistent types.
  • Partially migrated MoE-related kernels to size_t for size/count/index variables.

Reviewed changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
csrc/quantization/fp8/quant_utils.h Updated vector conversion helper signatures/loops to size_t.
csrc/quantization/fp8/fp8_quant.h Switched some per-row scan/reduction loop counters to size_t.
csrc/quantization/fp8/fp8_quant.cpp Updated hidden/group sizing and launch computations to size_t.
csrc/quantization/fp4/mxfp4_quant.h Updated MXFP4 kernel size/count fields to size_t.
csrc/quantization/fp4/mxfp4_quant.cpp Updated group/thread/block sizing and stride handling to size_t.
csrc/pos_encoding_kernels.cpp Converted rotary embedding sizes/indices to size_t and adjusted launch bounds.
csrc/moe/topk.cpp Converted expert/topk sizing/indexing to size_t in softmax/sigmoid/topk paths.
csrc/moe/remap_hidden_states.cpp Partially converted remap sizing parameters to size_t.
csrc/moe/moe_gather.cpp Converted token/expert/hidden/topk sizing and indexing to size_t.
csrc/moe/init_expert_map.cpp Converted expert-parallel sizing and nd_range calculation to size_t.
csrc/moe/fused_moe_prologue.cpp Converted experts_per_token to size_t.
csrc/layernorm_quant.cpp Converted hidden/group/token sizing and launch math to size_t.
csrc/layernorm.cpp Converted hidden/token/dim sizing and vectorization-related indices to size_t.
csrc/fused_qknorm_rope.cpp Converted head/token/rotary sizing and indexing to size_t.
csrc/cache.cpp Converted cache reshape/concat sizing and some stride/index variables to size_t.
csrc/activation.cpp Converted activation dimension d and vector loop indices to size_t.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +16 to 20
const size_t num_rows,
const size_t TopK)
: expert_map(expert_map),
expert_first_token_offset(expert_first_token_offset),
topk_ids(topk_ids),
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RowsPerExpertCount now takes size_t num_rows/size_t TopK, but the class still stores these as const int (see member declarations below) and get_nd_range still uses int parameters. This creates narrowing/truncation risk and inconsistent types. Make the stored members and get_nd_range parameters size_t as well (or revert the ctor params to int consistently).

Copilot uses AI. Check for mistakes.
Comment on lines +612 to 614
const size_t num_groups = input.numel() / group_size;

TORCH_CHECK(input.numel() % group_size == 0);
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

num_groups is computed as size_t from input.numel() / group_size but group_size is an int64_t parameter and isn’t validated to be > 0. A negative or zero group_size would lead to division-by-zero or a negative result being converted to a huge size_t. Add a TORCH_CHECK(group_size > 0) (and ideally cast group_size to size_t only after validation).

Copilot uses AI. Check for mistakes.
Comment thread csrc/moe/topk.cpp
const bool renormalize,
std::optional<torch::Tensor> bias) {
const int num_experts = gating_output.size(-1);
const size_t num_experts = gating_output.size(-1);
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

num_tokens is computed as gating_output.numel() / num_experts before checking num_experts != 0, so an empty num_experts dimension would trigger division-by-zero. Guard num_experts == 0 (and/or gating_output.numel() == 0) with an early return or TORCH_CHECK before this division.

Suggested change
const size_t num_experts = gating_output.size(-1);
const size_t num_experts = gating_output.size(-1);
TORCH_CHECK(
num_experts != 0,
"topk_softmax requires gating_output.size(-1) > 0, but got 0 experts");

Copilot uses AI. Check for mistakes.
Comment thread csrc/moe/topk.cpp
const bool renormalize,
std::optional<torch::Tensor> bias) {
const int num_experts = gating_output.size(-1);
const size_t num_experts = gating_output.size(-1);
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

num_tokens is computed as gating_output.numel() / num_experts before checking num_experts != 0, so an empty num_experts dimension would trigger division-by-zero. Guard num_experts == 0 (and/or gating_output.numel() == 0) with an early return or TORCH_CHECK before this division.

Suggested change
const size_t num_experts = gating_output.size(-1);
const size_t num_experts = gating_output.size(-1);
TORCH_CHECK(
num_experts != 0,
"topk_sigmoid expects gating_output.size(-1) (num_experts) to be non-zero");

Copilot uses AI. Check for mistakes.
Comment on lines +64 to +67
const size_t head_idx = i / embed_dim;
const int64_t token_head =
token_idx * query_stride + head_idx * head_stride;
const int rot_offset = i % embed_dim;
const size_t rot_offset = i % embed_dim;
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

token_head is computed from head_idx (size_t) and head_stride (int64_t), which forces unsigned arithmetic and can yield incorrect offsets (and compiler warnings) due to signed/unsigned conversion. Cast head_idx (and/or embed_dim) to int64_t before multiplying by head_stride/strides so the offset math stays in the signed stride domain.

Copilot uses AI. Check for mistakes.
Comment on lines +199 to +200
size_t query_hidden_size = query.numel() / num_tokens;
size_t key_hidden_size = key.has_value() ? key->numel() / num_tokens : 0;
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

query_hidden_size/key_hidden_size divide by num_tokens, but num_tokens = positions.numel() can be 0 for empty inputs, leading to division-by-zero before any kernel launch. Add an early return (or TORCH_CHECK) when num_tokens == 0 before these computations.

Copilot uses AI. Check for mistakes.
Comment thread csrc/cache.cpp
Comment on lines +748 to +749
size_t key_stride = key.stride(0);
size_t value_stride = value.stride(0);
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

key.stride(0)/value.stride(0) are int64_t in PyTorch; storing them in size_t can silently wrap negative strides (or other unexpected stride values) into huge positives and produce out-of-bounds indexing. Keep strides as int64_t here (and in the kernel) or explicitly validate stride >= 0 before converting.

Copilot uses AI. Check for mistakes.
@jikunshang jikunshang changed the title Replace int with size_t for size/count/stride/index variables across…csrc/ Replace int with size_t for size/count/stride/index variables across csrc Apr 20, 2026
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants