Replace int with size_t for size/count/stride/index variables across csrc #291
Replace int with size_t for size/count/stride/index variables across csrc #291jikunshang wants to merge 3 commits into
Conversation
…csrc/ (#6) * Replace int with size_t for size/count/stride/index variables in cache.cpp Change local variables assigned from .size() calls, kernel class constructor parameters, kernel class member variables, and loop variables to use size_t instead of int for variables representing sizes, dimensions, counts, strides, indices, block sizes, and head sizes. This avoids signed/unsigned comparison bugs. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: jikunshang <50897707+jikunshang@users.noreply.github.com> * Replace int with size_t for size/dimension/count variables in layernorm kernels Change variables representing sizes, dimensions, counts, and loop indices from int to size_t in csrc/layernorm.cpp and csrc/layernorm_quant.cpp. This affects: - Kernel class constructor parameters and member variables - Local variables (hidden_size, num_tokens, num_dims, etc.) - Loop variables iterating over sizes/dimensions - Fix std::min calls with static_cast<size_t>(1024) - Fix std::gcd calls with static_cast<size_t>(...) Preserved: template non-type params (int NUM_DIMS, int VEC_SIZE), int64_t variables, and inner j-loops bounded by VEC_SIZE. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: jikunshang <50897707+jikunshang@users.noreply.github.com> * Replace int with size_t for size/dimension/count variables in pos_encoding and fused_qknorm_rope kernels - pos_encoding_kernels.cpp: Changed positions_ndim, query_hidden_size, key_hidden_size, num_heads, num_kv_heads, rot_dim, seq_dim_idx, query_ndim, and kernel class member vars (head_size, num_heads, num_kv_heads, rot_dim) from int to size_t. Updated loop/index vars (embed_dim, nq, nk, head_idx, rot_offset, x_index, y_index) and fixed std::min<int64_t> to std::min<size_t>. - fused_qknorm_rope.cpp: Changed kernel class constructor params and member vars (num_heads_q, num_heads_k, num_heads_v, num_tokens, rotary_dim) from int to size_t. Updated all loop vars, index vars (total_qk_heads, token_idx, local_head_idx, head_idx, num_heads, offset_warp, offset_thread, embed_dim, rotary_lanes, pair_offset, dim_idx, half_dim), and launch function params. Changed static_cast<int> to static_cast<size_t> at call site. Preserved: int token_idx in rotary_embedding (used with int64_t strides), template non-type params (int head_dim), int64_t variables, and constexpr int constants (block_size, sgs_per_wg, SUB_GROUP_SIZE, NUM_ELEMS_PER_THREAD). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: jikunshang <50897707+jikunshang@users.noreply.github.com> * Replace int with size_t for size/dimension/count variables in moe kernels (partial) Changed int → size_t for variables representing sizes, dimensions, counts, strides, and indices in 4 files: - fused_moe_prologue.cpp: experts_per_token - init_expert_map.cpp: num_experts, ep_rank, ep_size (params, members, launcher) - moe_gather.cpp: num_experts, num_tokens, hidden_size, topk, loop vars - remap_hidden_states.cpp: RowsPerExpertCount get_nd_range params (partial) Note: This is a partial change. The following files still need updates: - topk.cpp - moe_align_sum_kernels.cpp - remap_hidden_states.cpp (remaining changes) - fused_grouped_topk.cpp - grouped_topk.cpp - fused_moe_prologue.hpp Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: jikunshang <50897707+jikunshang@users.noreply.github.com> * Replace int with size_t for size/count/stride/index variables across csrc/ to fix signed/unsigned comparison risks Agent-Logs-Url: https://github.com/jikunshang/vllm-xpu-kernels/sessions/77ac7b1b-2a91-4f87-9031-10fbf67b346f Co-authored-by: jikunshang <50897707+jikunshang@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: jikunshang <50897707+jikunshang@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR updates XPU/SYCL kernel code in csrc/ to use size_t (instead of int) for variables representing sizes, counts, strides, and indices, aiming to reduce signed/unsigned comparison issues.
Changes:
- Converted many kernel parameters/member variables and loop indices from
inttosize_tacross cache, layernorm, rotary/pos-encoding, activation, and quantization code. - Updated various launch-time calculations (e.g.,
std::min(...)andstd::gcd(...)) to usesize_t-consistent types. - Partially migrated MoE-related kernels to
size_tfor size/count/index variables.
Reviewed changes
Copilot reviewed 16 out of 16 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| csrc/quantization/fp8/quant_utils.h | Updated vector conversion helper signatures/loops to size_t. |
| csrc/quantization/fp8/fp8_quant.h | Switched some per-row scan/reduction loop counters to size_t. |
| csrc/quantization/fp8/fp8_quant.cpp | Updated hidden/group sizing and launch computations to size_t. |
| csrc/quantization/fp4/mxfp4_quant.h | Updated MXFP4 kernel size/count fields to size_t. |
| csrc/quantization/fp4/mxfp4_quant.cpp | Updated group/thread/block sizing and stride handling to size_t. |
| csrc/pos_encoding_kernels.cpp | Converted rotary embedding sizes/indices to size_t and adjusted launch bounds. |
| csrc/moe/topk.cpp | Converted expert/topk sizing/indexing to size_t in softmax/sigmoid/topk paths. |
| csrc/moe/remap_hidden_states.cpp | Partially converted remap sizing parameters to size_t. |
| csrc/moe/moe_gather.cpp | Converted token/expert/hidden/topk sizing and indexing to size_t. |
| csrc/moe/init_expert_map.cpp | Converted expert-parallel sizing and nd_range calculation to size_t. |
| csrc/moe/fused_moe_prologue.cpp | Converted experts_per_token to size_t. |
| csrc/layernorm_quant.cpp | Converted hidden/group/token sizing and launch math to size_t. |
| csrc/layernorm.cpp | Converted hidden/token/dim sizing and vectorization-related indices to size_t. |
| csrc/fused_qknorm_rope.cpp | Converted head/token/rotary sizing and indexing to size_t. |
| csrc/cache.cpp | Converted cache reshape/concat sizing and some stride/index variables to size_t. |
| csrc/activation.cpp | Converted activation dimension d and vector loop indices to size_t. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| const size_t num_rows, | ||
| const size_t TopK) | ||
| : expert_map(expert_map), | ||
| expert_first_token_offset(expert_first_token_offset), | ||
| topk_ids(topk_ids), |
There was a problem hiding this comment.
RowsPerExpertCount now takes size_t num_rows/size_t TopK, but the class still stores these as const int (see member declarations below) and get_nd_range still uses int parameters. This creates narrowing/truncation risk and inconsistent types. Make the stored members and get_nd_range parameters size_t as well (or revert the ctor params to int consistently).
| const size_t num_groups = input.numel() / group_size; | ||
|
|
||
| TORCH_CHECK(input.numel() % group_size == 0); |
There was a problem hiding this comment.
num_groups is computed as size_t from input.numel() / group_size but group_size is an int64_t parameter and isn’t validated to be > 0. A negative or zero group_size would lead to division-by-zero or a negative result being converted to a huge size_t. Add a TORCH_CHECK(group_size > 0) (and ideally cast group_size to size_t only after validation).
| const bool renormalize, | ||
| std::optional<torch::Tensor> bias) { | ||
| const int num_experts = gating_output.size(-1); | ||
| const size_t num_experts = gating_output.size(-1); |
There was a problem hiding this comment.
num_tokens is computed as gating_output.numel() / num_experts before checking num_experts != 0, so an empty num_experts dimension would trigger division-by-zero. Guard num_experts == 0 (and/or gating_output.numel() == 0) with an early return or TORCH_CHECK before this division.
| const size_t num_experts = gating_output.size(-1); | |
| const size_t num_experts = gating_output.size(-1); | |
| TORCH_CHECK( | |
| num_experts != 0, | |
| "topk_softmax requires gating_output.size(-1) > 0, but got 0 experts"); |
| const bool renormalize, | ||
| std::optional<torch::Tensor> bias) { | ||
| const int num_experts = gating_output.size(-1); | ||
| const size_t num_experts = gating_output.size(-1); |
There was a problem hiding this comment.
num_tokens is computed as gating_output.numel() / num_experts before checking num_experts != 0, so an empty num_experts dimension would trigger division-by-zero. Guard num_experts == 0 (and/or gating_output.numel() == 0) with an early return or TORCH_CHECK before this division.
| const size_t num_experts = gating_output.size(-1); | |
| const size_t num_experts = gating_output.size(-1); | |
| TORCH_CHECK( | |
| num_experts != 0, | |
| "topk_sigmoid expects gating_output.size(-1) (num_experts) to be non-zero"); |
| const size_t head_idx = i / embed_dim; | ||
| const int64_t token_head = | ||
| token_idx * query_stride + head_idx * head_stride; | ||
| const int rot_offset = i % embed_dim; | ||
| const size_t rot_offset = i % embed_dim; |
There was a problem hiding this comment.
token_head is computed from head_idx (size_t) and head_stride (int64_t), which forces unsigned arithmetic and can yield incorrect offsets (and compiler warnings) due to signed/unsigned conversion. Cast head_idx (and/or embed_dim) to int64_t before multiplying by head_stride/strides so the offset math stays in the signed stride domain.
| size_t query_hidden_size = query.numel() / num_tokens; | ||
| size_t key_hidden_size = key.has_value() ? key->numel() / num_tokens : 0; |
There was a problem hiding this comment.
query_hidden_size/key_hidden_size divide by num_tokens, but num_tokens = positions.numel() can be 0 for empty inputs, leading to division-by-zero before any kernel launch. Add an early return (or TORCH_CHECK) when num_tokens == 0 before these computations.
| size_t key_stride = key.stride(0); | ||
| size_t value_stride = value.stride(0); |
There was a problem hiding this comment.
key.stride(0)/value.stride(0) are int64_t in PyTorch; storing them in size_t can silently wrap negative strides (or other unexpected stride values) into huge positives and produce out-of-bounds indexing. Keep strides as int64_t here (and in the kernel) or explicitly validate stride >= 0 before converting.
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
Change local variables assigned from .size() calls, kernel class constructor parameters, kernel class member variables, and loop variables to use size_t instead of int for variables representing sizes, dimensions, counts, strides, indices, block sizes, and head sizes. This avoids signed/unsigned comparison bugs.
Change variables representing sizes, dimensions, counts, and loop indices from int to size_t in csrc/layernorm.cpp and csrc/layernorm_quant.cpp.
This affects:
Preserved: template non-type params (int NUM_DIMS, int VEC_SIZE), int64_t variables, and inner j-loops bounded by VEC_SIZE.
pos_encoding_kernels.cpp: Changed positions_ndim, query_hidden_size, key_hidden_size, num_heads, num_kv_heads, rot_dim, seq_dim_idx, query_ndim, and kernel class member vars (head_size, num_heads, num_kv_heads, rot_dim) from int to size_t. Updated loop/index vars (embed_dim, nq, nk, head_idx, rot_offset, x_index, y_index) and fixed std::min<int64_t> to std::min<size_t>.
fused_qknorm_rope.cpp: Changed kernel class constructor params and member vars (num_heads_q, num_heads_k, num_heads_v, num_tokens, rotary_dim) from int to size_t. Updated all loop vars, index vars (total_qk_heads, token_idx, local_head_idx, head_idx, num_heads, offset_warp, offset_thread, embed_dim, rotary_lanes, pair_offset, dim_idx, half_dim), and launch function params. Changed static_cast to static_cast<size_t> at call site.
Preserved: int token_idx in rotary_embedding (used with int64_t strides), template non-type params (int head_dim), int64_t variables, and constexpr int constants (block_size, sgs_per_wg, SUB_GROUP_SIZE, NUM_ELEMS_PER_THREAD).
Changed int → size_t for variables representing sizes, dimensions, counts, strides, and indices in 4 files:
Note: This is a partial change. The following files still need updates:
Agent-Logs-Url: https://github.com/jikunshang/vllm-xpu-kernels/sessions/77ac7b1b-2a91-4f87-9031-10fbf67b346f
PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS ABOVE HAVE BEEN CONSIDERED.
Purpose
Test Plan
Test Result
(Optional) Documentation Update
BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing (anything written below this line will be removed by GitHub Actions)