Fix config utils#1357
Conversation
There was a problem hiding this comment.
Code Review
This pull request optimizes recurrent processing in Qwen3next by reading q/k/v/a/b column views directly via per-token strides to avoid contiguous copies, removes profiling-related arguments and endpoints, updates Qwen3.5 EOS token ID retrieval logic, and adds corresponding unit tests. Feedback on the changes highlights three critical issues: removing the cleanup of req.linear_att_len_to_big_page_id in infer_batch.py can cause resource leaks and router crashes; calling int() on tokenizer.eos_token_id in config_utils.py may raise a TypeError if it is a list; and removing the root node check in linear_att_radix_cache.py could trigger assertion failures during eviction when the tree is empty.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| if shared_kv_len <= req.cur_kv_len: | ||
| free_token_index.append(self.req_manager.req_to_token_indexs[req.req_idx][shared_kv_len : req.cur_kv_len]) | ||
| # 该分支不会把 prefill 阶段累积的 big page id 插入 radix cache(典型为 pause/abort | ||
| # 在 prefill 跨过 big page 边界后、到达末尾前触发),需在此显式释放,避免泄漏。 | ||
|
|
||
| # 释放本请求 prefill 阶段在 big page 边界上申请、但尚未插入 radix cache 的 big page | ||
| # state buffer。仅当请求未走 insert 分支(小页/大页插入)就被释放时才会有残留,典型场景: | ||
| # big page 模式下请求在 prefill 跨过 big page 边界后、到达末尾前被 pause / abort。 | ||
| # 若不释放,会泄漏 big page state slot,并触发 free_a_req_mem 中 dict 为空的断言。 | ||
| if req.linear_att_len_to_big_page_id: | ||
| self.radix_cache.linear_att_big_page_buffers.free_state_cache( | ||
| list(req.linear_att_len_to_big_page_id.values()) | ||
| ) | ||
| req.linear_att_len_to_big_page_id.clear() | ||
|
|
||
| req.cur_kv_len = shared_kv_len |
There was a problem hiding this comment.
Removing the code that clears req.linear_att_len_to_big_page_id in _linear_att_free_req will cause a resource leak of big page state slots if a request is aborted or paused during prefill. Furthermore, it will cause the assertion assert len(req.linear_att_len_to_big_page_id) == 0 at line 132 to fail, crashing the router process. We should restore the cleanup logic to free the big page state buffers.
| if shared_kv_len <= req.cur_kv_len: | |
| free_token_index.append(self.req_manager.req_to_token_indexs[req.req_idx][shared_kv_len : req.cur_kv_len]) | |
| # 该分支不会把 prefill 阶段累积的 big page id 插入 radix cache(典型为 pause/abort | |
| # 在 prefill 跨过 big page 边界后、到达末尾前触发),需在此显式释放,避免泄漏。 | |
| # 释放本请求 prefill 阶段在 big page 边界上申请、但尚未插入 radix cache 的 big page | |
| # state buffer。仅当请求未走 insert 分支(小页/大页插入)就被释放时才会有残留,典型场景: | |
| # big page 模式下请求在 prefill 跨过 big page 边界后、到达末尾前被 pause / abort。 | |
| # 若不释放,会泄漏 big page state slot,并触发 free_a_req_mem 中 dict 为空的断言。 | |
| if req.linear_att_len_to_big_page_id: | |
| self.radix_cache.linear_att_big_page_buffers.free_state_cache( | |
| list(req.linear_att_len_to_big_page_id.values()) | |
| ) | |
| req.linear_att_len_to_big_page_id.clear() | |
| req.cur_kv_len = shared_kv_len | |
| if shared_kv_len <= req.cur_kv_len: | |
| free_token_index.append(self.req_manager.req_to_token_indexs[req.req_idx][shared_kv_len : req.cur_kv_len]) | |
| if req.linear_att_len_to_big_page_id: | |
| self.radix_cache.linear_att_big_page_buffers.free_state_cache( | |
| list(req.linear_att_len_to_big_page_id.values()) | |
| ) | |
| req.linear_att_len_to_big_page_id.clear() | |
| req.cur_kv_len = shared_kv_len |
| if tokenizer.eos_token_id is not None: | ||
| return [int(tokenizer.eos_token_id)] | ||
| eos_token_ids.append(int(tokenizer.eos_token_id)) |
There was a problem hiding this comment.
If tokenizer.eos_token_id is a list (which is common for tokenizers with multiple end-of-sequence tokens), calling int(tokenizer.eos_token_id) will raise a TypeError. Since this block is wrapped in a try...except Exception block, the error will be silently caught, and the function will fall back to the default behavior, bypassing the Qwen3.5-specific logic entirely. We should handle both int and list/iterable types for tokenizer.eos_token_id.
| if tokenizer.eos_token_id is not None: | |
| return [int(tokenizer.eos_token_id)] | |
| eos_token_ids.append(int(tokenizer.eos_token_id)) | |
| if tokenizer.eos_token_id is not None: | |
| if isinstance(tokenizer.eos_token_id, list): | |
| eos_token_ids.extend(int(eid) for eid in tokenizer.eos_token_id) | |
| else: | |
| eos_token_ids.append(int(tokenizer.eos_token_id)) |
| if node.is_leaf(): | ||
| self._evict_tree_set.add(node) |
There was a problem hiding this comment.
Removing the and node is not self.root_node check allows the root_node to be added to _evict_tree_set when the tree is empty (as the root node then satisfies is_leaf()). This can contradict assertions in the eviction logic (e.g., assert node is not self.root_node in _evict). Unless there is a specific reason to allow the root node in the eviction set, it is safer to keep this guard to prevent potential assertion failures.
| if node.is_leaf(): | |
| self._evict_tree_set.add(node) | |
| if node.is_leaf() and node is not self.root_node: | |
| self._evict_tree_set.add(node) |
5a42f1f to
5ef948b
Compare
No description provided.