[bugfix] fix gemma4 31b (#9080)

Jintao-Huang · web-flow · commit c4b463bf1d11 · 2026-04-13T00:18:43.000+08:00
diff --git a/docs/source/Instruction/Command-line-parameters.md b/docs/source/Instruction/Command-line-parameters.md
@@ -215,8 +215,7 @@ ENV:
 
 - 🔥output_dir: 模型预测结果和检查点将被写入的输出目录。默认为None，设置为`'output/<model_name>'`。
 - 🔥gradient_checkpointing: 是否使用gradient_checkpointing，默认为True。该参数可以显著降低显存占用，但降低训练速度。
-- 🔥vit_gradient_checkpointing: 多模态模型训练时，是否对vit部分开启gradient_checkpointing。默认为None，即设置为`gradient_checkpointing`。例子参考[这里](https://github.com/modelscope/ms-swift/blob/main/examples/train/multimodal/vit_gradient_checkpointing.sh)。
-  - 注意：多模态模型且是LoRA训练时，当设置了`--freeze_vit false`，且命令行中出现以下警告：`UserWarning: None of the inputs have requires_grad=True. Gradients will be None`，请设置`--vit_gradient_checkpointing false`，或提相关issue。全参数训练则不会出现该问题。（如果RLHF LoRA训练中，ref_model抛出来的警告，则是正常的）
+- 🔥vit_gradient_checkpointing: 多模态模型训练时，是否对vit部分开启gradient_checkpointing。默认为None，即当`--freeze_vit`为`false`时开启。例子参考[这里](https://github.com/modelscope/ms-swift/blob/main/examples/train/multimodal/vit_gradient_checkpointing.sh)。
 - 🔥deepspeed: 默认为None。可以设置为'zero0', 'zero1', 'zero2', 'zero3', 'zero2_offload', 'zero3_offload'来使用ms-swift内置的deepspeed配置文件。你也可以传入自定义deepspeed配置文件的路径。
 - zero_hpz_partition_size: 默认为None，这个参数是ZeRO++的特性，即node内模型分片，node间数据分片，如果遇到grad_norm NaN，请尝试使用`--torch_dtype float16`。
 - deepspeed_autotp_size: DeepSpeed张量并行大小，默认为1。使用DeepSpeed AutoTP时需将参数`--deepspeed`设置为'zero0'、'zero1'或'zero2'。（注意：该功能只支持全参数）
diff --git a/docs/source/Instruction/Supported-models-and-datasets.md b/docs/source/Instruction/Supported-models-and-datasets.md
@@ -1098,10 +1098,10 @@
 |[google/gemma-3n-E4B](https://modelscope.cn/models/google/gemma-3n-E4B)|gemma3n|gemma3n|transformers>=4.53.1|&#x2718;|-|[google/gemma-3n-E4B](https://huggingface.co/google/gemma-3n-E4B)|
 |[google/gemma-3n-E2B-it](https://modelscope.cn/models/google/gemma-3n-E2B-it)|gemma3n|gemma3n|transformers>=4.53.1|&#x2718;|-|[google/gemma-3n-E2B-it](https://huggingface.co/google/gemma-3n-E2B-it)|
 |[google/gemma-3n-E4B-it](https://modelscope.cn/models/google/gemma-3n-E4B-it)|gemma3n|gemma3n|transformers>=4.53.1|&#x2718;|-|[google/gemma-3n-E4B-it](https://huggingface.co/google/gemma-3n-E4B-it)|
-|[google/gemma-4-E2B](https://modelscope.cn/models/google/gemma-4-E2B)|gemma4|gemma4|transformers>=4.53|&#x2718;|-|[google/gemma-4-E2B](https://huggingface.co/google/gemma-4-E2B)|
-|[google/gemma-4-E2B-it](https://modelscope.cn/models/google/gemma-4-E2B-it)|gemma4|gemma4|transformers>=4.53|&#x2718;|-|[google/gemma-4-E2B-it](https://huggingface.co/google/gemma-4-E2B-it)|
-|[google/gemma-4-E4B](https://modelscope.cn/models/google/gemma-4-E4B)|gemma4|gemma4|transformers>=4.53|&#x2718;|-|[google/gemma-4-E4B](https://huggingface.co/google/gemma-4-E4B)|
-|[google/gemma-4-E4B-it](https://modelscope.cn/models/google/gemma-4-E4B-it)|gemma4|gemma4|transformers>=4.53|&#x2718;|-|[google/gemma-4-E4B-it](https://huggingface.co/google/gemma-4-E4B-it)|
+|[google/gemma-4-E2B](https://modelscope.cn/models/google/gemma-4-E2B)|gemma4|gemma4_nothinking|transformers>=4.53|&#x2718;|-|[google/gemma-4-E2B](https://huggingface.co/google/gemma-4-E2B)|
+|[google/gemma-4-E2B-it](https://modelscope.cn/models/google/gemma-4-E2B-it)|gemma4|gemma4_nothinking|transformers>=4.53|&#x2718;|-|[google/gemma-4-E2B-it](https://huggingface.co/google/gemma-4-E2B-it)|
+|[google/gemma-4-E4B](https://modelscope.cn/models/google/gemma-4-E4B)|gemma4|gemma4_nothinking|transformers>=4.53|&#x2718;|-|[google/gemma-4-E4B](https://huggingface.co/google/gemma-4-E4B)|
+|[google/gemma-4-E4B-it](https://modelscope.cn/models/google/gemma-4-E4B-it)|gemma4|gemma4_nothinking|transformers>=4.53|&#x2718;|-|[google/gemma-4-E4B-it](https://huggingface.co/google/gemma-4-E4B-it)|
 |[google/gemma-4-31B](https://modelscope.cn/models/google/gemma-4-31B)|gemma4|gemma4|transformers>=4.53|&#x2718;|-|[google/gemma-4-31B](https://huggingface.co/google/gemma-4-31B)|
 |[google/gemma-4-31B-it](https://modelscope.cn/models/google/gemma-4-31B-it)|gemma4|gemma4|transformers>=4.53|&#x2718;|-|[google/gemma-4-31B-it](https://huggingface.co/google/gemma-4-31B-it)|
 |[google/gemma-4-26B-A4B](https://modelscope.cn/models/google/gemma-4-26B-A4B)|gemma4|gemma4|transformers>=4.53|&#x2718;|-|[google/gemma-4-26B-A4B](https://huggingface.co/google/gemma-4-26B-A4B)|
diff --git a/docs/source/Megatron-SWIFT/Command-line-parameters.md b/docs/source/Megatron-SWIFT/Command-line-parameters.md
@@ -253,7 +253,7 @@ lora训练：
 - 🔥offload_bridge: Megatron导出的用于vLLM更新HF格式权重使用CPU主存存放，以降低 GPU 显存占用。默认为 False。（在GRPO/GKD算法中生效）
 
 **多模态参数**:
-- vit_gradient_checkpointing: 多模态模型训练时，是否对vit部分开启gradient_checkpointing。默认为True。（**Megatron-SWIFT的vit实现使用transformers实现**）
+- vit_gradient_checkpointing: 多模态模型训练时，是否对vit部分开启gradient_checkpointing。默认为None，即当`--freeze_vit`为`false`时开启。（**Megatron-SWIFT的vit实现使用transformers实现**）
 - vit_gradient_checkpointing_kwargs: 传入`torch.utils.checkpoint`中的参数。例如设置为`--vit_gradient_checkpointing_kwargs '{"use_reentrant": false}'`。默认为None。该参数只对`vit_gradient_checkpointing`生效。
 - vit_attn_impl: 多模态模型训练时，设置vit部分的attn_impl实现。默认为'flash_attn'。
 - vit_lr: 当训练多模态大模型时，该参数指定vit的学习率，默认为None，等于learning_rate。通常与`--freeze_vit`、`--freeze_aligner`参数结合使用。
diff --git a/docs/source_en/Instruction/Command-line-parameters.md b/docs/source_en/Instruction/Command-line-parameters.md
@@ -220,8 +220,7 @@ This list inherits from the Transformers `Seq2SeqTrainingArguments`, with ms-swi
 
 - 🔥output_dir: The output directory where the model predictions and checkpoints will be written. Default is `None`, automatically set to `'output/<model_name>'`.
 - 🔥gradient_checkpointing: Whether to use gradient checkpointing. Default is `True`. This significantly reduces GPU memory usage but slows down training.
-- 🔥vit_gradient_checkpointing: For multimodal model training, whether to enable gradient checkpointing for the ViT (Vision Transformer) component. Default is `None`, meaning it follows the value of `gradient_checkpointing`. For an example, please refer to [here](https://github.com/modelscope/ms-swift/blob/main/examples/train/multimodal/vit_gradient_checkpointing.sh).
-  - Note: When training multimodal models with LoRA and `--freeze_vit false`, if you see the warning: `UserWarning: None of the inputs have requires_grad=True. Gradients will be None`, try setting `--vit_gradient_checkpointing false` or open an issue. This issue does not occur in full-parameter training. (If this warning comes from the `ref_model` during RLHF LoRA training, it is normal.)
+- 🔥vit_gradient_checkpointing: Whether to enable gradient checkpointing for the ViT component during multimodal model training. Defaults to `None`, which means it is enabled when `--freeze_vit` is `false`. For an example, please refer to [here](https://github.com/modelscope/ms-swift/blob/main/examples/train/multimodal/vit_gradient_checkpointing.sh).
 - 🔥deepspeed: Default is `None`. Can be set to `'zero0'`, `'zero1'`, `'zero2'`, `'zero3'`, `'zero2_offload'`, `'zero3_offload'` to use built-in DeepSpeed configurations in ms-swift. You can also pass a path to a custom DeepSpeed config file.
 - zero_hpz_partition_size: Default is `None`. This enables ZeRO++ functionality—model sharding within nodes and data sharding across nodes. If encountering `grad_norm NaN`, try using `--torch_dtype float16`.
 - deepspeed_autotp_size: DeepSpeed tensor parallelism size. Default is 1. To use DeepSpeed AutoTP, set `--deepspeed` to `'zero0'`, `'zero1'`, or `'zero2'`. (Note: Only supports full-parameter training)
diff --git a/docs/source_en/Instruction/Supported-models-and-datasets.md b/docs/source_en/Instruction/Supported-models-and-datasets.md
@@ -1099,10 +1099,10 @@ The table below introduces the models integrated with ms-swift:
 |[google/gemma-3n-E4B](https://modelscope.cn/models/google/gemma-3n-E4B)|gemma3n|gemma3n|transformers>=4.53.1|&#x2718;|-|[google/gemma-3n-E4B](https://huggingface.co/google/gemma-3n-E4B)|
 |[google/gemma-3n-E2B-it](https://modelscope.cn/models/google/gemma-3n-E2B-it)|gemma3n|gemma3n|transformers>=4.53.1|&#x2718;|-|[google/gemma-3n-E2B-it](https://huggingface.co/google/gemma-3n-E2B-it)|
 |[google/gemma-3n-E4B-it](https://modelscope.cn/models/google/gemma-3n-E4B-it)|gemma3n|gemma3n|transformers>=4.53.1|&#x2718;|-|[google/gemma-3n-E4B-it](https://huggingface.co/google/gemma-3n-E4B-it)|
-|[google/gemma-4-E2B](https://modelscope.cn/models/google/gemma-4-E2B)|gemma4|gemma4|transformers>=4.53|&#x2718;|-|[google/gemma-4-E2B](https://huggingface.co/google/gemma-4-E2B)|
-|[google/gemma-4-E2B-it](https://modelscope.cn/models/google/gemma-4-E2B-it)|gemma4|gemma4|transformers>=4.53|&#x2718;|-|[google/gemma-4-E2B-it](https://huggingface.co/google/gemma-4-E2B-it)|
-|[google/gemma-4-E4B](https://modelscope.cn/models/google/gemma-4-E4B)|gemma4|gemma4|transformers>=4.53|&#x2718;|-|[google/gemma-4-E4B](https://huggingface.co/google/gemma-4-E4B)|
-|[google/gemma-4-E4B-it](https://modelscope.cn/models/google/gemma-4-E4B-it)|gemma4|gemma4|transformers>=4.53|&#x2718;|-|[google/gemma-4-E4B-it](https://huggingface.co/google/gemma-4-E4B-it)|
+|[google/gemma-4-E2B](https://modelscope.cn/models/google/gemma-4-E2B)|gemma4|gemma4_nothinking|transformers>=4.53|&#x2718;|-|[google/gemma-4-E2B](https://huggingface.co/google/gemma-4-E2B)|
+|[google/gemma-4-E2B-it](https://modelscope.cn/models/google/gemma-4-E2B-it)|gemma4|gemma4_nothinking|transformers>=4.53|&#x2718;|-|[google/gemma-4-E2B-it](https://huggingface.co/google/gemma-4-E2B-it)|
+|[google/gemma-4-E4B](https://modelscope.cn/models/google/gemma-4-E4B)|gemma4|gemma4_nothinking|transformers>=4.53|&#x2718;|-|[google/gemma-4-E4B](https://huggingface.co/google/gemma-4-E4B)|
+|[google/gemma-4-E4B-it](https://modelscope.cn/models/google/gemma-4-E4B-it)|gemma4|gemma4_nothinking|transformers>=4.53|&#x2718;|-|[google/gemma-4-E4B-it](https://huggingface.co/google/gemma-4-E4B-it)|
 |[google/gemma-4-31B](https://modelscope.cn/models/google/gemma-4-31B)|gemma4|gemma4|transformers>=4.53|&#x2718;|-|[google/gemma-4-31B](https://huggingface.co/google/gemma-4-31B)|
 |[google/gemma-4-31B-it](https://modelscope.cn/models/google/gemma-4-31B-it)|gemma4|gemma4|transformers>=4.53|&#x2718;|-|[google/gemma-4-31B-it](https://huggingface.co/google/gemma-4-31B-it)|
 |[google/gemma-4-26B-A4B](https://modelscope.cn/models/google/gemma-4-26B-A4B)|gemma4|gemma4|transformers>=4.53|&#x2718;|-|[google/gemma-4-26B-A4B](https://huggingface.co/google/gemma-4-26B-A4B)|
diff --git a/docs/source_en/Megatron-SWIFT/Command-line-parameters.md b/docs/source_en/Megatron-SWIFT/Command-line-parameters.md
@@ -269,7 +269,7 @@ LoRA Training:
 - 🔥offload_bridge: Use CPU main memory to store HF format weights exported by Megatron for vLLM updates, to reduce GPU memory usage. Defaults to False. (Takes effect in GRPO/GKD algorithms)
 
 **Multimodal Parameters**:
-- vit_gradient_checkpointing: Whether to enable gradient checkpointing for the ViT (Vision Transformer) component during multimodal model training. Defaults to `True`. (**The ViT implementation in Megatron-SWIFT uses the Hugging Face `transformers` library.**)
+- vit_gradient_checkpointing: Whether to enable gradient checkpointing for the ViT component during multimodal model training. Defaults to `None`, which means it is enabled when `--freeze_vit` is `false`. (**The ViT implementation in Megatron-SWIFT uses the Hugging Face `transformers` library.**)
 - vit_gradient_checkpointing_kwargs: Arguments passed to `torch.utils.checkpoint`. For example: set `--vit_gradient_checkpointing_kwargs '{"use_reentrant": false}'`. Defaults to `None`. This parameter only takes effect when `vit_gradient_checkpointing` is enabled.
 - vit_attn_impl: When training a multimodal model, sets the `attn_impl` implementation used for the ViT part. Defaults to `'flash_attn'`.
 - vit_lr: Specifies the learning rate for the ViT module when training multimodal models. Default is `None`, same as `learning_rate`. Typically used together with `--freeze_vit` and `--freeze_aligner`.
diff --git a/swift/arguments/sft_args.py b/swift/arguments/sft_args.py
@@ -4,7 +4,7 @@
 from transformers.utils.versions import require_version
 from typing import Literal, Optional
 
-from swift.trainers import Seq2SeqTrainingArguments, TrainArgumentsMixin, TrainerFactory
+from swift.trainers import Seq2SeqTrainingArguments, TrainerFactory
 from swift.utils import (add_version_to_work_dir, get_device_count, get_logger, get_pai_tensorboard_dir, is_mp,
                          is_pai_training_job, is_swanlab_available, json_parse_to_dict, to_abspath)
 from .base_args import BaseArguments
@@ -124,7 +124,7 @@ class SftArguments(SwanlabArguments, TunerArguments, BaseArguments, Seq2SeqTrain
     """Arguments pertaining to the training process.
 
     SftArguments is a dataclass that inherits from multiple argument classes: SwanlabArguments, TunerArguments,
-    BaseArguments, TrainArgumentsMixin, Seq2SeqTrainingArguments.
+    BaseArguments, Seq2SeqTrainingArguments.
 
     Args:
         add_version (bool): Whether to add a versioned subdirectory like '<version>-<timestamp>' to the `output_dir` to
@@ -205,6 +205,8 @@ def __post_init__(self) -> None:
         self._init_override()
         TunerArguments.__post_init__(self)
         self._check_padding_free()
+        if self.vit_gradient_checkpointing is None:
+            self.vit_gradient_checkpointing = not self.freeze_vit
         if self.optimizer is None:
             if self.lorap_lr_ratio:
                 self.optimizer = 'lorap'
diff --git a/swift/loss_scale/config/ignore_empty_think.json b/swift/loss_scale/config/ignore_empty_think.json
@@ -1,5 +1,6 @@
 {
     "^<think>\\s*</think>\\s*": [0.0],
-    "^<seed:think><seed:cot_budget_reflect>The current thinking budget is 0, so I will directly start answering the question.</seed:cot_budget_reflect>\n</seed:think>\\s*": [0.0],
-    "^</think>\\s*": [0.0]
+    "^<seed:think><seed:cot_budget_reflect>The current thinking budget is 0, so I will directly start answering the question.</seed:cot_budget_reflect>\\n</seed:think>\\s*": [0.0],
+    "^</think>\\s*": [0.0],
+    "^<\\|channel>thought\\n<channel\\|>": [0.0]
 }
diff --git a/swift/model/models/gemma.py b/swift/model/models/gemma.py
@@ -216,14 +216,17 @@ def get_model(self, model_dir: str, *args, **kwargs) -> PreTrainedModel:
                 Model('google/gemma-4-E2B-it', 'google/gemma-4-E2B-it'),
                 Model('google/gemma-4-E4B', 'google/gemma-4-E4B'),
                 Model('google/gemma-4-E4B-it', 'google/gemma-4-E4B-it'),
+            ],
+                       template=TemplateType.gemma4_nothinking),
+            ModelGroup([
                 Model('google/gemma-4-31B', 'google/gemma-4-31B'),
                 Model('google/gemma-4-31B-it', 'google/gemma-4-31B-it'),
                 Model('google/gemma-4-26B-A4B', 'google/gemma-4-26B-A4B'),
                 Model('google/gemma-4-26B-A4B-it', 'google/gemma-4-26B-A4B-it'),
-            ], ),
+            ],
+                       template=TemplateType.gemma4),
         ],
         Gemma4Loader,
-        template=TemplateType.gemma4,
         architectures=['Gemma4ForConditionalGeneration'],
         model_arch=ModelArch.gemma3n,
         requires=['transformers>=4.53'],
diff --git a/swift/template/base.py b/swift/template/base.py
@@ -1050,7 +1050,7 @@ def _is_add_non_thinking_round(self, messages, i: int, start_idx: int):
         message = messages[i]
         return i >= start_idx and message['role'] == 'assistant'
 
-    def _add_non_thinking_prefix(self, inputs) -> None:
+    def _add_non_thinking_prefix(self, inputs, thinking_prefix='<think>') -> None:
         messages = inputs.messages
         non_thinking_prefix = self.template_meta.non_thinking_prefix
         if non_thinking_prefix:
@@ -1063,14 +1063,14 @@ def _add_non_thinking_prefix(self, inputs) -> None:
                 start_idx = -1
             for i, message in enumerate(messages):
                 if (self._is_add_non_thinking_round(messages, i, start_idx) and isinstance(message['content'], str)
-                        and not message['content'].startswith(('<think>', non_thinking_prefix))):
+                        and not message['content'].startswith((thinking_prefix, non_thinking_prefix))):
                     # During multi-turn SFT training/validation:
                     # If the message has no <think> block and does not start with the non_thinking_prefix,
                     # prepend the non_thinking_prefix to the content.
                     message['content'] = non_thinking_prefix + message['content']
 
-    def _remove_thinking_content(self, content: str) -> str:
-        content = content.split('</think>')[-1].strip()
+    def _remove_thinking_content(self, content: str, thinking_suffix='</think>') -> str:
+        content = content.split(thinking_suffix)[-1].strip()
         return self.template_meta.history_thinking_prefix + content
 
     def _remove_history_thinking(self, inputs) -> None:
diff --git a/swift/template/constant.py b/swift/template/constant.py
@@ -249,6 +249,7 @@ class MLLMTemplateType:
     gemma3_vision = 'gemma3_vision'
     gemma3n = 'gemma3n'
     gemma4 = 'gemma4'
+    gemma4_nothinking = 'gemma4_nothinking'
     mistral_2503 = 'mistral_2503'
     mistral_2506 = 'mistral_2506'
     mistral_2512 = 'mistral_2512'
diff --git a/swift/template/templates/gemma.py b/swift/template/templates/gemma.py
@@ -256,6 +256,19 @@ def replace_tag(self, media_type: Literal['image', 'video', 'audio'], index: int
         elif media_type == 'video':
             return ['\n\n<|video|>\n\n']
 
+    def _swift_encode(self, inputs: StdTemplateInputs):
+        if self.enable_thinking:
+            if inputs.system is None:
+                inputs.system = ''
+            inputs.system = '<|think|>\n' + inputs.system
+        return super()._swift_encode(inputs)
+
+    def _add_non_thinking_prefix(self, inputs: StdTemplateInputs, thinking_prefix: str = '<|channel>thought'):
+        return super()._add_non_thinking_prefix(inputs, thinking_prefix=thinking_prefix)
+
+    def _remove_thinking_content(self, content: str, thinking_suffix: str = '<channel|>') -> str:
+        return super()._remove_thinking_content(content, thinking_suffix=thinking_suffix)
+
     def _encode(self, inputs: StdTemplateInputs) -> Dict[str, Any]:
         encoded = super()._encode(inputs)
         split_token = self._tokenize('\n')
@@ -329,4 +342,11 @@ class Gemma4TemplateMeta(TemplateMeta):
     system_prefix: Optional[Prompt] = field(default_factory=lambda: ['<bos><|turn>system\n{{SYSTEM}}<turn|>\n'])
 
 
-register_template(Gemma4TemplateMeta(MLLMTemplateType.gemma4, template_cls=Gemma4Template))
+register_template(Gemma4TemplateMeta(MLLMTemplateType.gemma4_nothinking, template_cls=Gemma4Template))
+
+register_template(
+    Gemma4TemplateMeta(
+        MLLMTemplateType.gemma4,
+        template_cls=Gemma4Template,
+        is_thinking=True,
+        non_thinking_prefix='<|channel>thought\n<channel|>'))
diff --git a/swift/trainers/utils.py b/swift/trainers/utils.py
@@ -187,6 +187,8 @@ def dynamic_gradient_checkpointing(model, including_vit: bool = False) -> None:
             model_tower = model
         else:
             model_tower = deep_getattr(model, tower_name)
+        if model_tower is None:
+            continue
         model_tower.supports_gradient_checkpointing = True
         module_list = find_module_list(model_tower)
         if module_list is None:
diff --git a/swift/utils/transformers_utils.py b/swift/utils/transformers_utils.py
@@ -189,7 +189,8 @@ def find_all_linears(model, model_arch=None, extra_layers=None, sub_module=None)
     # 'v_head': reward model
     ignore_layers = [lm_head_name, 'score', 'v_head', 'classifier'] + ['lora_A', 'lora_B', 'base_layer']
     ignore_linear_cls = [
-        'glulinear'  # phi4-mm
+        'glulinear',  # phi4-mm
+        'gemma4clippablelinear',  # gemma4
     ]
 
     def _cond(name, module):
@@ -235,6 +236,9 @@ def get_multimodal_target_regex(
                     rejected_modules.append(aligner)
 
         sub_module = deep_getattr(model, module)
+        if sub_module is None:
+            logger.warning(f'module: {module} is None')
+            continue
         if isinstance(sub_module, nn.Linear) and module.endswith('lm_head'):
             target_modules = []
         else:

Original file line number	Diff line number	Diff line change
`@@ -1,5 +1,6 @@`
`1`	`1`	`{`
`2`	`2`	`"^<think>\\s</think>\\s": [0.0],`
`3`		`- "^<seed:think><seed:cot_budget_reflect>The current thinking budget is 0, so I will directly start answering the question.</seed:cot_budget_reflect>\n</seed:think>\\s*": [0.0],`
`4`		`- "^</think>\\s*": [0.0]`
	`3`	`+ "^<seed:think><seed:cot_budget_reflect>The current thinking budget is 0, so I will directly start answering the question.</seed:cot_budget_reflect>\\n</seed:think>\\s*": [0.0],`
	`4`	`+ "^</think>\\s*": [0.0],`
	`5`	`+ "^<\\\|channel>thought\\n<channel\\\|>": [0.0]`
`5`	`6`	`}`