fix(core): update gemini insight reasoning handling

EAGzzyCSL · EAGzzyCSL · commit c33fec3e0361 · 2026-06-28T11:13:04.000+08:00
diff --git a/apps/site/docs/en/model-common-config.mdx b/apps/site/docs/en/model-common-config.mdx
@@ -212,7 +212,7 @@ MIDSCENE_MODEL_FAMILY="gpt-5"
 :::
 
 :::note Model-native thinking
-Midscene disables model-native thinking by default for the best execution speed and stability. To turn it on for any of the models above, set `MIDSCENE_MODEL_REASONING_ENABLED="true"`. Some families accept an extra knob — for example `MIDSCENE_MODEL_REASONING_BUDGET` (Qwen) or `MIDSCENE_MODEL_REASONING_EFFORT` (Doubao). See [Model-Native Thinking Mode](./model-strategy#model-native-thinking-mode).
+Midscene disables model-native thinking by default for the best execution speed and stability. To turn it on for any of the models above, set `MIDSCENE_MODEL_REASONING_ENABLED="true"`. Some families accept extra controls such as `MIDSCENE_MODEL_REASONING_BUDGET` and `MIDSCENE_MODEL_REASONING_EFFORT`. See [Model-Native Thinking Mode](./model-strategy#model-native-thinking-mode).
 :::
 
 ## Other Models
diff --git a/apps/site/docs/en/model-strategy.mdx b/apps/site/docs/en/model-strategy.mdx
@@ -139,7 +139,7 @@ Note: The implementation behind `deepThink` may change in the future as Midscene
 
 Midscene disables model-native thinking by default to get better execution speed and stability. If the model itself does not support disabling native thinking, Midscene reduces model-native thinking as much as possible by controlling thinking granularity or limiting the thinking budget.
 
-You can control model-native thinking with the following reasoning settings.
+You can control model-native thinking with the following reasoning settings. These are unified Midscene-level abstractions, and the parameters actually sent to the model are mapped by model family.
 
 - `MIDSCENE_MODEL_REASONING_ENABLED`: Explicitly controls whether model-native thinking is enabled.
   - `false`: Force-disable model-native thinking (Midscene's default behavior).
@@ -150,16 +150,21 @@ You can control model-native thinking with the following reasoning settings.
     - Doubao: corresponds to `thinking.type` in the Doubao docs.
     - Zhipu GLM: corresponds to `thinking.type` in the Zhipu docs.
     - GPT-5: corresponds to `reasoning_effort` in the OpenAI docs. Midscene uses `medium` when enabled and `none` when disabled.
+    - Gemini: when no Gemini-specific knob is configured, Midscene uses OpenAI-compatible `reasoning_effort`, with `medium` when enabled and `minimal` when disabled. When Gemini-specific knobs are configured, Midscene uses them first.
     - Kimi: corresponds to `thinking.type` in the Kimi docs.
     - Xiaomi Mimo: corresponds to `thinking.type` in the Xiaomi Mimo docs.
 - `MIDSCENE_MODEL_REASONING_BUDGET`: Controls the model's thinking budget. The following models currently support this setting:
   - Qwen: corresponds to `thinking_budget` in the Qwen docs.
+  - Gemini 2.5 series: corresponds to `thinking_config.thinking_budget` in the Gemini docs.
 - `MIDSCENE_MODEL_REASONING_EFFORT`: Controls the model's thinking effort. The following models currently support this setting:
   - Doubao: corresponds to `reasoning_effort` in the Doubao docs.
-  - Gemini: corresponds to `thinking_config.thinking_level` in the Gemini docs.
+  - Gemini 3.x series: corresponds to `thinking_config.thinking_level` in the Gemini docs.
   - GPT-5: corresponds to `reasoning_effort` in the OpenAI docs.
 
-Note: Different model providers use different parameters to control model-native thinking. For specific values and supported model versions, see the official docs from each model provider. If an explicit reasoning setting is not supported by the current model, Midscene ignores that setting instead of guessing provider-specific private parameters.
+Note:
+
+- Different model providers use different parameters to control model-native thinking. For specific values and supported model versions, see the official docs from each model provider. If an explicit reasoning setting is not supported by the current model, Midscene ignores that setting instead of guessing provider-specific private parameters.
+- For Gemini, the OpenAI-compatible `reasoning_effort` and the Gemini-specific `thinking_config` cannot be used together. Gemini only generates thinking summaries through `include_thoughts` inside `thinking_config`, so Midscene sends `include_thoughts` for Gemini only when `MIDSCENE_MODEL_REASONING_BUDGET` or `MIDSCENE_MODEL_REASONING_EFFORT` is explicitly configured.
 
 ### "MIDSCENE_MODEL_FAMILY is not set to a multimodal model" error
 
diff --git a/apps/site/docs/zh/model-common-config.mdx b/apps/site/docs/zh/model-common-config.mdx
@@ -217,7 +217,7 @@ MIDSCENE_MODEL_FAMILY="gpt-5"
 :::
 
 :::note 模型原生思考
-Midscene 默认关闭模型原生思考，以获得最佳的执行速度和稳定性。如需为上面任意模型开启，设置 `MIDSCENE_MODEL_REASONING_ENABLED="true"` 即可。部分模型系列还支持额外控制项，例如 `MIDSCENE_MODEL_REASONING_BUDGET`（Qwen）或 `MIDSCENE_MODEL_REASONING_EFFORT`（豆包）。详见[模型原生的思考模式](./model-strategy#模型原生的思考模式)。
+Midscene 默认关闭模型原生思考，以获得最佳的执行速度和稳定性。如需为上面任意模型开启，设置 `MIDSCENE_MODEL_REASONING_ENABLED="true"` 即可。部分模型系列还支持 `MIDSCENE_MODEL_REASONING_BUDGET` 和 `MIDSCENE_MODEL_REASONING_EFFORT` 等额外控制项。详见[模型原生的思考模式](./model-strategy#模型原生的思考模式)。
 :::
 
 ## 其他模型
diff --git a/apps/site/docs/zh/model-strategy.mdx b/apps/site/docs/zh/model-strategy.mdx
@@ -140,7 +140,7 @@ Midscene 提供了基于页面理解的数据处理接口，如 AI 断言（`aiA
 
 Midscene 会默认关闭模型原生思考，以获得更好的执行速度和稳定性。如果模型自身不支持关闭原生思考，Midscene 会通过控制思考粒度或限制思考额度的方式，尽可能减少模型的原生思考。
 
-你可以通过下面的 reasoning 配置控制模型的原生思考。
+你可以通过下面的 reasoning 配置控制模型的原生思考，这些配置是 Midscene 层的统一抽象，具体发送给模型的参数会根据 model family 做映射。
 
 - `MIDSCENE_MODEL_REASONING_ENABLED`：显式控制是否开启模型原生思考。
   - `false`：强制关闭模型原生思考（Midscene 默认行为）。
@@ -151,16 +151,21 @@ Midscene 会默认关闭模型原生思考，以获得更好的执行速度和
     - 豆包：对应豆包文档中的 `thinking.type`。
     - 智谱 GLM：对应智谱文档中的 `thinking.type`。
     - GPT-5：对应 OpenAI 文档中的 `reasoning_effort`，开启时默认使用 `medium`，关闭时使用 `none`。
+    - Gemini：当未配置 Gemini 专属控制项时，Midscene 会使用 OpenAI 兼容的 `reasoning_effort`，开启时使用 `medium`，关闭时使用 `minimal`。如果配置了 Gemini 专属控制项，Midscene 会优先使用 Gemini 专属控制项。
     - Kimi：对应 Kimi 文档中的 `thinking.type`。
     - 小米 Mimo：对应小米 Mimo 文档中的 `thinking.type`。
 - `MIDSCENE_MODEL_REASONING_BUDGET`：控制模型的思考限额，目前有以下模型支持该参数：
   - Qwen：对应 Qwen 文档中的 `thinking_budget`。
+  - Gemini 2.5 系列：对应 Gemini 文档中的 `thinking_config.thinking_budget`。
 - `MIDSCENE_MODEL_REASONING_EFFORT`：控制模型的思考力度，目前有以下模型支持该参数：
   - 豆包：对应豆包文档中的 `reasoning_effort`。
-  - Gemini：对应 Gemini 文档中的 `thinking_config.thinking_level`。
+  - Gemini 3.x 系列：对应 Gemini 文档中的 `thinking_config.thinking_level`。
   - GPT-5：对应 OpenAI 文档中的 `reasoning_effort`。
 
-注意：不同模型厂商控制模型原生思考方式的参数各不相同，具体取值和适用模型版本请参考各模型厂商的官方文档。如果某个显式的 reasoning 配置不被当前模型支持，Midscene 会忽略该配置，而不会猜测服务商的私有参数。
+注意：
+
+- 不同模型厂商控制模型原生思考方式的参数各不相同，具体取值和适用模型版本请参考各模型厂商的官方文档。如果某个显式的 reasoning 配置不被当前模型支持，Midscene 会忽略该配置，而不会猜测服务商的私有参数。
+- 对于 Gemini，OpenAI 兼容的 `reasoning_effort` 与 Gemini 私有的 `thinking_config` 不能同时使用；同时，Gemini 只有在 `thinking_config` 中配置 `include_thoughts` 才会生成思考摘要。因此，Midscene 只会在显式配置了 `MIDSCENE_MODEL_REASONING_BUDGET` 或 `MIDSCENE_MODEL_REASONING_EFFORT` 时，为 Gemini 发送 `include_thoughts`。
 
 ### "MIDSCENE_MODEL_FAMILY is not set to a multimodal model" 错误
 
diff --git a/packages/core/src/ai-model/models/gemini.ts b/packages/core/src/ai-model/models/gemini.ts
@@ -30,8 +30,8 @@ type GeminiContentSource =
 const buildGeminiChatCompletionParams = (
   input: ChatCompletionCallContext,
 ): ChatCompletionParamsResult => {
-  const { midsceneDefaults, userConfig, intent } = input;
-  const { reasoningEnabled, reasoningEffort } = userConfig;
+  const { midsceneDefaults, userConfig } = input;
+  const { reasoningEnabled, reasoningEffort, reasoningBudget } = userConfig;
   const commonOverrideConfig: Record<string, unknown> = {};
 
   if (userConfig.temperature !== undefined) {
@@ -48,28 +48,36 @@ const buildGeminiChatCompletionParams = (
   } = {};
 
   if (reasoningEnabled !== 'default') {
-    if (reasoningEffort) {
-      modelSpecificConfig.extra_body = {
-        google: {
-          thinking_config: {
-            thinking_level: reasoningEffort,
-            include_thoughts: true,
-          },
-        },
-      };
-    } else {
-      if (intent === 'insight') {
+    if (reasoningEnabled) {
+      // Gemini exposes different thinking knobs across model generations:
+      // 2.5 models support `thinking_budget`, while 3.x models support
+      // `thinking_level`. Pass through whichever user option is configured.
+      // `thinking_config` and OpenAI-compatible `reasoning_effort` are
+      // mutually exclusive. Request thought summaries only when we send
+      // `thinking_config`; otherwise fall back to `reasoning_effort: medium`.
+      if (reasoningEffort || reasoningBudget !== undefined) {
+        const thinkingConfig: {
+          include_thoughts: boolean;
+          thinking_level?: string;
+          thinking_budget?: number;
+        } = {
+          include_thoughts: true,
+        };
+        if (reasoningEffort) {
+          thinkingConfig.thinking_level = reasoningEffort;
+        }
+        if (reasoningBudget !== undefined) {
+          thinkingConfig.thinking_budget = reasoningBudget;
+        }
         modelSpecificConfig.extra_body = {
           google: {
-            thinking_config: {
-              // In real Gemini tests, insight calls need `include_thoughts` to get
-              // the model's thinking, and Gemini puts that thinking into `content`.
-              include_thoughts: true,
-            },
+            thinking_config: thinkingConfig,
           },
         };
+      } else {
+        modelSpecificConfig.reasoning_effort = 'medium';
       }
-
+    } else {
       // Gemini 3.x cannot fully disable native thinking, so use the lowest
       // supported effort unless the user explicitly requests another level.
       modelSpecificConfig.reasoning_effort = 'minimal';
@@ -196,7 +204,7 @@ export const extractGeminiContentAndReasoning = (
 export const geminiAdapters = {
   gemini: {
     chatCompletion: {
-      unsupportedUserConfig: ['reasoningEnabled', 'reasoningBudget'],
+      unsupportedUserConfig: [],
       buildChatCompletionParams: buildGeminiChatCompletionParams,
       extractContentAndReasoning: extractGeminiContentAndReasoning,
     },
diff --git a/packages/core/src/ai-model/prompt/extraction.ts b/packages/core/src/ai-model/prompt/extraction.ts
@@ -27,7 +27,10 @@ export function buildTypeQueryDemandValue(
 export function parseXMLExtractionResponse<T>(
   xmlString: string,
 ): AIDataExtractionResponse<T> {
-  const thought = extractXMLTag(xmlString, 'thought');
+  // Keep the internal field named `thought`, but ask models to emit
+  // <observation>. Gemini may only return <thought>-named content when
+  // thinking summaries are enabled.
+  const thought = extractXMLTag(xmlString, 'observation');
   const dataJsonStr = extractXMLTag(xmlString, 'data-json');
   const errorsStr = extractXMLTag(xmlString, 'errors');
 
@@ -107,7 +110,7 @@ When DATA_DEMAND is a JSON object, the keys in your response must exactly match
 
 
 Return in the following XML format:
-<thought>the thinking process of the extraction, less than 300 words. Use ${preferredLanguage} in this field.</thought>
+<observation>brief evidence observed for the extraction, less than 300 words. Use ${preferredLanguage} in this field.</observation>
 <data-json>the extracted data as JSON. Make sure both the value and scheme meet the DATA_DEMAND. If you want to write some description in this field, use the same language as the DATA_DEMAND.</data-json>
 <errors>optional error messages as JSON array, e.g., ["error1", "error2"]</errors>
 
@@ -124,7 +127,7 @@ For example, if the DATA_DEMAND is:
 
 By viewing the screenshot and page contents, you can extract the following data:
 
-<thought>According to the screenshot, i can see ...</thought>
+<observation>According to the screenshot, i can see ...</observation>
 <data-json>
 {
   "name": "John",
@@ -142,7 +145,7 @@ the todo items list, string[]
 
 By viewing the screenshot and page contents, you can extract the following data:
 
-<thought>According to the screenshot, i can see ...</thought>
+<observation>According to the screenshot, i can see ...</observation>
 <data-json>
 ["todo 1", "todo 2", "todo 3"]
 </data-json>
@@ -156,7 +159,7 @@ the page title, string
 
 By viewing the screenshot and page contents, you can extract the following data:
 
-<thought>According to the screenshot, i can see ...</thought>
+<observation>According to the screenshot, i can see ...</observation>
 <data-json>
 "todo list"
 </data-json>
@@ -172,7 +175,7 @@ If the DATA_DEMAND is:
 
 By viewing the screenshot and page contents, you can extract the following data:
 
-<thought>According to the screenshot, i can see ...</thought>
+<observation>According to the screenshot, i can see ...</observation>
 <data-json>
 { "StatementIsTruthy": true }
 </data-json>
diff --git a/packages/core/tests/unit-test/extraction.test.ts b/packages/core/tests/unit-test/extraction.test.ts
@@ -4,7 +4,7 @@ import { describe, expect, it } from 'vitest';
 describe('parseXMLExtractionResponse', () => {
   it('should parse complete XML response with all fields', () => {
     const xml = `
-<thought>According to the screenshot, I can see a user profile with name, age, and admin status</thought>
+<observation>According to the screenshot, I can see a user profile with name, age, and admin status</observation>
 <data-json>
 {
   "name": "John",
@@ -52,7 +52,7 @@ describe('parseXMLExtractionResponse', () => {
 
   it('should parse XML response with array data', () => {
     const xml = `
-<thought>I found three todo items in the list</thought>
+<observation>I found three todo items in the list</observation>
 <data-json>
 ["todo 1", "todo 2", "todo 3"]
 </data-json>
@@ -68,7 +68,7 @@ describe('parseXMLExtractionResponse', () => {
 
   it('should parse XML response with string data', () => {
     const xml = `
-<thought>The page title is "todo list"</thought>
+<observation>The page title is "todo list"</observation>
 <data-json>
 "todo list"
 </data-json>
@@ -84,7 +84,7 @@ describe('parseXMLExtractionResponse', () => {
 
   it('should parse XML response with boolean data', () => {
     const xml = `
-<thought>This is the SMS page</thought>
+<observation>This is the SMS page</observation>
 <data-json>
 { "result": true }
 </data-json>
@@ -100,7 +100,7 @@ describe('parseXMLExtractionResponse', () => {
 
   it('should parse XML response with errors', () => {
     const xml = `
-<thought>Failed to extract some data</thought>
+<observation>Failed to extract some data</observation>
 <data-json>
 {
   "name": "John"
@@ -138,10 +138,10 @@ describe('parseXMLExtractionResponse', () => {
 
   it('should handle multiline JSON in data-json', () => {
     const xml = `
-<thought>
+<observation>
   Extracting complex data structure
   from the screenshot
-</thought>
+</observation>
 <data-json>
 {
   "users": [
@@ -172,7 +172,7 @@ describe('parseXMLExtractionResponse', () => {
 
   it('should throw error when data-json is missing', () => {
     const xml = `
-<thought>Some thought</thought>
+<observation>Some thought</observation>
 <errors>[]</errors>
     `.trim();
 
@@ -183,7 +183,7 @@ describe('parseXMLExtractionResponse', () => {
 
   it('should throw error when data-json is invalid JSON', () => {
     const xml = `
-<thought>Some thought</thought>
+<observation>Some thought</observation>
 <data-json>
 {invalid json}
 </data-json>
@@ -196,7 +196,7 @@ describe('parseXMLExtractionResponse', () => {
 
   it('should throw error when data-json is markdown text instead of JSON', () => {
     const xml = `
-<thought>根据蓝色框选中的模块，提取到两个子页面：工具入口对比页、协商工具展示页，按照要求整理每个页面的信息如下。</thought>
+<observation>根据蓝色框选中的模块，提取到两个子页面：工具入口对比页、协商工具展示页，按照要求整理每个页面的信息如下。</observation>
 <data-json>
 # 页面名：工具入口（BEFORE/AFTER对比）
 # 页面描述：展示订单协商工具入口改造前后的界面对比，呈现不同阶段的订单列表页面样式，体现设计优化方向
@@ -239,7 +239,7 @@ invalid json array
 
   it('should handle case-insensitive tag matching', () => {
     const xml = `
-<THOUGHT>Case insensitive thought</THOUGHT>
+<OBSERVATION>Case insensitive thought</OBSERVATION>
 <DATA-JSON>
 {"result": "success"}
 </DATA-JSON>
@@ -253,7 +253,7 @@ invalid json array
 
   it('should parse nested objects correctly', () => {
     const xml = `
-<thought>Extracting nested data</thought>
+<observation>Extracting nested data</observation>
 <data-json>
 {
   "user": {
diff --git a/packages/core/tests/unit-test/inspect-extract-prompt.test.ts b/packages/core/tests/unit-test/inspect-extract-prompt.test.ts
@@ -45,7 +45,7 @@ describe('AiExtractElementInfo prompt assembly', () => {
     vi.clearAllMocks();
     vi.mocked(callAI).mockResolvedValue({
       content:
-        '<thought>Looks correct.</thought><data-json>{"result":true}</data-json>',
+        '<observation>Looks correct.</observation><data-json>{"result":true}</data-json>',
       usage: undefined,
       reasoning_content: undefined,
     } as any);
diff --git a/packages/core/tests/unit-test/model-adapter/gemini.test.ts b/packages/core/tests/unit-test/model-adapter/gemini.test.ts
diff --git a/packages/core/tests/unit-test/prompt/__snapshots__/prompt.test.ts.snap b/packages/core/tests/unit-test/prompt/__snapshots__/prompt.test.ts.snap