Skip to content

Commit c33fec3

Browse files
committed
fix(core): update gemini insight reasoning handling
1 parent 45ad814 commit c33fec3

10 files changed

Lines changed: 154 additions & 81 deletions

File tree

apps/site/docs/en/model-common-config.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -212,7 +212,7 @@ MIDSCENE_MODEL_FAMILY="gpt-5"
212212
:::
213213

214214
:::note Model-native thinking
215-
Midscene disables model-native thinking by default for the best execution speed and stability. To turn it on for any of the models above, set `MIDSCENE_MODEL_REASONING_ENABLED="true"`. Some families accept an extra knob — for example `MIDSCENE_MODEL_REASONING_BUDGET` (Qwen) or `MIDSCENE_MODEL_REASONING_EFFORT` (Doubao). See [Model-Native Thinking Mode](./model-strategy#model-native-thinking-mode).
215+
Midscene disables model-native thinking by default for the best execution speed and stability. To turn it on for any of the models above, set `MIDSCENE_MODEL_REASONING_ENABLED="true"`. Some families accept extra controls such as `MIDSCENE_MODEL_REASONING_BUDGET` and `MIDSCENE_MODEL_REASONING_EFFORT`. See [Model-Native Thinking Mode](./model-strategy#model-native-thinking-mode).
216216
:::
217217

218218
## Other Models

apps/site/docs/en/model-strategy.mdx

Lines changed: 8 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -139,7 +139,7 @@ Note: The implementation behind `deepThink` may change in the future as Midscene
139139

140140
Midscene disables model-native thinking by default to get better execution speed and stability. If the model itself does not support disabling native thinking, Midscene reduces model-native thinking as much as possible by controlling thinking granularity or limiting the thinking budget.
141141

142-
You can control model-native thinking with the following reasoning settings.
142+
You can control model-native thinking with the following reasoning settings. These are unified Midscene-level abstractions, and the parameters actually sent to the model are mapped by model family.
143143

144144
- `MIDSCENE_MODEL_REASONING_ENABLED`: Explicitly controls whether model-native thinking is enabled.
145145
- `false`: Force-disable model-native thinking (Midscene's default behavior).
@@ -150,16 +150,21 @@ You can control model-native thinking with the following reasoning settings.
150150
- Doubao: corresponds to `thinking.type` in the Doubao docs.
151151
- Zhipu GLM: corresponds to `thinking.type` in the Zhipu docs.
152152
- GPT-5: corresponds to `reasoning_effort` in the OpenAI docs. Midscene uses `medium` when enabled and `none` when disabled.
153+
- Gemini: when no Gemini-specific knob is configured, Midscene uses OpenAI-compatible `reasoning_effort`, with `medium` when enabled and `minimal` when disabled. When Gemini-specific knobs are configured, Midscene uses them first.
153154
- Kimi: corresponds to `thinking.type` in the Kimi docs.
154155
- Xiaomi Mimo: corresponds to `thinking.type` in the Xiaomi Mimo docs.
155156
- `MIDSCENE_MODEL_REASONING_BUDGET`: Controls the model's thinking budget. The following models currently support this setting:
156157
- Qwen: corresponds to `thinking_budget` in the Qwen docs.
158+
- Gemini 2.5 series: corresponds to `thinking_config.thinking_budget` in the Gemini docs.
157159
- `MIDSCENE_MODEL_REASONING_EFFORT`: Controls the model's thinking effort. The following models currently support this setting:
158160
- Doubao: corresponds to `reasoning_effort` in the Doubao docs.
159-
- Gemini: corresponds to `thinking_config.thinking_level` in the Gemini docs.
161+
- Gemini 3.x series: corresponds to `thinking_config.thinking_level` in the Gemini docs.
160162
- GPT-5: corresponds to `reasoning_effort` in the OpenAI docs.
161163

162-
Note: Different model providers use different parameters to control model-native thinking. For specific values and supported model versions, see the official docs from each model provider. If an explicit reasoning setting is not supported by the current model, Midscene ignores that setting instead of guessing provider-specific private parameters.
164+
Note:
165+
166+
- Different model providers use different parameters to control model-native thinking. For specific values and supported model versions, see the official docs from each model provider. If an explicit reasoning setting is not supported by the current model, Midscene ignores that setting instead of guessing provider-specific private parameters.
167+
- For Gemini, the OpenAI-compatible `reasoning_effort` and the Gemini-specific `thinking_config` cannot be used together. Gemini only generates thinking summaries through `include_thoughts` inside `thinking_config`, so Midscene sends `include_thoughts` for Gemini only when `MIDSCENE_MODEL_REASONING_BUDGET` or `MIDSCENE_MODEL_REASONING_EFFORT` is explicitly configured.
163168

164169
### "MIDSCENE_MODEL_FAMILY is not set to a multimodal model" error
165170

apps/site/docs/zh/model-common-config.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -217,7 +217,7 @@ MIDSCENE_MODEL_FAMILY="gpt-5"
217217
:::
218218

219219
:::note 模型原生思考
220-
Midscene 默认关闭模型原生思考,以获得最佳的执行速度和稳定性。如需为上面任意模型开启,设置 `MIDSCENE_MODEL_REASONING_ENABLED="true"` 即可。部分模型系列还支持额外控制项,例如 `MIDSCENE_MODEL_REASONING_BUDGET`(Qwen)或 `MIDSCENE_MODEL_REASONING_EFFORT`(豆包)。详见[模型原生的思考模式](./model-strategy#模型原生的思考模式)
220+
Midscene 默认关闭模型原生思考,以获得最佳的执行速度和稳定性。如需为上面任意模型开启,设置 `MIDSCENE_MODEL_REASONING_ENABLED="true"` 即可。部分模型系列还支持 `MIDSCENE_MODEL_REASONING_BUDGET``MIDSCENE_MODEL_REASONING_EFFORT` 等额外控制项。详见[模型原生的思考模式](./model-strategy#模型原生的思考模式)
221221
:::
222222

223223
## 其他模型

apps/site/docs/zh/model-strategy.mdx

Lines changed: 8 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -140,7 +140,7 @@ Midscene 提供了基于页面理解的数据处理接口,如 AI 断言(`aiA
140140

141141
Midscene 会默认关闭模型原生思考,以获得更好的执行速度和稳定性。如果模型自身不支持关闭原生思考,Midscene 会通过控制思考粒度或限制思考额度的方式,尽可能减少模型的原生思考。
142142

143-
你可以通过下面的 reasoning 配置控制模型的原生思考。
143+
你可以通过下面的 reasoning 配置控制模型的原生思考,这些配置是 Midscene 层的统一抽象,具体发送给模型的参数会根据 model family 做映射
144144

145145
- `MIDSCENE_MODEL_REASONING_ENABLED`:显式控制是否开启模型原生思考。
146146
- `false`:强制关闭模型原生思考(Midscene 默认行为)。
@@ -151,16 +151,21 @@ Midscene 会默认关闭模型原生思考,以获得更好的执行速度和
151151
- 豆包:对应豆包文档中的 `thinking.type`
152152
- 智谱 GLM:对应智谱文档中的 `thinking.type`
153153
- GPT-5:对应 OpenAI 文档中的 `reasoning_effort`,开启时默认使用 `medium`,关闭时使用 `none`
154+
- Gemini:当未配置 Gemini 专属控制项时,Midscene 会使用 OpenAI 兼容的 `reasoning_effort`,开启时使用 `medium`,关闭时使用 `minimal`。如果配置了 Gemini 专属控制项,Midscene 会优先使用 Gemini 专属控制项。
154155
- Kimi:对应 Kimi 文档中的 `thinking.type`
155156
- 小米 Mimo:对应小米 Mimo 文档中的 `thinking.type`
156157
- `MIDSCENE_MODEL_REASONING_BUDGET`:控制模型的思考限额,目前有以下模型支持该参数:
157158
- Qwen:对应 Qwen 文档中的 `thinking_budget`
159+
- Gemini 2.5 系列:对应 Gemini 文档中的 `thinking_config.thinking_budget`
158160
- `MIDSCENE_MODEL_REASONING_EFFORT`:控制模型的思考力度,目前有以下模型支持该参数:
159161
- 豆包:对应豆包文档中的 `reasoning_effort`
160-
- Gemini:对应 Gemini 文档中的 `thinking_config.thinking_level`
162+
- Gemini 3.x 系列:对应 Gemini 文档中的 `thinking_config.thinking_level`
161163
- GPT-5:对应 OpenAI 文档中的 `reasoning_effort`
162164

163-
注意:不同模型厂商控制模型原生思考方式的参数各不相同,具体取值和适用模型版本请参考各模型厂商的官方文档。如果某个显式的 reasoning 配置不被当前模型支持,Midscene 会忽略该配置,而不会猜测服务商的私有参数。
165+
注意:
166+
167+
- 不同模型厂商控制模型原生思考方式的参数各不相同,具体取值和适用模型版本请参考各模型厂商的官方文档。如果某个显式的 reasoning 配置不被当前模型支持,Midscene 会忽略该配置,而不会猜测服务商的私有参数。
168+
- 对于 Gemini,OpenAI 兼容的 `reasoning_effort` 与 Gemini 私有的 `thinking_config` 不能同时使用;同时,Gemini 只有在 `thinking_config` 中配置 `include_thoughts` 才会生成思考摘要。因此,Midscene 只会在显式配置了 `MIDSCENE_MODEL_REASONING_BUDGET``MIDSCENE_MODEL_REASONING_EFFORT` 时,为 Gemini 发送 `include_thoughts`
164169

165170
### "MIDSCENE_MODEL_FAMILY is not set to a multimodal model" 错误
166171

packages/core/src/ai-model/models/gemini.ts

Lines changed: 28 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -30,8 +30,8 @@ type GeminiContentSource =
3030
const buildGeminiChatCompletionParams = (
3131
input: ChatCompletionCallContext,
3232
): ChatCompletionParamsResult => {
33-
const { midsceneDefaults, userConfig, intent } = input;
34-
const { reasoningEnabled, reasoningEffort } = userConfig;
33+
const { midsceneDefaults, userConfig } = input;
34+
const { reasoningEnabled, reasoningEffort, reasoningBudget } = userConfig;
3535
const commonOverrideConfig: Record<string, unknown> = {};
3636

3737
if (userConfig.temperature !== undefined) {
@@ -48,28 +48,36 @@ const buildGeminiChatCompletionParams = (
4848
} = {};
4949

5050
if (reasoningEnabled !== 'default') {
51-
if (reasoningEffort) {
52-
modelSpecificConfig.extra_body = {
53-
google: {
54-
thinking_config: {
55-
thinking_level: reasoningEffort,
56-
include_thoughts: true,
57-
},
58-
},
59-
};
60-
} else {
61-
if (intent === 'insight') {
51+
if (reasoningEnabled) {
52+
// Gemini exposes different thinking knobs across model generations:
53+
// 2.5 models support `thinking_budget`, while 3.x models support
54+
// `thinking_level`. Pass through whichever user option is configured.
55+
// `thinking_config` and OpenAI-compatible `reasoning_effort` are
56+
// mutually exclusive. Request thought summaries only when we send
57+
// `thinking_config`; otherwise fall back to `reasoning_effort: medium`.
58+
if (reasoningEffort || reasoningBudget !== undefined) {
59+
const thinkingConfig: {
60+
include_thoughts: boolean;
61+
thinking_level?: string;
62+
thinking_budget?: number;
63+
} = {
64+
include_thoughts: true,
65+
};
66+
if (reasoningEffort) {
67+
thinkingConfig.thinking_level = reasoningEffort;
68+
}
69+
if (reasoningBudget !== undefined) {
70+
thinkingConfig.thinking_budget = reasoningBudget;
71+
}
6272
modelSpecificConfig.extra_body = {
6373
google: {
64-
thinking_config: {
65-
// In real Gemini tests, insight calls need `include_thoughts` to get
66-
// the model's thinking, and Gemini puts that thinking into `content`.
67-
include_thoughts: true,
68-
},
74+
thinking_config: thinkingConfig,
6975
},
7076
};
77+
} else {
78+
modelSpecificConfig.reasoning_effort = 'medium';
7179
}
72-
80+
} else {
7381
// Gemini 3.x cannot fully disable native thinking, so use the lowest
7482
// supported effort unless the user explicitly requests another level.
7583
modelSpecificConfig.reasoning_effort = 'minimal';
@@ -196,7 +204,7 @@ export const extractGeminiContentAndReasoning = (
196204
export const geminiAdapters = {
197205
gemini: {
198206
chatCompletion: {
199-
unsupportedUserConfig: ['reasoningEnabled', 'reasoningBudget'],
207+
unsupportedUserConfig: [],
200208
buildChatCompletionParams: buildGeminiChatCompletionParams,
201209
extractContentAndReasoning: extractGeminiContentAndReasoning,
202210
},

packages/core/src/ai-model/prompt/extraction.ts

Lines changed: 9 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,10 @@ export function buildTypeQueryDemandValue(
2727
export function parseXMLExtractionResponse<T>(
2828
xmlString: string,
2929
): AIDataExtractionResponse<T> {
30-
const thought = extractXMLTag(xmlString, 'thought');
30+
// Keep the internal field named `thought`, but ask models to emit
31+
// <observation>. Gemini may only return <thought>-named content when
32+
// thinking summaries are enabled.
33+
const thought = extractXMLTag(xmlString, 'observation');
3134
const dataJsonStr = extractXMLTag(xmlString, 'data-json');
3235
const errorsStr = extractXMLTag(xmlString, 'errors');
3336

@@ -107,7 +110,7 @@ When DATA_DEMAND is a JSON object, the keys in your response must exactly match
107110
108111
109112
Return in the following XML format:
110-
<thought>the thinking process of the extraction, less than 300 words. Use ${preferredLanguage} in this field.</thought>
113+
<observation>brief evidence observed for the extraction, less than 300 words. Use ${preferredLanguage} in this field.</observation>
111114
<data-json>the extracted data as JSON. Make sure both the value and scheme meet the DATA_DEMAND. If you want to write some description in this field, use the same language as the DATA_DEMAND.</data-json>
112115
<errors>optional error messages as JSON array, e.g., ["error1", "error2"]</errors>
113116
@@ -124,7 +127,7 @@ For example, if the DATA_DEMAND is:
124127
125128
By viewing the screenshot and page contents, you can extract the following data:
126129
127-
<thought>According to the screenshot, i can see ...</thought>
130+
<observation>According to the screenshot, i can see ...</observation>
128131
<data-json>
129132
{
130133
"name": "John",
@@ -142,7 +145,7 @@ the todo items list, string[]
142145
143146
By viewing the screenshot and page contents, you can extract the following data:
144147
145-
<thought>According to the screenshot, i can see ...</thought>
148+
<observation>According to the screenshot, i can see ...</observation>
146149
<data-json>
147150
["todo 1", "todo 2", "todo 3"]
148151
</data-json>
@@ -156,7 +159,7 @@ the page title, string
156159
157160
By viewing the screenshot and page contents, you can extract the following data:
158161
159-
<thought>According to the screenshot, i can see ...</thought>
162+
<observation>According to the screenshot, i can see ...</observation>
160163
<data-json>
161164
"todo list"
162165
</data-json>
@@ -172,7 +175,7 @@ If the DATA_DEMAND is:
172175
173176
By viewing the screenshot and page contents, you can extract the following data:
174177
175-
<thought>According to the screenshot, i can see ...</thought>
178+
<observation>According to the screenshot, i can see ...</observation>
176179
<data-json>
177180
{ "StatementIsTruthy": true }
178181
</data-json>

packages/core/tests/unit-test/extraction.test.ts

Lines changed: 12 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ import { describe, expect, it } from 'vitest';
44
describe('parseXMLExtractionResponse', () => {
55
it('should parse complete XML response with all fields', () => {
66
const xml = `
7-
<thought>According to the screenshot, I can see a user profile with name, age, and admin status</thought>
7+
<observation>According to the screenshot, I can see a user profile with name, age, and admin status</observation>
88
<data-json>
99
{
1010
"name": "John",
@@ -52,7 +52,7 @@ describe('parseXMLExtractionResponse', () => {
5252

5353
it('should parse XML response with array data', () => {
5454
const xml = `
55-
<thought>I found three todo items in the list</thought>
55+
<observation>I found three todo items in the list</observation>
5656
<data-json>
5757
["todo 1", "todo 2", "todo 3"]
5858
</data-json>
@@ -68,7 +68,7 @@ describe('parseXMLExtractionResponse', () => {
6868

6969
it('should parse XML response with string data', () => {
7070
const xml = `
71-
<thought>The page title is "todo list"</thought>
71+
<observation>The page title is "todo list"</observation>
7272
<data-json>
7373
"todo list"
7474
</data-json>
@@ -84,7 +84,7 @@ describe('parseXMLExtractionResponse', () => {
8484

8585
it('should parse XML response with boolean data', () => {
8686
const xml = `
87-
<thought>This is the SMS page</thought>
87+
<observation>This is the SMS page</observation>
8888
<data-json>
8989
{ "result": true }
9090
</data-json>
@@ -100,7 +100,7 @@ describe('parseXMLExtractionResponse', () => {
100100

101101
it('should parse XML response with errors', () => {
102102
const xml = `
103-
<thought>Failed to extract some data</thought>
103+
<observation>Failed to extract some data</observation>
104104
<data-json>
105105
{
106106
"name": "John"
@@ -138,10 +138,10 @@ describe('parseXMLExtractionResponse', () => {
138138

139139
it('should handle multiline JSON in data-json', () => {
140140
const xml = `
141-
<thought>
141+
<observation>
142142
Extracting complex data structure
143143
from the screenshot
144-
</thought>
144+
</observation>
145145
<data-json>
146146
{
147147
"users": [
@@ -172,7 +172,7 @@ describe('parseXMLExtractionResponse', () => {
172172

173173
it('should throw error when data-json is missing', () => {
174174
const xml = `
175-
<thought>Some thought</thought>
175+
<observation>Some thought</observation>
176176
<errors>[]</errors>
177177
`.trim();
178178

@@ -183,7 +183,7 @@ describe('parseXMLExtractionResponse', () => {
183183

184184
it('should throw error when data-json is invalid JSON', () => {
185185
const xml = `
186-
<thought>Some thought</thought>
186+
<observation>Some thought</observation>
187187
<data-json>
188188
{invalid json}
189189
</data-json>
@@ -196,7 +196,7 @@ describe('parseXMLExtractionResponse', () => {
196196

197197
it('should throw error when data-json is markdown text instead of JSON', () => {
198198
const xml = `
199-
<thought>根据蓝色框选中的模块,提取到两个子页面:工具入口对比页、协商工具展示页,按照要求整理每个页面的信息如下。</thought>
199+
<observation>根据蓝色框选中的模块,提取到两个子页面:工具入口对比页、协商工具展示页,按照要求整理每个页面的信息如下。</observation>
200200
<data-json>
201201
# 页面名:工具入口(BEFORE/AFTER对比)
202202
# 页面描述:展示订单协商工具入口改造前后的界面对比,呈现不同阶段的订单列表页面样式,体现设计优化方向
@@ -239,7 +239,7 @@ invalid json array
239239

240240
it('should handle case-insensitive tag matching', () => {
241241
const xml = `
242-
<THOUGHT>Case insensitive thought</THOUGHT>
242+
<OBSERVATION>Case insensitive thought</OBSERVATION>
243243
<DATA-JSON>
244244
{"result": "success"}
245245
</DATA-JSON>
@@ -253,7 +253,7 @@ invalid json array
253253

254254
it('should parse nested objects correctly', () => {
255255
const xml = `
256-
<thought>Extracting nested data</thought>
256+
<observation>Extracting nested data</observation>
257257
<data-json>
258258
{
259259
"user": {

packages/core/tests/unit-test/inspect-extract-prompt.test.ts

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -45,7 +45,7 @@ describe('AiExtractElementInfo prompt assembly', () => {
4545
vi.clearAllMocks();
4646
vi.mocked(callAI).mockResolvedValue({
4747
content:
48-
'<thought>Looks correct.</thought><data-json>{"result":true}</data-json>',
48+
'<observation>Looks correct.</observation><data-json>{"result":true}</data-json>',
4949
usage: undefined,
5050
reasoning_content: undefined,
5151
} as any);

0 commit comments

Comments
 (0)