Resolve the feedback case in 反馈/0512-1/ by first understanding how long web search results and long tool results enter model context, then choosing an implementation plan. This task is about long web/search/tool result payloads entering context, not generic long-context support.
- P1 Option B has been implemented locally and validated. The user confirmed the direction:
- Option B is the P1 mainline.
- Option D is the long-term enhancement.
- Option C is deferred to P2.
- Option D has now also been implemented locally and validated as a sidecar artifact store.
- P1.1 vendor-practice research is complete.
- P1.2 local code-path investigation is complete.
- P1.4/P1.5 implementation validation is complete for the shared
function_call_outputingress path and Gemini native duplicate path. - Claude project memories have been imported into Codex memory extension notes:
/Users/alysechen/.codex/memories/extensions/ad_hoc/notes/2026-05-13T01-24-19-codex-app-transfer-claude-memory-import.md. - P2 has been implemented locally and validated.
- Do not treat "lower the auto-compact threshold" as the solution. The feedback points to tool/search payload budgeting and compact input construction.
- P1. Why one large research task creates so much context
- P1.1 Collect current official vendor practices for long web/search/tool results entering context.
- P1.2 Map this project's current web search and tool-output paths against those practices.
- P1.3 Review implementation方案 with the user and choose one.
- P1.4 After confirmation, write tests for the chosen方案.
- P1.5 After confirmation, implement and validate the chosen方案.
- P1.6 User review before starting P2.
- P2. Why compact grows after threshold trigger
- P2.1 Start only after P1方案 is confirmed.
- P2.2 Add compact request input-budget regression tests.
- P2.3 Implement compact-only input budgeting and explicit omission metadata.
- P2.4 Validate compact, adapters, and registry tests.
- P3. Why Kimi displays about 244k/245k instead of 256k
- P3.1 Start only after P1/P2 sequencing is confirmed.
- P4. Option D artifact/source store
- P4.1 Add sidecar store for raw tool/search/page payloads.
- P4.2 Put artifact IDs and bounded evidence in model-visible tool content.
- P4.3 Validate normal Responses, compact, and Gemini native paths.
These notes are based on current official documentation.
- OpenAI Responses exposes web search as a hosted
web_searchtool. - Responses include a
web_search_callitem and URL citation annotations rather than requiring clients to paste raw page bodies into ordinary history. - OpenAI exposes consulted
sourcesseparately from inline citations. - OpenAI's web search context path is capped at 128k even when the selected model has a larger context window.
- File search exposes
max_num_results, and raw search results are not returned by default unless requested withinclude.
Sources:
- https://developers.openai.com/api/docs/guides/tools-web-search
- https://developers.openai.com/api/docs/guides/tools-file-search
- Claude's web search is a server tool.
- Anthropic documents that basic web search can be token-intensive because it can fetch full HTML.
- The newer dynamic filtering path filters retrieved search content before loading it into context.
max_useslimits search count.- Custom
search_resultblocks advise returning only the most relevant results, preserving source/title/content metadata, and splitting long content into logical blocks.
Sources:
- https://platform.claude.com/docs/en/agents-and-tools/tool-use/web-search-tool
- https://platform.claude.com/docs/en/build-with-claude/search-results
- Google frames web search as grounding.
GroundingMetadataseparates search queries, grounding chunks, grounding supports, and search entry points.- Web chunks carry URI/title/domain evidence instead of being treated as undifferentiated chat text.
Sources:
- https://docs.cloud.google.com/vertex-ai/generative-ai/docs/grounding/grounding-with-google-search
- https://docs.cloud.google.com/vertex-ai/generative-ai/docs/reference/rest/v1/GroundingMetadata
- xAI's current documented web search tool is on the Responses API path.
- It exposes citations/sources and source controls such as
allowed_domainsandexcluded_domains. - The documented model is server-side search plus citations, not client-side replay of every fetched page.
Source:
- Perplexity Search API is a concrete example of explicit result budgeting.
- It exposes
max_results,max_tokens_per_page, and totalmax_tokens. - It distinguishes raw ranked search results from LLM-generated summaries.
Source:
- Moonshot base chat does not access web/databases/code unless tool workflows are used.
- Kimi
$web_searchis implemented through builtin tool-call compatibility. - Kimi documents that search results count into
prompt_tokensand exposes usage/search-token information. - Kimi K2.5 has tool-use and thinking compatibility constraints.
Sources:
- https://platform.moonshot.cn/docs/introduction
- https://platform.kimi.com/docs/guide/use-web-search
- https://platform.kimi.com/docs/guide/kimi-k2-5-quickstart
Across these vendors, the common pattern is not "put every web page and every tool result into normal conversation history."
The common pattern is:
- Search/browse runs as a tool or grounding layer.
- Results carry source metadata and citations.
- Model-visible content is bounded by result count, domain filters, dynamic filtering, chunks, excerpts, or explicit token budgets.
- Raw trace/source lists are kept separate from normal conversation text.
- If retrieved content counts toward prompt tokens, the API exposes enough usage metadata to budget it.
Files in 反馈/0512-1/:
image.png: screenshot shows the first-turn research workflow auto-compacted, then failed withInvalid request: Your request exceeded model token limit: 262144 (requested: 336979). The same screenshot shows the UI context meter at244k / 245k tokens used.bundle-20260512-232148-61422-1778599308510.json: the relevant failing upstream bundle. It isPOST /responses/compacttohttps://api.kimi.com/coding/v1/chat/completions, with a transformed chat body of854476bytes. The stored diagnostic body is truncated to251194characters and reports592332omitted bytes. The upstream response body is the Kimi token-limit error above.- The stored diagnostic body contains many prior tool messages and includes one shell/search tool result with
Original token count: 924828. That result is a local grep over extracted Codex app assets, including minified JavaScript, and it is being replayed as model-visiblerole:"tool"content. bundle-20260512-014743-42908-1778521663471.jsonandbundle-20260512-014743-42908-1778521663481.json: unrelated Grok Web 401 bundles, useful only as folder inventory.codex-config.redacted.tomlandproxy-config.redacted.json: redacted config snapshots. The Kimi provider path matters because native$web_searchis not the root failure path here.[反馈] (无标题) · fb-da4d0cc8.pdf:pdftotextfound no extractable text; the actionable feedback text is represented by the screenshot.
Conclusion from the feedback bundle:
The compact request exceeded Kimi's real context because raw tool outputs had already entered the conversation history before compact. The observed failure is not solved by lowering the auto-compact trigger. The first priority is bounding and structuring long tool/search outputs before they become replayable chat history.
docs/web-search-implementation-tracker.mdrecords that Codex.app sends{type:"web_search"}in the inbound tools array by default, and this proxy converts it per provider.- In
crates/adapters/src/responses/request.rs, provider-native web search is only forwarded for supported providers and only when enabled by provider options. MiMo and Kimi/Moonshot have special handling; DeepSeek and MiniMax are dropped because their documented Chat APIs do not support native web search tools. - In
crates/adapters/src/gemini_native/request.rs,web_searchcan become GeminigoogleSearch, but it is later dropped when normal function declarations are also present because Gemini rejects built-in search plus function calling in the same request. - In
crates/adapters/src/grok_web/response.rs, Grok server-sidewebSearchResultsare converted into short reasoning summaries and URL citations. This path is closer to the vendor pattern.
This path is not the direct root cause when it is truly active and bounded. The important caveat is that it is not always active.
The riskier path is generic function_call_output.
responses_body_to_chat_body_for_provider_with_sessionbuilds Chat messages from Responsesinput, then clones those messages into session state.input_item_to_messagesmapsfunction_call_output.outputinto Chatrole:"tool"with fullcontent.- There is no current model-visible budget, source metadata split, URL/title/snippet normalization, or raw-payload separation at that point.
compact.rsreuses this same conversion path, so compact inherits whatever tool content is present in the input.
This is the direct mismatch with vendor practice: ordinary tool output is treated as replayable chat content.
- The active Kimi Code config in
反馈/0512-1/proxy-config.redacted.jsonhas emptyrequestOptions, so this project's Kimi/Moonshot native$web_searchpath was not enabled. - The screenshot shows a first-turn large research workflow with browser/search/source-code exploration actions.
- The saved compact diagnostics show many tool messages and large tool content before compaction.
Current P1 hypothesis:
The failing first-turn research case is mainly caused by local/browser/MCP research outputs entering the next upstream request as full Chat
tool.content, not by Kimi-native$web_searchitself.
User decision:
- P1 mainline: Option B.
- Long-term enhancement: Option D.
- P2 safety layer: Option C.
- Option A remains a compatibility improvement, but it is not sufficient for this feedback case.
Make native web_search the preferred path when the active provider supports it, and avoid encouraging fallback to local/browser search tools.
What it changes:
- enable or improve provider-native mapping for supported providers;
- surface unsupported-provider behavior clearly;
- keep citations/source metadata from provider responses.
Pros:
- closest to OpenAI/Anthropic/Google/xAI documented model;
- less local payload management;
- provider can do its own result filtering and citation handling.
Cons:
- provider support is inconsistent;
- Gemini cannot combine
googleSearchwith function calling in the current path; - Kimi
$web_searchstill counts search content toward prompt tokens; - does not solve arbitrary MCP/browser/local tool outputs.
Best use:
- as a compatibility improvement, not sufficient alone for the feedback failure.
Add a shared normalization/budget layer before function_call_output becomes Chat tool.content.
What it changes:
- classify tool outputs into search results, web pages, command/logs, code/file output, opaque JSON, and small output;
- keep tool call IDs intact;
- store raw output separately when possible;
- put only bounded, structured evidence into model-visible
tool.content; - enforce per-result and aggregate budgets before session cache and compact.
Pros:
- directly addresses the feedback failure path;
- applies to normal turns and compact because both use the shared converter;
- preserves provider-neutral behavior;
- aligns with vendor practice for bounded retrieved content.
Cons:
- needs careful tests to avoid breaking tool-call repair;
- risk of losing useful information if classification is too crude;
- raw artifact/storage design must be decided if we want more than simple truncation.
Best use:
- likely core P1 solution, but implementation details need confirmation before coding.
Scope:
- Add a shared bounded tool-output normalization layer before
function_call_output.outputis written into Chattool.content. - Primary hook:
crates/adapters/src/responses/request.rs::input_item_to_messages, because normal/responsesand/responses/compactboth pass through this conversion path. - The compact path currently builds a synthetic Responses body and reuses the same converter in
crates/adapters/src/responses/compact.rs::build_compact_chat_request, so fixing the shared ingress covers both normal turns and compact input. - Gemini native has related
function_call_outputhandling incrates/adapters/src/gemini_native/request.rs; P1 should verify whether it reuses the shared path for the affected route or needs the same normalizer called explicitly.
Behavior:
- Preserve
call_id/tool_call_idexactly so tool-call repair and provider ordering rules do not regress. - Leave small tool outputs unchanged.
- For large command/log outputs, replace raw content with a bounded evidence summary: command metadata when available, exit code, original byte/character count, first useful lines, last useful lines, matched file paths or headings, and an explicit omission marker.
- For search/web-like results, keep bounded structured evidence: title, URL/domain, snippet/excerpt, result count, and omitted-size metadata. Do not replay full HTML or minified assets as ordinary chat text.
- For large JSON/object output, keep a compact shape/key summary plus bounded excerpts instead of serializing the entire object.
- Do not introduce destructive silent drop. The model-visible message must say content was compressed and what was retained.
- Do not depend on the long-term artifact store in P1. Raw payload storage with source IDs belongs to Option D.
Tests to write before implementation:
- A regression fixture based on
反馈/0512-1/: afunction_call_outputcontainingOriginal token count: 924828and minified JavaScript must convert to boundedtool.content, while preservingtool_call_id. /responses/compactmust use the same bounded output path and must not re-embed the raw tool result before appending the compact prompt.- Small
function_call_outputstrings remain byte-for-byte visible. - Non-string JSON output remains valid and bounded.
- Tool-call repair tests around orphan
function_call_outputcontinue to pass. - Compaction summary items remain unchanged; this fix targets tool outputs, not existing
compactionitems.
Validation target:
- For the feedback-shaped fixture, transformed chat body size should be bounded by policy constants rather than proportional to raw tool output size.
- The transformed request must include explicit omission metadata so the model understands that full raw content was not replayed.
- Existing provider web-search tests for Kimi/MiMo/DeepSeek/MiniMax should remain unchanged because P1 does not change provider-native web-search mapping.
Leave normal tool output conversion unchanged, but add pruning inside /responses/compact.
What it changes:
- compact would prune old/large tool outputs before appending the summary prompt;
- compact would possibly drop irrelevant tools and reduce output budget.
Pros:
- smaller code surface;
- directly reduces the observed compact failure.
Cons:
- too late: normal requests can still become huge before compact;
- does not fix first-turn research context growth itself;
- can still leave session history full of raw tool content.
Best use:
- a P2 safety layer after P1, not a standalone P1 solution.
Root cause:
crates/adapters/src/responses/compact.rs::build_compact_chat_requestcopies original compactinputinto a synthetic Responses body.- It then appends
COMPACT_SUMMARIZATION_PROMPTas an additional user message. - The synthetic body is converted to Chat messages and sent upstream with
max_output_tokens = 20_000. - Before P2 there is no total budget for compact input, so a request can trigger compact near the visible context threshold and then grow further because summary instructions and output allowance are added on top.
Implementation target:
- Apply a compact-only budget after Responses-to-Chat conversion, because that is the actual upstream payload shape.
- Always preserve the final summarization prompt.
- Preserve the newest conversation blocks first.
- Keep assistant tool-call groups together with their following tool outputs so provider tool-order validation does not regress.
- If older content is pruned, insert a model-visible omission notice with counts and user-message excerpts, rather than silently dropping context.
- Keep this separate from P1/P4. P1/P4 prevents raw tool/search payload replay; P2 prevents compact itself from becoming oversized even when history is still large.
Implemented behavior:
build_compact_chat_requestnow applies a compact-only budget after Responses-to-Chat conversion.- The final Chat
messagesarray is bounded byCOMPACT_CHAT_MESSAGES_MAX_BYTES. - The final summarization prompt is always preserved as the last user message.
- The newest conversation blocks are preserved first.
- Assistant messages with
tool_callsare grouped with their followingtoolmessages so compact pruning does not create provider-invalid orphan tool messages. - Oversized retained messages and tool-call arguments are shortened with explicit markers.
- When the budget layer is applied, a user-visible compact notice is inserted with original message count, omitted message count, original JSON byte size, omitted byte size, and omitted user-message excerpts.
Move raw tool/search/page payloads into a sidecar store and pass only source IDs plus selected excerpts to the model.
What it changes:
- raw search/page/tool payloads are saved outside prompt context;
- model-visible content contains source IDs, URL/title/snippet/excerpt, byte counts, and omission metadata;
- follow-up tool calls can request more detail by source ID.
Pros:
- closest to mature retrieval architecture;
- best long-term answer for large research sessions;
- avoids irreversible information loss from simple truncation.
Cons:
- larger product/design surface;
- needs storage lifecycle, privacy, cleanup, and replay fixture decisions;
- may need UI/diagnostic changes.
Implemented scope:
- Added
crates/adapters/src/responses/artifact_store.rs. - Added a new registry path for
~/.codex-app-transfer/tool_artifacts.db. - Large
function_call_outputpayloads are written toToolArtifactStorebefore model-visible chat history is built. - Model-visible
tool.contentnow contains:Artifact ID;- preserved
Tool call ID; - artifact kind such as
command_output,web_or_search,file_or_code_output,json, oropaque_tool_output; - original character and line counts;
- original token count markers when present;
- path hints, URL hints, and bounded head/tail excerpts;
- explicit omission metadata.
- Small tool outputs remain inline unchanged.
- The OpenAI-compatible Responses path,
/responses/compact, and Gemini native normalized-chat path share the same long-output normalization behavior.
Storage lifecycle:
- Default persistent store is
~/.codex-app-transfer/tool_artifacts.db. - Store uses SQLite with a 30-day persisted TTL, matching the existing session-cache retention model.
- If SQLite initialization or writes fail, the store falls back to bounded in-memory storage and emits stable warning IDs.
- Unit tests use in-memory global artifact storage to avoid writing to the developer's real home directory.
Boundary:
- This solves raw payload replay into model context and session history.
- It does not add a new model-callable retrieval tool, because upstream-generated tool calls must correspond to tools known by the Codex client. Adding a hidden provider-only retrieval tool would create unsupported calls on the client side unless the proxy also owns that execution loop.
Best use:
- long-term robust version of Option B.
- P1: implement Option B with tests first.
- P2: add compact-only safety pruning if P1 still leaves edge cases.
- Long-term: design Option D artifact/source store with source IDs and retrieval.
- 2026-05-13: Created this task document after the user clarified that P1 is about long web/search/tool results entering context.
- 2026-05-13: Completed vendor-practice research.
- 2026-05-13: Completed local code-path investigation for provider-native web search and generic
function_call_output. - 2026-05-13: Corrected the document after an implementation was started without user confirmation. The code changes were withdrawn; this document now returns P1 to方案 review state.
- 2026-05-13: Imported this project's Claude memory index into Codex memory extension notes.
- 2026-05-13: User confirmed Option B as P1, Option D as long-term, and Option C as P2.
- 2026-05-13: Rechecked all files under
反馈/0512-1/and tied the P1 root cause to raw tool output entering/responses/compact. - 2026-05-13: Implemented Option B by bounding large
function_call_outputcontent before chat-history construction and compact prompt construction. - 2026-05-13: Added regression coverage for the OpenAI-compatible Responses conversion path, compact request construction, and Gemini native normalized-chat path.
- 2026-05-13: User asked to continue Option D and fully solve the long tool/search payload context issue before P2.
- 2026-05-13: Implemented Option D sidecar artifact storage and changed large tool output summaries from simple compression markers to artifact IDs plus bounded evidence.
- 2026-05-13: User asked to continue P2.
- 2026-05-13: Implemented compact-only input budgeting after Responses-to-Chat conversion, preserving the final prompt and newest valid message blocks while making budget pruning explicit.
- P1.1 validation is documentation-level: conclusions are tied to vendor docs listed above.
- P1.2 validation is code-reading level: findings are tied to local files listed in the code-path sections.
- P1.2 feedback validation is tied to
反馈/0512-1/bundle-20260512-232148-61422-1778599308510.jsonand反馈/0512-1/image.png. - P1.4/P1.5 validation commands:
cargo fmt --checkcargo test -p codex-app-transfer-adapters large_tool_outputcargo test -p codex-app-transfer-adapters large_function_call_output_is_boundedcargo test -p codex-app-transfer-adapters
- P4 validation commands:
cargo fmt --checkcargo test -p codex-app-transfer-adapters artifactcargo test -p codex-app-transfer-adapters large_function_call_outputcargo test -p codex-app-transfer-adapters large_tool_outputcargo test -p codex-app-transfer-adapterscargo test -p codex-app-transfer-registry
- P2 validation commands:
cargo fmt --checkcargo test -p codex-app-transfer-adapters build_compact_chat_request_cargo test -p codex-app-transfer-adapters large_tool_outputcargo test -p codex-app-transfer-adapters large_function_call_outputcargo test -p codex-app-transfer-adapters artifactcargo test -p codex-app-transfer-adapterscargo test -p codex-app-transfer-registry
Review the P1/P2/P4 implementation result with the user. P3 remains next if the user wants to continue.