fix(core): add heap-pressure auto-compaction safety net#4186
Conversation
📋 Review SummaryThis PR introduces a temporary heap-pressure safety net for chat compression, allowing auto-compaction to trigger based on V8 heap usage (≥70% of 🔍 General Feedback
🎯 Specific Feedback🟡 High
🟢 Medium
🔵 Low
✅ Highlights
|
|
Additional context on why this PR is intentionally small: Memory/OOM has become an active area of investigation in this repo this week. Several contributors are already working on related but different slices, and I want to make the relationship between those efforts and this PR explicit. Related work already in progressMy own related PRs / prior work:
Other contributors' related work:
Why this PR exists separatelyThe larger memory-related efforts above are valuable, but they are either diagnostic-only, broader in scope, or touch many files and therefore need more review time. Meanwhile, recent reports show users are already hitting V8 heap OOMs in long sessions. Based on this week's investigation and my local reproduction of the heap OOM pattern, the immediate failure mode is that token/context-based compaction can be too late. On large context-window models, especially 1M-context models, the 70% token threshold may not be reached before V8 heap is already near its limit. At that point, extra allocations from history cloning, compression setup, large tool outputs, UI/logging state, stream processing, or error handling can push the process into: This PR is therefore intentionally a temporary mitigation rather than the final memory architecture. What this PR doesThe narrow fix here is:
What this PR does not try to solveThis PR does not implement:
Those should continue in the existing follow-up PRs/issues. The intent of this PR is to provide a quick, low-conflict safety net that reduces current long-session OOM risk while the more complete memory work continues. |
| // steady-state path on every send; we want to exit before paying for the | ||
| // full `getHistory(true)` clone below. | ||
| if (!force) { | ||
| if (!force && !bypassTokenThreshold) { |
There was a problem hiding this comment.
[Critical] bypassTokenThreshold 在此处正确绕过了 token 阈值检查,但上方第 223 行的早期闸门 if (threshold <= 0 || (hasFailedCompressionAttempt && !force)) 会在 bypassTokenThreshold 被读取之前就返回 NOOP。
场景: 1) 堆内存压力触发压缩,bypassTokenThreshold=true;2) 压缩失败(API 限流等),hasFailedCompressionAttempt 被设为 true;3) 此后所有堆内存压力压缩都被第 223 行的闸门静默拦截为 NOOP,但 tryCompress 中仍会记录「Heap pressure is high; attempting auto-compaction」的误导性警告。
这意味着堆内存压力安全网在一次失败后永久失效。需要将 bypassTokenThreshold 纳入早期闸门:
| if (!force && !bypassTokenThreshold) { | |
| if (threshold <= 0 || (hasFailedCompressionAttempt && !force && !bypassTokenThreshold)) { |
同时考虑在因 hasFailedCompressionAttempt 返回 NOOP 时记录一条 debug/warn 日志,避免静默失败。
— DeepSeek/deepseek-v4-pro via Qwen Code /review
| // Temporary safety net: token-based compaction can be too late for | ||
| // large-context sessions because JS heap pressure may hit first. | ||
| // Do not use force=true here because that carries manual /compress | ||
| // semantics in ChatCompressionService. |
There was a problem hiding this comment.
[Suggestion] 堆内存压力警告日志缺少诊断值(实际堆使用率)。在事后排查 OOM 问题时,无法判断压力程度是 0.71 还是 0.99。建议包含比率:
| // semantics in ChatCompressionService. | |
| debugLogger.warn( | |
| `Heap pressure at ${(ratio * 100).toFixed(1)}%; attempting auto-compaction before token threshold.`, | |
| ); |
— DeepSeek/deepseek-v4-pro via Qwen Code /review
| @@ -544,6 +558,17 @@ export class GeminiChat { | |||
| return info; | |||
There was a problem hiding this comment.
[Suggestion] isHeapPressureHigh() 每次调用都执行 getHeapStatistics(),但 heap_size_limit 在进程生命周期内是静态的(由 V8 启动时确定)。可缓存该值。同时 heapUsed 仅覆盖 V8 管理的堆,不包含外部内存(ArrayBuffers、Buffer),在某些大工具响应导致的 OOM 场景下可能漏检。作为临时安全网可接受,建议后续迭代中纳入 process.memoryUsage().external。
— DeepSeek/deepseek-v4-pro via Qwen Code /review
| // but as a safeguard: | ||
| if (comprehensiveHistory.length === 0) { | ||
| const lastComprehensiveMessage = chat.getLastHistoryEntry(); | ||
| // If comprehensive history is empty, there is no last message to check. |
There was a problem hiding this comment.
[Suggestion] getLastHistoryEntry() reads comprehensive (raw) history, which is critical here — the function_response check below must see tool-result messages that curated history would strip. However, this dependency on raw vs. curated semantics is implicit.
The same function calls chat.getHistory(true) (curated) for the LLM prompt just below. The two data sources serve different purposes but the distinction is easy to miss during future refactors. If someone changes getLastHistoryEntry() to return curated entries, the isFunctionResponse short-circuit silently stops matching and the model falls through to an unnecessary LLM call on every turn.
| // If comprehensive history is empty, there is no last message to check. | |
| // getLastHistoryEntry() reads comprehensive (raw) history by design — | |
| // we must see function_response messages that curated history would strip. | |
| const lastComprehensiveMessage = chat.getLastHistoryEntry(); |
— glm-5.1 via Qwen Code /review
| // steady-state path on every send; we want to exit before paying for the | ||
| // full `getHistory(true)` clone below. | ||
| if (!force) { | ||
| if (!force && !bypassTokenThreshold) { |
There was a problem hiding this comment.
[Suggestion] After bypassTokenThreshold skips this token gate, the code still proceeds to chat.getHistory(true) (full structuredClone of entire history) and an LLM compression call. If heap pressure originates from non-history sources (large tool outputs, file caches, external allocations), this clone can worsen memory pressure rather than relieve it.
Consider adding a cheap pre-check using chat.getHistoryLength() before paying for the full clone. If the history is small, compression won't shrink it meaningfully and the clone is wasteful:
| if (!force && !bypassTokenThreshold) { | |
| if (bypassTokenThreshold && chat.getHistoryLength() < MIN_COMPRESSIBLE_HISTORY_LENGTH) { | |
| return { | |
| newHistory: null, | |
| info: { | |
| originalTokenCount, | |
| newTokenCount: originalTokenCount, | |
| compressionStatus: CompressionStatus.NOOP, | |
| }, | |
| }; | |
| } |
— glm-5.1 via Qwen Code /review
doudouOUC
left a comment
There was a problem hiding this comment.
Code Review
整体评价:PR 的设计思路是正确的——bypassTokenThreshold 与 force=true 的语义区分很清晰,nextSpeakerChecker 的优化也干净无回归。以下是几个需要关注的问题:
🟠 高优先级:hasFailedCompressionAttempt 闩锁使安全网变成一次性机制
位置: chatCompressionService.ts 第 216 行附近的前置门控与新增的 bypassTokenThreshold 的交互
当堆压力触发压缩(bypassTokenThreshold=true)但压缩失败后,hasFailedCompressionAttempt 被置为 true。后续消息中即使堆压力持续升高,服务在前置门控就提前退出(NOOP),永远到不了 token 阈值门控:
// 第 216 行 — hasFailedCompressionAttempt 门控,不受 bypassTokenThreshold 影响
if (threshold <= 0 || (hasFailedCompressionAttempt && !force)) {
return { newHistory: null, info: { ... NOOP } };
}
// 第 234 行 — bypassTokenThreshold 只绕过了这里
if (!force && !bypassTokenThreshold) { ... }影响: 安全网实质上是一次性的——一次压缩失败后就永久失效。如果堆继续增长,OOM 仍会发生。
建议修复:
- 将
bypassTokenThreshold也纳入第 216 行门控:(hasFailedCompressionAttempt && !force && !bypassTokenThreshold) - 或者在堆压力触发时不标记
hasFailedCompressionAttempt = true
🟡 中优先级:isHeapPressureHigh() 做了两次 native 调用
getHeapStatistics() 已经提供了 used_heap_size(与 process.memoryUsage().heapUsed 是同一个 V8 指标)。一次调用即可获取分子分母,避免热路径上多余的同步 native 调用和微小的 TOCTOU 不一致:
private isHeapPressureHigh(): boolean {
const stats = getHeapStatistics();
const heapLimit = stats.heap_size_limit;
if (!Number.isFinite(heapLimit) || heapLimit <= 0) return false;
return stats.used_heap_size / heapLimit >= HEAP_PRESSURE_COMPRESSION_RATIO;
}🟡 中优先级:缺失关键交互场景的测试
当前测试覆盖了 bypassTokenThreshold=true 绕过 token 门控的正常路径,但没有覆盖 hasFailedCompressionAttempt=true && bypassTokenThreshold=true 的场景。建议增加此边界 case 的测试,确认行为是否符合预期。
🔵 低优先级:getLastHistoryEntry() 的 structuredClone
这是对之前全量 clone 的显著改进。但在 nextSpeakerChecker 中实际只检查 role 和是否是 function response(纯只读操作)。如果将来性能分析发现瓶颈,可考虑提供无拷贝的只读版本。当前实现正确且安全,不急需处理。
🤖 Generated with Qwen Code
Summary
GeminiChat.tryCompress(): when V8heapUsed / heap_size_limit >= 70%, the chat attempts auto-compaction even if the token/context threshold has not been reached yet.bypassTokenThresholdtoChatCompressionServiceso heap-pressure compaction can skip only the token gate while preserving automatic compaction semantics. It deliberately does not useforce=true, becauseforce=truecarries manual/compressbehavior.GeminiChat.getLastHistoryEntry()and use it innextSpeakerCheckerto avoid cloning the full comprehensive history just to inspect the last turn.packages/core/src/core/geminiChat.ts: heap-pressure guard and the decision to keepforce=false.packages/core/src/services/chatCompressionService.ts:bypassTokenThresholdskips only the token threshold gate and still respects the existing failed-compression latch.packages/core/src/utils/nextSpeakerChecker.ts: O(1) last-history inspection instead of a full comprehensive-history clone.Validation
force=true.bypassTokenThresholdis set.nextSpeakerCheckercallsgetHistory(true)only once and usesgetLastHistoryEntry()for comprehensive-history last-turn checks.cd packages/core && npm run typecheckpassed.npm run build && npm run typecheckpassed. The build emitted existing non-blocking warnings for stale Browserslist data andpackages/vscode-ide-companion/src/utils/editorGroupUtils.tscurly lint warning, with 0 errors.bypassTokenThreshold, service-level threshold bypass, defensive last-history clone behavior, and avoiding the second full-history clone innextSpeakerChecker.Scope / Risk
force=truespecifically to avoid manual/compresssemantics and to keep the change narrow./doctor memorydiagnostics.bypassTokenThresholdis an internal optional service option.Coordination with related memory work
This PR is intentionally small and transitional because several broader memory efforts are already open or being discussed:
/doctor memorydiagnostics PRs./resumememory behavior.The goal here is to reduce default long-session OOM risk without taking ownership of the full memory-diagnostics/tool-offload redesign.
Testing Matrix
Testing matrix notes:
Linked Issues / Bugs