Skip to content

fix(core): add heap-pressure auto-compaction safety net#4186

Open
yiliang114 wants to merge 1 commit into
QwenLM:mainfrom
yiliang114:codex/4185-heap-pressure-safety
Open

fix(core): add heap-pressure auto-compaction safety net#4186
yiliang114 wants to merge 1 commit into
QwenLM:mainfrom
yiliang114:codex/4185-heap-pressure-safety

Conversation

@yiliang114
Copy link
Copy Markdown
Collaborator

Summary

  • What changed:
    • Add a temporary heap-pressure safety net in GeminiChat.tryCompress(): when V8 heapUsed / heap_size_limit >= 70%, the chat attempts auto-compaction even if the token/context threshold has not been reached yet.
    • Add bypassTokenThreshold to ChatCompressionService so heap-pressure compaction can skip only the token gate while preserving automatic compaction semantics. It deliberately does not use force=true, because force=true carries manual /compress behavior.
    • Add GeminiChat.getLastHistoryEntry() and use it in nextSpeakerChecker to avoid cloning the full comprehensive history just to inspect the last turn.
  • Why it changed:
    • Recent OOM reports show V8 heap can reach its limit before token-based compaction runs, especially in long sessions, large-context models, and large tool-output/diff workflows.
    • This is intended as a small transitional mitigation while broader memory work is in progress.
  • Reviewer focus:
    • packages/core/src/core/geminiChat.ts: heap-pressure guard and the decision to keep force=false.
    • packages/core/src/services/chatCompressionService.ts: bypassTokenThreshold skips only the token threshold gate and still respects the existing failed-compression latch.
    • packages/core/src/utils/nextSpeakerChecker.ts: O(1) last-history inspection instead of a full comprehensive-history clone.

Validation

  • Commands run:
    cd packages/core && npx vitest run src/core/geminiChat.test.ts src/services/chatCompressionService.test.ts src/utils/nextSpeakerChecker.test.ts
    cd packages/core && npm run typecheck
    npm run build && npm run typecheck
  • Prompts / inputs used:
    • No live prompt was required. Unit tests mock high V8 heap pressure and compression-service behavior.
  • Expected result:
    • Heap pressure bypasses only the token gate and does not set force=true.
    • Under-threshold compression remains a NOOP unless bypassTokenThreshold is set.
    • nextSpeakerChecker calls getHistory(true) only once and uses getLastHistoryEntry() for comprehensive-history last-turn checks.
    • Workspace build and typecheck complete successfully.
  • Observed result:
    ✓ src/services/chatCompressionService.test.ts (54 tests)
    ✓ src/utils/nextSpeakerChecker.test.ts (11 tests)
    ✓ src/core/geminiChat.test.ts (81 tests)
    
    Test Files  3 passed (3)
         Tests  146 passed (146)
    
    cd packages/core && npm run typecheck passed.
    npm run build && npm run typecheck passed. The build emitted existing non-blocking warnings for stale Browserslist data and packages/vscode-ide-companion/src/utils/editorGroupUtils.ts curly lint warning, with 0 errors.
  • Quickest reviewer verification path:
    cd packages/core && npx vitest run src/core/geminiChat.test.ts src/services/chatCompressionService.test.ts src/utils/nextSpeakerChecker.test.ts
  • Evidence:
    • Added tests cover heap-pressure bypassTokenThreshold, service-level threshold bypass, defensive last-history clone behavior, and avoiding the second full-history clone in nextSpeakerChecker.

Scope / Risk

  • Main risk or tradeoff:
    • This can start auto-compaction earlier when heap pressure is high. That is intentional, but compression itself still has peak-memory cost. This PR avoids force=true specifically to avoid manual /compress semantics and to keep the change narrow.
  • Not covered / not validated:
    • This does not claim to fix all OOM classes.
    • It does not implement large tool-result offload, stream/logging retention reduction, heap snapshots, or /doctor memory diagnostics.
    • It does not reproduce a multi-hour live OOM session locally; validation is targeted unit coverage plus build/typecheck.
  • Breaking changes / migration notes:
    • None. bypassTokenThreshold is an internal optional service option.

Coordination with related memory work

This PR is intentionally small and transitional because several broader memory efforts are already open or being discussed:

The goal here is to reduce default long-session OOM risk without taking ownership of the full memory-diagnostics/tool-offload redesign.

Testing Matrix

🍏 🪟 🐧
npm run ⚠️ ⚠️
npx ⚠️ ⚠️
Docker ⚠️ ⚠️ ⚠️
Podman ⚠️ N/A N/A
Seatbelt ⚠️ N/A N/A

Testing matrix notes:

  • Tested locally on macOS arm64 with focused core vitest, core typecheck, and root build/typecheck.
  • Windows/Linux/container execution was not run locally; the changed logic is Node/V8 process-memory logic plus TypeScript unit coverage.

Linked Issues / Bugs

@github-actions
Copy link
Copy Markdown
Contributor

📋 Review Summary

This PR introduces a temporary heap-pressure safety net for chat compression, allowing auto-compaction to trigger based on V8 heap usage (≥70% of heap_size_limit) even when the token threshold hasn't been reached. It also adds a getLastHistoryEntry() method for O(1) last-message inspection, which is used in nextSpeakerChecker to avoid cloning the full comprehensive history. The implementation is well-tested, narrowly scoped, and addresses a real OOM mitigation need while broader memory work is in progress.

🔍 General Feedback

  • Positive aspects:

    • The heap-pressure bypass is narrowly scoped and intentionally avoids force=true semantics, keeping manual /compress behavior distinct from automatic compaction.
    • Good test coverage across all three modified areas (heap-pressure bypass, bypassTokenThreshold service option, and getLastHistoryEntry() optimization).
    • The O(1) last-history inspection is a sensible performance optimization that avoids unnecessary structuredClone costs.
    • Code follows existing patterns and conventions; the changes integrate naturally with the existing compression flow.
  • Architectural notes:

    • The separation between force (manual semantics) and bypassTokenThreshold (automatic heap-pressure relief) is well-justified and correctly implemented.
    • The failed-compression latch (hasFailedCompressionAttempt) is correctly respected even when bypassing the token gate.

🎯 Specific Feedback

🟡 High

  • File: packages/core/src/core/geminiChat.ts:507-518 — The isHeapPressureHigh() check uses process.memoryUsage().heapUsed / heapLimit >= 0.7. This is a reasonable default, but the 0.7 threshold is hardcoded and may not suit all environments (e.g., containers with tight memory limits). Consider making this configurable via ChatCompression settings or an environment variable, especially since the PR description acknowledges this is a "transitional mitigation."

  • File: packages/core/src/core/geminiChat.ts:507 — The bypassTokenThreshold flag is set based on heap pressure at the start of tryCompress(), but there's no logging of the actual heap percentage that triggered it. Adding the actual ratio to the debug log (e.g., 'Heap pressure is high (X%); attempting auto-compaction...') would aid debugging and incident investigation.

🟢 Medium

  • File: packages/core/src/services/chatCompressionService.ts:237 — The condition if (!force && !bypassTokenThreshold) correctly gates the token-threshold check, but the comment above it ("steady-state path on every send") is now slightly outdated since bypassTokenThreshold introduces a non-steady-state path. Consider updating the comment to reflect that this gate is skipped when heap pressure is high.

  • File: packages/core/src/utils/nextSpeakerChecker.ts:63-73 — The variable naming is slightly confusing: comprehensiveHistory is assigned but then the comment says "If comprehensive history is empty" (lowercase). Also, the early-return check at lines 66-70 is now redundant with the getLastHistoryEntry() check at line 71. Consider consolidating to just use getLastHistoryEntry() and remove the explicit length check.

  • File: packages/core/src/core/geminiChat.ts:1188-1191 — The getLastHistoryEntry() method returns a defensive clone of only the last entry, which is correct. However, the JSDoc comment says "O(1) inspections" but doesn't explicitly note that this is still O(n) in the size of the last Content.parts array due to structuredClone(). This is a minor documentation clarification.

🔵 Low

  • File: packages/core/src/core/geminiChat.ts:58 — The constant HEAP_PRESSURE_COMPRESSION_RATIO = 0.7 is well-named, but consider adding a brief comment explaining why 70% was chosen (e.g., "Leaves 30% headroom for compression overhead and subsequent operations").

  • File: packages/core/src/services/chatCompressionService.ts:186-190 — The bypassTokenThreshold JSDoc is clear, but it might benefit from a cross-reference to the heap-pressure check in geminiChat.ts that sets this flag, helping future maintainers trace the flow.

  • File: packages/core/src/utils/nextSpeakerChecker.ts — The test file mocks both getHistory and getLastHistoryEntry, which is correct for testing. However, the test at line 278-295 ("should avoid cloning comprehensive history...") asserts getHistory is called once and getLastHistoryEntry once, but the mock setup at line 94-101 mocks both methods. Consider adding a test that verifies getHistory(/* curated */ true) is still called (for the curated history used in the LLM request) while getLastHistoryEntry() is used for the function-response check.

✅ Highlights

  • Excellent test coverage: All three test files have been updated with targeted tests for the new heap-pressure bypass, bypassTokenThreshold service option, and getLastHistoryEntry() optimization. The 146 passing tests provide strong confidence in the changes.

  • Thoughtful design: The decision to use bypassTokenThreshold instead of force=true demonstrates careful consideration of the different semantics between manual and automatic compaction. This keeps the change narrowly scoped as intended.

  • Performance-conscious: The getLastHistoryEntry() optimization avoids cloning the entire comprehensive history just to inspect the last message, which is a meaningful improvement for long conversation histories.

  • Build and typecheck pass: The PR author confirmed that npm run build && npm run typecheck completed successfully with zero errors, and all relevant test suites pass.

@yiliang114
Copy link
Copy Markdown
Collaborator Author

Additional context on why this PR is intentionally small:

Memory/OOM has become an active area of investigation in this repo this week. Several contributors are already working on related but different slices, and I want to make the relationship between those efforts and this PR explicit.

Related work already in progress

My own related PRs / prior work:

Other contributors' related work:

Why this PR exists separately

The larger memory-related efforts above are valuable, but they are either diagnostic-only, broader in scope, or touch many files and therefore need more review time. Meanwhile, recent reports show users are already hitting V8 heap OOMs in long sessions.

Based on this week's investigation and my local reproduction of the heap OOM pattern, the immediate failure mode is that token/context-based compaction can be too late. On large context-window models, especially 1M-context models, the 70% token threshold may not be reached before V8 heap is already near its limit. At that point, extra allocations from history cloning, compression setup, large tool outputs, UI/logging state, stream processing, or error handling can push the process into:

FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory

This PR is therefore intentionally a temporary mitigation rather than the final memory architecture.

What this PR does

The narrow fix here is:

  1. Add a V8 heap-pressure safety net based on heapUsed / heap_size_limit.
  2. When heap pressure is high, bypass only the token threshold gate for auto-compaction.
  3. Keep force=false, so this does not take the broader manual /compress path.
  4. Avoid one known full-history clone in nextSpeakerChecker by adding an O(1) last-history accessor.

What this PR does not try to solve

This PR does not implement:

  • /doctor memory diagnostics
  • heap snapshots
  • large tool-result offload
  • stream/logging retention reduction
  • a full auto-compaction threshold redesign
  • session resume storage redesign

Those should continue in the existing follow-up PRs/issues.

The intent of this PR is to provide a quick, low-conflict safety net that reduces current long-session OOM risk while the more complete memory work continues.

// steady-state path on every send; we want to exit before paying for the
// full `getHistory(true)` clone below.
if (!force) {
if (!force && !bypassTokenThreshold) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Critical] bypassTokenThreshold 在此处正确绕过了 token 阈值检查,但上方第 223 行的早期闸门 if (threshold <= 0 || (hasFailedCompressionAttempt && !force)) 会在 bypassTokenThreshold 被读取之前就返回 NOOP。

场景: 1) 堆内存压力触发压缩,bypassTokenThreshold=true;2) 压缩失败(API 限流等),hasFailedCompressionAttempt 被设为 true;3) 此后所有堆内存压力压缩都被第 223 行的闸门静默拦截为 NOOP,但 tryCompress 中仍会记录「Heap pressure is high; attempting auto-compaction」的误导性警告。

这意味着堆内存压力安全网在一次失败后永久失效。需要将 bypassTokenThreshold 纳入早期闸门:

Suggested change
if (!force && !bypassTokenThreshold) {
if (threshold <= 0 || (hasFailedCompressionAttempt && !force && !bypassTokenThreshold)) {

同时考虑在因 hasFailedCompressionAttempt 返回 NOOP 时记录一条 debug/warn 日志,避免静默失败。

— DeepSeek/deepseek-v4-pro via Qwen Code /review

// Temporary safety net: token-based compaction can be too late for
// large-context sessions because JS heap pressure may hit first.
// Do not use force=true here because that carries manual /compress
// semantics in ChatCompressionService.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] 堆内存压力警告日志缺少诊断值(实际堆使用率)。在事后排查 OOM 问题时,无法判断压力程度是 0.71 还是 0.99。建议包含比率:

Suggested change
// semantics in ChatCompressionService.
debugLogger.warn(
`Heap pressure at ${(ratio * 100).toFixed(1)}%; attempting auto-compaction before token threshold.`,
);

— DeepSeek/deepseek-v4-pro via Qwen Code /review

@@ -544,6 +558,17 @@ export class GeminiChat {
return info;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] isHeapPressureHigh() 每次调用都执行 getHeapStatistics(),但 heap_size_limit 在进程生命周期内是静态的(由 V8 启动时确定)。可缓存该值。同时 heapUsed 仅覆盖 V8 管理的堆,不包含外部内存(ArrayBuffers、Buffer),在某些大工具响应导致的 OOM 场景下可能漏检。作为临时安全网可接受,建议后续迭代中纳入 process.memoryUsage().external

— DeepSeek/deepseek-v4-pro via Qwen Code /review

// but as a safeguard:
if (comprehensiveHistory.length === 0) {
const lastComprehensiveMessage = chat.getLastHistoryEntry();
// If comprehensive history is empty, there is no last message to check.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] getLastHistoryEntry() reads comprehensive (raw) history, which is critical here — the function_response check below must see tool-result messages that curated history would strip. However, this dependency on raw vs. curated semantics is implicit.

The same function calls chat.getHistory(true) (curated) for the LLM prompt just below. The two data sources serve different purposes but the distinction is easy to miss during future refactors. If someone changes getLastHistoryEntry() to return curated entries, the isFunctionResponse short-circuit silently stops matching and the model falls through to an unnecessary LLM call on every turn.

Suggested change
// If comprehensive history is empty, there is no last message to check.
// getLastHistoryEntry() reads comprehensive (raw) history by design —
// we must see function_response messages that curated history would strip.
const lastComprehensiveMessage = chat.getLastHistoryEntry();

— glm-5.1 via Qwen Code /review

// steady-state path on every send; we want to exit before paying for the
// full `getHistory(true)` clone below.
if (!force) {
if (!force && !bypassTokenThreshold) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] After bypassTokenThreshold skips this token gate, the code still proceeds to chat.getHistory(true) (full structuredClone of entire history) and an LLM compression call. If heap pressure originates from non-history sources (large tool outputs, file caches, external allocations), this clone can worsen memory pressure rather than relieve it.

Consider adding a cheap pre-check using chat.getHistoryLength() before paying for the full clone. If the history is small, compression won't shrink it meaningfully and the clone is wasteful:

Suggested change
if (!force && !bypassTokenThreshold) {
if (bypassTokenThreshold && chat.getHistoryLength() < MIN_COMPRESSIBLE_HISTORY_LENGTH) {
return {
newHistory: null,
info: {
originalTokenCount,
newTokenCount: originalTokenCount,
compressionStatus: CompressionStatus.NOOP,
},
};
}

— glm-5.1 via Qwen Code /review

Copy link
Copy Markdown
Collaborator

@doudouOUC doudouOUC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

整体评价:PR 的设计思路是正确的——bypassTokenThresholdforce=true 的语义区分很清晰,nextSpeakerChecker 的优化也干净无回归。以下是几个需要关注的问题:


🟠 高优先级:hasFailedCompressionAttempt 闩锁使安全网变成一次性机制

位置: chatCompressionService.ts 第 216 行附近的前置门控与新增的 bypassTokenThreshold 的交互

当堆压力触发压缩(bypassTokenThreshold=true)但压缩失败后,hasFailedCompressionAttempt 被置为 true。后续消息中即使堆压力持续升高,服务在前置门控就提前退出(NOOP),永远到不了 token 阈值门控:

// 第 216 行 — hasFailedCompressionAttempt 门控,不受 bypassTokenThreshold 影响
if (threshold <= 0 || (hasFailedCompressionAttempt && !force)) {
  return { newHistory: null, info: { ... NOOP } };
}

// 第 234 行 — bypassTokenThreshold 只绕过了这里
if (!force && !bypassTokenThreshold) { ... }

影响: 安全网实质上是一次性的——一次压缩失败后就永久失效。如果堆继续增长,OOM 仍会发生。

建议修复:

  • bypassTokenThreshold 也纳入第 216 行门控: (hasFailedCompressionAttempt && !force && !bypassTokenThreshold)
  • 或者在堆压力触发时不标记 hasFailedCompressionAttempt = true

🟡 中优先级:isHeapPressureHigh() 做了两次 native 调用

getHeapStatistics() 已经提供了 used_heap_size(与 process.memoryUsage().heapUsed 是同一个 V8 指标)。一次调用即可获取分子分母,避免热路径上多余的同步 native 调用和微小的 TOCTOU 不一致:

private isHeapPressureHigh(): boolean {
  const stats = getHeapStatistics();
  const heapLimit = stats.heap_size_limit;
  if (!Number.isFinite(heapLimit) || heapLimit <= 0) return false;
  return stats.used_heap_size / heapLimit >= HEAP_PRESSURE_COMPRESSION_RATIO;
}

🟡 中优先级:缺失关键交互场景的测试

当前测试覆盖了 bypassTokenThreshold=true 绕过 token 门控的正常路径,但没有覆盖 hasFailedCompressionAttempt=true && bypassTokenThreshold=true 的场景。建议增加此边界 case 的测试,确认行为是否符合预期。


🔵 低优先级:getLastHistoryEntry()structuredClone

这是对之前全量 clone 的显著改进。但在 nextSpeakerChecker 中实际只检查 role 和是否是 function response(纯只读操作)。如果将来性能分析发现瓶颈,可考虑提供无拷贝的只读版本。当前实现正确且安全,不急需处理。


🤖 Generated with Qwen Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants