Skip to content

Add OOM pause mechanism to pause query threads before killing on critical heap#18163

Merged
yashmayya merged 6 commits intoapache:masterfrom
yashmayya:oom-pause-queries
Apr 17, 2026
Merged

Add OOM pause mechanism to pause query threads before killing on critical heap#18163
yashmayya merged 6 commits intoapache:masterfrom
yashmayya:oom-pause-queries

Conversation

@yashmayya
Copy link
Copy Markdown
Contributor

Summary

  • When heap transitions from below-critical to critical, pauses all query execution threads for a configurable timeout and calls System.gc() (only once, on first transition to critical), giving the JVM breathing room to reclaim memory before resorting to query kills.
  • If heap recovers below critical during the pause window, threads resume with no query killed.
  • If heap stays critical after timeout, the existing kill-most-expensive-query logic proceeds.
  • Off by default, dynamically toggleable via cluster config (no restart required).

Overhead

  • Happy path (no OOM pressure): waitIfPaused() is called at every sampling checkpoint and costs a single volatile read of _pauseActive. No lock, no contention.
  • Triggering path (critical heap): Query threads block on a ReentrantLock + Condition. Lock contention is not a concern because:
    • Threads acquire the lock only to enter Condition.await(), releasing it immediately - hold time is near-zero.
    • activatePause() (watcher thread) is lock-free - it writes two volatiles with correct ordering.
    • This only activates when the JVM is already under critical heap pressure and queries are about to be killed anyway.

Configuration

┌────────────────────────────────────────────────┬─────────┬─────────┬─────────────────────────────────────────────────────────┐
│                   Config key                   │  Type   │ Default │                       Description                       │
├────────────────────────────────────────────────┼─────────┼─────────┼─────────────────────────────────────────────────────────┤
│ accounting.oom.critical.query.pause.enabled    │ boolean │ false   │ Enable pausing query threads before killing on critical │
│                                                │         │         │  heap                                                   │
├────────────────────────────────────────────────┼─────────┼─────────┼─────────────────────────────────────────────────────────┤
│ accounting.oom.critical.query.pause.timeout.ms │ long    │ 1000    │ How long (ms) to pause before proceeding with kill      │
└────────────────────────────────────────────────┴─────────┴─────────┴─────────────────────────────────────────────────────────┘

Both are dynamically updatable via Helix cluster config.


State transitions

  • Below critical → Critical (enabled): Activate pause, call System.gc()
  • Critical → Critical (within timeout): No-op, let GC work
  • Critical → Critical (timeout expired): Clear pause, kill most expensive query
  • Critical → Below critical (pause active): Clear pause, resume threads, no kill
  • Any → Panic: Kill all queries first, then clear pause
  • Below critical → Critical (disabled): Existing behavior (immediate kill of most expensive query)

@yashmayya yashmayya requested a review from Jackie-Jiang April 10, 2026 18:53
@yashmayya yashmayya added feature New functionality oom-protection Related to out-of-memory protection mechanisms labels Apr 10, 2026
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 10, 2026

Codecov Report

❌ Patch coverage is 63.33333% with 33 lines in your changes missing coverage. Please review.
✅ Project coverage is 63.49%. Comparing base (3cc1902) to head (0f63522).
⚠️ Report is 8 commits behind head on master.

Files with missing lines Patch % Lines
...pinot/core/accounting/QueryResourceAggregator.java 74.50% 9 Missing and 4 partials ⚠️
...org/apache/pinot/spi/query/QueryThreadContext.java 27.77% 11 Missing and 2 partials ⚠️
...ache/pinot/core/accounting/QueryMonitorConfig.java 76.47% 2 Missing and 2 partials ⚠️
...ore/accounting/ResourceUsageAccountantFactory.java 0.00% 3 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18163      +/-   ##
============================================
+ Coverage     63.32%   63.49%   +0.16%     
  Complexity     1627     1627              
============================================
  Files          3238     3243       +5     
  Lines        197011   197248     +237     
  Branches      30466    30510      +44     
============================================
+ Hits         124762   125247     +485     
+ Misses        62244    61958     -286     
- Partials      10005    10043      +38     
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (ø)
java-11 63.46% <63.33%> (+0.15%) ⬆️
java-21 63.39% <63.33%> (+8.13%) ⬆️
temurin 63.49% <63.33%> (+0.16%) ⬆️
unittests 63.49% <63.33%> (+0.16%) ⬆️
unittests1 55.39% <63.33%> (+0.11%) ⬆️
unittests2 35.04% <3.33%> (+0.09%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Contributor

@xiangfu0 xiangfu0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High-signal issue inline.

Comment thread pinot-spi/src/main/java/org/apache/pinot/spi/utils/CommonConstants.java Outdated
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Comment thread pinot-spi/src/main/java/org/apache/pinot/spi/utils/CommonConstants.java Outdated
// NOTE: In production code, threadContext should never be null. It might be null in tests when QueryThreadContext
// is not set up.
if (threadContext != null) {
threadContext.waitIfPausedInternal();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline. Let's still check termination first, then when entering the pause, check termination again before exiting the pause. Pause should be done after the sampling

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just realized that the second termination check should only be done if we're actually pausing and exiting the pause, and not on the regular fast path. Making this change now.

private final ReentrantLock _pauseLock = new ReentrantLock();
private final Condition _pauseCondition = _pauseLock.newCondition();
private volatile boolean _pauseActive = false;
private volatile long _pauseDeadlineMs = 0;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) Doesn't need to be volatile as it is maintained as a local state for the watcher.
Consider adding some comments describing the contract for them:

_pauseActive is shared by both watcher and query threads
_pauseDeadlineMs is local state for the watcher (similar to _triggeringLevel)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes good point, this is left over from the previous implementation where the other threads were also checking the deadline to unpause themselves.

clearPause();
}
// else: still in grace period, skip kill this cycle
} else if (config.isOomPreQueryKillPauseEnabled()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only pauses when the previous cycle was below critical. If the config is enabled while the heap is already critical/panic, the watcher still takes the immediate-kill path until usage drops below critical and rises again, so the advertised dynamic toggle does not take effect during an active incident. Please make the pause arm based on the current config/state, not only on the transition edge.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this extreme edge case is worth introducing additional complexity for IMO, it's easier to reason about this way.

@yashmayya yashmayya merged commit 46cc680 into apache:master Apr 17, 2026
31 of 32 checks passed
@xiangfu0
Copy link
Copy Markdown
Contributor

Docs PR opened: pinot-contrib/pinot-docs#750

xiangfu0 added a commit to pinot-contrib/pinot-docs that referenced this pull request Apr 21, 2026
Adds documentation for the OOM pause mechanism feature from
apache/pinot#18163.

This new feature pauses query execution threads when heap transitions to
critical level, giving the JVM a chance to reclaim memory through
garbage collection before resorting to query kills.

- Explains the pause behavior (on critical heap transition, pause for
timeout, call gc, resume if recovered, else proceed with kill)
- Documents the two new cluster configs:
accounting.oom.critical.query.pause.enabled and
accounting.oom.critical.query.pause.timeout.ms
- Notes that it's off by default and dynamically configurable
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature New functionality oom-protection Related to out-of-memory protection mechanisms

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants