Commit e2e869d
committed
Native allocator follow-up: dynamic settings, query/datafusion pools, plugin nodeStats — review-feedback edition
Builds on #21703 to add the query, datafusion with allocator, dynamic-tuning, stats,
and pool-wiring pieces. This commit is the consolidated version after addressing
review feedback on PR #21732.
## Architecture: three native-memory trackers, three knobs
Each tracker owns the bytes it can actually see, with a process-level cap above all
of them. On a 64 GB / 16 GB-heap node the defaults compose to:
RAM 64 GB
JVM heap (operator-configured -Xmx) 16 GB
Off-heap (RAM - heap) 48 GB
node.native_memory.limit (79% of off-heap) 37.92 GB ← AC throttle threshold
native.allocator.root.limit (20% of NM) 7.58 GB ← Arrow framework cap
pool.flight.max (5% of NM) 1.90 GB ← Flight transport pool
pool.ingest.max (8% of NM) 3.03 GB ← Parquet VSR pool
pool.query.max (5% of NM) 1.90 GB ← analytics-engine pool
datafusion.memory_pool_limit (75% of NM) 28.44 GB ← Rust runtime pool (sibling to Arrow)
Unmanaged 10.08 GB ← Lucene mmap, OS page cache, etc.
Independent (disk budget):
datafusion.spill_memory_limit_bytes (50% of physical RAM) 32 GB
## What this commit does
### 1. Dynamic settings + cluster-update consumers
All native-allocator settings are Setting.Property.Dynamic. Cluster-settings update
consumers in ArrowBasePlugin#registerSettingsUpdateConsumers wire PUTs to live
ArrowNativeAllocator state. The grouped validator (validateUpdate) rejects
cross-setting violations (sum of pool mins > root, per-pool min > max) at PUT time
with HTTP 400.
### 2. Query and DataFusion pools
POOL_QUERY added for analytics-engine per-query allocators (children of
nativeAllocator.getPoolAllocator(POOL_QUERY)). DataFusion gets its own setting
(datafusion.memory_pool_limit_bytes) since DataFusion's working memory lives
entirely on the Rust side and is reported only through DataFusion's MemoryPool
— a Java-side pool that pretended to track it would either need a per-allocation
FFM round-trip (performance disaster) or be a config-only mirror (HTTP 200 with
no observable effect).
POOL_DATAFUSION is intentionally absent; the Rust-side
datafusion.memory_pool_limit_bytes is the honest knob for that layer.
### 3. Plugin#nodeStats SPI
New SPI method Plugin#nodeStats() returning List<PluginNodeStats>. NodeStats
serializes plugin-contributed payloads via NamedWriteable, surfacing them under
_nodes/stats. Annotated @experimentalapi to match PluginNodeStats.
### 4. Pool wiring through Guice (no static singleton)
ArrowNativeAllocator is constructed in ArrowBasePlugin#createComponents and
returned to Node.java which auto-binds it via Guice. Consumers
(AnalyticsSearchService, FlightTransport, DefaultPlanExecutor) receive it through
constructor @Inject — no static instance() / INSTANCE / ensureForTesting.
### 5. Drop NativeAllocatorListener SPI [@bowenlan-amzn #3269655712, #3269660170,
@Bukhtawar #3267186318]
The original design used NativeAllocatorListener to push pool-limit changes into
consumer-side child allocators that captured the pool's limit at construction
time. With children created at Long.MAX_VALUE, Arrow's Accountant.allocate (lines
191-203 of Accountant.java in arrow-java v18.3.0) walks the parent chain on every
allocation and checks the parent's allocationLimit — so dynamic resizes of
parquet.native.pool.{flight,query}.max take effect immediately on subsequent
allocations through any descendant. The listener was emulating this Arrow-native
behavior and is no longer needed.
- AnalyticsSearchService: drop poolListener; child uses Long.MAX_VALUE
- DefaultPlanExecutor: drop poolListener; coordinator uses Long.MAX_VALUE
- FlightTransport: drop intermediate flightAllocator + poolListener; pool is
the parent, server/client are Long.MAX_VALUE children
- libs/arrow-spi/.../NativeAllocatorListener.java: deleted
- NativeAllocator interface: addListener / removeListener removed
- ArrowNativeAllocator: listeners field, fireListeners method, and listener
calls in setPoolLimit/setPoolMin/rebalance removed
The rebalancer continues to function as before: it changes pool limits via
setLimit, which children read via Arrow's parent-cap check at allocateBytes.
Rebalancer is off by default.
### 6. Fix Guice duplicate-binding regression on ArrowNativeAllocator
Removed the redundant ArrowBasePlugin#createGuiceModules override. Node.java
line 1748 already binds every component returned from createComponents:
pluginComponents.stream()
.forEach(p -> b.bind((Class) p.getClass()).toInstance(p));
ArrowBasePlugin#createComponents already returns the allocator, so the explicit
bind(ArrowNativeAllocator.class).toInstance(allocator) in createGuiceModules
registered the same binding a second time and caused Guice CreationException at
cluster startup. This was the root cause of the gradle-check Jenkins CI failures
on commits 9188043, 498b502, d646dd4, e1f3c5b, 041dc0e — every integration test
failed during MockNode construction.
### 7. Memory defaults aligned to the partitioning model
- node.native_memory.limit defaults to 79% × (RAM - heap) via OsProbe (cgroup-
aware: container memory limit on K8s/Docker, total physical RAM on bare
metal). Falls back to ZERO if probe fails or heap >= RAM, preserving the
pre-default opt-in semantics.
- native.allocator.root.limit defaults to 20% × node.native_memory.limit.
- parquet.native.pool.{flight,ingest,query}.max default to 5%/8%/5% of
node.native_memory.limit (sum 18% < root's 20%, leaving 2 pp headroom inside
the framework cap).
- datafusion.memory_pool_limit_bytes defaults to 75% × node.native_memory.limit
(called out by @bharath-techie #3271093086: prior default
Runtime.maxMemory() / 4 was JVM-heap derived, wrong baseline for an off-heap
runtime).
- datafusion.spill_memory_limit_bytes defaults to 50% × physical RAM,
independent of node.native_memory.limit (spill is a disk-staging budget,
not a memory budget).
Behavior change to call out in upgrade notes: admission control is now active
by default. Operators wanting pre-existing opt-out behavior can set
node.native_memory.limit: 0b — all framework caps then fall back to
Long.MAX_VALUE unbounded mode.
### 8. PLUGIN_STATS metric is opt-in
Plugin-contributed nodeStats are gated behind a new NodesStatsRequest.Metric
(plugin_stats) so that operators not opting in don't pay the serialization cost
on every _nodes/stats poll. Default for cluster-stats is opt-out (false).
### 9. @experimentalapi on Plugin#nodeStats() [@bowenlan-amzn #3269563833]
Annotate the new SPI hook to match the @experimentalapi annotation already
present on PluginNodeStats.
### 10. FQN -> import nit in ParquetIndexingEngine [@bowenlan-amzn #3269589758]
Replace 3 fully-qualified org.opensearch.arrow.allocator.ArrowNativeAllocator
references in constructor parameters with a single import.
### 11. Integration tests for cap-enforcement boundaries
Added NativeAllocatorBoundaryIT under
plugins/arrow-flight-rpc/src/internalClusterTest. Three tests boot a single-node
cluster with tight memory settings (root=16 MiB, pool maxes=16 MiB) and exercise
real Arrow allocations via the Guice-injected ArrowNativeAllocator:
- testPoolMaxRejectsAllocationsBeyondCap: PUT pool.query.max=4 MiB, verify a
8 MiB request through the pool throws OutOfMemoryException.
- testRootLimitRejectsAllocationsBeyondCap: hold 8 MiB in FLIGHT, verify a
16 MiB QUERY request fails at the root level (8+16 > 16 root cap) even
though QUERY's own max would individually allow it.
- testDynamicPoolResizeAffectsInFlightAllocations: create child at
Long.MAX_VALUE (the AnalyticsSearchService / DefaultPlanExecutor pattern
post-listener-removal), verify a 2 MiB allocation succeeds, PUT
query.max=1 MiB via cluster settings, verify a 2 MiB request now throws
OOM via Arrow's parent-cap check at allocateBytes.
These cover the boundaries the framework is supposed to enforce. The third test
specifically verifies the listener-replacement contract: dynamic pool resize
takes effect on in-flight Long.MAX_VALUE children without any listener
machinery, exactly as Arrow's parent-cap check provides natively.
## How a query flows through these layers
Take a concrete example: a user issues a PPL query that goes through
analytics-engine and dispatches to DataFusion.
1. Coordinator receives the request.
→ Admission control checks node.native_memory.limit budget.
→ If OK, proceeds; otherwise rejects with 429.
2. AnalyticsSearchService creates a per-fragment Arrow allocator:
allocator = getPoolAllocator(POOL_QUERY).newChildAllocator("frag-N", 0, Long.MAX_VALUE)
→ Future BufferAllocator.buffer(...) calls on this allocator increment
POOL_QUERY's counter and the root counter via Arrow's parent-cap chain.
→ Bounded by parquet.native.pool.query.max.
3. AnalyticsSearchService dispatches to DataFusion via NativeBridge.
→ NativeBridge.executeQueryAsync(...) marshals plan bytes via FFM.
4. DataFusion runs the query in Rust:
→ HashAggregate builds a hash table.
→ reservation.try_grow(50MB) → DataFusion MemoryPool counter += 50MB.
→ Bounded by datafusion.memory_pool_limit_bytes.
→ If exceeded, HashAggregate spills to disk (bounded by
datafusion.spill_memory_limit_bytes).
5. DataFusion produces a result batch in Rust.
→ Java imports it via Arrow C Data Interface.
→ The import allocates Java ArrowBufs under the per-fragment allocator
from step 2. POOL_QUERY counter += result_size.
6. Result returns to coordinator.
→ Per-fragment allocator closes; POOL_QUERY counter decrements.
→ DataFusion query completes; MemoryPool counter decrements.
Each layer accounts for what it owns. No double-counting. Each operator-tunable
knob bounds a real, observable thing.
## Verification
./gradlew -Dsandbox.enabled=true \\
:plugins:arrow-base:test :plugins:arrow-flight-rpc:test \\
:sandbox:plugins:{analytics-engine,analytics-backend-datafusion,parquet-data-format}:test \\
:server:test --tests "org.opensearch.node.resource.tracker.*" \\
:plugins:arrow-flight-rpc:internalClusterTest --tests "*NativeAllocatorBoundaryIT*"
→ BUILD SUCCESSFUL
By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license.
Signed-off-by: Gaurav Singh <snghsvn@amazon.com>1 parent 40688ad commit e2e869d
26 files changed
Lines changed: 726 additions & 376 deletions
File tree
- libs/arrow-spi/src/main/java/org/opensearch/arrow/spi
- plugins
- arrow-base/src
- main/java/org/opensearch/arrow/allocator
- test/java/org/opensearch/arrow/allocator
- arrow-flight-rpc/src
- internalClusterTest/java/org/opensearch/arrow/flight
- main/java/org/opensearch/arrow/flight/transport
- sandbox/plugins
- analytics-backend-datafusion/src
- main/java/org/opensearch/be/datafusion
- nativelib
- test/java/org/opensearch/be/datafusion
- analytics-engine/src/main/java/org/opensearch/analytics/exec
- parquet-data-format/src/main/java/org/opensearch/parquet
- engine
- server/src
- main/java/org/opensearch
- action/admin/cluster
- node/stats
- stats
- node
- resource/tracker
- plugins
- test/java/org/opensearch
- action/admin/cluster/node/stats
- node/resource/tracker
- test/framework/src/main/java/org/opensearch/test
Lines changed: 3 additions & 19 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
36 | 36 | | |
37 | 37 | | |
38 | 38 | | |
39 | | - | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
40 | 42 | | |
41 | 43 | | |
42 | 44 | | |
| |||
55 | 57 | | |
56 | 58 | | |
57 | 59 | | |
58 | | - | |
59 | | - | |
60 | | - | |
61 | | - | |
62 | | - | |
63 | | - | |
64 | | - | |
65 | | - | |
66 | | - | |
67 | | - | |
68 | | - | |
69 | | - | |
70 | | - | |
71 | | - | |
72 | | - | |
73 | | - | |
74 | | - | |
75 | | - | |
76 | 60 | | |
77 | 61 | | |
78 | 62 | | |
| |||
Lines changed: 0 additions & 39 deletions
This file was deleted.
Lines changed: 106 additions & 43 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
11 | 11 | | |
12 | 12 | | |
13 | 13 | | |
14 | | - | |
15 | | - | |
16 | 14 | | |
17 | 15 | | |
18 | 16 | | |
| |||
55 | 53 | | |
56 | 54 | | |
57 | 55 | | |
58 | | - | |
59 | | - | |
60 | | - | |
61 | | - | |
62 | | - | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
63 | 63 | | |
64 | 64 | | |
65 | 65 | | |
66 | 66 | | |
67 | 67 | | |
68 | 68 | | |
69 | 69 | | |
70 | | - | |
71 | | - | |
72 | | - | |
| 70 | + | |
73 | 71 | | |
74 | 72 | | |
75 | 73 | | |
| |||
78 | 76 | | |
79 | 77 | | |
80 | 78 | | |
81 | | - | |
82 | | - | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
83 | 88 | | |
84 | 89 | | |
85 | | - | |
86 | | - | |
| 90 | + | |
| 91 | + | |
87 | 92 | | |
88 | 93 | | |
89 | | - | |
90 | | - | |
91 | | - | |
| 94 | + | |
92 | 95 | | |
93 | 96 | | |
94 | 97 | | |
| |||
100 | 103 | | |
101 | 104 | | |
102 | 105 | | |
103 | | - | |
104 | | - | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
105 | 113 | | |
106 | | - | |
107 | | - | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
108 | 122 | | |
109 | 123 | | |
110 | 124 | | |
| |||
118 | 132 | | |
119 | 133 | | |
120 | 134 | | |
121 | | - | |
122 | | - | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
123 | 143 | | |
124 | | - | |
125 | | - | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
126 | 152 | | |
127 | 153 | | |
128 | 154 | | |
| |||
141 | 167 | | |
142 | 168 | | |
143 | 169 | | |
144 | | - | |
145 | | - | |
146 | | - | |
| 170 | + | |
| 171 | + | |
| 172 | + | |
| 173 | + | |
| 174 | + | |
| 175 | + | |
147 | 176 | | |
148 | 177 | | |
149 | 178 | | |
150 | 179 | | |
151 | 180 | | |
152 | 181 | | |
153 | 182 | | |
154 | | - | |
| 183 | + | |
155 | 184 | | |
156 | | - | |
157 | | - | |
| 185 | + | |
| 186 | + | |
| 187 | + | |
| 188 | + | |
| 189 | + | |
| 190 | + | |
| 191 | + | |
| 192 | + | |
158 | 193 | | |
159 | 194 | | |
160 | 195 | | |
161 | 196 | | |
| 197 | + | |
| 198 | + | |
| 199 | + | |
| 200 | + | |
| 201 | + | |
| 202 | + | |
| 203 | + | |
| 204 | + | |
| 205 | + | |
| 206 | + | |
| 207 | + | |
| 208 | + | |
| 209 | + | |
| 210 | + | |
| 211 | + | |
| 212 | + | |
| 213 | + | |
| 214 | + | |
| 215 | + | |
| 216 | + | |
| 217 | + | |
| 218 | + | |
| 219 | + | |
| 220 | + | |
| 221 | + | |
| 222 | + | |
| 223 | + | |
| 224 | + | |
| 225 | + | |
| 226 | + | |
162 | 227 | | |
163 | 228 | | |
164 | 229 | | |
| |||
194 | 259 | | |
195 | 260 | | |
196 | 261 | | |
197 | | - | |
198 | | - | |
| 262 | + | |
| 263 | + | |
| 264 | + | |
| 265 | + | |
| 266 | + | |
| 267 | + | |
| 268 | + | |
| 269 | + | |
| 270 | + | |
| 271 | + | |
199 | 272 | | |
200 | 273 | | |
201 | 274 | | |
| |||
206 | 279 | | |
207 | 280 | | |
208 | 281 | | |
209 | | - | |
210 | | - | |
211 | | - | |
212 | | - | |
213 | | - | |
214 | | - | |
215 | | - | |
216 | | - | |
217 | | - | |
218 | | - | |
219 | 282 | | |
220 | 283 | | |
221 | 284 | | |
| |||
0 commit comments