Native allocator: dynamic settings, query/datafusion pools, plugin nodeStats#21732
Conversation
PR Reviewer Guide 🔍(Review updated until commit dcddc3c)Here are some key observations to aid the review process:
|
PR Code Suggestions ✨Latest suggestions up to dcddc3c Explore these optional code suggestions:
Previous suggestionsSuggestions up to commit dcddc3c
Suggestions up to commit 1bd0be2
Suggestions up to commit b033e5c
Suggestions up to commit a9ff751
Suggestions up to commit a9ff751
|
|
❌ Gradle check result for 9188043: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
9188043 to
d419b2a
Compare
PR Code Analyzer ❗AI-powered 'Code-Diff-Analyzer' found issues on commit d646dd4.
The table above displays the top 10 most important findings. Pull Requests Author(s): Please update your Pull Request according to the report above. Repository Maintainer(s): You can Thanks. |
d419b2a to
10f3617
Compare
|
Persistent review updated to latest commit 498b502 |
|
❌ Gradle check result for 498b502: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Bukhtawar
left a comment
There was a problem hiding this comment.
We might need to reject the request, throw OpenSearchRejectedException rather than oom-ing. Also lets try to see how we can wire up circuit-breaker for this. Maybe circuit-breaker stats is something we leverage for tracking memory used
|
This PR seems to be the good place to get some alignment on the memory assignment across different components in the system. I think the Java Arrow memory are mostly used for intermediate data transfer, while Rust memory are used for query execution. That's one reason to provide more to Rust side. |
|
Persistent review updated to latest commit d646dd4 |
|
❌ Gradle check result for d646dd4: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
d646dd4 to
53c0e36
Compare
|
Persistent review updated to latest commit 53c0e36 |
|
Persistent review updated to latest commit e1f3c5b |
|
❌ Gradle check result for e1f3c5b: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
e1f3c5b to
041dc0e
Compare
|
Persistent review updated to latest commit 041dc0e |
|
❌ Gradle check result for 041dc0e: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
bowenlan-amzn
left a comment
There was a problem hiding this comment.
Overral looks good! Thanks for refactoring the allocator related changes in arrow-base.
Left some comments that should be easy to accommodate. Approving now.
|
Persistent review updated to latest commit 40688ad |
|
❌ Gradle check result for 40688ad: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
❌ Gradle check result for 66f047b: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
Persistent review updated to latest commit 3a0373c |
3a0373c to
6645315
Compare
DefaultPlanExecutor created its coordinator allocator from POOL_QUERY in its ctor, but as a Guice-bound singleton it has no close hook. On node shutdown ArrowBasePlugin closes the root allocator chain, which walks descendants and finds POOL_QUERY → coordinator still attached. The pool allocator's close-with-outstanding-children check then throws: java.lang.IllegalStateException: Allocator[ROOT] closed with outstanding child allocators. Allocator(ROOT) → Allocator(query) → Allocator(coordinator) This was visible as 14 failures across CoordinatorTransportStressIT (5), MemoryGuardIT (8), and WindowSqlIT.classMethod (1) — all in :sandbox:qa:analytics-engine-coordinator:internalClusterTest. The CoordinatorTransportStressIT subset reproduced on pre-merge HEAD too, confirming the leak pre-dates the latest origin/main merge. Fix: introduce a tiny CoordinatorAllocatorHandle (AutoCloseable) and move the allocator's creation into AnalyticsPlugin.createComponents. The plugin now owns the allocator's lifetime and closes it from Plugin.close() — the same shape AnalyticsSearchService already follows for its own POOL_QUERY child. DefaultPlanExecutor consumes the handle via Guice injection and stops worrying about lifecycle. Plugins are closed in reverse iteration order in Node.close() (server/src/main/.../Node.java:2228), so AnalyticsPlugin.close() runs before ArrowBasePlugin.close() — the coordinator child is released before the pool/root teardown begins. Verified: :sandbox:qa:analytics-engine-coordinator:internalClusterTest goes from 14 failures → 0 failures (82 tests, 61 skipped — pre-existing @AwaitsFix/@ignore). Signed-off-by: Gaurav Singh <snghsvn@amazon.com>
|
Persistent review updated to latest commit 6645315 |
6645315 to
a9ff751
Compare
|
Persistent review updated to latest commit a9ff751 |
|
Closing and reopening to trigger the build again. |
|
Persistent review updated to latest commit a9ff751 |
|
Persistent review updated to latest commit a9ff751 |
VSRRotationBenchmark references ArrowNativeAllocator (the new unified
allocator API), but the benchmarks subproject only declared a transitive
api dep on parquet-data-format. parquet-data-format declares arrow-base
as compileOnly (not exported), so benchmarks couldn't see it at compile.
CI failure:
VSRRotationBenchmark.java:82: error: package org.opensearch.arrow.allocator does not exist
private org.opensearch.arrow.allocator.ArrowNativeAllocator nativeAllocator;
Mirror parquet-data-format's pattern: declare arrow-base as compileOnly.
Runtime continues to work via the plugin's classloader.
Signed-off-by: Gaurav Singh <snghsvn@amazon.com>
Signed-off-by: Gaurav Singh <snghsvn@amazon.com>
|
Persistent review updated to latest commit b033e5c |
|
Persistent review updated to latest commit 1bd0be2 |
|
Persistent review updated to latest commit dcddc3c |
|
❌ Gradle check result for dcddc3c: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
Persistent review updated to latest commit dcddc3c |
|
Follow up PR changes.
|
Description
Builds on #21703 to add the query, datafusion with allocator, dynamic-tuning, stats, and pool-wiring pieces.
Architecture: three native-memory trackers, three knobs
Each tracker owns the bytes it can actually see, with a process-level cap above all of them.
Three layers, three responsibilities. Each is necessary; none can be replaced by another.
ArrowNativeAllocator) accounts for every ArrowBufferAllocatorallocation in the JVM. Partitioned into FLIGHT / INGEST / QUERY pools so one plugin can't starve another. All Arrow buffers descend from the sameRootAllocator, preserving Arrow's same-root invariant for cross-plugin zero-copy handoff.MemoryPool) accounts for DataFusion's own working memory and triggers spill / fail-fast when a query exceeds budget. Lives entirely on the Rust side; updates flow in via FFM throughNativeBridge.setMemoryPoolLimit.node.native_memory.limitis the operator-declared off-heap budget AC throttles against; framework-derived defaults (root, pools, DataFusion pool) all scale from this single number.POOL_DATAFUSIONis intentionally not present. DataFusion's working memory is allocated by Rust operators directly and reported only to DataFusion's ownMemoryPool— it never flows through Arrow'sBufferAllocatorAPI. Adding a Java-side pool that pretended to track it would have required either a per-allocation FFM round-trip (performance disaster) or a config-only mirror (a setting that returns HTTP 200 and silently does nothing). The Rust-sidedatafusion.memory_pool_limit_bytesis the honest knob for that layer.Worked example: 64 GB / 16 GB-heap node
With operator-declared
-Xmx16gon a 64 GB host (bare metal or cgroup-limited container), defaults compose to:As a flat table for quick reference:
Independent budget (disk staging, not memory):
How a query flows through these layers
Take a concrete example: a user issues a PPL query that goes through analytics-engine and dispatches to DataFusion.
Each layer accounts for what it owns. No double-counting (the result bytes are counted in DataFusion's pool only while they exist as Rust buffers; after Java imports them, ownership transfers and the Java side counts them; when Java closes the import, the bytes are freed). Each operator-tunable knob bounds a real, observable thing.
Operator surface
After this PR, an operator inspecting a node running the query above sees:
{ "native_allocator": { "root": {"allocated": "150MB", "limit": "7.58GB"}, "pools": { "flight": {"allocated": "20MB", "limit": "1.90GB"}, "ingest": {"allocated": "30MB", "limit": "3.03GB"}, "query": {"allocated": "100MB", "limit": "1.90GB"} } }, "datafusion": { "memory_pool": {"usage": "2.4GB", "limit": "28.44GB"}, "spill": {"usage": "0", "limit": "32GB"} } }Three numbers, three sources, three knobs. The operator looks at each in isolation when tuning that layer:
datafusion.memory_pool_limit_bytes.parquet.native.pool.flight.maxor lowerparquet.native.pool.ingest.max.node.native_memory.limit.All pool min/max settings are
Setting.Property.Dynamic. Pool limit changes propagate to consumer-side child allocators automatically via Arrow's parent-cap check atallocateBytes— child allocators are created withLong.MAX_VALUEand inherit the live parent cap on every allocation, so dynamic resizes reach in-flight workloads without restart and without an explicit notification SPI. The grouped validator rejects cross-setting violations (sum of pool mins > root, per-pool min > max) at PUT time with HTTP 400 rather than at the next allocation.Behavior change to call out
Admission control is now active by default.
node.native_memory.limitdefaults to 79% × (RAM − JVM heap) instead of0(unconfigured). On upgrade, AC will start tracking native memory utilization and the framework's pool caps will derive sensible values from the operator's declared off-heap budget without any explicit configuration.Operators who want pre-existing opt-out behavior (AC unconfigured, all framework caps unbounded) can set:
This restores the prior "explicit-opt-in" semantics — useful for nodes where Lucene mmap or non-Arrow native consumers dominate and the operator does not want admission control throttling search/index traffic.
Changes after initial review
NativeAllocatorListenerSPI. The SPI was emulating Arrow's native parent-cap check atallocateBytes. With consumer-side child allocators created atLong.MAX_VALUE, dynamic pool resizes propagate automatically through Arrow'sAccountant.allocateparent walk — no listener machinery needed. Resolves @bowenlan-amzn (Native allocator: dynamic settings, query/datafusion pools, plugin nodeStats #21732 (comment), Native allocator: dynamic settings, query/datafusion pools, plugin nodeStats #21732 (comment)) and @Bukhtawar (Native allocator: dynamic settings, query/datafusion pools, plugin nodeStats #21732 (comment)).ArrowBasePlugin#createGuiceModulesoverride;Node.javaalready auto-binds every component returned fromcreateComponents. This was the root cause of thegradle-checkJenkins CI failures across multiple SHAs.@ExperimentalApionPlugin#nodeStats()to match the annotation already onPluginNodeStats. Resolves @bowenlan-amzn (Native allocator: dynamic settings, query/datafusion pools, plugin nodeStats #21732 (comment)).ParquetIndexingEngine(3 constructor parameters). Resolves @bowenlan-amzn (Native allocator: dynamic settings, query/datafusion pools, plugin nodeStats #21732 (comment)).node.native_memory.limitderives from RAM−heap (79%);root.limitis 20% of NM (Arrow gets a small fraction; DataFusion gets the larger 75% as a sibling); pool maxes anchor to NM at 5%/8%/5%;datafusion.memory_pool_limit_bytesis 75% of NM (replaces the prior JVM-heap-derived default flagged by @bharath-techie at Native allocator: dynamic settings, query/datafusion pools, plugin nodeStats #21732 (comment));datafusion.spill_memory_limit_bytesis 50% of physical RAM (independent disk-staging budget).NativeAllocatorBoundaryIT(plugins/arrow-flight-rpc/src/internalClusterTest) boot a real cluster with tight memory settings and verify cap enforcement end-to-end through actual Arrow allocations: per-pool max rejection, root-level rejection across pools, and dynamic pool-resize propagation to in-flightLong.MAX_VALUEchildren.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.