Improve/native memory admission control's precision addresses multiple bug fixes#21749
Conversation
- Auto-derive effective native memory budget as totalRAM - jvmMaxHeap when node.native_memory.limit is not explicitly configured. This eliminates the requirement to set the setting for admission control to function. - Subtract JVM non-heap committed memory (metaspace + code cache) from the native usage calculation so the metric reflects actual native allocations (Arrow, jemalloc/DataFusion) rather than fixed JVM overhead. - Downgrade log from WARN to DEBUG when the limit is simply unconfigured to prevent log spam on every poll cycle. Signed-off-by: Bukhtawar Khan <bukhtawa@amazon.com>
Tests cover: - Auto-derived limit when node.native_memory.limit is not configured - Enforced mode rejecting requests when utilization exceeds threshold - Monitor mode allowing requests through while counting breaches - Requests succeeding when utilization is below threshold - Dynamic threshold update taking effect at runtime Signed-off-by: Bukhtawar Khan <bukhtawa@amazon.com>
Use cluster-level stats injection and verify utilization tracking rather than attempting to trigger transport-level rejection (which requires the full shard allocation + SearchTransportService path). Tests cover: auto-derived limit, explicit limit tracking, monitor mode pass-through, below-threshold success, and dynamic limit update. Signed-off-by: Bukhtawar Khan <bukhtawa@amazon.com>
PR Reviewer Guide 🔍(Review updated until commit 3bf2ede)Here are some key observations to aid the review process:
|
PR Code Suggestions ✨Latest suggestions up to 3bf2ede Explore these optional code suggestions:
Previous suggestionsSuggestions up to commit 93a8c26
Suggestions up to commit e10f542
Suggestions up to commit 606f361
Suggestions up to commit 1c25c76
Suggestions up to commit 53f8dc8
|
PR Code Analyzer ❗AI-powered 'Code-Diff-Analyzer' found issues on commit 3360ee6.
The table above displays the top 10 most important findings. Pull Requests Author(s): Please update your Pull Request according to the report above. Repository Maintainer(s): You can Thanks. |
|
Persistent review updated to latest commit 3360ee6 |
|
❌ Gradle check result for 3360ee6: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Reduce polling frequency from 500ms to 2s for the native memory tracker. Reading /proc/self/status every 500ms is unnecessary given native memory changes gradually. Keep RssAnon-unavailable log at WARN level since that indicates a platform issue. Signed-off-by: Bukhtawar Khan <bukhtawa@amazon.com>
3360ee6 to
2859e30
Compare
|
Persistent review updated to latest commit 2859e30 |
|
❌ Gradle check result for 2859e30: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
Persistent review updated to latest commit e76084e |
|
❌ Gradle check result for e76084e: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Before closing the root allocator, force-close any remaining child allocators that were created via ArrowAllocatorService but not closed by their owners (e.g., DefaultPlanExecutor's coordinator allocator). This prevents IllegalStateException on node shutdown in tests and production graceful restarts. Signed-off-by: Bukhtawar Khan <bukhtawa@amazon.com>
e76084e to
8eeb786
Compare
|
Persistent review updated to latest commit 8eeb786 |
|
❌ Gradle check result for 8eeb786: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
Persistent review updated to latest commit a3e8dec |
Change the fragment execution streaming handler from ThreadPool.Names.SAME to ThreadPool.Names.SEARCH. Running on SAME blocks the transport I/O thread for the entire query duration (observed at 115s in production), starving heartbeats, cluster state updates, and other transport actions. Signed-off-by: Bukhtawar Khan <bukhtawa@amazon.com>
a3e8dec to
53f8dc8
Compare
|
Persistent review updated to latest commit 53f8dc8 |
b60f1b5 to
1c25c76
Compare
|
Persistent review updated to latest commit 1c25c76 |
|
Persistent review updated to latest commit 1c25c76 |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #21749 +/- ##
============================================
- Coverage 73.43% 73.38% -0.06%
+ Complexity 75074 74990 -84
============================================
Files 6012 6012
Lines 340934 340940 +6
Branches 49076 49076
============================================
- Hits 250352 250182 -170
- Misses 70640 70807 +167
- Partials 19942 19951 +9 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
1c25c76 to
606f361
Compare
|
Persistent review updated to latest commit 606f361 |
606f361 to
8273c98
Compare
Signed-off-by: Bukhtawar Khan <bukhtawa@amazon.com>
8273c98 to
e10f542
Compare
|
Persistent review updated to latest commit e10f542 |
Fixes ReplicationTracker assertion race condition (opensearch-project#3923) by adding MockTransportService interceptor to gracefully handle checkpoint updates during relocation handoff. Removes @AwaitsFix as the root cause is now addressed. Signed-off-by: Kavya Aggarwal <kavyaagg@amazon.com>
Add interceptCheckpointUpdates to TieringStatusIT and TierCancelIT
|
Persistent review updated to latest commit 93a8c26 |
1 similar comment
|
Persistent review updated to latest commit 93a8c26 |
|
❌ Gradle check result for 93a8c26: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
❌ Gradle check result for 93a8c26: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Signed-off-by: Bukhtawar Khan <bukhtawa@amazon.com>
|
Persistent review updated to latest commit 3bf2ede |
|
❌ Gradle check result for 3bf2ede: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Description
Summary
totalPhysicalMemory - jvmMaxHeapwhennode.native_memory.limitis not explicitly configured, eliminating the hard requirementto set this setting for admission control to function
from the native usage calculation so the metric reflects actual native allocations
(Arrow buffers, jemalloc/DataFusion, direct byte buffers) rather than fixed JVM overhead
every poll cycle (default 2s from 500ms)
Related Issues
Resolves #[Issue number to be closed when this PR is merged]
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.