8385697: Add a Pooled Confined Arena#31365
Conversation
|
👋 Welcome back pminborg! A progress list of the required criteria for merging this PR into |
|
❗ This change is not yet ready to be integrated. |
| try { | ||
| confinedArenaAllocator.close(); | ||
| } catch (Exception e) { | ||
| e.printStackTrace(); |
There was a problem hiding this comment.
I think we should remove this
There was a problem hiding this comment.
Might be better to assign confinedArenaAllocator to a local first, then null out the confinedArenaAllocator then null-check the local and closing the local.
| ((ConfinedSession) backingArena.scope()).closeFromThreadCleanup(); | ||
| } | ||
|
|
||
| final class CachedArena implements Arena, NoInitAllocator { |
There was a problem hiding this comment.
My general feeling here is that the implementation is arranged the wrong way. E.g. in my mind, we have ArenaImpl, which is the type of the builtin arena we return. And, if an ArenaImpl is confined, it can allocate memory more cheaply, with the help of some kind of thread-backed allocator.
I feel the right arrangement is to have a SegmentAllocator (not an Arena) that returns usable regions of memory from a given thread. Maybe the allocator is very low level, it computes the next pointer, and does a MemorySegment.ofAddress(ptr) for the region. Then the ArenaImpl::allocate takes that, and does a reinterpret with the correct arena and size. When the confined arena closes, the memory is returned to the underlying pool.
Since this is the builtin confined arena we're talking about, I'm not sure about CachedArena -- as that looks like any other 3rd party Arena. I think we can achieve tighter integration?
There was a problem hiding this comment.
More specifically, I think that in both the confined and shared/auto case, what you want is a setup like this:
- when an arena is created, we try to acquire some memory from some pool
- if we fail, then the arena behaves like before
- otherwise, the first N allocations of the arena will be served by the pool
- after that, we either try to acquire another pool, or fallback to default impl
The only difference between confined and shared is where the pool lives, and how it is acquired.
There was a problem hiding this comment.
My feeling is the same here. Looking at the patch it feels like the control is inverted. I'd expect either ArenaImpl to be changed directly to do what Maurizio lists here, or to have a ConfinedArenaImpl sub class that does these things (but I think from internal discussion we wanted to do something that works for shared/auto as well)
In my mind we don't want a different arena impl, we want the existing arena impls do to something different, if that makes sense.
There was a problem hiding this comment.
More specifically, I think that in both the confined and shared/auto case, what you want is a setup like this:
- when an arena is created, we try to acquire some memory from some pool
- if we fail, then the arena behaves like before
- otherwise, the first N allocations of the arena will be served by the pool
- after that, we either try to acquire another pool, or fallback to default impl
The only difference between confined and shared is where the pool lives, and how it is acquired.
There are some issues with the proposal:
- If there are no allocations or if there are allocations that are larger than the pool size, we cannot defer allocation from the pool. I think we can adhere to the principle outlined above, but do lazy allocation instead.
- We then need to be able to handle N slots per arena. That is doable but more complex. I think that just increasing the pool size would achieve the same thing (or would be even better), but with lower complexity. I.e., setting the pool size to 128 bytes rather than having two separate 64-byte pools.
The work with other arenas is future work, but if we can have a better structure that allows them to be more easily added in the future, that is good.
There was a problem hiding this comment.
My general feeling here is that the implementation is arranged the wrong way. E.g. in my mind, we have ArenaImpl, which is the type of the builtin arena we return. And, if an ArenaImpl is confined, it can allocate memory more cheaply, with the help of some kind of thread-backed allocator.
I feel the right arrangement is to have a SegmentAllocator (not an Arena) that returns usable regions of memory from a given thread. Maybe the allocator is very low level, it computes the next pointer, and does a
MemorySegment.ofAddress(ptr)for the region. Then the ArenaImpl::allocate takes that, and does a reinterpret with the correct arena and size. When the confined arena closes, the memory is returned to the underlying pool.Since this is the builtin confined arena we're talking about, I'm not sure about CachedArena -- as that looks like any other 3rd party Arena. I think we can achieve tighter integration?
There are some good suggestions here worth exploring.
In one of the prototypes, I extended ArenaImpl, but eventually I opted for a more decoupled approach. But I can take a new look at centralizing logic in ArenaImpl instead.
|
One possible concern here is with clients that expect We have seen evidence of that here: |
| // Only call this method during thread cleanup! | ||
| void closeFromThreadCleanup() { | ||
| if (ownerThread().isVirtual()) { | ||
| // Allow thread cleanup to close a confined arena on behalf of the | ||
| // owning virtual thread after it has terminated. | ||
| if (state < OPEN) { | ||
| throw ALREADY_CLOSED.newRuntimeException(); | ||
| } | ||
| } else { | ||
| checkValidState(); | ||
| } | ||
| state = CLOSED; | ||
| resourceList.cleanup(); | ||
| } | ||
|
|
There was a problem hiding this comment.
Why is this bypass needed? Is clearReferences called from another thread?
There was a problem hiding this comment.
Yes. A virtual thread is cleaned up by another thread. This is unfortunate.
Yes. There will be observable behavior changes in how segments are handled. But in no way, we guarantee that there will be a malloc/free invocations when using arenas. |
| /* | ||
| * Lazily initialized allocator used to speed up small allocations made by | ||
| * confined arenas owned by this thread. | ||
| */ | ||
| @Stable | ||
| private AutoCloseable confinedArenaAllocator; |
There was a problem hiding this comment.
I'm on the fence about this—adding a new field into Thread (which also means for all Virtual Threads that get allocated) for a very specific use-case—what other options exist, and do VirtualThreads need their own caches or can they rely on the caches of their carriers?
There was a problem hiding this comment.
Experiments with carrier-local caches reveal that it is more memory efficient, but very complex and slow. A virtual thread can be remounted on another carrier, for example, which creates problems.
For platform threads, I do not believe an extra field is a problem. Creating such a thread is resource-intensive anyhow, and an extra field is well into the noise.
For virtual threads, I am a bit uncertain how such a thread is "parked". If one creates a million virtual threads, then we typically would need an extra 4 MiB if no pools are ever created (just field overhead). There are already a large number of fields in Thread/LocalThread.
Using a ThreadLocal is much slower.
There was a problem hiding this comment.
The challenge is that it is a tragedy of the commons-kind of situation, if every possible use cases reasons the same then it is 4MB * use-cases for adding fields to Thread. Given that this use-case is rather niche in terms of "how many threads will over the course of their lifetime make use of this field".
I may be that my calibration is off here, so I'm inviting @AlanBateman to the chat to do a sanity-check on me :)
There was a problem hiding this comment.
Whether this field increases the size of Thread objects also depends on the alignment and size of all existing Thread fields (both normal Java and VM‑injected)
| /* | ||
| * Lazily initialized allocator used to speed up small allocations made by | ||
| * confined arenas owned by this thread. | ||
| */ | ||
| @Stable | ||
| private AutoCloseable confinedArenaAllocator; |
There was a problem hiding this comment.
Whether this field increases the size of Thread objects also depends on the alignment and size of all existing Thread fields (both normal Java and VM‑injected)
|
|
||
| package jdk.internal.foreign; | ||
|
|
||
| public interface NoInitAllocator { |
There was a problem hiding this comment.
This interface should probably extend SegmentAllocator:
| public interface NoInitAllocator { | |
| import java.lang.foreign.SegmentAllocator; | |
| public interface NoInitAllocator extends SegmentAllocator { |
Summary
This PR proposes to introduce a pooled confined arena as an optimization for
Arena.ofConfined(), where small native allocations can be served from a reusable per-thread memory pool instead of calling the regular native allocator for every short-lived arena. The arena remains confined to its owner thread and is still closed normally, but its backing storage can be reset and reused when the arena closes. The feature requires no API changes.Pooled memory is zeroed out upon closing an Arena to minimize data visibility between reuse. This means the data is visible only within a TWR block, and never outside it.
By default, a confined arena has access to 64 bytes of pooled data, and up to two nested confined arenas can gain access to pooling, covering the case for potential upcall allocations. These figures are configurable via system properties. Pooling can also be turned off completely by setting either the pool power-of-two size to zero or the number of nested pools to zero.
Static Analysis
An extensive static corpus analysis of third-party libraries and the JDK itself has been conducted with respect to
Area.ofConfined()usage, revealing that confined arenas were used only in TWR blocks and never in an unstructured way. The static analysis further revealed that in most cases, only a small amount of native memory was ever allocated, usually less than 32 bytes, and in many cases, 8 bytes or less. This usage pattern lends itself well to pooling.Dynamic Analysis
A dynamic statistical analysis of actual runs was also made, where various properties of confined arenas were recorded and summarized during a complete tier1 test run. While a tier1 run is not necessarily representative of a typical application workload, it provided some interesting results:
The run produced 93 per-process histogram blocks and 788,773,092 closed confined arenas. The result is dominated by arenas with no native allocation at all: 375,934,768 arenas (47.661%) are in the zero-byte bucket. Counting arenas up to 63 bytes covers 99.997% of all arena closures.
The largest count bucket is 8-15 bytes per arena with 400,951,293 arenas (50.832% of all arenas). The largest byte bucket is 8-15 bytes per arena with 3,207,623,039 B (3,059.03 MiB) (46.794% of all bytes). Buckets below 64 KiB preserve very close to 100% of arena count and 53.583% of bytes.
Capturing total allocation per arena shows a surprisingly large population of arenas that close without native allocation. This changes the count-weighted view substantially: zero-byte arenas alone represent 47.661% of closures, while arenas at or below 63 bytes represent 99.997%.
The byte-weighted view is led by the 8-15 byte bucket, with additional weight from the 32-63 byte bucket and the large-allocation tail. The 8-15 byte bucket contributes 46.794% of all bytes, the 32-63 byte bucket contributes 6.320%, and arenas at or above 64 KiB are only 0.000061% of closures but contribute 46.417% of bytes. This suggests separating zero/tiny arenas from larger total-allocation arenas when evaluating confined-arena cache policy.
Looking at a more granular histogram where we can observe all the occurrences of 0-63 byte buckets individually (and where the 63-byte bucket accounts for 63 bytes and above), we can see that:
The above data supports that picking a 64-byte-sized pool would enable most allocations to be served from the pool.
Performance
Pooling improves tiny confined allocations consistently across all platforms. For 5-byte allocations, speedups range from 3.1x on Linux aarch64 to 8.2x on Windows x64. For 20-byte allocations, speedups range from 2.8x to 7.5x.
For 100 bytes and larger, results are broadly neutral (as the default pool size is 64 bytes), with small wins/losses around parity. That is consistent with the feature targeting small allocations that fit in the cached pool slot; larger allocations mostly follow the previous, regular path.
A performance test with three consecutive allocations (rather than just one) shows an even more pronounced gain. On a MacOSX aarch64 (M1), the performance increase was 12x for 3 * 5 bytes and 13x for 3 * 20 bytes.
It is also believed that pooling would reduce the load on the operating system itself, as there are fewer malloc/free calls.
Workloads
Workloads most likely to benefit are those that use the FFM API with many short-lived
Arena.ofConfined()allocations, especially tiny native segments.Most likely winners:
try (Arena arena = Arena.ofConfined()) { ... }around each native call.allocateFrom(...)calls, for example, short byte arrays or short strings passed to native code.Less likely to benefit:
In short, this PR helps FFM code that treats confined arenas as cheap, scoped scratch space for small native call arguments. The benchmark shape supports that: large gains for 0-64 byte allocations, near-neutral behavior for larger sizes.
Memory Allocation Concerns
Care has been taken to mitigate memory allocation by lazily allocating pool slots on a per-need basis. Threads that do not use confined arenas will never allocate memory pools. Only a single reference field overhead is then imposed.
Threads that do not use nested confined arenas will only allocate a single pool. Confined arenas that allocate larger segments will not acquire a pool until an allocation can fit in the pool.
When a thread stops, the pools are deterministically freed and returned to the system.
Notes
Depending on the workload, the hot path can become bi-morphic on
ArenaImplandThreadConfinedSegmentPool.CachedArena. Before, it could be monomorphic.Discussion Points
Future Work
CaptureStateUtilclass.Testing
This PR passes the following tiers on multiple platforms:
Progress
Issue
Reviewing
Using
gitCheckout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/31365/head:pull/31365$ git checkout pull/31365Update a local copy of the PR:
$ git checkout pull/31365$ git pull https://git.openjdk.org/jdk.git pull/31365/headUsing Skara CLI tools
Checkout this PR locally:
$ git pr checkout 31365View PR using the GUI difftool:
$ git pr show -t 31365Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/31365.diff