8385697: Add a Pooled Confined Arena by minborg · Pull Request #31365 · openjdk/jdk

minborg · 2026-06-03T08:50:01Z

Summary

This PR proposes to introduce a pooled confined arena as an optimization for Arena.ofConfined(), where small native allocations can be served from a reusable per-thread memory pool instead of calling the regular native allocator for every short-lived arena. The arena remains confined to its owner thread and is still closed normally, but its backing storage can be reset and reused when the arena closes. The feature requires no API changes.

Pooled memory is zeroed out upon closing an Arena to minimize data visibility between reuse. This means the data is visible only within a TWR block, and never outside it.

By default, a confined arena has access to 64 bytes of pooled data, and up to two nested confined arenas can gain access to pooling, covering the case for potential upcall allocations. These figures are configurable via system properties. Pooling can also be turned off completely by setting either the pool power-of-two size to zero or the number of nested pools to zero.

Static Analysis

An extensive static corpus analysis of third-party libraries and the JDK itself has been conducted with respect to Area.ofConfined() usage, revealing that confined arenas were used only in TWR blocks and never in an unstructured way. The static analysis further revealed that in most cases, only a small amount of native memory was ever allocated, usually less than 32 bytes, and in many cases, 8 bytes or less. This usage pattern lends itself well to pooling.

Dynamic Analysis

A dynamic statistical analysis of actual runs was also made, where various properties of confined arenas were recorded and summarized during a complete tier1 test run. While a tier1 run is not necessarily representative of a typical application workload, it provided some interesting results:

The run produced 93 per-process histogram blocks and 788,773,092 closed confined arenas. The result is dominated by arenas with no native allocation at all: 375,934,768 arenas (47.661%) are in the zero-byte bucket. Counting arenas up to 63 bytes covers 99.997% of all arena closures.

The largest count bucket is 8-15 bytes per arena with 400,951,293 arenas (50.832% of all arenas). The largest byte bucket is 8-15 bytes per arena with 3,207,623,039 B (3,059.03 MiB) (46.794% of all bytes). Buckets below 64 KiB preserve very close to 100% of arena count and 53.583% of bytes.

Metric	Value
Histogram blocks / JVMs	93 / 93
Total confined arenas closed	788,773,092
Total allocated bytes recorded	6,854,736,704 B (6,537.19 MiB)
Average bytes per arena	8.69 B
Arenas with zero allocated bytes	375,934,768 (47.661%)
Arenas <= 15 bytes	776,891,372 (98.494%)
Arenas <= 31 bytes	778,210,621 (98.661%)
Bytes from arenas <= 31 bytes	3,231,340,337 B (3,081.65 MiB) (47.140%)
Arenas <= 63 bytes	788,747,109 (99.997%)
Bytes from arenas <= 63 bytes	3,664,546,412 B (3,494.78 MiB) (53.460%)
Arenas < 64 KiB	788,772,608 (100.000%)
Bytes from arenas < 64 KiB	3,672,973,605 B (3,502.82 MiB) (53.583%)
Arenas >= 64 KiB	484 (0.000061%)
Bytes from arenas >= 64 KiB	3,181,763,099 B (3,034.37 MiB) (46.417%)
Arenas >= 512 KiB	473 (0.000060%)
Bytes from arenas >= 512 KiB	3,178,724,123 B (3,031.47 MiB) (46.373%)

Capturing total allocation per arena shows a surprisingly large population of arenas that close without native allocation. This changes the count-weighted view substantially: zero-byte arenas alone represent 47.661% of closures, while arenas at or below 63 bytes represent 99.997%.

The byte-weighted view is led by the 8-15 byte bucket, with additional weight from the 32-63 byte bucket and the large-allocation tail. The 8-15 byte bucket contributes 46.794% of all bytes, the 32-63 byte bucket contributes 6.320%, and arenas at or above 64 KiB are only 0.000061% of closures but contribute 46.417% of bytes. This suggests separating zero/tiny arenas from larger total-allocation arenas when evaluating confined-arena cache policy.

Looking at a more granular histogram where we can observe all the occurrences of 0-63 byte buckets individually (and where the 63-byte bucket accounts for 63 bytes and above), we can see that:

The workload is dominated by empty arenas and 8-byte arenas: together they account for about 98.34% of all arenas.
The 8-byte bucket alone accounts for 52.92% of arenas and 47.00% of total allocated bytes.
The 63+ bucket is tiny by arena count, but contributes 46.07% of total bytes, so large arenas are rare but byte-heavy.
This distribution strongly supports optimizing small confined arenas, especially the 8-byte case.

The above data supports that picking a 64-byte-sized pool would enable most allocations to be served from the pool.

Performance

Pooling improves tiny confined allocations consistently across all platforms. For 5-byte allocations, speedups range from 3.1x on Linux aarch64 to 8.2x on Windows x64. For 20-byte allocations, speedups range from 2.8x to 7.5x.

For 100 bytes and larger, results are broadly neutral (as the default pool size is 64 bytes), with small wins/losses around parity. That is consistent with the feature targeting small allocations that fit in the cached pool slot; larger allocations mostly follow the previous, regular path.

Platform	5-byte speedup	20-byte speedup	100+ byte behavior
Linux aarch64	3.13x	2.79x	Near parity
Linux x64	3.51x	3.26x	Near parity
MacOSX aarch64	4.21x	4.04x	Near parity
Windows x64	8.16x	7.45x	Near parity

A performance test with three consecutive allocations (rather than just one) shows an even more pronounced gain. On a MacOSX aarch64 (M1), the performance increase was 12x for 3 * 5 bytes and 13x for 3 * 20 bytes.

It is also believed that pooling would reduce the load on the operating system itself, as there are fewer malloc/free calls.

Workloads

Workloads most likely to benefit are those that use the FFM API with many short-lived Arena.ofConfined() allocations, especially tiny native segments.

Most likely winners:

Native library bindings using FFM in hot paths, especially jextract-style wrappers.
Code that does try (Arena arena = Arena.ofConfined()) { ... } around each native call.
Allocations of small C values or structs: int*, long*, handles, pointers, small out-parameters, small option structs.
Small allocateFrom(...) calls, for example, short byte arrays or short strings passed to native code.
Latency-sensitive loops where the fixed cost of creating/closing confined arenas dominates the actual native allocation size.
Workloads with per-thread, non-overlapping confined arenas, or only a small number of concurrently live confined arenas.

Less likely to benefit:

Large native allocations, since they fall back to the regular path.
Long-lived arenas with many allocations, where arena creation/close overhead is already amortized.
Arena.ofShared(), Arena.ofAuto(), or Arena.global() users.
Code already using custom slicing/pooling allocators.
Workloads with many concurrently live confined arenas per thread beyond the pool slot count.

In short, this PR helps FFM code that treats confined arenas as cheap, scoped scratch space for small native call arguments. The benchmark shape supports that: large gains for 0-64 byte allocations, near-neutral behavior for larger sizes.

Memory Allocation Concerns

Care has been taken to mitigate memory allocation by lazily allocating pool slots on a per-need basis. Threads that do not use confined arenas will never allocate memory pools. Only a single reference field overhead is then imposed.

Threads that do not use nested confined arenas will only allocate a single pool. Confined arenas that allocate larger segments will not acquire a pool until an allocation can fit in the pool.

When a thread stops, the pools are deterministically freed and returned to the system.

Notes

Depending on the workload, the hot path can become bi-morphic on ArenaImpl and ThreadConfinedSegmentPool.CachedArena. Before, it could be monomorphic.

Discussion Points

Should pooling be enabled by default or only enabled by setting a system property?
Should the default pool size be 64 bytes or something else?
Should the default nesting depth for pools be two or something else?

Future Work

If this PR is integrated, we can consider removing the internal CaptureStateUtil class.
Other arena types may also be pooled.

Testing

This PR passes the following tiers on multiple platforms:

tier1
tier2
tier3

I confirm that I make this contribution in accordance with the OpenJDK Interim AI Policy.

Progress

Change must be properly reviewed (1 review required, with at least 1 Reviewer)
Change must not contain extraneous whitespace
Commit message must refer to an issue

Issue

JDK-8385697: Add a Pooled Confined Arena (Enhancement - P3)(⚠️ The fixVersion in this issue is [28] but the fixVersion in .jcheck/conf is 27, a new backport will be created when this pr is integrated.)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/31365/head:pull/31365
$ git checkout pull/31365

Update a local copy of the PR:
$ git checkout pull/31365
$ git pull https://git.openjdk.org/jdk.git pull/31365/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 31365

View PR using the GUI difftool:
$ git pr show -t 31365

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/31365.diff

bridgekeeper · 2026-06-03T08:52:06Z

👋 Welcome back pminborg! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

openjdk · 2026-06-03T08:52:58Z

❗ This change is not yet ready to be integrated.
See the Progress checklist in the description for automated requirements.

openjdk · 2026-06-03T08:54:16Z

@minborg The following label will be automatically applied to this pull request:

core-libs

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

minborg · 2026-06-03T09:30:59Z

+            try {
+                confinedArenaAllocator.close();
+            } catch (Exception e) {
+                e.printStackTrace();


I think we should remove this

Might be better to assign confinedArenaAllocator to a local first, then null out the confinedArenaAllocator then null-check the local and closing the local.

mcimadamore · 2026-06-03T11:23:05Z

+        ((ConfinedSession) backingArena.scope()).closeFromThreadCleanup();
+    }
+
+    final class CachedArena implements Arena, NoInitAllocator {


My general feeling here is that the implementation is arranged the wrong way. E.g. in my mind, we have ArenaImpl, which is the type of the builtin arena we return. And, if an ArenaImpl is confined, it can allocate memory more cheaply, with the help of some kind of thread-backed allocator.

I feel the right arrangement is to have a SegmentAllocator (not an Arena) that returns usable regions of memory from a given thread. Maybe the allocator is very low level, it computes the next pointer, and does a MemorySegment.ofAddress(ptr) for the region. Then the ArenaImpl::allocate takes that, and does a reinterpret with the correct arena and size. When the confined arena closes, the memory is returned to the underlying pool.

Since this is the builtin confined arena we're talking about, I'm not sure about CachedArena -- as that looks like any other 3rd party Arena. I think we can achieve tighter integration?

More specifically, I think that in both the confined and shared/auto case, what you want is a setup like this:

when an arena is created, we try to acquire some memory from some pool

if we fail, then the arena behaves like before

otherwise, the first N allocations of the arena will be served by the pool

after that, we either try to acquire another pool, or fallback to default impl

The only difference between confined and shared is where the pool lives, and how it is acquired.

My feeling is the same here. Looking at the patch it feels like the control is inverted. I'd expect either ArenaImpl to be changed directly to do what Maurizio lists here, or to have a ConfinedArenaImpl sub class that does these things (but I think from internal discussion we wanted to do something that works for shared/auto as well)

In my mind we don't want a different arena impl, we want the existing arena impls do to something different, if that makes sense.

More specifically, I think that in both the confined and shared/auto case, what you want is a setup like this:

when an arena is created, we try to acquire some memory from some pool

if we fail, then the arena behaves like before

otherwise, the first N allocations of the arena will be served by the pool

after that, we either try to acquire another pool, or fallback to default impl

The only difference between confined and shared is where the pool lives, and how it is acquired.

There are some issues with the proposal:

If there are no allocations or if there are allocations that are larger than the pool size, we cannot defer allocation from the pool. I think we can adhere to the principle outlined above, but do lazy allocation instead.

We then need to be able to handle N slots per arena. That is doable but more complex. I think that just increasing the pool size would achieve the same thing (or would be even better), but with lower complexity. I.e., setting the pool size to 128 bytes rather than having two separate 64-byte pools.

The work with other arenas is future work, but if we can have a better structure that allows them to be more easily added in the future, that is good.

My general feeling here is that the implementation is arranged the wrong way. E.g. in my mind, we have ArenaImpl, which is the type of the builtin arena we return. And, if an ArenaImpl is confined, it can allocate memory more cheaply, with the help of some kind of thread-backed allocator.

I feel the right arrangement is to have a SegmentAllocator (not an Arena) that returns usable regions of memory from a given thread. Maybe the allocator is very low level, it computes the next pointer, and does a MemorySegment.ofAddress(ptr) for the region. Then the ArenaImpl::allocate takes that, and does a reinterpret with the correct arena and size. When the confined arena closes, the memory is returned to the underlying pool.

Since this is the builtin confined arena we're talking about, I'm not sure about CachedArena -- as that looks like any other 3rd party Arena. I think we can achieve tighter integration?

There are some good suggestions here worth exploring.

In one of the prototypes, I extended ArenaImpl, but eventually I opted for a more decoupled approach. But I can take a new look at centralizing logic in ArenaImpl instead.

mcimadamore · 2026-06-03T11:26:26Z

One possible concern here is with clients that expect Arena::allocate to result in a call to malloc. Some of these clients might expect to be able to override the system allocator -- e.g with jemalloc, to maybe take advantage of additional features such as use after free protection.

We have seen evidence of that here:
#28235

JornVernee · 2026-06-03T11:56:20Z

+    // Only call this method during thread cleanup!
+    void closeFromThreadCleanup() {
+        if (ownerThread().isVirtual()) {
+            // Allow thread cleanup to close a confined arena on behalf of the
+            // owning virtual thread after it has terminated.
+            if (state < OPEN) {
+                throw ALREADY_CLOSED.newRuntimeException();
+            }
+        } else {
+            checkValidState();
+        }
+        state = CLOSED;
+        resourceList.cleanup();
+    }
+


Why is this bypass needed? Is clearReferences called from another thread?

Yes. A virtual thread is cleaned up by another thread. This is unfortunate.

minborg · 2026-06-03T13:06:35Z

One possible concern here is with clients that expect Arena::allocate to result in a call to malloc. Some of these clients might expect to be able to override the system allocator -- e.g with jemalloc, to maybe take advantage of additional features such as use after free protection.

We have seen evidence of that here: #28235

Yes. There will be observable behavior changes in how segments are handled. But in no way, we guarantee that there will be a malloc/free invocations when using arenas.

viktorklang-ora · 2026-06-03T13:42:55Z

+    /*
+     * Lazily initialized allocator used to speed up small allocations made by
+     * confined arenas owned by this thread.
+     */
+    @Stable
+    private AutoCloseable confinedArenaAllocator;


I'm on the fence about this—adding a new field into Thread (which also means for all Virtual Threads that get allocated) for a very specific use-case—what other options exist, and do VirtualThreads need their own caches or can they rely on the caches of their carriers?

Experiments with carrier-local caches reveal that it is more memory efficient, but very complex and slow. A virtual thread can be remounted on another carrier, for example, which creates problems.

For platform threads, I do not believe an extra field is a problem. Creating such a thread is resource-intensive anyhow, and an extra field is well into the noise.

For virtual threads, I am a bit uncertain how such a thread is "parked". If one creates a million virtual threads, then we typically would need an extra 4 MiB if no pools are ever created (just field overhead). There are already a large number of fields in Thread/LocalThread.

Using a ThreadLocal is much slower.

The challenge is that it is a tragedy of the commons-kind of situation, if every possible use cases reasons the same then it is 4MB * use-cases for adding fields to Thread. Given that this use-case is rather niche in terms of "how many threads will over the course of their lifetime make use of this field".

I may be that my calibration is off here, so I'm inviting @AlanBateman to the chat to do a sanity-check on me :)

Whether this field increases the size of Thread objects also depends on the alignment and size of all existing Thread fields (both normal Java and VM‑injected)

ExE-Boss · 2026-06-04T07:18:38Z

+    /*
+     * Lazily initialized allocator used to speed up small allocations made by
+     * confined arenas owned by this thread.
+     */
+    @Stable
+    private AutoCloseable confinedArenaAllocator;


Whether this field increases the size of Thread objects also depends on the alignment and size of all existing Thread fields (both normal Java and VM‑injected)

ExE-Boss · 2026-06-04T07:21:53Z

+
+package jdk.internal.foreign;
+
+public interface NoInitAllocator {


This interface should probably extend SegmentAllocator:

Suggested change

public interface NoInitAllocator {

import java.lang.foreign.SegmentAllocator;

public interface NoInitAllocator extends SegmentAllocator {

minborg added 8 commits June 1, 2026 18:39

Add a thread confined segment pool

17bfe6f

Add tests

17804a7

Improve testing

67feab3

Zero out after use rather than before

f751ba0

Improve testing

c584e5c

Fix initial zeroing

117eb29

Move zeroing section

63dbffa

Clean up

159ae33

openjdk Bot added the core-libs core-libs-dev@openjdk.org label Jun 3, 2026

Revoke change in benchmarks

04f2481

minborg commented Jun 3, 2026

View reviewed changes

mcimadamore reviewed Jun 3, 2026

View reviewed changes

JornVernee reviewed Jun 3, 2026

View reviewed changes

viktorklang-ora reviewed Jun 3, 2026

View reviewed changes

ExE-Boss reviewed Jun 4, 2026

View reviewed changes


		package jdk.internal.foreign;

		public interface NoInitAllocator {

-public interface NoInitAllocator {
+import java.lang.foreign.SegmentAllocator;
+public interface NoInitAllocator extends SegmentAllocator {

Conversation

minborg commented Jun 3, 2026 • edited by openjdk Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Static Analysis

Dynamic Analysis

Performance

Workloads

Most likely winners:

Less likely to benefit:

Memory Allocation Concerns

Notes

Discussion Points

Future Work

Testing

Progress

Issue

Reviewing

Uh oh!

bridgekeeper Bot commented Jun 3, 2026

Uh oh!

openjdk Bot commented Jun 3, 2026

Uh oh!

openjdk Bot commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JornVernee Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mcimadamore commented Jun 3, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

minborg commented Jun 3, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

5 participants

minborg commented Jun 3, 2026 •

edited by openjdk Bot

Loading

openjdk Bot commented Jun 3, 2026 •

edited

Loading

JornVernee Jun 3, 2026 •

edited

Loading