Skip to content

8385697: Add a Pooled Confined Arena#31365

Draft
minborg wants to merge 9 commits into
openjdk:masterfrom
minborg:rfe-cached-arena-ofconfined-clean
Draft

8385697: Add a Pooled Confined Arena#31365
minborg wants to merge 9 commits into
openjdk:masterfrom
minborg:rfe-cached-arena-ofconfined-clean

Conversation

@minborg
Copy link
Copy Markdown
Contributor

@minborg minborg commented Jun 3, 2026

Summary

This PR proposes to introduce a pooled confined arena as an optimization for Arena.ofConfined(), where small native allocations can be served from a reusable per-thread memory pool instead of calling the regular native allocator for every short-lived arena. The arena remains confined to its owner thread and is still closed normally, but its backing storage can be reset and reused when the arena closes. The feature requires no API changes.

Pooled memory is zeroed out upon closing an Arena to minimize data visibility between reuse. This means the data is visible only within a TWR block, and never outside it.

By default, a confined arena has access to 64 bytes of pooled data, and up to two nested confined arenas can gain access to pooling, covering the case for potential upcall allocations. These figures are configurable via system properties. Pooling can also be turned off completely by setting either the pool power-of-two size to zero or the number of nested pools to zero.

Static Analysis

An extensive static corpus analysis of third-party libraries and the JDK itself has been conducted with respect to Area.ofConfined() usage, revealing that confined arenas were used only in TWR blocks and never in an unstructured way. The static analysis further revealed that in most cases, only a small amount of native memory was ever allocated, usually less than 32 bytes, and in many cases, 8 bytes or less. This usage pattern lends itself well to pooling.

Dynamic Analysis

A dynamic statistical analysis of actual runs was also made, where various properties of confined arenas were recorded and summarized during a complete tier1 test run. While a tier1 run is not necessarily representative of a typical application workload, it provided some interesting results:

The run produced 93 per-process histogram blocks and 788,773,092 closed confined arenas. The result is dominated by arenas with no native allocation at all: 375,934,768 arenas (47.661%) are in the zero-byte bucket. Counting arenas up to 63 bytes covers 99.997% of all arena closures.

The largest count bucket is 8-15 bytes per arena with 400,951,293 arenas (50.832% of all arenas). The largest byte bucket is 8-15 bytes per arena with 3,207,623,039 B (3,059.03 MiB) (46.794% of all bytes). Buckets below 64 KiB preserve very close to 100% of arena count and 53.583% of bytes.

Metric Value
Histogram blocks / JVMs 93 / 93
Total confined arenas closed 788,773,092
Total allocated bytes recorded 6,854,736,704 B (6,537.19 MiB)
Average bytes per arena 8.69 B
Arenas with zero allocated bytes 375,934,768 (47.661%)
Arenas <= 15 bytes 776,891,372 (98.494%)
Arenas <= 31 bytes 778,210,621 (98.661%)
Bytes from arenas <= 31 bytes 3,231,340,337 B (3,081.65 MiB) (47.140%)
Arenas <= 63 bytes 788,747,109 (99.997%)
Bytes from arenas <= 63 bytes 3,664,546,412 B (3,494.78 MiB) (53.460%)
Arenas < 64 KiB 788,772,608 (100.000%)
Bytes from arenas < 64 KiB 3,672,973,605 B (3,502.82 MiB) (53.583%)
Arenas >= 64 KiB 484 (0.000061%)
Bytes from arenas >= 64 KiB 3,181,763,099 B (3,034.37 MiB) (46.417%)
Arenas >= 512 KiB 473 (0.000060%)
Bytes from arenas >= 512 KiB 3,178,724,123 B (3,031.47 MiB) (46.373%)
image image image

Capturing total allocation per arena shows a surprisingly large population of arenas that close without native allocation. This changes the count-weighted view substantially: zero-byte arenas alone represent 47.661% of closures, while arenas at or below 63 bytes represent 99.997%.

The byte-weighted view is led by the 8-15 byte bucket, with additional weight from the 32-63 byte bucket and the large-allocation tail. The 8-15 byte bucket contributes 46.794% of all bytes, the 32-63 byte bucket contributes 6.320%, and arenas at or above 64 KiB are only 0.000061% of closures but contribute 46.417% of bytes. This suggests separating zero/tiny arenas from larger total-allocation arenas when evaluating confined-arena cache policy.

Looking at a more granular histogram where we can observe all the occurrences of 0-63 byte buckets individually (and where the 63-byte bucket accounts for 63 bytes and above), we can see that:

  • The workload is dominated by empty arenas and 8-byte arenas: together they account for about 98.34% of all arenas.
  • The 8-byte bucket alone accounts for 52.92% of arenas and 47.00% of total allocated bytes.
  • The 63+ bucket is tiny by arena count, but contributes 46.07% of total bytes, so large arenas are rare but byte-heavy.
  • This distribution strongly supports optimizing small confined arenas, especially the 8-byte case.
image

The above data supports that picking a 64-byte-sized pool would enable most allocations to be served from the pool.

Performance

Pooling improves tiny confined allocations consistently across all platforms. For 5-byte allocations, speedups range from 3.1x on Linux aarch64 to 8.2x on Windows x64. For 20-byte allocations, speedups range from 2.8x to 7.5x.

For 100 bytes and larger, results are broadly neutral (as the default pool size is 64 bytes), with small wins/losses around parity. That is consistent with the feature targeting small allocations that fit in the cached pool slot; larger allocations mostly follow the previous, regular path.

Platform 5-byte speedup 20-byte speedup 100+ byte behavior
Linux aarch64 3.13x 2.79x Near parity
Linux x64 3.51x 3.26x Near parity
MacOSX aarch64 4.21x 4.04x Near parity
Windows x64 8.16x 7.45x Near parity
image image

A performance test with three consecutive allocations (rather than just one) shows an even more pronounced gain. On a MacOSX aarch64 (M1), the performance increase was 12x for 3 * 5 bytes and 13x for 3 * 20 bytes.

It is also believed that pooling would reduce the load on the operating system itself, as there are fewer malloc/free calls.

Workloads

Workloads most likely to benefit are those that use the FFM API with many short-lived Arena.ofConfined() allocations, especially tiny native segments.

Most likely winners:

  • Native library bindings using FFM in hot paths, especially jextract-style wrappers.
  • Code that does try (Arena arena = Arena.ofConfined()) { ... } around each native call.
  • Allocations of small C values or structs: int*, long*, handles, pointers, small out-parameters, small option structs.
  • Small allocateFrom(...) calls, for example, short byte arrays or short strings passed to native code.
  • Latency-sensitive loops where the fixed cost of creating/closing confined arenas dominates the actual native allocation size.
  • Workloads with per-thread, non-overlapping confined arenas, or only a small number of concurrently live confined arenas.

Less likely to benefit:

  • Large native allocations, since they fall back to the regular path.
  • Long-lived arenas with many allocations, where arena creation/close overhead is already amortized.
  • Arena.ofShared(), Arena.ofAuto(), or Arena.global() users.
  • Code already using custom slicing/pooling allocators.
  • Workloads with many concurrently live confined arenas per thread beyond the pool slot count.

In short, this PR helps FFM code that treats confined arenas as cheap, scoped scratch space for small native call arguments. The benchmark shape supports that: large gains for 0-64 byte allocations, near-neutral behavior for larger sizes.

Memory Allocation Concerns

Care has been taken to mitigate memory allocation by lazily allocating pool slots on a per-need basis. Threads that do not use confined arenas will never allocate memory pools. Only a single reference field overhead is then imposed.

Threads that do not use nested confined arenas will only allocate a single pool. Confined arenas that allocate larger segments will not acquire a pool until an allocation can fit in the pool.

When a thread stops, the pools are deterministically freed and returned to the system.

Notes

Depending on the workload, the hot path can become bi-morphic on ArenaImpl and ThreadConfinedSegmentPool.CachedArena. Before, it could be monomorphic.

Discussion Points

  • Should pooling be enabled by default or only enabled by setting a system property?
  • Should the default pool size be 64 bytes or something else?
  • Should the default nesting depth for pools be two or something else?

Future Work

  • If this PR is integrated, we can consider removing the internal CaptureStateUtil class.
  • Other arena types may also be pooled.

Testing

This PR passes the following tiers on multiple platforms:

  • tier1
  • tier2
  • tier3


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8385697: Add a Pooled Confined Arena (Enhancement - P3)(⚠️ The fixVersion in this issue is [28] but the fixVersion in .jcheck/conf is 27, a new backport will be created when this pr is integrated.)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/31365/head:pull/31365
$ git checkout pull/31365

Update a local copy of the PR:
$ git checkout pull/31365
$ git pull https://git.openjdk.org/jdk.git pull/31365/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 31365

View PR using the GUI difftool:
$ git pr show -t 31365

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/31365.diff

@bridgekeeper
Copy link
Copy Markdown

bridgekeeper Bot commented Jun 3, 2026

👋 Welcome back pminborg! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link
Copy Markdown

openjdk Bot commented Jun 3, 2026

❗ This change is not yet ready to be integrated.
See the Progress checklist in the description for automated requirements.

@openjdk openjdk Bot added the core-libs core-libs-dev@openjdk.org label Jun 3, 2026
@openjdk
Copy link
Copy Markdown

openjdk Bot commented Jun 3, 2026

@minborg The following label will be automatically applied to this pull request:

  • core-libs

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

try {
confinedArenaAllocator.close();
} catch (Exception e) {
e.printStackTrace();
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should remove this

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be better to assign confinedArenaAllocator to a local first, then null out the confinedArenaAllocator then null-check the local and closing the local.

((ConfinedSession) backingArena.scope()).closeFromThreadCleanup();
}

final class CachedArena implements Arena, NoInitAllocator {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My general feeling here is that the implementation is arranged the wrong way. E.g. in my mind, we have ArenaImpl, which is the type of the builtin arena we return. And, if an ArenaImpl is confined, it can allocate memory more cheaply, with the help of some kind of thread-backed allocator.

I feel the right arrangement is to have a SegmentAllocator (not an Arena) that returns usable regions of memory from a given thread. Maybe the allocator is very low level, it computes the next pointer, and does a MemorySegment.ofAddress(ptr) for the region. Then the ArenaImpl::allocate takes that, and does a reinterpret with the correct arena and size. When the confined arena closes, the memory is returned to the underlying pool.

Since this is the builtin confined arena we're talking about, I'm not sure about CachedArena -- as that looks like any other 3rd party Arena. I think we can achieve tighter integration?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More specifically, I think that in both the confined and shared/auto case, what you want is a setup like this:

  1. when an arena is created, we try to acquire some memory from some pool
  2. if we fail, then the arena behaves like before
  3. otherwise, the first N allocations of the arena will be served by the pool
  4. after that, we either try to acquire another pool, or fallback to default impl

The only difference between confined and shared is where the pool lives, and how it is acquired.

Copy link
Copy Markdown
Member

@JornVernee JornVernee Jun 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My feeling is the same here. Looking at the patch it feels like the control is inverted. I'd expect either ArenaImpl to be changed directly to do what Maurizio lists here, or to have a ConfinedArenaImpl sub class that does these things (but I think from internal discussion we wanted to do something that works for shared/auto as well)

In my mind we don't want a different arena impl, we want the existing arena impls do to something different, if that makes sense.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More specifically, I think that in both the confined and shared/auto case, what you want is a setup like this:

  1. when an arena is created, we try to acquire some memory from some pool
  2. if we fail, then the arena behaves like before
  3. otherwise, the first N allocations of the arena will be served by the pool
  4. after that, we either try to acquire another pool, or fallback to default impl

The only difference between confined and shared is where the pool lives, and how it is acquired.

There are some issues with the proposal:

  1. If there are no allocations or if there are allocations that are larger than the pool size, we cannot defer allocation from the pool. I think we can adhere to the principle outlined above, but do lazy allocation instead.
  2. We then need to be able to handle N slots per arena. That is doable but more complex. I think that just increasing the pool size would achieve the same thing (or would be even better), but with lower complexity. I.e., setting the pool size to 128 bytes rather than having two separate 64-byte pools.

The work with other arenas is future work, but if we can have a better structure that allows them to be more easily added in the future, that is good.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My general feeling here is that the implementation is arranged the wrong way. E.g. in my mind, we have ArenaImpl, which is the type of the builtin arena we return. And, if an ArenaImpl is confined, it can allocate memory more cheaply, with the help of some kind of thread-backed allocator.

I feel the right arrangement is to have a SegmentAllocator (not an Arena) that returns usable regions of memory from a given thread. Maybe the allocator is very low level, it computes the next pointer, and does a MemorySegment.ofAddress(ptr) for the region. Then the ArenaImpl::allocate takes that, and does a reinterpret with the correct arena and size. When the confined arena closes, the memory is returned to the underlying pool.

Since this is the builtin confined arena we're talking about, I'm not sure about CachedArena -- as that looks like any other 3rd party Arena. I think we can achieve tighter integration?

There are some good suggestions here worth exploring.

In one of the prototypes, I extended ArenaImpl, but eventually I opted for a more decoupled approach. But I can take a new look at centralizing logic in ArenaImpl instead.

@mcimadamore
Copy link
Copy Markdown
Contributor

One possible concern here is with clients that expect Arena::allocate to result in a call to malloc. Some of these clients might expect to be able to override the system allocator -- e.g with jemalloc, to maybe take advantage of additional features such as use after free protection.

We have seen evidence of that here:
#28235

Comment on lines +85 to +99
// Only call this method during thread cleanup!
void closeFromThreadCleanup() {
if (ownerThread().isVirtual()) {
// Allow thread cleanup to close a confined arena on behalf of the
// owning virtual thread after it has terminated.
if (state < OPEN) {
throw ALREADY_CLOSED.newRuntimeException();
}
} else {
checkValidState();
}
state = CLOSED;
resourceList.cleanup();
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this bypass needed? Is clearReferences called from another thread?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. A virtual thread is cleaned up by another thread. This is unfortunate.

@minborg
Copy link
Copy Markdown
Contributor Author

minborg commented Jun 3, 2026

One possible concern here is with clients that expect Arena::allocate to result in a call to malloc. Some of these clients might expect to be able to override the system allocator -- e.g with jemalloc, to maybe take advantage of additional features such as use after free protection.

We have seen evidence of that here: #28235

Yes. There will be observable behavior changes in how segments are handled. But in no way, we guarantee that there will be a malloc/free invocations when using arenas.

Comment on lines +396 to +401
/*
* Lazily initialized allocator used to speed up small allocations made by
* confined arenas owned by this thread.
*/
@Stable
private AutoCloseable confinedArenaAllocator;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm on the fence about this—adding a new field into Thread (which also means for all Virtual Threads that get allocated) for a very specific use-case—what other options exist, and do VirtualThreads need their own caches or can they rely on the caches of their carriers?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Experiments with carrier-local caches reveal that it is more memory efficient, but very complex and slow. A virtual thread can be remounted on another carrier, for example, which creates problems.

For platform threads, I do not believe an extra field is a problem. Creating such a thread is resource-intensive anyhow, and an extra field is well into the noise.

For virtual threads, I am a bit uncertain how such a thread is "parked". If one creates a million virtual threads, then we typically would need an extra 4 MiB if no pools are ever created (just field overhead). There are already a large number of fields in Thread/LocalThread.

Using a ThreadLocal is much slower.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The challenge is that it is a tragedy of the commons-kind of situation, if every possible use cases reasons the same then it is 4MB * use-cases for adding fields to Thread. Given that this use-case is rather niche in terms of "how many threads will over the course of their lifetime make use of this field".

I may be that my calibration is off here, so I'm inviting @AlanBateman to the chat to do a sanity-check on me :)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whether this field increases the size of Thread objects also depends on the alignment and size of all existing Thread fields (both normal Java and VM‑injected)

Comment on lines +396 to +401
/*
* Lazily initialized allocator used to speed up small allocations made by
* confined arenas owned by this thread.
*/
@Stable
private AutoCloseable confinedArenaAllocator;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whether this field increases the size of Thread objects also depends on the alignment and size of all existing Thread fields (both normal Java and VM‑injected)


package jdk.internal.foreign;

public interface NoInitAllocator {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This interface should probably extend SegmentAllocator:

Suggested change
public interface NoInitAllocator {
import java.lang.foreign.SegmentAllocator;
public interface NoInitAllocator extends SegmentAllocator {

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core-libs core-libs-dev@openjdk.org

Development

Successfully merging this pull request may close these issues.

5 participants