This document is the performance-engineering log for the framework: what has been applied, what has been tried and rejected, and what is on the short list to investigate. Its primary purpose is to stop contributors — human or AI — from re-proposing changes that have already been merged or already considered and discarded.
The canonical log of behavioral changes is docs/CHANGELOG.md;
this file is the performance-engineering log. Both should be updated
when a perf change ships, but the depth of detail belongs here.
Maintainership: append at the end of the relevant subsection, newest last. Cite the merging PR number. Keep entries to a paragraph; if more is needed, link to the PR.
The April–May 2026 optimization campaign that produced most of the
content below is summarized in the
3.49.5-eisop1 release notes: compilation of a
4000+-file project with complex qualifiers went from ~30 minutes to
under 7 minutes; allNullnessTests from ~3:00 to ~2:30; checkNullness
from ~5:15 to under 4:00.
These compound across every visitor pattern and dataflow map operation, so small per-call wins paid back substantially.
- PR #1638 — Several smaller performance optimizations. The
opening salvo, touching
AnnotatedTypeFactory,HashcodeAtmVisitor,EqualityAtmComparer,DefaultAnnotatedTypeFormatter,AnnotatedTypeReplacer,AsSuperVisitor, andAnnotationMirrorSet. Includes theAnnotationMirrorSet-side groundwork for later PRs. - PR #1641 — Further optimizations.
AnnotatedTypeMirror,HashcodeAtmVisitor,AnnotatedTypeScanner, moreAnnotationMirrorSet. - PR #1644 — Test reference equality before structural equality.
Added
if (this == other) return trueshort-circuits to ~25equalsmethods across dataflow, framework, and javacutil (a sweep across 92 files). Co-authored with the Copilot SWE agent. - PR #1663 —
AnnotatedTypeScanner#reduceordering andHashcodeAtmVisitorreduce improvement. Reordered the reduce combiner so the cheap branch runs first. - PR #1667 — Further optimize
ATM.hashCodeby simpler handling of primitive types. Primitive-type ATMs are interned byTypeKindand have no qualifiers worth visiting; short-circuit directly to a fixed hash. - PR #1672 — Avoid
Integerboxing and lambda for ATM hashCode. RewroteHashcodeAtmVisitorfromSimpleAnnotatedTypeScanner<Integer, Void>with a polynomial-hash reduce lambda toSimpleAnnotatedTypeScanner<Void, Void>with a mutableint hashaccumulator. Removes per-nodeInteger.valueOfallocation and reduce-dispatch overhead. - PR #1675 — Small optimizations. Touched
AnnotatedTypeMirror,DefaultQualifierForUseTypeAnnotator,AnnotatedTypeScanner,QualifierDefaults,AnnotationUtils,ElementUtils,TypeKindUtils, and twotypeinference8files. - PR #1763 — Constant
getKind()overrides in ATM subclasses. Added@Override getKind()returning the fixedTypeKindto every fixed-kind subclass:AnnotatedDeclaredType,AnnotatedArrayType,AnnotatedExecutableType,AnnotatedTypeVariable,AnnotatedNullType,AnnotatedWildcardType,AnnotatedIntersectionType,AnnotatedUnionType. Eliminates the heap hop throughunderlyingType.getKind(); on declared types that call includesSymbol#apiComplete. OnlyAnnotatedPrimitiveTypeandAnnotatedNoTypefall through to the base, where the underlyinggetKind()is cheap.
- PR #1798 —
freeze()mechanism +AnnotatedTypeCopiervararg-aliasing fix + frozen cache masters. The foundation of the value-semantics program (full narrative under "AnnotatedTypeMirror value-semantics program" in Short list). Adds afrozenbit toAnnotatedTypeMirror, acheckMutable()guard on the three primary-annotation sinks (withAnnotationMirrorSet.makeUnmodifiable()as a backstop), and a cycle-safe deepfreeze()that freezes only already-initialized components (lazy getters freeze what they create later). Freezes the master stored at all eightAnnotatedTypeFactorycaches, so a latent in-place mutation of a cached type now fails fast withBugInCFinstead of silently corrupting a shared value. Freezing flushed — and the PR fixes — a realAnnotatedTypeCopier.visitExecutablebug: it aliased the original's vararg type into the copy instead of copying it, sodeepCopy()of an executable type was not fully independent. Caches stilldeepCopy()on every hit, so this is behavior-neutral and measured perf-neutral (deterministic allocation ±0.1% incl. a vararg-heavy workload;freeze()below the on-CPU sampling threshold). Shipped for the enforced invariant and the bug fix, not a perf number. - PR #1798 (cont.) —
classAndMethodTreeCacheboundary flip (kept);elementTypeCacheflip (REVERTED). The cross-cutting enabler was makingStructuralEqualityComparer.arePrimaryAnnosEqualnon-mutating (the Value Checker's override used to normalize its operands in place, which both prevents comparing a shared immutable type and is a side-effecting equality). Both post-pipeline caches were then flipped to return the shared frozen master instead ofdeepCopy()ing on every hit, with the minority of mutating callers copy-on-frozen at the mutation site. Measured win was modest: deterministicThreadAllocationStatistics(median of 3) −0.75% (Big300) / −0.97% (Big600) on generic-call code, within noise elsewhere — ~1%, not the −5.3% an earlier estimate suggested (never reproduced against this baseline; the copier was already cheap, see the post-mortem above). TheelementTypeCacheflip was then reverted: a full Guava nullness build (test-guava.sh, not covered byalltests) crashed withBugInCF"Attempted to mutate a frozen AnnotatedTypeMirror with underlying type java.lang.Object". Root cause: a consumer lifts a sub-component of the shared frozen master — an unbounded wildcard's implicitObjectupper bound, derived from a JDK generic's cached type-parameter bound (Function<?, K>withK extends Comparable) — into a fresh, non-frozen result type;addComputedTypeAnnotationsthen mutates the frozen child. The nine copy-on-frozen guards all copy at the root (if (type.isFrozen()) deepCopy()), so a non-frozen root holding a frozen child slips through — a hazard the root-level guard cannot catch and that escaped bothalltestsand the nine fixes. This is structural to returning a shared frozen value: any path that reparents a child of the shared master is a latent crash, unenumerable short of running every downstream project. For a ~1% win that is not worth it; reverted (commit message references the Guava crash). Regression test:checker/tests/nullness/ElementTypeCacheWildcardBound.java(minimized fromcom.google.common.collect.SortedLists.binarySearch). TheclassAndMethodTreeCacheflip is kept — it is much lower-traffic (class/method declaration trees) and survived the full Guava build andalltests— but it carries the same residual embedded-frozen-component risk in principle; re-run downstream builds (Guava et al.), not justalltests, before extending the shared-frozen-return pattern to any further cache. The lesson: a frozen-master tripwire that still returnsdeepCopy()is safe (the master is never handed out); returning the shared frozen value is what creates the reparenting hazard. Can the flip be salvaged? Three options, with a cost ladder (the answer is "not without copy-on-write"): (1) Fix each reparenting site (the nine copy-on-frozen guards, plus a tenth for this wildcard/type-var bound). Cheap per site, but the obvious construction sites (BoundsInitializer, the wildcard visitor) already build fresh, so the frozen child enters through a subtler path; the guarantee needed ("nothing ever reparents a child of a shared frozen master") is a convention, not enforced — Guava found whatalltests+ nine fixes missed, and the next codebase could find an eleventh. Not shippable for ~1%. (2) Deep guard at the choke point —deepCopy()if any node is frozen, not just the root. Complete for the choke-point mutator, but the scan cost scales with the type and frozen children appear whenever a type embeds a cached generic bound (common), so it copies about as often as today and likely erases the ~1% — net-neutral-to-negative; measurement-gated, unpromising. (3) Copy-on-write ATMs — a frozen node's mutators return a fresh shallow node instead of throwing, so sharing is safe regardless of who reparents what; the whole bug class disappears. This is the only complete fix, and it would make all eight caches flippable, not justelementType— so its payoff is the combined copy elimination, not ~1%. It is a separate, measured architectural project (see "the recommended next direction" in the Short list and the copy-on-write notes below), not a patch to this PR. Verdict: keep the flip reverted; pursue the allocation win, if at all, via copy-on-write as its own effort. - Post-mortem: why the immutability allocation win came in at ~1%, not the projected large payoff. The
program was motivated by an earlier profile attributing
AnnotatedTypeCopier.visit~2% on-CPU self-time and the dominant share ofObject[]TLAB allocation (~22%). A fresh full-checknullnesstrace taken after this PR (11.8kExecutionSamples, 156 s span) shows that figure is stale:AnnotatedTypeCopieris now ~0.76% self-time and ~1-1.5% of allocation. The intervening work — the PR #1777methodAsMemberOf/directSupertypes/elementTypecaches, the thread-local copieroriginalToCopymap, and lazyAnnotatedTypeScanner.visitedNodes— had already harvested most of the copier allocation the immutability program was meant to remove. So by the time the boundary flips landed there was little copier cost left to delete, and the flip removes only the per-hit copy for the read-only-majority of consumers (≈1%). Lesson: re-trace the current baseline before committing to an architectural plan built on an older profile — an allocation hotspot named in this log may already have been shrunk by later commits. The currentchecknullnessCPU profile is flat (hottest leafIdentityHashMap.getat 2.98%, spread across ~10 callers); the largest remaining addressable allocation slices areAnnotatedTypeScanner.markVisited's per-scanIdentityHashMap(~5% of allocation) andAnnotationMirrorSetconstruction+iteration (~6-10%), each a careful per-item job with low-single-digit wall-clock upside, not a large lever.
- PR #1649 — Reimplement
AnnotationMirrorSetusing anArrayList. Sets are small in practice;TreeSet'scompareTo(which decodesNametoStringper comparison) was strictly more expensive than linearareSameon the observed sizes. RemovedNavigableSetfrom the public interface — see CHANGELOG note. The patch initially shipped with a regression inaddAllsemantics; the fix preserves the non-standard fast-path return-true-if-any-new contract. - PR #1776 — Index-based iteration of
AnnotationMirrorSeton hot paths. JFRallNullnessTests -PmaxParallelForks=1attributed 80% of allArrayList$ItrTLAB allocations (6,523 of 8,143 events, 4.24% of total TLAB traffic) toAnnotationMirrorSet.iterator(), with the iterator-allocating callers concentrated inAnnotatedTypeMirror.addAnnotations(33%),AnnotationMirrorSet.addAll(14%),ElementQualifierHierarchy/NoElementQualifierHierarchy.findAnnotationInSameHierarchy (17.5% combined), andAnnotatedTypeFactory.getDeclAnnotation(10%). Added a publicAnnotationMirrorSet.get(int)accessor (the backing store is already anArrayList, so iteration order is stable) and routed those sites through index-based loops, following the overload-resolution pattern of PR #1775:addAnnotations/addMissingAnnotations/replaceAnnotationsgainedAnnotationMirrorSet-typed overloads that the ~69 call sites passinggetAnnotationsField()/getAnnotations()bind to automatically;addAllgot aninstanceof AnnotationMirrorSetfast path; the two qualifier-hierarchy methods got aninstanceoffast path; andgetDeclAnnotation's two loops (over an already-AnnotationMirrorSet-typed local) became index loops. Re-measured on the same workload:ArrayList$Itrdropped to 3,172 events (1.81%),AnnotationMirrorSet.iterator()calls dropped 6,523 → 1,530 (−77%), andAnnotationMirrorSet$ReadOnlyIter(751 events) left the profile entirely. TheObject[]/IdentityHashMapallocation path and CPU self-time were unchanged.CFAbstractValue.validateSetwas deliberately left alone: it runs only underassert, so its iterator allocation never occurs in production (-da) runs. - PR #1790 — index-based
AnnotationMirrorSetiterator (June 2026). PR #1776 converted the heaviest direct callers toget(int)loops, but the remainingfor-each /forEachcallers (AnnotationUtils.areSame/getSame, dependent-types and inference lambdas, ...) still went throughiterator(), which was the single largest survivingArrayList$Itrsource (57% of it on the all-systems corpus).iterator()allocated the backingArrayList's own iterator and, for an unmodifiable set, wrapped it in aReadOnlyIter— two allocations per traversal.ReadOnlyIternow walks the backing list by index (get(i)/size()) instead of wrapping an iterator, so the unmodifiable case (the common one) no longer allocates theArrayList$Itrat all — the wrapper was already being allocated, so this is pure waste removed, and it fixes every caller (including lambdas) in one place with no caller churn. A mutable set still returns the backing iterator, on purpose: that preservesremove()and concurrent-modification detection, which a bare index walk cannot. Measured:ArrayList$Itr5.39% → 3.68% of TLAB events (the eliminated 58 samples are the unmodifiable iterations; the residual 38%-of-iterator()is mutable sets, kept for safety); deterministic all-systems allocation 5951 → 5884 MB (−1.1%), no normal-code regression; passesalltests. The other remaining iterator allocations are not cheaply/safely index-convertible: javac's cons-List(get(i)is O(i)), map iterators (LinkedHashMap), andCollectionsPlume.mapListover an unknownListimpl. - PR #1669 — Improve equality and comparisons of annotation
names. Introduced
AnnotationUtils.annotationNameAsName, which returns the underlyingNamewithout ever allocating aString. Hot callers that only need identity comparison or hashing now go through it.Nameinstances from the sameElementsare guaranteed comparable by==within one javac invocation.
- PR #1775 —
scan/scanAndReduceList overloads andAnnotatedTypeCopierindex-based iteration. Addedprotected R scan(List<? extends ATM>, P)andprotected R scanAndReduce(List<? extends ATM>, P, R)overloads toAnnotatedTypeScanner. Java overload resolution prefers these over theIterableversions, so all existing call sites invisitDeclared,visitExecutable,visitIntersection, andvisitUnionautomatically use index-basedlist.get(i)instead of an enhanced-for loop. Also changedAnnotatedTypeCopier.visitDeclaredto iterate the rawtypeArgsfield (package-private, same package) by index instead of callinggetTypeArguments()(which wraps in an unmodifiable list) and iterating with for-each. Combined JFR impact onallNullnessTests -PmaxParallelForks=1:Collections$UnmodifiableCollection$1TLAB events dropped from 3,113 (1.8%) to zero;ArrayList$ItrTLAB events dropped from 11,471 (6.7%) to 5,332 (3.2%); total TLAB event count dropped 3.1% (171,829 → 166,464). Also added anisEmpty()short-circuit toAnnotatedTypeMirror.getAnnotations()that returns the sharedemptySet()sentinel whenprimaryAnnotationsis empty, avoiding a freshAnnotationMirrorSetallocation for unannotated types.
-
PR #1646 — Only reset the visitedNodes if they are not empty. Cheap guard before
reset()— the common case is an already-empty map after the previous walk completed. -
PR #1671 — Increase the
AnnotatedTypeScanner#visitedNodesmap size. Pre-sizes theIdentityHashMapto 64 to eliminate the early-resize storms previously visible in allocation profiles. -
Re-measured June 2026 —
reset()usesnew IdentityHashMap<>(VISITED_NODES_INITIAL_CAPACITY)rather thanclear()(the constant was then namedVISITED_NODES_EXPECTED_MAX_SIZEand equal to 64). Leaf-frame self-time onallNullnessTests -PmaxParallelForks=1:IdentityHashMap.clear= 3.42% (668/19479 samples);IdentityHashMap.initnet after background subtraction ≈ 1.27% (456 total − 180 background = 276 samples, /20809).clear()wins on object allocation (1.09% vs 1.48% of TLAB events) but loses on CPU:IdentityHashMap.clear()walks all 128 table slots explicitly in Java; TLAB allocation uses JVM bulk zeroing. The pre-sizing in PR #1671 is what makesclear()more expensive — pre-sizing enlarged the array that must be explicitly zeroed. -
PR #1785 — reduced the pre-size from 64 to 8 (June 2026); renamed the constant to
VISITED_NODES_INITIAL_CAPACITY. Resolves the open candidate that used to sit in the short list below. The realistic worker (checknullness, all subprojects) showedObject[]at ~61% of TLAB allocations, ~91% of them theseIdentityHashMapbacking arrays fromAnnotatedTypeScanner.reset/<init>andAnnotatedTypeCopier.visit; PR #1671's pre-size of 64 backs a 256-slotObject[]per scan while most scans visit only 1–3 nodes. The constructor argument isIdentityHashMap'sexpectedMaxSize, not a table size: it allocates a power-of-two backing array large enough to hold that many entries without resizing — so 4/8/16/32 back 16/32/64/128-slotObject[]s that first resize at 6/11/22/42 entries, and 16 is byte-for-byte the no-arg default. A 4/8/16/32 JFR sweep (one full-build capture each) measured (on-CPU samples / wall span / GC collections /Object[]near CF /reset-siteObject[]):- 4: 13,246 / 174 s / 667 / 21,490 / 2,803 — worst: resize-storm rehash on deeper types.
- 8: 12,218 / 159 s / 458 / 36,767 / 8,379 — best GC and CPU; ~26% less map allocation than the default.
- 16: 12,448 / 162 s / 504 / 43,091 / 11,370 — the JDK default size.
- 32: 12,366 / 163 s / 554 / 86,226 / 31,019 — double the default allocation, no CPU gain.
8 is the chosen value: resizing at 11 instead of 6 clears the 6–10-node tail that made 4 resize, so it matches the default on CPU/GC while still allocating less. Among 8/16/32 the CPU/wall numbers sit inside the ~3% run-to-run noise (a separate warm-daemon wall-clock A/B of 16 vs the shipped 64 —
shadowJarrebuilt per side, first rep per block discarded, two interleaved passes — was a wash, both median 136 s over a 130–138 s spread), so the win is GC/footprint, not wall clock, exactly as the CPU-bound (~96% on-CPU, ~4% GC) profile predicts. An audit of the other ~60IdentityHashMaps found just one more transient small-map worth pre-sizing —ElementAnnotationUtil.annotateViaTypeAnnoPosition'swildcardToAnnos(≤2 entries, pre-sized to 4 with its own rationale, unrelated to the visitor maps); the long-lived caches and per-analysis dataflow stores hold many entries and must keep the default, since pre-sizing them small would reintroduce resize storms. -
PR #1763 — Pre-sized
ArrayListcopies inAnnotatedTypeCopier. ReplacedCollectionsPlume.mapListlambda calls with direct pre-sizednew ArrayList<>(size)loops. Removes lambda-dispatch overhead and allocates the destination list at the correct capacity immediately, avoiding internal growth copies. -
PR # 1776 — Reuse the
QualifierDefaultsdefaulting scanner instead of constructing one per application.QualifierDefaults.applyDefaultsElementcreated a freshDefaultApplierElement, whose constructor created a freshDefaultApplierElementImpl— anAnnotatedTypeScannerwhosevisitedNodesIdentityHashMapis pre-sized to 64 (a 256-slotObject[]) — for every type defaulted. On a realistic single compilation (:checker:checkNullness, isolated forked-javac worker, 3,337 samples) this was the largest single allocation source after the copier:Object[]was 61% of all TLAB events, and 17% of thoseObject[]s came fromAnnotatedTypeScanner.<init>, of which 76% (3,360 events ≈ 8% of all TLAB events) wereDefaultApplierElementImplconstruction. TheAnnotatedTypeScannerJavadoc explicitly says not to construct a scanner per use but to store and reuse one. Fix:DefaultApplierElementImpl.outerbecame non-final, and a single scanner is parked in aQualifierDefaults.pooledApplierImplfield and borrowed/returned around eachapplyDefault(borrowApplierImpl/returnApplierImpl).AnnotatedTypeScanner.visitalready resets all scan state, so reuse is transparent. Safety: defaulting is not re-entrant intoapplyDefault(the scan only reads caches —elementToBoundType,getPath— and adds annotations; verifiedgetBoundTypeand the per-location branches do not call back intogetAnnotatedType/defaulting), and the pool is a size-1 slot that isnullexactly while borrowed, so any hypothetical re-entrant borrow allocates a fresh scanner rather than corrupting the parked one — correctness never depends on non-re-entrancy. Confined to the javac main thread like the other caches. Re-measured on the same worker:AnnotatedTypeScanner.<init>Object[]allocations dropped 4,415 → 1,090 (−75%);DefaultApplierElementImpl(both the object and its map) left the allocation profile entirely; total TLAB events −5.7% (42,738 → 40,314) at an unchanged sample count;DefaultApplierElementImpl.scanself-time unchanged (no CPU regression). The 1,090 residual scanner constructions are now dominated (77%) byElementAnnotationApplier$TypeVarAnnotator, addressed next. -
PR #1776 — Reuse two more per-use scanners:
TypeVarAnnotatorand theisValidStructurallystructural scanner. Same anti-pattern as theQualifierDefaultsentry above, found by re-running the:checker:checkNullnessallocation analysis after it. (1)ElementAnnotationApplier.applyconstructednew TypeVarAnnotator()(a statelessAnnotatedTypeScanner) per call — 839Object[]events, ~2% of TLAB, the largest remaining scanner construction. Pooled in astatic AtomicReference<TypeVarAnnotator>:getAndSet(null)borrows,setreturns. AnAtomicReference(not a plain field, unlike theQualifierDefaultscase) becauseapplyisstaticand shared across factories/threads in the Gradle daemon and language server, and becauseTypeVarAnnotator.visitTypeVariablecalls back intoapplyInternal(possible re-entrancy); a concurrent or re-entrant borrow seesnulland allocates its own, so correctness never depends on single-threaded or non-re-entrant use. (2)BaseTypeValidator.isValidStructurallybuilt aSimpleAnnotatedTypeScannerper call (234 events). The validator is per-checker and main-thread-confined and the structural scan is not re-entrant (it is called once per top-level type fromisValid, and its action only reads annotations), so the scanner is now a lazily-initialized field (lazy, not a field initializer, to avoidthisescaping during construction; the capturedisTopLevelValidTypestill dispatches to subclass overrides). Combined effect, measured across the fourcheckNullnessworker traces: totalAnnotatedTypeScanner.<init>Object[]allocations 4,415 → 1,090 (#1) → 290 (TypeVar) → 48 (isValidStructurally), i.e. −99% overall. The 48 residuals areTypesIntoElements$TCConvert(30, ~0.08% of TLAB) andtypeinference8 InvocationType(18) — both negligible; per-use scanner construction is no longer a meaningful allocation source. Caveat (measured): none of these moved single-compile wall-clock —checkNullnessis not GC-bound at-Xmx512m, so the with/without delta was inside ±10% run-to-run noise (see the timing note). The value is GC/memory-pressure reduction (tight heaps, concurrent collectors, long-lived daemon/LSP JVMs), not single-compile latency. General pattern for any future scanner found constructed per use: a main-thread-confined scanner can reuse a plain size-1 pool field (likeQualifierDefaults); astatic/shared one needs anAtomicReference.getAndSetpool (likeTypeVarAnnotator) to stay correct under daemon/LSP concurrency and re-entrancy — thenull-while-borrowed state doubles as the re-entrancy guard. -
PR #1794 — lazy
visitedNodes(June 2026); encapsulation, measured perf-neutral. MadeAnnotatedTypeScanner.visitedNodesprivateand lazily allocated: the field startsnulland theIdentityHashMapis created on the first stored node (markVisited) instead of in a field initializer, andreset()nulls the field instead of conditionally reallocating. The seven subclasses that touched the field directly (BaseTypeValidator,TypesIntoElements,PropagationTypeAnnotator,QualifierDefaults,DependentTypesHelper,DoubleAnnotatedTypeScanner, and the base class) now go through threeprotected finalaccessorshasVisited/getVisited/markVisited, centralizing the lazy-null invariant. A/B (PR #1794 vs. its merge-base): deterministicThreadAllocationStatisticson a 300-method file, 6 interleaved rounds, was −0.18% total allocation (median 880.2 → 878.6 MB) — inside the ~0.5% run-to-run band; wall clock flat. A full-build--no-daemon checknullnessJFR capture per side confirmed the mechanism — the eagerObject[]allocation atAnnotatedTypeScanner.reset(the #1 CFObject[]site, ~29% of CFObject[]events on the base side) and at<init>disappears, reappearing only atmarkVisited(the lazy first-put) — but on-CPU samples, GC collections, and post-GC retained heap were all within single-run noise. Why it is a wash: PR #1646 already deferred the empty-case realloc, and the per-use scanner pooling above already removed ~99% of per-construction map allocations, so the base path was already mostly lazy; the lazy field only additionally skips allocation for the narrow "reset a non-empty map, then visit no recursive type" case. Distinct from the rejectedclear()idea (see Tried and rejected): there is no explicit Java-loop zeroing — the win, where it exists, is skipping the allocation entirely, not making the reuse cheaper. Shipped for the encapsulation (one enforced place for the null invariant;privatefield; storage strategy decoupled from subclasses), not for a perf number. One source-compat note:visitedNodesgoingprotected→privateis incompatible for any third-partyAnnotatedTypeScannersubclass that referenced the field directly. -
PR #1815 — re-instantiate
IdentityHashMaps instead ofclear()(June 2026). Applies the same principle as PR #1794 (fresh TLAB allocation is cheaper than an explicitArrays.fillover the existing backing array) to four additional sites: (1)AnnotatedTypeCopier.visit: removes the PR #1791 thread-local map pool entirely. The pool'sfinally { map.clear() }was profiled at ~2.6% ofchecknullnessself-time (see Tried and rejected:AnnotatedTypeCopier.visitpooled-map clear ratchet). The pool also required a re-entrancy fallback that allocated a new map anyway. Nowvisitalways allocates a freshnew IdentityHashMap<>(VISITED_NODES_INITIAL_CAPACITY)and discards it on return — no pool, no clear, no re-entrancy guard. (2)AbstractQualifierPolymorphism.AnnotationMirrorMap.reset():visitedTypes.clear()→ re-instantiateCollections.newSetFromMap(new IdentityHashMap<>()). (3)EquivalentAtmComboScanner.Visited.clear():visits.clear()→ re-instantiatenew IdentityHashMap<>(). (4)AtmLubVisitor.visit():visited.clear()→ re-instantiate. Quick A/B (cold-JVM wall clock, 3 reps/side,gen-sized-program.py --shape generic): master 300-method median ~5.9 s vs. branch ~6.2 s; 600-method ~8.3 s vs. ~8.2 s — within cold-JVM noise (±0.7 s), no measurable wall-clock difference. The win, if any, is GC pressure / allocation throughput rather than wall clock on heap-generous single-file runs; thechecknullnessJFR self-time attribution forIdentityHashMap.clear(2.6%) suggests the benefit would be clearest on a multi-CU warm-daemon workload. -
PR #1827 — re-instantiate the per-CU / per-CFG
IdentityHashMaps instead ofclear()(June 2026). Extends the PR #1815 principle to the long-lived, per-compilation-unit and per-CFG caches that PR #1791 had left onclear(): (1)AnnotatedTypeFactory.setRoot— the five tree caches (classAndMethodTreeCache,fromExpressionTreeCache,fromMemberTreeCache,fromTypeTreeCache,elementToTreeCache) plus thescannedEnclosingTreesidentity set, now reassigned to fresh maps rather than cleared. (2)GenericAnnotatedTypeFactory—scannedClasses,regularExitStores,exceptionalExitStores,returnStatementStores(insetRootandperformFlowAnalysis). (3) DataflowinitFields—AbstractAnalysis(inputs,nodeValues,finalLocalValues),ForwardAnalysisImpl(thenStores,elseStores,blockCount,storesAtReturnStatements),BackwardAnalysisImpl(outStores,exceptionStores),Worklist.depthFirstOrder(re-assigned inprocess). (4)DefaultQualifierForUseTypeAnnotator.clearCacheandTreePathCacher.clear. All affected fields were de-finaled with a doc note naming the sole reassigning method. Why re-instantiate and notclear()for these: the deciding factor is not per-CFG vs per-CU but whether the map's high-water mark can be inflated by a single large input. All of the above are uncapped — one giant method body (for the per-CFG maps) or one large compilation unit (for the per-CU caches) can blow them up.clear()retains that peak backing array for the rest of the build and re-zeroes it (Arrays.fill-style, O(capacity)) on every later, possibly tiny, reset — a potential super-linear cost (peak-capacity x number-of-later-resets) and a permanent memory high-water mark. Re-instantiation lets the big array be collected and starts each reset small; the per-reset allocation is negligible (and below the noise floor on the full-build A/B). The map content is new each reset anyway (keys are fresh per-CFG/per-CU identity objects), soclear()would preserve only capacity, never useful entries.clear()would only be the better choice for a genuinely size-capped reused map — none of these qualify. Special case:GenericAnnotatedTypeFactory.subcheckerSharedCFGstays on the null-guardedclear()(clearSharedCFG). It is not exempt for size reasons (it, too, grows with the CU); rather, unconditional re-instantiation broke two invariants: it must staynullfor any factory that never shares CFGs (an ultimate parent with no subcheckers would otherwise allocate a forever-empty map every CU), and the lazy init inaddSharedCFGForTreepre-sizes it togetCacheSize()(the field doc promises that initial capacity), which a default-capacity re-instantiation would defeat. A null-guardednew IdentityHashMap<>(getCacheSize())would also satisfy both and additionally shed the peak array, but since this map is reset only per-CU the retain-vs-shed difference is minor; the null-guardedclear()is kept for simplicity. Correctness fix bundled in (the real motivation):AbstractAnalysis.setNodeValueswent fromnodeValues.clear(); nodeValues.putAll(in)tonodeValues = new IdentityHashMap<>(in). The old in-placeclear()had an aliasing bug:AnalysisResultwraps the analysis'snodeValuesin anUnmodifiableIdentityHashMap(read-through, no copy), andAnalysisResult.getStoreBefore/getStoreAfterpass that wrapper straight back intorunAnalysisFor→setNodeValues(in). ThenodeValues == inguard does not fire (the argument is the wrapper, not the raw map), sonodeValues.clear()emptied the very map the wrapper reads, andputAll(in)then copied back nothing — wiping all node values. Copy-before-mutate (new IdentityHashMap<>(in)) reads the still-intact wrapper into a fresh map and leaves the original untouched. Observable indataflow/tests/constant-propagation/Expected.txt: node abstract values (a … > 0,b = a … > 0,4 … > 4,b … > T) that the old code suppressed during visualization now appear; confirmed by running the test on both sides. A/B (full./gradlew --no-daemon checknullness, JFR, one run/side): on-CPU ExecutionSamples 10530 (master) vs. 10655 (branch) — CPU-neutral, within run noise.IdentityHashMap.clearleft the leaderboard entirely (0.50% / 53 samples → 0). The trade is more allocation churn: total TLAB events 81566 → 92713 (+13.7%, butIdentityHashMapper-class count flat at 2773 → 2798, so mostly sampling noise), GC 210 → 248 collections / 3258 → 4013 ms. Post-GC live heap measured higher on the branch (max 475 → 755 MB, median 243 → 315 MB), but on follow-up analysis this is almost certainly GC-timing noise across two separately-launched JVMs, not a real footprint regression: the genuinely retained per-CU dataflow memory isGenericAnnotatedTypeFactory.flowResult(the accumulator thatcombine()copies every CFG'snodeValuesentries into —flowResultis reset per-CU insetRootand is untouched by this patch). The theoretical divergence — aftergetResult()theAnalysisResultwrapper retains the per-CFGnodeValues(MAP_A) while a latersetNodeValuesreassigns the field to a copy (MAP_B) — is bounded to one CFG (the analysis'sgetResultCacheandnodeValuesare both reset ininitFieldsbefore the next body) and to one map, i.e. kilobytes, not the hundreds of MB measured. A controlled warm-daemonheap-mode A/B/A/B would confirm, but no code change is warranted by it. Net: like PR #1815, wall/CPU-neutral and allocation-neutral (per-classIdentityHashMapflat); ship for thesetNodeValuescorrectness fix and consistency with the established re-instantiate pattern, not for a throughput number. Follow-ons applied in the same PR: (a) the stalesetNodeValuescomment (which justified thenodeValues == inguard by the now-removedclear()-empties-inbug) was rewritten — correctness now rests on copy-before-mutate, and the==/syncedFromguards are pure optimizations. (b) Workaround removed:AnalysisResult.runAnalysisFor(node, preOrPost)dropped itscopyNodeValuesIfNeeded()defensive copy (and the comment/TODO explaining it). That copy existed only because the old in-placesetNodeValuescould mutate the map anAnalysisResultwraps; with copy-before-mutate the wrapped map is read but never mutated, so the copy is dead. On the hot store-query path it was already a no-op (flowResult'snodeValuesis private after the firstcombine). With that gone, the lazy copy-on-write also simplifies: the only remaining mutator iscombine(), which copies all three wrapped maps together, so the separatenodeValuesCopied/otherMapsCopiedflags and thecopyNodeValuesIfNeeded()helper (which existed only to copynodeValuesindependently for the spot-query path) collapse into a singlemapsCopiedflag and a singlecopyMapsIfNeeded(). The copy mechanism itself stays — the maps are read-onlyUnmodifiableIdentityHashMapviews thatcombinemust replace before mutating. Verified by:framework:test,:checker:NullnessTest/ResourceLeakTest/MustCallTest, and all dataflow tests. (c) Regression test: the constant-propagation dataflow test happened to cover this bug, but no Nullness test did — the bug only surfaces with verbose CFG visualization (-Acfgviz=...StringCFGVisualizer,verbose, which is the only caller of the unprotectedgetStoreAfter(Block)/getStoreBefore(Block)) and only on blocks rendered after the first node-bearing one (the wipe happens once, after the first block's contents are already emitted).checker/tests/nullness-extra/cfgviz-nodevalues/is anullness-extramake test (a method with an if/else, so >1 node-bearing block) whoseExpected.outcarries the abstract values on the later-block nodes; it fails on the oldclear()-basedsetNodeValues(values wiped) and passes on the fix.
-
PR #1645 — Cache the methods in an element. Adds a per-element method cache in
AnnotationFileParserto avoid repeatedgetEnclosedElementsfiltering. -
PR #1648 — Optimize determining boxed primitive types. Adds a
TypeKindUtilsfast path that avoids aTypes.boxedClasscall followed bygetQualifiedNamewhen the kind alone is sufficient. -
PR #1673 — Cache qualified, interned names for all elements. Adds an
IdentityHashMap-backed cache inElementUtilskeyed on theQualifiedNameable. Routes the six hottestName.toString()call sites (AnnotationUtils,AnnotationBuilder,TypesUtils, stub parser, etc.) through it. Also removes the now-redundantAnnotationUtils#annotationNameInterned—annotationNameitself now returns an interned name. See CHANGELOG note. -
PR #1763 —
ElementUtils.parentPackagefast path. When thePackageElementis a javacSymbol.PackageSymbol, reads the enclosing package directly from theownerfield instead of callingElements#getPackageElement(String). Falls back to the original string-based lookup for non-javac implementations. -
PR #1796 — Interned-
Nameidentity comparison for fixed name literals. NewInternalUtilshelpers (isInitName,isThisName,isSuperName,isValueName,isJavaLangObjectName,isJavaLangEnumName) compare a javacNameagainst its own table's pre-interned name (n == n.table.names.initetc. — uses the name's own table, so no cross-context identity assumption) with acontentEqualsfallback for non-javacNames. Converted the utf2chars-profiled sites —TreeUtils.isConstructor/isEnumSuperCall(the latter also reordered to check<init>before the class name), thethis/superidentifier checks inTreeUtils,TypeFromExpressionVisitor.visitIdentifier/visitMemberSelect,ParamApplier.isReceiver,ElementUtils.isObject— and then the remaining ~25 fixed-literalcontentEqualssites across dataflow (CFGTranslationPhaseOne,JavaExpression,SuperNode,ExplicitThisNode), framework, and the checkers (initialization, interning, nullness, lock, units). Full-buildchecknullnessJFR: every converted site left theConvert.utf2chars/utf2stringattribution (was ~30 of 275 utf2* samples ≈ 0.23% of all samples), so the end-to-end effect is real but sub-0.5% — not resolvable in wall clock. Microbenchmark on a byte-backed name table: 12x faster and ~66 B/op allocation removed vscontentEquals; neutral onStringNameTable. Only names with pre-internedNamesfields available on JDK 11+ are used (init,_this,_super,value,java_lang_Object,java_lang_Enum);names.yield(JDK 13+) was deliberately not used.General
sameName(Name, CharSequence)with a table-validated static cache (also PR #1796). For arbitrary (non-pre-interned) target strings, the naive per-call formn == n.table.fromString(literal)is a dud — measured (8.2M mixed hit/miss ops): ~12% faster but 28% more allocation thancontentEqualson a byte-backed table (it re-encodes the literal per call), and 1.8x slower onStringNameTable. A naive staticMap<String, Name>is unsound in multi-compilation JVMs (the test suite, a language server): a cachedNamefrom a previous compilation's table compares==-false against content-equal names from the new table. The applied design closes both holes: a singlevolatileholder pinning(Name.Table, ConcurrentHashMap<String, Name>)thatsameNamediscards whenever it sees a name from a different table — stale answers are impossible, the worst case is a cache rebuild on table switch. Measured: 5.8 ns/op, zero allocation vscontentEquals's 36.8 ns/op + ~66 B/op on byte-backed tables (6.4x); onStringNameTable5.7 vs 4.7 ns/op (~neutral). Converted the dynamic-but-bounded-target sites:AnnotationUtils.getElementValue's element-name loops (the #1 utf2 consumer for the Resource Leak Checker* — 66 of 201 utf2* samples oncheckResourceLeak, invisible onchecknullness; hot-site profiles are checker-specific),AnnotationBuilder.findElement,ElementUtils/TreeUtilsmethod/field-name lookups,JavaExpressionParseUtilidentifier resolution, the stub parser'sfindElementfamily, andSetOfTypes.anyOfTheseNames(viaElementUtils.getQualifiedName's interned cache). Cardinality caveat: each distinct probe string is interned into the compiler's name table and cached for the compilation, sosameNameis only for bounded target sets (annotation element names, configured method names, source identifiers) — not arbitrary unbounded input.Key environmental facts (verified June 2026): (1) which
Name.Tablejavac uses decides whetherName.toString()/contentEqualsdecode UTF-8 per call: byte-backedSharedNameTableis the default before JDK 23,StringNameTable(decode-free, cachedtoString) since JDK 23. (2) Gradle passes-XDuseUnsharedTableto every forked javac (verified in a--debugcompile log), forcing the byte-backed table on all JDK versions — so under Gradle (this project's own build, most users' builds) the decode cost is alive on JDK 25/26 too, while plain-javac/Maven runs on JDK 23+ don't have it. Measure name-decode changes with-XDuseUnsharedTable, or the A/B silently tests the wrong table. (3) Do NOT compare aNamechar-by-char (charAtloop): baseName.length()/charAt()calltoString()per invocation, so that is N+1 decodes instead ofcontentEquals's one (measured 2.8x slower interpreted, 545 MB extra allocation per 8M ops on JDK 21); the raw-byte APIs (getUtf8Length/getUtf8Bytes/map) are version-specific and re-encode onStringNameTable(measured 5x slower) — identity against an internedNameis the only variant that wins on every table.
-
PR #1776 — Share the annotated-JDK stub AST across compilations. Inclusive-time analysis of
allNullnessTests -PmaxParallelForks=1(the run is many small per-directory compilations in one worker JVM) showedAnnotationFileParser.parseStubFileat ~32% and the JavaParser parse itself (com.github.javaparser.*) at 14.4% of execution samples — the annotated JDK is re-read and re-parsed from scratch by every compilation, becausestubTypesis a per-AnnotatedTypeFactoryfield. The JDK stub text is fixed for a given JVM and its JavaParser AST does not depend on the javac context (only the laterprocess*resolution does), and JDK-stub processing is read-only on the AST (verified: the only AST mutation,concatenateAddedStringLiterals, is ajava-only). SoAnnotationFileParser.parseStubUnitnow memoizes theStubUnitforJDK_STUBfiles in a staticConcurrentHashMapkeyed by jar-entry name; each compilation still re-runsprocess*against its own model. Re-measured on the same workload:parseStubUnitForJdkinclusive dropped 10.0% → 2.1%,com.github.javaparser.*14.4% → 7.2%, the JavaParser allocation classes (Token1,643,Position1,612,JavaToken1,023,Range784 TLAB events) left the top-35 entirely, and total TLAB events fell 3.4% (175,677 → 169,675). A single user compilation parses each JDK class once either way, so the win is for multi-compilation JVMs: the test suite (a tracked metric), the Gradle daemon, and the language server. The cache is bounded by the number of distinct JDK stub classes (a few hundred) and is shared, so it is a fixed cost, not per-compilation garbage. Correctness re-verified withallNullnessTests,IndexTest,SignatureTest,NullnessTest,InterningTest,ValueTest, and the:checker:test,:framework:test,:javacutil:test,:dataflow:testsuites. -
PR #1797 —
IdentityHashMap<Name, TypeElement>for annotation name maps (June 2026). The annotation-name lookup maps inAnnotationFileParser,InsertAjavaAnnotations, andTypeAnnotationMoverpreviously usedHashMap<String, TypeElement>, requiringName.toString()(a UTF-8 decode on byte-backed tables) at every map-build site. Changed toIdentityHashMap<Name, TypeElement>: keys are theNameobjects returned bygetSimpleName()/getQualifiedName()directly, and lookups useelements.getName(s)to intern a JavaParserStringinto the same table, guaranteeing identity equality within one compilation. Also removed a redundantelements.getName(annoElt.getSimpleName())call inAnnotationFileParser.getImportedAnnotations—getSimpleName()already returns an internedNamefrom the same table, so the round-trip was a decode-and-re-intern no-op. Safety: all maps are built and consumed within a single compilation'sElementsinstance, so same-table identity holds; thegetAnnotationfallback (elements.getTypeElement(fqn)+createNameToAnnotationMap) handles first-encounter FQN annotations and populates both simple-name and FQN entries for future hits. -
PR #1776 — Avoid the defensive deep copy in read-only
fromElementconsumers.AnnotatedTypeFactory.fromElementreturnscached.deepCopy()on every cache hit so callers may mutate the result; this is the second-largestObject[]allocation source (AnnotatedTypeCopier.visit, the per-copyIdentityHashMap). AddedgetElementAnnotations(Element), which returns the cached type's primary annotations directly (getAnnotations()already returns an unmodifiable set and cached types are never mutated, so this is safe), and routedDefaultQualifierForUseTypeAnnotator.getExplicitAnnos— a read-only caller that only needs the element's primary annotations — through it. Honest impact note: on the profiled workloads the measured delta is within noise, becausegetExplicitAnnosruns ~95% of the time during stub parsing, where the element cache is cold andfromElementtakes the compute (no-copy) path anyway. The change is correct and removes the copy on the warm-cache path (repeated default-for-use queries on already-cached elements, as in large multi-round projects); it is kept on that basis, not on a measured win here.
- PR #1665 — Increase default cache size from 300 to 2000. Profiling showed the LRU eviction rate dominated some workloads; the win from cache-hit rate exceeds the memory cost at modern heap sizes.
- The
Collections.synchronizedMapwrapper aroundAnnotatedTypeFactory.annotationClassNamesis gone. The wrapper was carried over from a 2020 refactor of a previously staticAnnotationUtilscache; per-factory instance fields don't need it, matching every other LRU cache on the same object. AT factories are confined to the javac main thread.
- PR #1670 — Add extra maps to qualifier hierarchies. Added
identity-keyed caches in
ElementQualifierHierarchyandNoElementQualifierHierarchyto avoid repeatedannotationNamelookups infindAnnotationInSameHierarchyand adjacent hot paths. Madeelementsfield protected so subclass hierarchies can extend the same caching pattern. - PR #1763 — Empty-collection early-out in
ElementQualifierHierarchy. Added anannos.isEmpty()guard at the top offindAnnotationInSameHierarchyto return immediately without entering the qualifier-kind lookup loop.
-
PR #1643 — Cache the hashCode for dataflow expressions. Added a cached
hashCodefield acrossArrayAccess,ArrayCreation,BinaryOperation,FieldAccess,FormalParameter,LocalVariable,MethodCall,UnaryOperation, andValueLiteral. Per-object cost varies:LocalVariablepays zero because of the existing alignment gap;FieldAccessand similar pay +8 bytes. Peak overhead measured at ~128 bytes for a large method, well worth the savings on store- comparison hot paths. -
PR #1797 —
LocalVariableNode.hashCode/equalsavoidgetName()andObjects.hash()(June 2026). Both methods previously calledgetName(), which callsName.toString()(a UTF-8 decode on byte-backed tables). Changed to read theNamedirectly from the tree (IdentifierTree.getName()/VariableTree.getName()) for both operations:equalsusesInternalUtils.sameName;hashCodecallsname.hashCode()directly — onSharedNameTablethis returns the byte-tableindex, which is content-stable via interning, no decode needed. Also removes theObjects.hash(name)varargs call, which allocated anObject[]per invocation (the varargs antipattern flagged in Applied optimizations → Generic map/lookup patterns). -
PR #1765 —
BinaryOperation.hashCodesymmetry fix. For commutative operations,equals()ignores operand order; the hash code must match. Replaced the order-dependentObjects.hash(kind, left, right)withObjects.hash(kind, left.hashCode() + right.hashCode())so thata OP bandb OP ahash identically. This is a correctness fix for theequals/hashCodecontract that also improves cache hit rates for commutative expressions. -
PR #1812 — eliminate
Objects.hashboxing and fix zero-hash caching. Two related fixes applied across the remainingJavaExpressionsubclasses (ArrayAccess,BinaryOperation,FormalParameter,LocalVariable,MethodCall,UnaryOperation,ClassName) and several framework/javacutil classes (CFAbstractValue,DiagMessage,AnnotationMirrorSet): (1)Objects.hashremoval. EachhashCode()that usedObjects.hash(...)was rewritten to the equivalenth = 31 * h + field.hashCode()polynomial, eliminating the varargsObject[]allocation and autoboxing per call (see the "Gotcha" entry in Generic map/lookup patterns for the general rule). (2) Zero-hash sentinel fix. The lazyhashCodeCache == 0guard used by PR #1643 treats 0 as "not yet computed". A hash that genuinely computes to 0 would be recomputed on every call, defeating the cache. Fixed throughout withhashCodeCache = h == 0 ? 1 : h, remapping the all-zero case to 1. (3)QualifierVargains a cached hash code.QualifierVar.hashCode()was previously uncached despite callingObjects.hash(id, invocation, polyQualifier)(whereinvocation.hashCode()can itself be expensive). Added acachedHashCodefield with the same lazy + zero-sentinel pattern.
-
PR #1664 — Improve
hashCodeimplementation forCFAbstractStore. Cleaner accumulation; avoids redundant work on empty sub-stores. -
PR #1686 — Small optimizations/clarifications in dataflow/analysis. Touched
AbstractAnalysis,AnalysisResult,BackwardAnalysisImpl,ForwardAnalysisImpl. -
PR #1688 — Use identity for dataflow worklists.
IdentityHashSetsemantics for the worklist — block identity is what matters, not block equality. -
PR #1691 — Small
BackwardAnalysisImplandForwardAnalysisImplimprovements. Authored by Copilot; reduces redundant work in the two analysis impls. -
PR #1696 —
CFAbstractTransferfixes and optimisations. Includes the fix for theIndexOutOfBoundsExceptionfor lambdas in varargs with Aliasing-Checker subcheckers, and switchesCFAbstractTransferto returnRegularTransferResultfor non- boolean returns instead of always wrapping inConditionalTransferResult. Downstream effect: checkers that need aConditionalTransferResultfor non-boolean methods must update their transfer functions. SeeNonEmptyTransferfor the pattern. -
PR #1707 — Review of dataflow package. Touched
ControlFlowGraph,ConstantPropagationStore,JavaExpression,MethodCall,PurityUtils, and the live-variable + reaching-def stores. Both a perf and clean-up pass. -
PR #1817 —
CopyOnWriteMapforCFAbstractStoremaps + store-merge optimizations (June 2026). Four independent changes to reduce store-copy overhead during forward analysis:(1)
CopyOnWriteMap<K,V>(new class). AMap<K,V>wrapper that defers copying its delegateHashMapuntil the first mutation after acopy()call. Store copy operations — which occur at every block join — previously created fivenew HashMap<>(other.map)allocations; withCopyOnWriteMapthey share the delegate reference and copy only on the firstput/remove/clear. Theequalsmethod adds a delegate-identity fast path (this.delegate == other.delegate), so unmodified copies compare equal in O(1).hashCodeis cached with the zero-sentinel pattern and invalidated on mutation. Caution:keySet(),entrySet(), andvalues()return the raw delegate's views; mutations through those views bypassensureUnshared(). All in-package callers useput/remove/clear, but any future iterator-based removal must go through the map's ownremove, not the view'sIterator.remove().(2)
leastUpperBound/widenedUpperBoundearly-exit when equal.CFAbstractStore,InitializationStore,NullnessNoInitStore, andLockStoreeach gain athis.equals(other)guard at the top of their LUB methods, returningthis.copy()immediately when the stores are identical. WithCopyOnWriteMap, this check hits the delegate-identity path (O(1)) for stores that have not diverged since their last shared copy — the common case in a fixpoint that has nearly converged.(3) Smaller-map iteration in
upperBound. When the two stores differ, the merge iterates the smaller map and looks up in the larger, reducing the number ofgetcalls by half when one store has a subset of the other's keys (common when one branch adds more refinements). A per-valuethisVal.equals(otherVal)short-circuit inupperBoundOfValuesavoids callingleastUpperBoundorwidenUpperBoundon already-equal values.(4)
ForwardAnalysisImplcopy-once per block. Previously each uncached node in a regular block calledcallTransferFunction(n, store.copy()), allocating a newTransferInputper node. The copy's sole purpose was to protectblockTransferInput(the block's cached entry state, stored ininputs) from in-place mutation by the transfer function. Sincestoreis replaced bynew TransferInput<>(n, this, transferResult)after the first node regardless, only the initial copy (before the first uncached node) is necessary — subsequent nodes receive a freshTransferInputwrapping the previous node's result, notblockTransferInput. One copy per block replaces one copy per node.Also fixes
Objects.hash/ varargs boxing inTransferInput,Constant,ArrayCreation,FieldAccess,ValueLiteral,AnnotatedTypeParameterBounds,Contract,Default,DependentTypesError, andPair(consistent with PR #1812's sweep); addsequals/hashCodetoNullnessNoInitValue; and reduces jtreg timeout budgets forIssue1438{,b,c}.java(90/120/50 s → 20 s each), which are the regression tests for the quadratic-store fixpoint issue these changes fix.Quick A/B (3-rep cold-JVM,
gen-sized-program.py --shape generic,assembleForJavacrebuilt per side):methods/file master alloc branch alloc Δ alloc master wall branch wall Δ wall 300 850.70 MB 835.60 MB −1.8% 6.13 s 6.31 s +0.18 s 600 1432.60 MB 1399.70 MB −2.3% 8.32 s 7.93 s −0.39 s Allocation is a consistent −2% across both sizes; wall clock differences are within the ±0.7 s cold-JVM noise floor. JFR self-time:
CFAbstractTransfer.addFinalLocalValues(1.40% of samples on master) drops to absent on branch; totalExecutionSamples571 → 493 on a single Big600 run (within single-run noise).CopyOnWriteMapdoes not appear as a CPU leaf — all hot paths go directly to the delegateHashMap. For the actual fixpoint-convergence speedup, measure on a real project with./gradlew --no-daemon checknullness(warm-daemon reps); the generic-shape tiny-file corpus stresses per-file overhead, not the fixpoint loop.
-
PR #1692 — Avoid contains/get for maps that contain no null values. Replaces the
if (m.containsKey(k)) m.get(k)antipattern with a singlegetplus null check at multiple call sites inAnnotatedTypeFactoryandGenericAnnotatedTypeFactory. The precondition (no null values) is documented inline at each site. -
PR #1693 — Avoid duplicate checks. Cross-references between type factories had redundant precondition checks; consolidated.
-
PR #1694 — Guard log calls by debug flag.
String.formatarguments were being evaluated for log messages that would be discarded. Wrapped inif (debug)guards where present. -
PR #1695 — Optimize
TreePathCacherusage. Avoid recomputingTreePathwhen the cache already has the answer. -
PR #1763 —
TreePathCachercontrol-flow exception optimization. TheResultexception used for non-local exit is constructed withsuper(null, null, false, false)to suppress stack-trace generation; the exception is caught two frames up and never logged or rethrown. -
PR #1765 —
entrySet()iteration overkeySet() + get(). Applied the pattern acrossUBQualifier,LockAnnotatedTypeFactory,MustCallInference, andAnnotationConverter: iteratemap.entrySet()instead ofmap.keySet()followed bymap.get(key), eliminating a redundant second hash lookup per iteration. -
PR #1781 —
IdentityHashMapfor caches keyed byElement/Tree. javacSymbols andJCTrees use identityequals/hashCode(they do not overrideObject's), so aHashMapkeyed by them was already an identity map — switching toIdentityHashMapdoes not change behavior, it just drops the per-entryNodeallocation (open addressing) and replaces the virtualhashCode()/equals()dispatch withSystem.identityHashCode/==. Converted six long-lived, identity-keyed maps that had been left asHashMap:AnnotatedTypeFactory.cacheDeclAnnos(Element→AnnotationMirrorSet; populated by the hotgetDeclAnnotations),GenericAnnotatedTypeFactory.subcheckerSharedCFG(Tree→ControlFlowGraph; pre-sized togetCacheSize()),GenericAnnotatedTypeFactory.scannedClasses(ClassTree→ScanState),TreePathCacher.foundPaths(Tree→TreePath),CFGTranslationPhaseOne.parenMapping(Tree→ParenthesizedTree; built per-method during CFG construction),AbstractAnalysis.finalLocalValues(VariableElement→ abstract value) — this one was the loneHashMapamong siblings (nodeValues,inputs,treeLookup,postfixLookup) that were alreadyIdentityHashMap.
Safety rule for this conversion.
IdentityHashMap.equals/hashCodecompare values by reference too (they intentionally violate theMap.equalscontract), so only convert a map that is neverMap.equals-compared. The dataflow fixpoint comparesCFAbstractStores, not these maps, and none of the six is passed toMap.equals— verify this before converting any further map. Audited but left asHashMap: method-local short-lived maps (InferenceResult,DependentTypesHelper.checkTypesForErrorKey,BaseTypeVisitorjavaparser pairing) where there is no entry-allocation pressure to remove, the stub-parserAnnotationFileParser.atypes(cold for real workloads — stub parsing only dominates in test-harness amplification), and the 4-entryLombokSupport.defaultedElements.Measured impact (full-build A/B, June 2026). This is an allocation/dispatch reduction, not a measurable speedup. Wall clock on
./gradlew checknullness(all ~10 subprojects, warm daemon, processorshadowJarrebuilt each side, median of ≥3 warm reps) was ~2m34 s with vs. ~2m32 s without — within run-to-run noise; the build-level gain is below the wall-clock floor. The mechanism is real but small in the JFR profile (full-build--no-daemontraces):HashMap$Nodedropped from 3.21% → 2.76% of TLAB allocation events (6,856 vs. 7,315 absolute — fewer even though that branch trace sampled ~10% more total allocation), and with the[Ljava.util.HashMap$Node;backing arrays the HashMap internals fell 4.66% → 4.02%. Leaf self-time inHashMap.getNodefell 3.38% → 2.27%, partly offset byIdentityHashMap.getrising 1.28% → 1.50% (cheaper per call:identityHashCode==+ flat-array probe vs. virtualhashCode/equals+Nodechase). Retained memory was unchanged: post-GC live heap maxed at 512 MB on both sides with p90/median within noise, and GC count/summed-pause were flat (647/7.86 s vs. 660/7.93 s) — the flatObject[]ofIdentityHashMapis roughly memory-neutral againstHashMap'sNode[]-plus-Nodes for these small, long-lived maps. The takeaway: file such identity-map conversions for their cumulative GC-pressure relief, not for a standalone wall-clock win.
-
Gotcha — avoid
Objects.hash(...)/Arrays.hashCode(...)on hot paths.Objects.hash(a, b, ...)is varargs, so each call allocates anObject[]and autoboxes every primitive argument to its wrapper — e.g.Objects.hash(int, int)is three allocations (the array + twoIntegers) per call. On a per-node / per-invocation path (cache-key constructors,hashCode()overrides) write the polynomial by hand instead:int h = 31 * a + b;(orh = 31 * h + next;for more fields). This mirrors theHashcodeAtmVisitorboxing/lambda removal (PR #1672) and was applied to themethodFromUsecache-key constructor (MethodAsMemberOfCacheKey), which builds a key on every cached method invocation. The two-argObjects.equalsis fine (no array/boxing); only the varargshash/Arrays.hashCodefamily allocates. Precompute the result into afinal int hashfield when the key is immutable, as both new cache keys do. A follow-up sweep (PR #1812) applied the same rewrite to the remainingJavaExpressionsubclasses and framework classes still usingObjects.hashin their cachedhashCode()implementations, and also fixed the zero-hash sentinel bug (see Dataflow expressions above).
-
PR #1786 — cache the per-body
TreePathlookup inCFCFGBuilder(June 2026).CFGTranslationPhaseOne.process(CompilationUnitTree, UnderlyingAST)(line ~527) computed the body's path with an uncachedtrees.getPath(root, code)— a JDKTrees.PathFinderTreeScannerthat allocates anew TreePathper node visited while searching from the compilation-unit root down to the body, once per method / lambda / initializer body. For the k-th body in a file it re-scans the preceding k−1 bodies, so cost is quadratic in bodies-per-compilation-unit. This was the single largestTreePathallocator: on a fullchecknullness--no-daemontrace,CFGTranslationPhaseOne.processwas the nearest-CF frame for 70% ofcom.sun.source.util.TreePathallocation samples (2146 of 3057), vs. the per-tree path extension atCFGTranslationPhaseOne.scan(line ~562) at only 0.56% (17 samples) — see the rejected "lazy path stack" note below.Fix.
CFCFGBuilder.buildalready holdscheckerandatypeFactory, so it owns the checker's sharedTreePathCacher(the same instance theAnnotatedTypeFactorypopulates during visiting). Replace the uncached search withchecker.getTreePathCacher().getPath(root, underlyingAST.getCode())and feed the result into the existingprocess(TreePath, UnderlyingAST)overload. The cacher serves the enclosing class/method prefix from cache (warmed by visiting) and caches each node's path once, collapsing the per-body re-scan from O(bodies × file) toward O(file). No class move is needed — the framework-side caller already has the cache; the dataflowCFGBuilder.buildstandalone-tool path (used byCFGProcessor, which has no checker) is left ontrees.getPath. This is a ~6-line change at one call site, no dataflow API change. It only doesgetPath(root, realBodyTree)lookups against the real AST keyed by tree identity, so none of the artificial-tree / bulk-population hazards of caching from the per-tree CFG traversal apply.Measured allocation (deterministic
ThreadMXBean.getThreadAllocatedBytesvia JFRThreadAllocationStatistics, single forkedjavac, one class of N trivial generic-call methods). The win scales with methods-per-file, exactly as the quadratic predicts:methods / file master cached reduction 100 524 MB 506 MB −3.5% 300 1,453 MB 1,251 MB −13.9% 600 3,525 MB 2,737 MB −22.4% 1500 15,192 MB 10,193 MB −32.9% (wall −6.5%, 46.3 → 43.3 s) On the all-systems corpus (267 tiny files, 1–3 bodies each) the effect is ~0% (noise) — there is no per-file body reuse to exploit. On the realistic CF build the site is ~1.4% of total allocation (70% of
TreePath, which is ~2% of TLAB events), so a normal mixed codebase sees a low-single-digit allocation reduction; the large numbers are worst-case protection for machine-generated or very large single-class files, where it removes a genuine quadratic. Wall clock tracks allocation (only measurable where allocation is large). The shared cache now retains the body-prefix paths it builds (bounded by the compilation unit — the same eager-scan caching the cacher already does onAnnotatedTypeFactory.getPathfallbacks). Validated withframework/dataflow/NullnessTestandalltests. -
PR #1788 — make
TreePathCacherlazy and routeAnnotatedTypeFactory.getPaththrough it (June 2026). Builds on PR #1786 (already in master). PR #1786 routes oneTreePathCacher.getPathlookup per method body; but the eagergetPathcaches aTreePathfor every node it DFS-traverses to reach the target, so on a large class each body lookup also caches the preceding bodies' internal nodes — O(file-nodes) of needless cached allocation, even though only body trees are ever queried. This change (a) makes the cacher lazy:scanonly pushes/pops acurrentStackand allocates nothing, andbuildPathForStackmaterializes only the root-to-target path once the target is reached; and (b) routesAnnotatedTypeFactory.getPath's two non-heuristic call sites through the cacher so they share that lazy cache. Measured (deterministic harness): −4.0% allocation on all-systems and −51.6% on a 1500-method class; passes fullalltests. Three findings each cost a wasted attempt:-
The two halves must ship together. The lazy cacher without the
getPathcall-site rerouting is worse than the eager cacher (1500-method class: 12.3 GB vs. 10.2 GB), because the eager cacher's broad node-caching is exactly what warms the cache for the per-body lookups; going lazy removes that warming unless thegetPathsites repopulate it. Hence both halves are in one change. -
The
getPath(TreePath, Tree)overload's locality is load-bearing on large classes. It scanscurrentPath's subtree first and expands outward (it relies onTreeScanner.scan(Iterable, P)visiting the path leaf-first), so a target near the visitor path is found without rescanning the whole unit. A "simplified" variant that just delegates togetPath(root, target)is byte-identical on normal code (all-systems 5979 vs. 5982 MB) but reintroduces a residual O(members) rescan at that site, and the gap to the full version widens with class size:methods/class full (locality scan) simplified (delegate to root) delegate overhead 1500 4936 MB 5407 MB +9.5% 3000 11805 MB 13686 MB +15.9% 6000 32076 MB 39444 MB +23.0% (+7.4 GB) So the extra overload earns its keep: invisible to users (contained in
TreePathCacher) and decisive only for very large or machine-generated single-class files. -
Do not narrow that overload to a subtree-only scan. Replacing
super.scan(rootPath, target)withsuper.scan(rootPath.getLeaf(), target)(and seeding/put(null)tweaks) looks cleaner but is wrong: it returns null for out-of-subtree targets and caches that null as "absent from the unit", crashing type-argument inference onall-systems/TypeVarVarArgs.java. It passesNullnessTest, so it is not caught there;framework/util/TreePathCacherTestguards it directly (thesecondOverloadFindsOut...case fails on the subtree-only variant). The original outward-expanding scan is correct and is now documented in the code.
-
-
PR #1789 — linear (instead of quadratic)
getPathsearches (June 2026). Even after #1786 + #1788, a single class with very many methods allocated super-linearly (6000-method class: 32 GB, ~2.5–2.7×/doubling). A nearest-CF-frame allocation capture on agen-sized-program.pysize sweep traced 57% of allocation at 6000 methods tocom.sun.tools.javac.util.List$2iterators thatTreeScannerallocates whileTreePathCacher.scantraverses the tree — i.e.getPathsearches rescanning the whole class (268M node visits at 1500 methods). InstrumentinggetPathshowed the targets were almost always local; they were just searched from too broad a start. Three causes, all fixed by starting each search from the tightest known path:AnnotatedTypeFactory.getPath's final fallback searched fromvisitorTreePathclimbed up a fixed two levels; for a method-body path that overshoots to the class, forcing a whole-class rescan. It now searches from the original (tightest)visitorTreePath(the second overload still expands outward for non-local targets). Alone this cut traversal 268M → 38.5M (−86%).GenericAnnotatedTypeFactory.performFlowAnalysispinnedvisitorTreePathto the enclosing class; flow-analysis-time inference lookups now run against the body being analyzed. (A no-op by itself — the climb above negated it — but needed together with the first fix.)CFCFGBuilder's per-bodygetPath(root, code)scanned from the compilation-unit root (O(members) per body).analyzenow primes that body's path in O(1) from the class path (class → method → body, an unambiguous extension; methods only — lambdas/initializers fall through), so the lookup is a cache hit. Result: per-method allocation went from rising (3.3 → 5.4 MB/method, quadratic) to flat (2.6 → 2.5 MB/method, linear); 6000-method class 32.1 GB → 14.8 GB (−54%), growing with size. No effect on normal code (all-systems unchanged) and no correctness risk: all three are search hints —getPathalways returns the correct path (guarded byframework/util/TreePathCacherTest's JDK-equivalence check). Validated withalltests.
- PR #1647 — Cache a frequent conversion in the Value Checker.
ValueAnnotatedTypeFactory.convertSpecialIntRangeToStandardIntRangecached perAnnotationMirror; the unbounded-call profile flattened.
- PR #1703 — Small visitor performance tweaks. Touched
BaseTypeVisitorheavily (195 lines),NullnessNoInitVisitor,InitializationVisitor, andAnnotatedTypeFactory. Includes the hoisting ofgetReceiverTypeinto a local, lint-option caching inNullnessNoInitVisitor, and removal of a duplicate null check incheckMethodInvocability.
The per-package "Review of" PRs are systematic audits, each typically a mix of perf, clarification, and small correctness fixes:
-
PR #1705 — Initialization Checker (
InitializationATF,InitializationFieldAccessTreeAnnotator,InitializationParentATF,InitializationTransfer). -
PR #1706 — Nullness Checker (
CollectionToArrayHeuristics,NullnessNoInitATF,NullnessNoInitTransfer). -
PR #1708 — javacutil (
Resolver,TreeUtils,TreeUtilsAfterJava11,TypeAnnotationUtils,UserError,trees/TreeBuilder,trees/TreeParser). -
PR #1711 — framework (
DependentTypesHelper,ElementAnnotationUtil,TargetedElementAnnotationApplier,AbstractTypeInformationPresenter, others). -
PR #1716 — typeinference8 (
UseOfVariable,Variable,VariableBounds,Java8InferenceContext,Resolution). -
PR #1718 — framework/stub (
AnnotationFileElementTypes,AnnotationFileParser,AnnotationFileUtil,RemoveAnnotationsForInference,StubGenerator). -
PR #1719 — framework/type. Includes the
IPair-sharing optimization acrossSubtypeVisitHistoryandStructuralEqualityVisitHistory: a new package-privateputKey/removeKey/containsKeyAPI letsStructuralEqualityVisitHistorybuild oneIPairper public call and pass it to both inner histories, halving the per-callIPairallocations on this hot path. -
PR #1720 — framework/util/element (
ElementAnnotationUtil,IndexedElementAnnotationApplier,ParamApplier,TypeParamElementAnnotationApplier,TypeVarUseApplier). -
PR #1721 — common/basetype (
BaseTypeValidator,BaseTypeVisitor). Includes the null-pointer guard incheckAccessAllowedfor static fields with@Unused, theLong→longautoboxing fix incheckSlowTypechecking, the staticEnumSetrewrite ofvalidateWildCardTargetLocation, theanyQualHasTargetLocationsshort-circuit, and the empty-list early-out inmaybeReportAnnoOnIrrelevant. -
PR #1723 — common/value (
JavaExpressionOptimizer,ReflectiveEvaluator,ValueQualifierHierarchy,util/Range). -
PR #1724 — framework-test (
TestUtilities,TypecheckExecutor,TypecheckResult,TestDiagnostic,TestDiagnosticUtils). -
PR #1725 — type annotators (
LiteralTreeAnnotator,PropagationTreeAnnotator,DefaultForTypeAnnotator,DefaultQualifierForUseTypeAnnotator,ListTypeAnnotator). -
PR #1727 — framework/util/defaults (
Default,QualifierDefaults). -
PR #1763 — Mixed performance tweaks.
AnnotationFileParser: skips JavaToken retention for JDK stubs via a newparseStubUnitForJdk()path (user stubs still use the full diagnostic-quality parser).DefaultQualifierForUseTypeAnnotator: added an empty-set early-out beforeaddMissingAnnotationsand canonicalized empty results to the sharedAnnotationMirrorSet.emptySet()sentinel, avoiding a retained backingArrayListper cached element.QualifierDefaults.shouldBeAnnotated: hoisted repeatedgetKind()calls into a local. -
PR #1797 —
FoundRequiredlazy type formatting (June 2026).BaseTypeVisitor.FoundRequiredpreviously computedATM.toString()/toString(true)eagerly in the constructor, paying full ATM-formatting cost even when the reported error would be suppressed by@SuppressWarningsor-AsuppressWarnings. Changedfound/requiredfields fromStringtoObjectwith anonymous inner classes whosetoString()evaluates the ATM format lazily;shouldPrintVerboseresult is memoized in averboseComputed/verbosepair shared across both objects, so it is called at most once regardless of which field is stringified first. Also lazified the concatenated parameter-name prefix string incheckMethodInvocabilityError. CF's own sources suppress thousands of warnings, so the deferred cost is often zero. Wall-clock impact is proportional to suppression rate on the profiled workload; on thecheckNullnessbuild of CF itself the delta is within the noise floor (see A/B note in the name-decoding narrative below). -
PR #1797 —
SourceChecker.shouldSkipUsescache (June 2026). Previously calledtypeElement.toString()(aSymbol.toString()→Name.toString()UTF-8 decode) and matched against a compiled regex on every invocation. Added anIdentityHashMap<Name, Boolean>cache keyed ontypeElement.getQualifiedName()(identity-stable within one compilation's name table): first-visit still decodes and matches, but repeat visits for the same enclosing class are O(1). Reduces theutf2stringattribution to cache-miss-only (4 samples in the post-fix profile). -
PR #1797 —
Variable.computeHashCodeandProperType.computeHashCodeavoidtoString()(June 2026).Variable.computeHashCodehashedelt.getSimpleName().toString(), decoding the byte-backedNameon every hash computation. Changed toelt.getSimpleName().hashCode(), which returns the byte-tableindex(content-stable via interning, no decode).ProperType.computeHashCodehashedproperType.toString()— anATM.toString()call on every type-inference cache lookup. Replaced withTypeKind.hashCode() + 31 * elt.getSimpleName().hashCode()(element extracted forDeclaredTypeandTypeVariable; other kinds hash by kind alone). The newProperTypehash is weaker (no package component), but hash collisions only affect map distribution, not correctness (equalsis unchanged).
These belong with the campaign because they were uncovered while auditing the same files.
- PR #1689 — Preserve invariant
!isRunning => currentNode == nulleven on exception. Restores invariant inAbstractAnalysisfinally block. - PR #1690 — Change
catch Throwabletocatch Exceptionin several framework call sites.Throwableaccidentally suppressed things likeOutOfMemoryErrorandThreadDeath. - PR #1765 —
ElementUtils.hasParametersname-form fix. ReplacedClass.getName()(JVM binary form:"java.util.Map$Entry") withgetCanonicalName()(source form:"java.util.Map.Entry") when matching againstTypeMirror.toString(). Previously, nested classes and array types could be silently mismatched. Surfaced during the performance review sweep.
The methodology, full A/B numbers, and the rejected variants are in the value-semantics narrative under "Short list"; this is the canonical applied summary.
- PR #1777 —
methodAsMemberOf,directSupertypes, andelementTypecaches. Three caches on thegetAnnotatedTypehot paths, each storing/returning deep copies (ATMs are mutable): (1)methodAsMemberOfCachememoizes the(method, receiver-type)-determined substitution base insidemethodFromUse(skips the declared-@Poly*guard;Value/MethodValopt out viashouldCacheMethodAsMemberOf); (2)directSupertypesCachememoizesdirectSupertypes(type)(a pure function of the type — no poly guard or opt-out needed); (3)elementTypeCachememoizes the fully-computed (post-defaults)getAnnotatedType(Element)result (cheap element-identity key,shouldCacheElementTypeopt-out). Structural keys use a cache-localTypes.isSameTypecomparison (IsSameTypeAtmComparer), not the globalATM.equals. Full-build warm-daemon A/B (./gradlew checknullness): themethodAsMemberOf+directSupertypescaches ≈ −9% cold / −13% warm;elementType(Phase 1) ≈ −10% on its own. Validate ATM-producing caches withalltestsdiagnostics, never a recompute cross-check (substitution mints fresh captures, so identical results compareisSameType- unequal — see the narrative). PR #1778 forks the Java-8check*tasks so these caches' retained heap no longer piles into the shared Gradle daemon. declarationFromElement: scan the enclosing method subtree, not the whole compilation unit (applied, PR #1780).TreeInfo.declarationFor(sym, root)scanned the whole CU per local/ parameter to find its declaration tree — JFR-attributed at ~13% of achecknullnesscompile. Replaced withTreeInfo.declarationFor(sym, trees.getTree(elt.getEnclosingElement()))(scan only the enclosing method/class), with a fallback to the full-CU scan, plus a short-circuit returningnullforTYPE_PARAMETER(it scanned the whole CU only to return null). Same-session traced A/B:declarationFromElement−33%, total on-CPU −5.1%. Key distinction from the rejectedtrees.getTree(localVar)variant:trees.getTreeon the enclosing method is cheap (position-based), whereas on the local itself it internally scans.- PR #1791 — Per-CU
IdentityHashMaptree caches, one-pass declaration scan, pooled copier map. Three changes that remove cache thrash and redundant recomputation on large compilation units: (1) LRU →IdentityHashMap.classAndMethodTreeCache,fromExpressionTreeCache,fromMemberTreeCache,fromTypeTreeCache, andelementToTreeCachewere boundedCollectionsPlume.createLruCache(2048)maps. On a large CU the live tree set overflows 2048, so the LRU evicts still-needed entries and re-getAnnotatedTypes them — each miss recomputes anddeepCopy()s the ATM (classAndMethodTreeCache.put(tree, type.deepCopy())). PlainIdentityHashMaps (Tree/Element keys are identity-compared anyway) remove the eviction thrash; they stay bounded in practice becausesetRootalready clears all five per compilation unit, so peak size is one CU's tree count, not the whole build. This swap alone is the bulk of the win. (2)DeclarationScanner(extends PR #1780). Rather thanTreeInfo.declarationFor(sym, enclosingTree)per local/parameter, the first lookup into a given enclosing method/class scans that subtree once and records every variable/method/class declaration intoelementToTreeCache; ascannedEnclosingTreesidentity set (also cleared insetRoot) makes the scan once-per-subtree so sibling lookups hit the cache. Falls back to the full-CUTreeInfo.declarationForif the scan misses. (3)AnnotatedTypeCopiermap pool.visitallocated a freshIdentityHashMapper copy; it now borrows a thread-local pooled map (cleared after use, fresh-map fallback if re-entrant). Deterministic allocation A/B (single forkedjavac,jdk.ThreadAllocationStatistics; see Reproducing measurements): total bytes allocated −14.5% / −17.1% / −19.2% on 300- / 600- / 1500-method single-class files and −6.6% on an 80-file (15-method) corpus — the win grows with per-CU size, since that is when the 2048 LRU thrashes. The LRU→IdentityHashMapswap on its own is −10% to −13.5%; the scanner and copier pool add the rest. Wall clock is roughly neutral (≈ −3% at 1500 methods, within noise) on heap-generous single-file compiles — these are not GC-bound, so the reduced allocation does not shorten them; the payoff is GC pressure / memory headroom, most relevant under default heap on a many-CU warm-daemon build. Build/measure caveat: flipping sides withgit stashdoes not reliably recompile —:framework:compileJavareportsUP-TO-DATEand serves the other side's classes — so force--rerun-tasksand gate each run by decompiling the shippedchecker/dist/checker.jar(e.g. countcreateLruCachecall sites inAnnotatedTypeFactory: 9 on master, 4 with this change). An un-gated early A/B read this change as ~0%, a false negative from two stale shadowJars. Dropped from the original proposal as separately risky and not part of the allocation win: a never-clearedIdentityHashMap<AnnotationMirror, QualifierKind>in the qualifier hierarchies (unbounded over a whole build) and disabling the per-CUdefaultQualifierForUseTypeAnnotator.clearCache()(cross-CU staleness — the cache reads element annotated types that stubs/ajava refine). declarationFromElementfallback: scan the enclosing subtree, not the whole CU; walk the enclosing chain (PR #1793, June 2026). Two refinements to the variable/ parameter path's fallback (the case PR #1791'sDeclarationScannerleft on the full-CU scan — when the scan missed orshouldCacheis off). (1) Subtree fallback. WhenelementToTreeCachehas no entry, tryTreeInfo.declarationFor(sym, enclosingTree)(scan only the enclosing method/ class subtree) before the full-CUTreeInfo.declarationFor(sym, root). (2) Enclosing-chain walk. Master tooktrees.getTree(elt.getEnclosingElement())once and, if it was null, fell straight to the whole-CU scan; the change walksgetEnclosingElement()upward until a non-null tree is found, so the subtree scan applies (and theDeclarationScannergets primed) even when the immediate enclosing element has no tree. BothdeclarationFor(sym, enclosingTree)anddeclarationFor(sym, root)match by symbol identity, so the returned tree is unchanged — the only difference is how many tree nodes the scan visits. Deterministic A/B (single forkedjavac; size sweep ofgen-sized-program.py, drift-controlled interleave — see Reproducing measurements): wall-clock −11% / −15% / −26% on 300- / 600- / 1500-method single-class files, neutral at ≤100 methods (3.2–4.5 s, JVM-startup-dominated) and on a 30-method file. Allocation is flat (jdk.ThreadAllocationStatisticswithin the ~0.3% run-to-run band at every size) — this is a scan-node / CPU change, not an allocation change. The win is super-linear in CU size and zero on typical small files: it is worst-case protection for large or machine-generated single-class compilation units (where master's full-CU fallback scan is super-linear in CU size), never a regression. Same flattening signature as the PR #1786 body-path quadratic; like that one it is invisible on the tiny-file all-systems corpus and only a size sweep exposes it.
Tree-search quadratics: declarationFromElement, varargs arrays, and warning paths (PR #1803, June 2026)
Three independent O(n²)-in-compilation-unit-size scans, each a per-element or per-message
Trees.getTree / getPath / TreePathCacher search that rescanned the whole enclosing class
or compilation unit. All three are worst-case protection — super-linear only on large or
machine-generated single-class files, or on message-dense code — found by a shape × size ×
checker sweep (gen-sized-program.py {generic,vararg,deep-nesting,many-fields} ×
N=300/1000/3000 × {nullness,interning,value}) and each confirmed by instrumenting the
scanner's node-visit count: nodes-per-getPath growing with N is the signature (e.g. the
varargs case scanned ~6,400 nodes/call at N=300 and ~31,600 at N=1500 — the whole unit each time).
-
declarationFromElementmember/variable lookup via the visitor path. Even after PR #1791/#1793'sDeclarationScannerand subtree fallback,declarationFromElementstill calledtrees.getTree(elt)(member) andtrees.getTree(enclosing)(a variable's enclosing method), and javac implementsTrees.getTree(Element)as aTreeInfo.declarationForscan of the enclosing class — O(class) per call, O(class²) across a class's members. (The in-code comment claiminggetTreeon a method is "position-based cheap" is wrong; it scans.) Fix: obtain the enclosing method/class tree from the factory'svisitorTreePath(already set to the method body during flow analysis — the path through which these lookups arrive) instead oftrees.getTree; the path is only a search-start hint, so the result is unchanged. Also fixedDeclarationScannerto cache by the rawJCTree.sym, notTreeInfo.symbolFor'sbaseSymbol()— underbaseSymbol()a generic method or a parameter is stored at a key no lookup ever uses. Full-buildchecknullnessJFR:declarationFromElement8.4% → 1.5% inclusive,DeclScanner.scan200 → 21 samples; warm-daemon wall ~1m51s → ~1m43s (~7%, median of ≥3 reps/side,shadowJarrebuilt per side); a 1500-method single class 21.95s → 11.56s (−47%).alltestspasses. -
Varargs synthetic-array path.
checkVarargs→getAnnotatedTypeVarargsArraycomputes the type of the syntheticNewArrayTreethe CFG builder wraps a varargs call's arguments in. That tree is not in the compilation unit, so defaulting it (QualifierDefaults.nearestEnclosingExceptLocal→getPath) madeTreePathCacherscan the whole unit to fail to find it — O(unit) per varargs call. (CFGTranslationPhaseOne'shandleArtificialTreeregisters the same tree's path, but that registration does not reach this consumer.) Fix: register the array's path under the call site —setPathForArtificialTree(arrayTree, new TreePath(getPath(callTree), arrayTree))— before typing it; the call tree's own path is cheap because the call is being visited. Nullness vararg-shape on-CPU samples 12,306 → 2,597 at N=3000 (growth ~20× → ~4× over a 10× size increase);TreeScanner.scanundergetPath1,016 → 10 at N=1500. Only heavily-defaulting checkers (nullness) were affected; interning and the Value Checker default little and were already flat. -
@SuppressWarnings/ precise-position path lookup. Reporting a message callsSourceChecker.shouldSuppressWarnings(tree)(andgetSourceWithPrecisePosition), which looked up the tree's path withgetTreePathCacher().getPath(currentRoot, tree)— a scan from the compilation-unit root — for the suppression walk. O(unit) per reported message, so a message-dense checker is quadratic in file size. The Interning Checker reports on every==, so a 3000-comparison file spent ~46% of on-CPU samples scanning for paths and grew ~6.6× over a 10× size increase (nullness, which does not report on those expressions, was unaffected). Fix: newSourceChecker.pathToTree, which starts the search fromvisitor.getCurrentPath()(public onTreePathScanner; an ancestor of the reported tree, which is being visited) and falls back to the root scan otherwise — same path result, local search. Interning generic-shape on-CPU samples 1,567 → 715 at N=3000, growth ~6.6× → ~2.9× (linear). This helps any checker that emits many messages, not just interning.
A unifying lesson: a per-element or per-message Trees.getTree / getPath that re-derives a
tree's position scans the enclosing class or whole unit, which is super-linear when the caller
iterates members or messages. The fix is to reuse position context the program already has —
the visitor path, or a registered path for a synthetic tree — so the lookup localizes, rather
than to add another cache. Diagnostic caution: the JDK TreePath.getPath(path, target) "is it
under this path?" check is unreliable here — it searches the whole compilation unit of path and
returns non-null if target is found anywhere, not only under the leaf, so it does not confirm
locality. Instrument the cacher's node-visit count instead.
A post-fix verification sweep (the same shape × size × checker matrix) confirmed every
shape is linear/sublinear across all three checkers, with the top leaves at N=3000 being
irreducible framework work (HashMap/IdentityHashMap lookups, AnnotatedTypeScanner) rather
than tree scans. The sole remaining super-linear shape was deep-nesting (typeinference8),
addressed by PR #1805 below.
The deep-nesting shape's super-linearity is Java-8 type-argument inference. Incorporating
bounds to a fixed point (BoundSet.incorporateToFixedPoint →
VariableBounds.applyInstantiationsToBounds) re-applies instantiations to every inference
variable on every iteration: O(iterations × variables × bounds) ≈ O(depth³) for a depth-D
nested-id chain. depth-80 in one method did not finish in 25 minutes. PR #1805 has three parts:
- Work budget. (PR #1829 later lowered this default to 10000 from a measurement of real code,
made it configurable with
-AinferenceWorkBudget, and reshaped/moved the regression test — see the PR #1829 section.) A per-invocation counter (Java8InferenceContext.MAX_INCORPORATION_WORK, originally 100k bound-visits) charged inapplyInstantiationsToBounds. When exceeded,recordIncorporationWorkthrowsInferenceBudgetExceededError(anError, so it unwinds past thecatch (Exception)/catch (FalseBoundException)blocks in the incorporation/resolution machinery), caught inDefaultTypeArgumentInference.inferTypeArgs. Inference is abandoned soundly: a newtype.argument.inference.budgeterror is reported pointing the user to supply explicit type arguments, and the return type is defaulted (as for an inference crash, viaInferenceResult.needsDefaultedReturnType()) so checking continues. The error is distinct fromtype.argument.inference.crashedbecause exceeding the budget is a deliberate give-up, not a crash. Reaching the budget takes ~0.15 s warm per inference problem. Regression test:checker/tests/interning/InferenceWorkBudget.java(depth-60). NB: do not@SuppressWarnings("interning")in such a test — that is the checker name and suppresses all of the interning checker's output, including the framework error under test. - Skip fully-resolved variables (
allBoundsProper). Once every bound of a variable is a proper type the bounds cannot change (ProperType.applyInstantiations()is the identity, and the changed-gated instantiation detection cannot fire), so re-scanning them is wasted. A boolean maintained at the only three sites that mutate the bound map (addBound,restore, the rebuild) drops such variables from the per-iteration scan. Provably equivalent; validated with a temporary flag that recomputed the invariant from scratch on every call and threw on mismatch — 0 violations across all-systems and the fullalltestssuite. The gain grows with nesting depth (~9% wall / 8% CPU at depth-20), the signature of cutting the cubic fixpoint toward quadratic. - Two micro-opts.
BoundSet.hasInstantiatedVariablereplaces agetInstantiatedVariables()call that built aLinkedHashSetevery iteration only to test!isEmpty(); andProperType.getErased()caches a proper type's erasure (immutable) instead of recomputing it on every subtyping check.
Java 8 type-argument inference: constraint set, dependency traversal, and hashCode caching (PR #1813, June 2026)
Four independent optimizations to the typeinference8 inference engine, all confined to the
typeinference8 package. Measured on single-run cold-JVM wall clock (gen-sized-program.py --shape deep-nesting, N methods each with 20 nested id() calls):
| N (methods) | master | PR #1813 | reduction |
|---|---|---|---|
| 30 | 25.0 s | 20.0 s | −20% |
| 80 | 55.8 s | 47.9 s | −14% |
| 100 | 72.0 s | 62.0 s | −14% |
(Single runs; these are cold-JVM wall-clock, not deterministic allocation A/Bs. The win is real but the exact percentages carry ~±5% single-run noise.)
-
ConstraintSet:ArrayDeque+HashSetfor O(1) deduplication. The backingArrayListmadeadd,push, andcontainsO(n) (linear scan). Replaced with anArrayDeque(preserves LIFO/FIFO order andaddFirst/addLast/descendingIterator) plus a parallelHashSetfor O(1) membership tests. All mutation sites —add,push,pushAll,pop,remove— now keep both structures in sync.addAllwas also fixed: the oldlist.addAll(constraintSet)skipped the duplicate check (inconsistent withaddandpush); the new version deduplicates. Theremove(self)case reallocates fresh structures instead ofclear()to match the invariant thatfastLookupandlistare never independently partial. -
Dependencies.calculateTransitiveDependencies: BFS replaces fixpoint. The old implementation was an outerwhile (changed)loop that, on each pass, iterated all entries andaddAll'd the transitive neighbours — effectively O(V²) or worse for dense dependency graphs. Replaced with a per-source BFS (ArrayDequequeue,LinkedHashSetas visited/reachable set): each variable is enqueued at most once, so the total work is O(V + E). On deep-nesting code where many inference variables have mutual dependencies this was a meaningful hotspot. -
hashCodecaching in constraint and type objects.TypeConstraint,Typing,Expression,InferenceType,UseOfVariable, andQualifierVareach gain acachedHashCodefield (lazy, zero-sentinel pattern from PR #1812). These objects are effectively immutable post-construction and are placed inHashSets / used as map keys during inference, so theirhashCodewas being recomputed on every lookup.TypeConstraint.hashCodedelegates toT.hashCode()(an ATM hash), andUseOfVariable.hashCodechains five field hashes — both are non-trivial calls. -
VariableBounds.addQualifierBoundpre-filter. Before callingaddConstraintsFromComplementaryQualifierBoundsandaddConstraintsFromComplementaryBounds, the new code filters out qualifiers already present inqualifierBounds.get(kind). If none are new, it returns immediately without entering the (potentially recursive) constraint-generation paths. This avoids redundant constraint proliferation when the same qualifier bound is added more than once (which happens during fixpoint iteration). -
Javadoc guards on
Qualifier,QualifierTyping, andAbstractQualifier. Documents why these classes must not overrideequals/hashCode: the constraint solver relies on identity equality forQualifierwrappers (value-based dedup would merge distinct constraints that happen to wrap the same annotation) and forQualifierTypinginstances (multiple identically-shaped qualifier constraints must coexist). These are correctness comments, not perf changes; recorded here because they interact directly with theConstraintSetHashSetdeduplication above.
These were implemented and measured but kept out of #1805; recorded so they are not re-derived.
- Incorporation worklist (the dependency-based variant of the
allBoundsProperskip) — SHIPPED in PR #1829; see the PR #1829 section below, which beat the ~3% recorded here by combining it with the constraint gating and measuring on deeper nesting, and replaced the verify harness with an always-on self-correcting rescan. The original deferral rationale follows. ReplacedallBoundsProperwith a per-VariableBoundsdirtyflag plus an append-only, over-approximate reverse-dependency list (dependents):addBoundrecords edges β→α for each variable β mentioned in a new bound of α; when α is instantiated it marks its dependents dirty;applyInstantiationsToBoundsskips a clean variable. Over-approximation makes it safe by construction — a stale edge causes only a harmless extra re-scan; only a missing edge is a bug, caught by the verify harness (after the worklist reports convergence, a full scan must find no change). Correct (0 verify violations across all-systems and the fullalltestssuite) and it eliminated the fixpoint's #1 self-time leaf (applyInstantiationsToBounds17.8% → out of the top underincorporateToFixedPoint). But the net wall-clock gain was only ~3% on inference-heavy synthetic code and <1% on realistic code, because once the redundant rescan is gone the fixpoint's remaining cost is essential JLS-18 work (ConstraintSet.applyInstantiations/reduceOneStep~9%) and javacSymbol.apiComplete(~8%) — the per-variable scan was the dominant leaf but not a large enough share of total time. A second, separate benefit: because the worklist does less work per depth, the budget triggers ~2× deeper (depth-60 completes; depth-100 aborts). That is a real correctness/usability win, not a downside — the budget abandons inference (a false positive on valid code), so raising the depth at which real generic code completes before the budget fires removes false positives. The depth-60 budget regression test then "fails", but that test only encodes the old limitation; the right response is to deepen it to a depth that still exceeds the budget — routine maintenance recording the higher capability, not a regression. Deferred for now (the user chose to ship the simplerallBoundsProperskip in #1805 first), but the case is stronger than the ~3% wall-clock alone: weigh the false-positive reduction too. Reconsider together with the constraint-reduction cost (below) — gating those by the same dirty flag is what could push the wall-clock gain past ~3%. InferenceType.applyInstantiationslist lazy-allocation. It allocates threeArrayLists unconditionally and discards them when nothing is instantiated (the common case). A no-allocation pre-check was measured neutral (many allocations by count, negligible by bytes — the lists are empty/tiny; same lesson asAnnotatedTypeScanner.markVisited).getTargetType,asSuper,mergecaching.getTargetTypeis called once per inference problem (no repeat to cache);asSuper/mergeare intrinsic set-union/supertype work.
Ships the incorporation worklist that #1805 deferred (see "Explorations that did not ship" above), this time with the constraint gating that makes it pay off, plus a self-correcting safety net and a work budget re-tuned from a measurement of real code. Two parts.
1. Self-correcting incorporation worklist. The JLS-18 incorporation fixed point
(BoundSet.incorporateToFixedPoint) re-scanned every inference variable, and re-applied every
variable's constraints, on every round. The worklist re-scans only variables that can have changed:
each VariableBounds carries a dirty flag and an over-approximate, append-only reverse-dependency
set dependents (built in addBound from AbstractType.getInferenceVariables()); instantiating a
variable marks its dependents dirty, and applyInstantiationsToBounds skips a clean variable, which
also skips that variable's constraints.applyInstantiations() — so the constraint gating #1805's
note asked for falls out of the same skip, no separate flag.
The correctness guarantee is not a flag. When the worklist reports a fixed point,
hasReachedFixedPoint confirms it with one full un-gated rescan of every variable. At a true fixed
point that is a no-op; if the worklist ever skipped a variable (a missing reverse-dependency edge),
the rescan's doApplyInstantiationsToBounds has already applied the change, and the method marks all
variables dirty and runs another round. So the result is identical to scanning every variable every
round, by construction — the worklist is a pure optimization with no soundness knob to get wrong.
In production this self-heals silently; this project's tests run in strict mode
(-Dcf.typeinference.worklist.strict, set on all test tasks in build.gradle; TypecheckExecutor
runs the checker in-process so the property reaches it), where the same situation throws, turning a
worklist regression into a loud CI failure. 0 strict/self-heal events across alltests,
NullnessTest, InterningTest, and a manual sweep of 425 checker/tests/nullness + 279
checker/tests/index files (all byte-identical to the pre-worklist baseline).
Why it now beats #1805's ~3%. #1805 measured the worklist alone, on shallow inference, at ~3%.
The gain is real on deeply nested generics, where the redundant rescan is cubic: wall clock
(gen-sized-program --shape deep-nesting, single-build -D toggle, interleaved, n=8) −8.6% /
−11.4% / −19.0% at nesting depth 8 / 12 / 20 — the win grows with depth. JFR (depth-8 × 800):
doApplyInstantiationsToBounds self-time 7.5% → 1.9% (3.8×), ConstraintSet.applyInstantiations off
the leaderboard, total on-CPU −6.5%; reduceOneStep (~54%, the essential JLS-18 constraint
reduction) is correctly untouched.
The self-correcting rescan is free — but measure it in isolation. On shallow generics and on a
realistic full checknullness the change is neutral. The first A/B (worklist branch vs. master, warm
daemon) read +0.5% — within noise but the wrong comparison: it conflates the worklist's savings
with the rescan's cost, which cancel on ordinary code, so a real rescan cost could hide behind a
worklist win. The right isolation is same-build, self-heal on vs. off (a temporary
-Dcf.typeinference.worklist.noselfheal plumbed to the checknullness fork; toggle verified with a
one-shot marker before trusting the result): 105.0 s vs. 107.0 s warm median — i.e. within noise (a
rescan cannot be negative-cost). It is free because each trivial inference problem's rescan is one
pass over a 1–2-variable bound set, and inference is a small slice of a build.
2. Configurable, lowered work budget. Java8InferenceContext.MAX_INCORPORATION_WORK was a
hardcoded 100000. Add -AinferenceWorkBudget=N to override it. The option is read once and cached on
the AnnotatedTypeFactory (getInferenceWorkBudget): a Java8InferenceContext is created per
generic invocation (DefaultTypeArgumentInference.inferTypeArgs), so reading getOption there would
repeat the lookup on a hot path for a value that is constant per compilation.
The default is lowered 100000 → 10000, from a measurement (instrument recordIncorporationWork
to track the peak per problem; run with an effectively-unlimited budget so nothing aborts): the
heaviest hand-written generics reach only 994 work units (Guava with the Nullness Checker; the
framework's own source 363; a deliberately tricky stress test 296). So 100000 was ~100× more headroom
than the ~1000 the #1805 comment cited (994 is post-worklist, so the worklist did not meaningfully
shrink Guava's peak), and let a single pathological invocation grind ~4–5 s before bailing; 10000
keeps ~10× headroom and bails in ~3 s. There is an irreducible ~2.4 s floor (parsing/attributing the
deep expression, which the budget cannot cap). False-positive-clean at 10000: zero budget errors
on Guava under the Nullness, Interning, and Index Checkers, on the framework's own checknullness,
and on the all-systems corpus.
The budget is checker-independent per problem. The threshold is dominated by Java-type bound incorporation (~cubic in nesting depth, checker-agnostic); the qualifier bounds each checker adds are a smaller, comparable factor. The Nullness Checker fires the budget three times (its three subcheckers each run their own inference), but at the same per-problem nesting depth as the single-system Interning Checker (the three identical diagnostics dedupe to one in the test harness, which compares actuals as a set). A single invocation never blows up regardless of type-parameter count — a method with 400 chained-bound type parameters called once does not fire — because the cubic cost lives in the chained dependency that nesting creates, so the budget cannot be triggered without deep nesting.
Notes for future sessions.
- Re-measure deferred/rejected items against the current baseline and a maximal workload. The worklist was "~3%, deferred" in #1805; combining it with the constraint gating and measuring on deep nesting (not shallow synthetic) turned it into −19%.
- A throwaway inference test that does not compile silently measures nothing. An early "Interning
inference is harder than Nullness" reading was an artifact: the throwaway file's
publicclass name did not match its filename, so javac errored before inference ran. Use a non-public class (or a matching filename) for throwaway inference probes, and confirm the budget actually fires. - jtreg
nullness/Issue1438timeouts underalltestsare environmental, not a regression: the file compiles in ~7.5 s standalone (well under the 20 s jtreg limit) and is marginally faster with the worklist; the timeouts are parallel-agent contention during the full run. - Shaping a budget regression test: google-java-format escalates one nested call per line, so a
test's line count tracks the number of nested calls. "More type parameters" backfires (each filler
argument gets its own line — a 5-type-parameter chain formatted to 84 lines). The compact form keeps
the expression under 100 columns so the formatter leaves it on one line: a method heavy enough to
fire at shallow depth (return type mentions its type variable three times, parameters are wildcards)
with short names.
checker/tests/nullness/InferenceWorkBudget.java(default budget) andchecker/tests/inference-budget/InferenceWorkBudget.java(small-AinferenceWorkBudget) are the two regression tests;checker/tests/nullness/Java8InferenceWorklistStress.javaexercises the worklist's dependency tracking across interacting inference features under strict mode.
Bring new evidence before revisiting any of these — a JFR trace on a workload not previously considered, or a measurement that contradicts the prior finding. A fresh hypothesis is not new evidence.
-
AbstractAnalysis.getValue(Node)subnode gate — identity rewrite, reorder, and check-removal all measured-and-rejected (June 2026). The subnode gate ingetValueusesCollection#contains(n)(structuralNode#equals), even though thenodeValuesmap it guards is anIdentityHashMapand "two Nodes can be .equals but represent different CFG nodes" (seeNode). Three angles, all rejected:- (a) Structural→identity for correctness — not a demonstrated bug, and regresses allocation.
Precise statement of the issue:
nodeValues.get(n)is identity-keyed, so the structural match never returns a different node's value — it only changes the gate decision (returnget(n)vs.null). The only possible defect is returningn's own (possibly stale) value whennis not truly a subnode. (My earlier "returns the wrong node's value" framing was imprecise.) Instrumenting both gates, the structural one diverges on real code (132leak=truecases on the all-systems corpus) only for constant-valuedClassNameNodes (static-call qualifiers such asCollectors), whose value is identical at every occurrence — a full diagnostic A/B (-Dcf.structuralGatetoggling both gates in one build) showed zero diagnostic difference across all 269 all-systems files, and no targeted attempt (repeated derefs, loops, nested instance callsx.f(x.g()),thisduring construction) could make a refinable node take the colliding path. The identity gate also regresses allocation (it refuses values the structural gate was eliding, forcing recomputation): on thecondshape allocation rose 28 GB → 46 GB at depth 160. - (b) Reorder (
get(n)first, skip the check when null) — no win. Behavior-preserving, but the value is non-null at the gate 99.3% of calls on all-systems and 100% oncond(callers query subnodes that already have values), so the null-skip almost never applies; allocation unchanged. - (c) The subnode check's allocation cost is realistically negligible.
getOperands()allocates per visited node (Arrays.asListfor ternaries,new ArrayListfor method invocations). Removing the check entirely (incorrect; measures its total cost) saved only ~1.2% on all-systems (5703 → 5636 MB) — but ~35% on the deeply-nested-ternarycondshape (9803 → 6333 MB at D=120), making it a second super-linear allocator on that shape alongside the #602 rebuild. So the check is cheap on real code and not worth optimizing; the deep-nesting cost only matters if nested-expression worst-case protection is wanted (then: a non-allocatinggetOperands, or an ancestry check via a parent pointer, would remove it without changing correctness). - (d) The immediate-operands fast-path buys nothing. The
immediate.contains(n)fast-path (skip theArrayDequeBFS whennis a direct operand) hits only 13.3% on all-systems, 0% on the method-call-heavyrepeatshape, and 53% oncond— and an A/B against a unified BFS (always allocate the queue, check every node) is flat within noise on all three (all-systems 5710 vs 5718; cond 9783 vs 9746; repeat 1921 vs 1935 MB). The single avoidedArrayDequeis negligible against the sharedgetOperands()allocations. So the two-phase structure is a simplification candidate (a unified BFS is equivalent and shorter), not a perf lever.
A
// TODOdocumenting all of the above is inAbstractAnalysis.getValue. Net: leave as-is; pursue identity only as deliberate defensive hardening (allocation cost accepted), and the allocation only via a non-allocating subnode test, never by dropping the guard. - (a) Structural→identity for correctness — not a demonstrated bug, and regresses allocation.
Precise statement of the issue:
-
AnnotatedTypeCopier.visitpooled-map clear ratchet (June 2026).IdentityHashMap.clearis ~2.6% ofchecknullnessself-time, ~74% of it fromAnnotatedTypeCopier.visit'sfinally { map.clear() }. The pooled map never shrinks, so one large copy (observed max 879 entries; avg ~4 over 1.7M copies on all-systems) ratchets the table to ~2048 slots and every later small copy thenArrays.fills that whole table. The mechanism is real, but discarding the map when it grows past its initial table (so later clears stay O(32)) was measured neutral on two realistic workloads (all-systems and 60 framework/type files), by both wall and user CPU. The reason:IdentityHashMap.clearis a cache-residentArrays.fill, fast per call — it samples high by count (called 1.7M+ times) but the actual time saved by a smaller table is negligible (same lesson as theAnnotatedTypeScanner.markVisitedarray sizing). Don't pursue without a workload where a genuinely large table is filled enough times that the byte-volume, not the call count, dominates. Follow-up (PR #1815): the pool was subsequently removed entirely (always fresh allocation) — also measured neutral on cold-JVM wall clock, but removes the re-entrancy fallback complexity. See Applied optimizations →AnnotatedTypeScannerand visitor state. -
Per-CU tree-defaults memoization in
QualifierDefaults(June 2026). A full prototype of short-list item #4 — a per-compilation-unit cache from(scope element, pre-defaulting type structure)to the defaulted type, keyed by a soundAppliedDefaultsKey(identity scope + structural type hash) with anEqualityAtmComparersubclass addingisDeclaration(), write-back on hit via a pooledAnnotatedTypeReplacer,deepCopysnapshots of key and value on miss, cleared per-CU inGenericAnnotatedTypeFactory.setRoot, with-Dcf.defaults.noTreeCacheand-Dcf.defaults.cacheStatsruntime flags. The premise was thatapplyDefaultsElementruns ~9 full-tree scans per call (one perDefaultplus checked/unchecked), replaceable by 3 cheaper walks on a hit. The hit rate hypothesis held — 79.6% on nullness, 85.5% on interning over all-systems, well above the predicted ~24% break-even and ~60% "worth it" thresholds — but the cache was net neutral-to-worse on every axis in a same-jar kill-switch A/B (all-systems / 269 files, median of 3–5): allocation +2.06% (5407 vs 5298 MB, deterministicThreadAllocationStatistics), wall +0.9% (25.19 vs 24.96 s), user CPU +1.3% (85.4 vs 84.3 s). Why the high hit rate doesn't pay: the per-Defaultscans are cheap in-place annotation mutations (DefaultApplierElementwalks the type and sets annotations, allocating little and short-circuiting), not expensive operations — so the "9 scans → 3 walks" trade is between things of similar cost. Meanwhile the cache machinery is pure overhead: anAppliedDefaultsKey+ a structural hash walk on every lookup, a structural equality walk + a replacer walk on every hit, and twodeepCopys per miss (key snapshot + stored value). The ~20% miss tail's deep copies alone account for the +2% allocation. The earlier "value-key lookup ≈ one scan, so neutral" intuition was right empirically; the patch's reframing to "9 scans" overcounted the cost being saved. Don't revisit without first showing — by instrumentation, not reasoning — that theDefaultApplierElementscan itself (not its call count) is an allocation/CPU hotspot; today it is not. -
Caching
getAnnotatedTypefor expression/variable trees.getAnnotatedType(Tree)caches only class and method trees (classAndMethodTreeCache); expressions recomputefromExpression+addComputedTypeAnnotationsevery call, andCFAbstractTransfer.getValueFromFactory(~19% inclusive) hits that path per node during flow analysis. This non-caching is intentional and load-bearing: an expression's annotated type depends on context (assignment context, capture, the in-progress flow store), so a cache would return stale/unsound types. The flow-stable subset (fromExpression's structural result) is in principle cacheable but the cached value must be frozen anddeepCopy'd on use, which offsets the saving; not attempted, and risky against theAnnotatedTypeMirrorcache invariants CLAUDE.md flags. -
Large method bodies / dataflow size. A size sweep of one method with N local-variable declarations-and-uses (N=200/400/800) is linear in N (marginal cost per statement flat once the fixed ~8 s JVM+javac+nullness init is subtracted). No quadratic in method-body size; the CFG/dataflow fixpoint scales as expected.
-
Cache-boundary flips after freezing the masters — PARTIALLY SUPERSEDED (PR #1798). The first cut rejected boundary flips wholesale: "the cache-return copy is load-bearing; the dominant consumers mutate the returned type, so a flip only moves the copy." That was survivorship bias from the
BugInCFflush — the flush only enumerates the mutating consumers, not the read-only majority, so reasoning from it overcounts the cost. A direct measurement of the read-only fraction (65–88% forgetAnnotatedType(Element)) and a deterministic allocation A/B then showed theelementTypeCacheandclassAndMethodTreeCacheflips DO pay (small): ~−1% on generic-call-heavy code, noise elsewhere (NOT the −5.3% an earlier estimate suggested — never reproduced against the post-flip baseline). Both shipped in PR #1798 (see the foundation section). The lesson stands for three other flips that were tried and genuinely do not pay, because their consumer is always mutating: (a) the raw Element boundary feeding the tree pipeline —TypeFromExpressionVisitor→addComputedTypeAnnotations(defaulting + flow refinement + annotators) rewrites the whole type every time; (b) themethodFromUseon-hit copy-elision — type-argument inference (typeinference8.Resolution.resolveWithLowerBounds) mutates the method type in place; and (c) thedirectSupertypesCacheflip — its dominant consumer isAsSuperVisitor(the cache exists becauseasSuper/allSupertypesrecompute supertypes constantly), which mutates each returned supertype in place (copyPrimaryAnnos,setUpperBound,fixupBoundAnnotationsviavisitDeclared_{Typevar,Wildcard,Intersection}); flipping it (return the frozen masters, skip the per-hitdeepCopySupertypes) flushed 60BugInCFs, all inAsSuperVisitor, and fixing them would relocate the per-hit copy intoAsSuperVisitor— fired once per asSuper walk-step, the same frequency — a provable wash. Reverted. Rule of thumb confirmed: a post-pipeline cache whose hits are mostly read-only pays from a flip; a cache whose hot consumer rewrites the result in place does not — and you can tell which from whether the hot consumer (tree pipeline / inference / asSuper) is in the flush. -
Caching
AnnotatedTypeMirror.hashCode()on frozen types (PR #1798 session). The standing idea (the hash can't be cached because ATMs are mutable) is unblocked for frozen types — but instrumentation showed 0.0% ofhashCode()calls land on frozen types (every hot hash target is a mutable working copy, since the caches hand out copies). Worthless in the current architecture, and it would only become useful after a boundary flip that itself does not pay. -
Shallow-location defaulting shortcut (PR #1798 session). Skipping
QualifierDefaults's recursive descent for top-level-only locations (FIELD/PARAMETER/RETURN/RECEIVER/RESOURCE_VARIABLE/ EXCEPTION_PARAMETER/CONSTRUCTOR_RESULT) is sound and cut scan calls 10.2%, but those saved scans are over cheap shallow types and allocation was flat — negligible. The cost is the deepOTHERWISE/ bound traversals over generic types; merging those into a single pass is a high-risk refactor with a ~2% ceiling (defaulting is single-digit-% of CPU). Distinct from the deferred Defaulting Phase 2 (caching the result); this was the cache-free variant. -
AnnotationMirror → QualifierKindsecond-level cache in qualifier hierarchies (June 2026).NoElementQualifierHierarchy.getQualifierKind(AnnotationMirror)and the matching method inElementQualifierHierarchyalready use anelementToQualifierKind: IdentityHashMap<TypeElement, QualifierKind>(PR #1670) that resolves the kind in O(1) via a single identity probe on a tiny map (~3–5 entries for Nullness, ~15–20 for Value). Proposed: add a second-levelannoToQualifierKindMap: IdentityHashMap<AnnotationMirror, QualifierKind>so that repeat queries for the sameAnnotationMirrorinstance bypass the TypeElement extraction entirely.The miss path that is being "saved" is already free. For a
Attribute.Compoundannotation (the common case),getAnnotationType().asElement()reduces to two direct field reads (anno.type.tsym) — the same cost as anIdentityHashMapprobe after JIT devirtualization. There is nothing to save: the second-level cache adds overhead on misses (aninstanceofcheck plus aput) and is neutral on hits (trades one two-field-read path for one identity lookup), netting to zero or slightly negative.A/B (deterministic
jdk.ThreadAllocationStatistics, single forkedjavac):corpus master branch delta Nullness checker, 600-method class ( NoElementQualifierHierarchy)1396.8 MB 1397.9 MB +0.08% (noise) Value checker, 300-method @IntRangeclass (ElementQualifierHierarchy)194.7 MB 195.0 MB +0.15% (noise) Both within the ~0.5% run-to-run band. Rejected on measurement: flat allocation, and for
ElementQualifierHierarchythe map grows unboundedly — the Value checker processes many distinct@IntRange(from=X, to=Y)instances whoseAnnotationMirroridentity is unique per value combination, soannoToQualifierKindMapaccumulates one entry per distinct annotation instance seen over the whole build with no bound or clear. General lesson: a second-level cache in front of an already-O(1) tiny-map lookup provides no benefit — the cost being "saved" is on the order of two field reads, below any measurable threshold. -
Keeping the
DefaultQualifierForUseTypeAnnotatorcache warm across compilation units (June 2026).GenericAnnotatedTypeFactory.setRootclearsdefaultQualifierForUseTypeAnnotator'selementToDefaults(Element → default-for-use qualifiers) per compilation unit, alongside the other per-CU caches. Proposed: stop clearing it, so defaults computed for a library element (e.g.java.util.Map) in one CU are reused in later CUs. The cache-hit win is real but a vanity metric — it does not move allocation or CPU. Instrumented hit/miss counters (runtime-toggled with-Ddqfu.noclear) over a 120-file corpus that references many distinct JDK types: warm cuts misses from 18,396 to 2,331 (−87%), since the per-CU clear was flushing ~16.6k entries that each later CU re-missed. (A simpler corpus that mostly usesObject/generics shows almost nothing — 1,252 → 936 — so this needs a diverse-library-type corpus to exercise at all.) But the deterministic allocation A/B (jdk.ThreadAllocationStatistics) was −0.2%, within noise on both corpora, and on-CPUExecutionSamplecount did not move (warm side nominally higher, inside ±5–10% sampling noise). The reason the misses are nearly free to recompute:getDefaultAnnosForUsesalready canonicalizes the overwhelmingly-common empty result to the sharedAnnotationMirrorSet.emptySet()sentinel, so a miss on a type with no@DefaultQualifierForUse(≈ every JDK type) allocates only a tiny transient set that is immediately discarded — 16k misses ≈ 1 MB against a 2.2 GB total. Rejected: no measurable allocation/CPU benefit, and it trades away a correctness invariant — the cache reads element annotated types (getExplicitAnnos→getElementAnnotations) that stub/ajava loading can refine across compilation units, so a warm entry can go stale (independent of WPI). A checker that makes heavy use of@DefaultQualifierForUse(non-empty default sets) could show a different allocation profile and would be worth re-measuring before revisiting; forNullnessCheckerthe change is all cost, no measurable gain. General lesson: a cache-hit-rate improvement is not a performance result. When the miss path is already cheap (here, an empty-set sentinel), eliminating misses changes neither allocation nor wall clock — always confirm a hit-rate gain on the deterministic allocation / on-CPU A/B before crediting it, and exercise element-keyed caches with a realistic, diverse corpus, since trivial synthetic inputs under-fill them. -
AnnotatedTypeMirror.getEffectiveAnnotationscaching. JFR- attributed self-time was ~0.05% on the alltests trace and ~0.1% on the Oscar EMR (~4000 file) trace. Not a hotspot. -
Lazy path stack in
CFGTranslationPhaseOne.scan(June 2026, during PR #1786).scaneagerly doesnew TreePath(path, tree)for every tree it visits to maintaingetCurrentPath(), and most of those paths are never queried (onlyMethodInvocationNode, 1 of 78 node types, retains one; the other 22 of 23getCurrentPath()call sites feedTreePathUtilhelpers that extract a fact and drop the path). The proposed fix: keep aTreestack and materialize theTreePathlazily ongetCurrentPath(), allocating nothing for unqueried trees. Rejected as not worth it: the target is negligible. Pure-counting instrumentation (no behavior change; it also simulates the lazy stack's allocation count) on the all-systems corpus measured eagerAllocs = 11,665 (~373 KB) for the whole 267-file run, of which the lazy stack would save 47.9% — i.e. ~178 KB against a ~6 GB total. The JFR agrees:CFGTranslationPhaseOne.scan(line ~562) is only 0.56% ofTreePathallocation (17 samples), ≈0.01% of total. The "70% ofTreePathallocs in the CFG builder" headline is not this line — it is the body-path search atprocess/ line ~527 (see the applied "CFG-builder body-path lookup" note), which caching fixes for ~6 lines instead of a risky rewrite of dataflow's central traversal. -
Pre-sizing
AnnotationMirrorSet's backing array (June 2026, during PR #1785).AnnotationMirrorSet.<init>was the 2nd-largestObject[]source in thechecknullnessworker (18.77%, 6,901 samples, behind only the visitor maps). But it is not oversized: the set is array-backed byshadowList = new ArrayList<>(2), already a 2-elementObject[2]. With compressed oops (the realistic-heap case) and 8-byte object alignment,Object[1](20 B → padded 24 B) andObject[2](24 B) cost the same 24 bytes, so shrinking to 1 saves nothing and would force a resize on every 2-element set (common: multi-hierarchy qualifier sets, declaration-annotation sets).Object[2]is already at the alignment floor while holding two without resizing; the 18.77% is allocation volume (one tiny array per set, and sets are created constantly), which a capacity argument cannot reduce. Cutting it needs fewer set instances — empty/singleton sentinels or a lazy/specialized backing store — which is an architectural, correctness-sensitive change to a hotSet/DeepCopyablepath, not a size tweak. The empty-set case is already partly handled (theemptySet()sentinel and thegetAnnotations()isEmpty()short-circuit). -
CFAbstractStore.copyMapallocation avoidance.new HashMap<>(emptyMap)andnew HashMap<>()produce identical JIT output once the map is written to; the "savings" were illusory. -
clear()onAnnotatedTypeScanner.visitedNodesinreset(). Tried as an alternative to reallocating. An earlier note (written without fresh measurement data) claimed G1/ZGC makesclear()cheaper. Re-measured June 2026 onallNullnessTests -PmaxParallelForks=1:IdentityHashMap.clearconsumed 3.42% of leaf-frame samples vs ≈1.27% net for reallocation. The pre-sizing from PR #1671 is what makesclear()lose: it enlarged the array to 128 slots thatclear()must zero via an explicit Java loop, while TLAB allocation for the same array uses JVM bulk zeroing. See the applied section for the current measured numbers. -
Converting
visitedNodesfromIdentityHashMaptoHashMap. Identity is required for correctness — distinct ATM instances representing the same Java type must be visited separately to break cycles. Identity is also faster. -
Aggressively clearing
SubtypeVisitHistorywithIdentityHashMapkeys. An earlier attempt OOM'd or hung on some checker test inputs. The version that shipped (PR #1634, just prior to the window above) keeps the originalHashMap<IPair<ATM, ATM>>and clears it at the start of each top-level subtype check. -
Collections.synchronizedMap(new IdentityHashMap<>())for the qualified-name cache. Considered for thread-safety paranoia; threading audit confirmed AT factories are confined to the javac main thread. PlainIdentityHashMapshipped instead, matching every other LRU cache on the same object. -
==fast-path inisSupportedQualifierbefore theSetlookup. Raised on the short list (after PR #1673 interned annotation names) on the theory that a reference-equality check would skip the hash computation entirely. Investigation showed the premise is wrong twice over. First,Stringcaches its own hash code, so the hash is not recomputed across the repeated interned-name lookups on this path. Second, the backing set is already interned: it is built fromClass.getCanonicalName(), and for the packaged top-level annotation types that qualifiers always are, the canonical name equals the binary name returned byClass.getName(), which the JVM interns. SogetSupportedTypeQualifierNames().contains(annotationName(a))already matches by reference insideString.equals's identity short-circuit; there is no slowequalsto skip. An isolated JDK micro-benchmark (HotSpot 21) confirmed the currentHashSet.containsof an interned key runs at ~2.7 ns/op, identical to an interned-set variant, while a linear==scan only edges it out below ~5 qualifiers and loses past that. The linear-scan variant also gives up the publicSet<String>return type ofgetSupportedTypeQualifierNamesand adds a correctness dependency on every caller passing an interned string. Not worth it. -
constructorFromUsecache (analog of themethodFromUse/asMemberOfcache). Implemented, validated correct (all suites pass), measured flat-to-slightly-negative despite a 96.4% hit rate. The deep-copy-cache overhead floor (structural key hash + deep-copy on hit + deep-copy of the stored key ≈ 2 type-walks) roughly equals the work a hit saves, because the saved part is just the constructorasMemberOf(getAnnotatedType(ctor)is already element-cached) and constructors are infrequent (~5–10k calls). Lesson: hit rate is necessary but not sufficient — confirm with the wall-clock A/B. Revisit only if immutability removes the deep-copy tax. Full detail in the value-semantics narrative below. -
Deferring polymorphic-qualifier resolution past the
methodFromUsecache (to drop the@Poly*guard). Clean design exists (route the hook by the cached per-element poly check; no per-call cost), but payoff is negligible: an instrumentedchecknullnessfound only 0.1% of cacheable method calls (322/250,000) are on poly-declared methods — the guard already admits 99.9% of calls. Keep the guard. (Would help a checker whose calls are dominated by polymorphic methods; not the realistic target.) -
declarationFromElementviatrees.getTree(localVar). Verified to return the identical tree (8,124/8,124 match) but a no-op:trees.getTreefor a local/parameter internally calls the sameTreeInfo$DeclScanner— it relocates the scan, it does not avoid it. Lesson: verifying the result matches is not verifying the cost drops; confirm the expensive leaf disappears. (The fix that did work — scanning the enclosing method subtree — is in Applied optimizations.) -
declarationFromElementvia a single-pass declaration map. Build a per-CUelement → VariableTreemap in oneTreeScannerpass, replacing per-element scans with lookups. Correct after a fix, but flat: javac defers attribution of lambda/generic-method bodies, so the variables that dominate the cost have null symbols when the pass runs and are skipped; they later miss the map and fall back to the full scan (DeclScannerstayed 99.7% underdeclarationFromElement). Pre-population can't win — the expensive variables aren't attributed at any single build point. -
Shrinking the new heavy caches (
directSupertypesatcacheSize/2) to reclaim memory. Measured a ~10% wall-clock regression (it gave back essentially all of theelementType/Phase-1 gain) — far worse than its 90.5%→81% hit-rate delta suggested. The new caches' +50–70 MB retained heap is the price of the perf; the right way to cut it is reducing per-entry weight (immutability), not entry count. Keep all caches at fullcacheSize. -
TypeKindas a field onAnnotatedTypeMirror— superseded by PR #1763, do not implement. The goal (avoid the heap hop throughunderlyingType.getKind()) is already met more cheaply: every subclass whose kind is constant (AnnotatedDeclaredType,AnnotatedArrayType,AnnotatedExecutableType,AnnotatedTypeVariable,AnnotatedNullType,AnnotatedWildcardType,AnnotatedIntersectionType,AnnotatedUnionType) overridesgetKind()to return the constant inline (PR #1763) — zero memory and zero indirection, strictly better than the proposed ~8 MB field. OnlyAnnotatedPrimitiveTypeandAnnotatedNoTypefall through to the base method, and for themunderlyingType.getKind()is cheap and does not force symbol completion (see the doc comment on the basegetKind()). -
Dropping
MethodInvocationNode'sTreePathfield. It is the only one of 78Nodetypes that retains aTreePath(captured cheaply fromgetCurrentPath()at CFG-build time), for two framework consumers — WPI'sisRecursiveCall(enclosingMethod) andAliasingTransfer(the invocation's parent). Investigated June 2026 (PR #1788 session) as a memory save and found not worth it: CFGs are per-compilation-unit (subcheckerSharedCFGis cleared onsetRoot,flowResultnulled), so the paths are transient, not retained program-wide. Reconstructing on demand (atypeFactory.getPath(node.getTree())) is feasible — both consumers hold the factory — but must preserve behavior for synthetic invocation nodes (desugarediterator()/hasNext()/next()/close(), whichAliasingTransferalso visits). If ever touched, do it for decoupling (WPI could read the enclosing method fromCFGMethod.getMethod()instead of walking the path), not for memory. -
Equal-store short-circuit in the analysis store merge (explored for PR #1793, June 2026). In
ForwardAnalysisImpl.mergeStores(and the two merge sites inBackwardAnalysisImpl), checknewStore.equals(previousStore)before callingleastUpperBound/widenedUpperBound, and reuse the existing store when equal — skipping the LUB, which allocates a fresh store and its five maps. The intent: avoid the throwaway LUB allocation at fixpoint when a merge does not change the store. A/B (deterministicjdk.ThreadAllocationStatistics+ wall clock;gen-sized-program.pyand a loop-heavy variant, drift-controlled interleave of prebuilt jars): allocation −1.1% to −1.5%, consistent across sizes and on the loop corpus — real but small. Wall clock neutral-to-worse: flat at ≤300 methods, +4–5% on the loop-heavy 600-method corpus (master ~59 s → ~62 s, interleaved). Rejected: the allocation saving is below the wall-clock cost it adds. The reason is structural and worth recording, because "skip the LUB when nothing changed" looks free but is not:CFAbstractStore.equalsalready has an O(1) size fast-path (compare the five map sizes; the size-onlyhashCodematches) — so merges where the live-variable set changed are already rejected for free. The cost the short-circuit pays is the same-size, different-value case — the dominant case during loop fixpoint convergence, where the variable set is stable while abstract values refine. Thereequalsmust fall through tosupersetOfand walk every entry.- On that case the short-circuit does a double walk: the failed
equalswalk, then the LUB walk it could not skip. Master does one walk. That extra per-merge walk is the wall-clock regression. Alternatives explored, both dead ends: ==instead of.equals(reference identity). Cheaper (a pointer compare, never a walk), but on the same Loop600 interleave allocation came back to −0.2% (master, within noise) — the equal stores at merge points are distinct objects (content-equal, not reference-equal), so==fires almost never and the allocation win vanishes. Wall clock stayed at master. Net: nothing.- Fold the equality detection into the single LUB walk (
upperBoundOrPrevious: track duringupperBound's existing entry walk whether the result equalsprevious, and returnpreviouswhen so). This removes the double walk — butupperBoundstill allocatesnewStoreat line 1171 before it can know the result, so it loses the allocation saving and only restores wall-clock parity with master. No net win, added complexity. The two goals are in tension: saving the allocation requires knowing equality before buildingnewStore(a pre-walk = the short-circuit, with its double-walk tax), while avoiding the double walk requires buildingnewStorefirst. Determining same-size equality is a full walk, of the same order as the LUB it would skip. Revisit only with new evidence on a memory-bound workload. The whole prize is ~1.4% allocation with no CPU win; on a heap-generous compile that is invisible. It could convert to a real win only under default heap on a many-CU warm-daemon build where GC pressure dominates (the regime where PR #1791's allocation cuts pay off) — measure there before reconsidering, not on single-file allocation totals. See the Short list for the one way to make the same-sizeequalscheap enough (a maintained content hash) and why it was not pursued.
-
declarationFromElementeager whole-classDeclarationScannerbatch (PR #1803 session). Before the visitor-path fix shipped (Applied optimizations), the first attempt located the outermost enclosing class and scanned it once withDeclarationScannerto cache every member's tree. It helped a 1500-method single class (11%) but **regressed realistic+8%)**. Cause: the eager scan recurses into every method body (once per each nullness subchecker factory) — overhead that does not pay off on realistic many-small-class CUs — and javac's symbol-identity instability (queries arrive with view /checknullness:declarationFromElement8.4% → 14.4% inclusive, warm wall 2m15s → 2m26s (baseSymbolsymbols, not the declaration symbol) means parameters and locals miss the batch cache and fall back to a class-level scan anyway. The visitor-path fix replaced it: it adds zero scans, so it helps large classes and never regresses small ones. Lesson reaffirmed: a giant-single-class A/B can read as a big win while the realistic full build regresses; always A/B the full build. -
Pointing
visitorTreePathat the inference expression during type-argument inference (PR #1803 session). A hypothesis for the varargsgetPathquadratic: set the factory'svisitorTreePathtopathToExpressionaroundDefaultTypeArgumentInference.inferTypeArgs. No effect — only ~8% of the hotgetPathscans were under inference; the real sources were the synthetic varargs array and (for interning) the warning-report path, both fixed separately. -
typeinference8 incorporation: removing redundant
applyInstantiationswork (two attempts, both measured-neutral, PR #1803 session). Deeply nested generic invocations (id(id(...id(x)))) make Java-8 type-argument inference super-linear in nesting depth (VariableBounds.applyInstantiationsToBoundsis ~14% self-time on thedeep-nestingshape; a depth-80 single method does not finish in 25 minutes). (1) Computingbound.applyInstantiations()once instead of twice (the change-check, then the rebuild) was neutral — the fast-path already breaks on the first changed bound, so the redundancy was ~1 bound, not 2×. (2) Deleting the redundant per-iteration full apply-pass inBoundSet.incorporateToFixedPoint(the loop that re-applies to every variable before the per-variable loop) was neutral — those calls are individually cheap no-ops. The cost is not redundant work: 69% of self-time is inapplyInstantiationsToBounds's own loops over cheapUseOfVariablebounds, i.e. the O(depth³) count of (variable × bound × fixpoint-iteration) tuples the JLS-18 incorporation fixpoint processes. The only direction that could change the complexity is in the Short list. -
MethodInvocationNode.hashCode()'sObjects.hashvarargs allocation (June 2026).hashCodeusesObjects.hash(target, arguments), the varargsObject[]antipattern called out in Applied optimizations. Raised as a possible hot-path allocation becauseNodehash/equals could back the dataflow worklists and stores. Rejected — the structural override is never reached in production. (1) Self-time: 0 of 14,812ExecutionSamples across the realistic traces. (2) Static reachability: everyNode-keyed map in the dataflow analysis —AbstractAnalysis.nodeValues/syncedFrom, the per-input analysis caches,ForwardAnalysisImpl.storesAtReturnStatements,AnalysisResult.nodeValues— is anIdentityHashMap, and theSet<Node>values (treeLookup/CFG-build) areIdentityArraySet; both useSystem.identityHashCode/==and never call the structuralhashCode/equals. The only structuralMap<Node, _>isConstantPropagationStore.contents(LinkedHashMap) — the example constant-propagation analysis, not used by any production checker. So the varargsObject[]never allocates during real checking; there is nothing to optimize. Lesson: confirm a structuralhashCodeis actually invoked (the collection must be a structural, not identity, map) before treating it as an allocation source. -
Meta-annotation read cache (June 2026, PR #1803 session). A proposed cache memoizing
annotation.getAnnotationType().asElement().getAnnotationMirrors()(e.g. an instanceIdentityHashMap<TypeElement, AnnotationMirrorSet>inAnnotatedTypeFactory) to avoid re-reading an annotation type's meta-annotations. Below the noise floor and rejected. On the 99 MBchecknullnessworker trace (11,691 on-CPU samples): allgetAnnotationMirrorsmeta-reads = 18 samples (0.15%);getDeclAnnotationWithMetaAnnotation(contracts/WPI) = 0.14%;getAnnotationWithMetaAnnotation(theCFAbstractStoremonotonic path, already guarded bygetSupportedMonotonicTypeQualifiers().isEmpty()) = 0.03% — the same magnitude as the rejectedgetEffectiveAnnotationsfalse-hotspot (0.05%). The only warm neighbor isinheritOverriddenDeclAnnos(2.59%), but its self-time isoverriddenMethods()/getDeclAnnotations(apiComplete,ElementFilter,Scopewalks), not the meta-read — so if this area is ever revisited, cache that path, not the meta-annotation lookup. -
Copy-on-write / lazy-share for
CFAbstractStoremaps (June 2026, PR #1803 session). A copy-on-write scheme forCFAbstractStore(inframework.flow, distinct from the ATM copy-on-write discussed in the Short list) — share the five backing maps between a store and its copy until a mutation forcesensureUnshared(). Conceptually valid but not worth it for this workload. On theNullnessTestslice (aggregated worker traces):createCopiedStore(the store-copy entry point) = 0.09% CPU; theCFAbstractStorecopy constructor is 2.45% ofHashMap$Nodeallocation andHashMap$Nodeis 1.34% of total allocation, so store-copy is <0.1% of total allocation. (A "copy" frame shows 11.7%, but that catches every.copy()in the framework, not store copies — misleading.) It can't pay off here because (1) stores track few locals/fields, sonew HashMap<>(other)is already cheap, and (2) most copies are immediately mutated — the fixpoint copies precisely to refine the then/else/transfer stores — soensureUnshared()fires anyway and only adds a per-mutation branch. The surface is also large and failure-silent: five maps with ~22 put/remove plus ~10 iterator-remove()sites to guard (plus a leakygetFieldValues()); a single missed guard is silent wrong dataflow everywhere (theAnnotationMirrorSet#addAllclass of bug). Real dataflow allocation lives elsewhere ([BUTF-8 Name decode ~28%,Object[]~13%,AnnotationMirrorSet~2%), none of which COW touches. The only sanctioned rescue in this basin is a maintained incremental content-hash onCFAbstractStore(see the Short list and the rejected equal-store merge short-circuit above), memory-A/B-gated. -
String→Nameannotation-name migration: replaceAnnotationUtils.annotationName(String) withannotationNameAsName(javacName) +==identity comparison, framework-wide (June 2026). Branch converting the name-dispatch sites inValueAnnotatedTypeFactory,ValueQualifierHierarchy,ValueTransfer, the IndexUpperBound*checkers,Units,BaseTypeVisitor/BaseTypeValidator(qualAllowedLocations),DependentTypesHelper(annoToElements), andAnnotatedTypeFactory(aliases,declAliases,isSupportedQualifier) from value-basedStringcomparison toName-identity comparison, backing maps switched toIdentityHashMap<Name,…>. Premise: "avoid theName.toString()decode +String.intern()thatannotationNameperforms."Measured flat — alloc and wall — even after completing the half-finished form. As proposed the branch slightly regresses, because
annotationNameAsNamehad noCheckerFrameworkAnnotationMirror(CFAM) fast path (it pointer-chasesgetAnnotationType().asElement().getQualifiedName()whereannotationNamereads a cached field) andisSupportedQualifier(Name)still calledname.toString()on every cache miss. To measure the approach and not the artifacts, the branch was completed: cache aNamefield on CFAM next to its existing interned-StringannotationName; giveannotationNameAsNamea CFAM fast path; replaceisSupportedQualifier'sgetSupportedTypeQualifierNames().contains(name.toString())with a one-time identity-backed companionSet<Name>membership test. A/B of that fully-realized version vs. master (deterministicjdk.ThreadAllocationStatistics, single forkedjavac):corpus metric master realized delta nullness, inheritshape (~7.4k LoC)alloc (median of 5) 4488.0 MB 4492.3 MB +0.10% (noise) nullness, repeatshape (~7.3k LoC)alloc 785.7 MB 787.5 MB +0.23% (noise) Value, 8 test inputs batched alloc 312.6 MB 312.9 MB +0.10% (noise) nullness, inheritwall (2nd-best of 4) 7.98 s 8.18 s noise nullness, repeatwall 5.10 s 5.05 s noise Root cause — the targeted allocation does not happen on hot paths. CFAM, the representation the framework manipulates for the overwhelming majority of annotations, already caches its name as an
@Interned Stringcomputed once in its constructor, soannotationName(am)was already a field read with zero allocation for CFAM. ThetoString().intern()cost applies only to rawAttribute.Compoundsource mirrors, and even theregetAnnotationType().asElement()is two field reads (anno.type.tsym). You cannot remove garbage that is not being produced.Ceiling proof. A deliberately annotation-saturated, checker-bound workload (400 methods × 40 explicitly-
@Nullable/@NonNull/@MonotonicNonNulllocals each), profiled at 2596ExecutionSamples, shows the name-handling frames —annotationName,annotationNameAsName,areSameByName,isSupportedQualifier,Name.toString,String.intern— entirely below the sample floor (0 samples). The hot leaf there isIdentityHashMap.get(7.8%), attributed (jfr-analyze self) ~40% toElementQualifierHierarchy.getQualifierKind— annotation→QualifierKindresolution, not name handling — and caching that is itself already measured-and-rejected (see "AnnotationMirror → QualifierKindsecond-level cache" above). General lesson: annotation-name comparison/dispatch is not a hotspot; CFAM already caches the decoded interned name, so any String→Namererepresentation is flat. This is the dispatch-level analogue of theisStringEqualfalse premise (see "Element and name caching"). The annotation-name allocation that does show up ([BUTF-8 decode, ~28% of dataflow allocation) isName.toString()deep in dataflow/store machinery, not in qualifier-name dispatch — optimize there, not here. The branch also seeded three anti-patterns worth recording: astatic IdentityHashMap<Name,…>(Names are interned per compilation context, so a static identity-map both leaks across compilation tasks and never hits cross-context — keep such caches instance-scoped), unused cached-Namefields hidden behind a class-wide@SuppressWarnings("UnusedVariable"), andName-identity assumptions spread across the publicjavacutilsurface for no measured gain.
Verdict (June 2026): the high-value frontier is exhausted — there is no known idea
that is both safe and worth a meaningful fraction of a realistic build. The leaf
self-time profile is flat (no CF leaf above ~3%), the architectural getAnnotatedType
redundancy family is closed (all sub-directions measured: rejected, or already shipped),
and Java-8 inference shipped its big win (PR #1829). Everything still open is one of:
sub-1% and audit-heavy (qualifiedNameCache), correctness-blocked (the #602 conditional
cache, the lazy JDK-stub cascade, the CFAbstractStore content hash), low-value or
fragile (the two typeinference8 resolution items), worst-case-only (the cond/inherit
super-linears — pathological depth, shallow in real code), or research-scale
(parallelism, the blocked immutability allocation win). Treat new hypotheses here as
"measure the addressable fraction first"; most prior ones died exactly there.
Index (entries are interleaved below; each is tagged with its status inline):
- Open, low-value:
qualifiedNameCachebacking map; typeinference8 resolution #3 (getInstantiatedVariables) and #4 (getSmallestDependencySet); thecondpost-dataflow conditional cache andinheritasSuper depth (size-sweep);getAnnotatedType#6 (parallelism). - Open but correctness/cost-blocked:
CFAbstractStorecontent hash; lazy JDK-stub cascade; the immutability allocation win — copy-on-write PROTOTYPED and characterized (branchcow-prototype, an un-merged experiment — a PR was opened but will not be merged): solves the soundness blocker (Guava-validated), allocation −4.8%, but no configuration is wall-positive (all-caches +5.3%, elementTypeCache-only +1.8%/noise) — a memory/GC win only, not a wall win. See the narrative's "Copy-on-write ATMs — PROTOTYPED" entry. - Closed, do not re-propose: PR #1829 incorporation worklist (shipped);
getAnnotatedType#1 (per-analysis gvff memo), #2 (pre-flow split cache — built & rejected: unsound across override-checkers + flat), #3 (already implemented), #4 (applied-defaults), #5 (methodFromUse/constructorFromUse); typeinference8 resolution #1 (dependency graph) and #2 (saveBounds);AbstractAnalysis.getValuesubnode test (see Tried and rejected).
Capture format: hot method, hypothesis, blockers.
-
Size-sweep super-linearity audit (June 2026) — two unfixed super-linear costs, both low-realistic-priority. Generated eight single-dimension shapes (
/tmp/gen-shapes.py:control,cond,chain,inherit,switchc,repeat,tryfin,loops), each R independent constructs of size D, and swept D, reading deterministic allocation (alloc-total.java) and the marginal Δalloc/ΔD (constant marginal = linear; doubling-per-D-doubling = quadratic). Thecontrolshape confirmed the harness reads linear as linear (flat ~11.6 MB/unit). Results:cond(nested ternaryb?x:(b?x:…)) — severe super-linear (quadratic→cubic). Allocation 635 MB → 1.6 GB → 5.9 GB → 28 GB at D = 20/40/80/160. Inclusive profile:getAnnotatedType89% →TypeFromExpressionVisitor.visitConditionalExpression88% →addComputedTypeAnnotations79%; the allocation driver is the Issue #602 conditional non-caching (conditionals are excluded fromfromExpressionTreeCache, so a depth-D nested ternary is rebuilt O(D) per query × O(D) queries = O(D²); same flow-dependence wall as Short-listgetAnnotatedType#3). A secondary CPU cost isTernaryExpressionNode.equalsat 32% inclusive viaAbstractAnalysis.getValue's structuralcontains(see Tried and rejected: the identity fix for that regresses allocation). Possibly-fixable angle (untested): #602 forbids caching conditionals during dataflow, but the post-dataflow visitor pass has a final, stable store where a conditional-type cache could be sound. Sizing = measure conditional-query provenance (dataflow iterations vs. post-analysis visitor). Realistic conditionals are shallow, so the everyday win is small; this is mostly worst-case (generated-code) protection.inherit(depth-D class chain + D up-assignments) — quadratic. 320 MB → 4.6 GB at D = 20→160;AsSuperVisitor.visitDeclared_Declared22% + repeatedgetDirectSupertypes/AnnotatedDeclaredType.directSupertypes17% — the asSuper/supertype walk is super-linear in inheritance depth (thedirectSupertypescache does not flatten it). Real hierarchies are rarely deeper than ~30, so low practical priority; inherent traversal, no obvious cheap fix.- Linear (no action):
repeat(D calls to the same method element) is flat (~4.2 MB/unit) —methodAsMemberOfCachealready makes repeated same-element calls linear, confirming amethodFromUseelement-keyed cache buys nothing (cross-refs #5).tryfin(nested try/finally) andloops(nested for) are flat — no finally-duplication or fixpoint-nesting blowup.chain(.self()chain) andswitchc(D-case switch) are only mildly super-linear (minor).
-
typeinference8 incorporation worklist + constraint gating — SHIPPED in PR #1829 (closed; was deferred from #1805 at ~3%). −8.6%/−11.4%/−19.0% at nesting depth 8/12/20, neutral on realistic builds. Full record: Applied optimizations → PR #1829.
-
typeinference8 resolution phase: recompute and over-save (the non-incorporation costs). On a moderate-depth heavy workload (
gen-sized-program.py --shape deep-nesting, depth-8 × 800 methods), inclusive time splits roughlyincorporateToFixedPoint~47% andResolution.resolve~34% — but most of resolution's time is re-incorporation (resolution adds an instantiation bound and re-incorporates), so the incorporation worklist above also speeds resolution up. The genuinely separate resolution costs, in priority order:- Uncached dependency graph + transitive closure — MEASURED AND REJECTED (June 2026).
BoundSet.getDependencies()rebuilds the whole variable-dependency graph and recomputes its transitive closure (≈O(V³)) from scratch on everyResolution.resolvecall, caching nothing; it is ~7.7% of inference time on the deep-nesting workload, which suggested caching with bound-change invalidation. Instrumented it to compare each recompute's graph (an order-independent signature) to the previous one on the same BoundSet. Result: the redundancy and the cost are perfectly anti-correlated. On deep-nesting (where it is 7.7%) 0% of recomputes are redundant — each resolution step changes the graph (a variable instantiates and drops out), so a cache would invalidate every time. On real code (all-systems) 46.6% of recomputes are redundant — but theregetDependenciesis negligible (it does not appear in the inclusive or self-time profile at all; the inference problems are tiny). So caching saves essentially nothing on any workload: useless where it is expensive, free-but-cheap where it is redundant. Do not pursue. saveBounds()snapshots all variables every resolution round — MEASURED AND REJECTED (June 2026, post-#1805).Resolution.resolveSmallestSetdoesnew BoundSet(...)+saveBounds()(eachVariableBounds.saveallocates 2EnumMaps + 6LinkedHashSetcopies, for every variable in the bound set) as rollback insurance, then discards it wheneverresolveWithoutCapturesucceeds — the common case. The standing hypothesis was that this ~3.2% could be cut by saving only the mutated subset (medium risk: must save a superset of what is mutated). Re-traced against the current baseline and the ~3.2% no longer reproduces:saveBounds/VariableBounds.saveis 0% self-time on maximally inference-heavy deep-nesting workloads at both depth-8 (0/3,368 samples — the original condition) and depth-20 (1/28,193). The ~3.2% figure predated PR #1805; itsallBoundsProperskip + work budget moved the cost entirely into incorporation (incorporateToFixedPoint47–71%,applyInstantiationsToBounds23%), andresolveSmallestSet's inclusive time is now fully accounted for byresolveWithoutCapture's re-incorporation, not the snapshot. A lazy/COW save would chase a ~0% cost. (CPU measured viasettings=profile; allocation not directly TLAB-traced, but overall GC is ~1.7% on this workload, bounding the snapshot's allocation small.) Do not pursue without new evidence on a workload where the snapshot itself is shown to be a hotspot. Lesson (again): re-trace before acting on a logged percentage — #1805 overtook this one.getInstantiatedVariables()recomputed every resolution round (O(V²)). Theresolveloop rebuilds the resolved-variable set (full scan + freshLinkedHashSet) each round. Incremental maintenance is possible but complicated by backtracking (restore()un-instantiates variables, so the set can shrink). Low value alone.getSmallestDependencySetis O(V²–V³) and mutates the shared dependency sets. Each round it doesdependencies.get(alpha).removeAll(resolvedSet)for every unresolved variable — re-removing already-removed elements across rounds — and mutates the cached dependency map in place. Fragile; entangled with #1 (fix together).
- Uncached dependency graph + transitive closure — MEASURED AND REJECTED (June 2026).
-
Architectural: redundant
getAnnotatedType(the biggest realistic-build lever). After the leaf-level wins were exhausted (the realisticchecknullnessself-time profile is flat — no leaf above ~3.4%, and the hot leaves are already-cached lookups called frequently:getQualifierKind,getDeclAnnotations,isSupportedQualifier), the remaining cost is architectural.getAnnotatedTypeis ~47% inclusive, and instrumentation (June 2026) showed it is called ~10× per distinct tree (all-systems redundancy 10.8×, one tree recomputed 987×; loop-heavy artificial 8.3×). Today only class/method trees are cached (classAndMethodTreeCache); expressions recomputefromExpression+addComputedTypeAnnotationsevery call. That non-caching is intentional — an expression's type can depend on context (assignment context, capture, the in-progress flow store) — but "can depend" ≠ "always differs", and 10× says most recomputations return the same thing. Directions, best risk/reward first:-
Per-analysis
getValueFromFactorymemo — MEASURED AND REJECTED (June 2026). Hypothesis:CFAbstractTransfer.getValueFromFactory(~19% inclusive) re-queries the same nodes every CFG-fixpoint iteration with a flow-stable factory value, so a per-analysis per-node memo would remove the redundancy. InstrumentedgetAnnotatedType(a flag set around the body ofgetValueFromFactory, plus a per-analysis tree→type map cleared at eachCFAbstractAnalysis.performAnalysis) and measured on real (all-systems 120 and 269 files) and artificial (straight-line, nested-loop) workloads. Two findings killed it: (a) The cacheable redundancy is small. Per-analysis dataflow redundancy is only ~2.0× on real code and 1.2–1.4× on artificial — not the ~10× headline, which conflated cross-analysis and visitor calls (a tree queried in 10 different methods is not per-analysis-cacheable). The gvff path is ~43% of allgetAnnotatedTypecalls and ~43% of those are per-analysis repeats, so the memo's theoretical ceiling is ~18% ofgetAnnotatedTypecalls — before soundness. (b) The factory value is not stable within one analysis, so the memo would be unsound. Of the repeated gvff queries, ~8% return a different type on real code and ~18% in loop-heavy code, becausegetAnnotatedTypeconsults the evolving flow store for subexpressions (the type ofa.bdepends on the refined type ofa, which changes across fixpoint iterations) — exactly the soundness crux flagged here. And the instability is highest where the redundancy is highest (loops). A per-analysis memo would cache stale/wrong types; a per-iteration memo would be sound but the within-iteration redundancy is ~0. Low value and unsound — do not pursue. -
Pre-flow expression-type ("split") cache — BUILT AND REJECTED (June 2026). Hypothesis:
addComputedTypeAnnotations(Tree, ATM)splits at theif (this.useFlow)boundary into a prefix 2a (applyQualifierParameterDefaults→ tree/type annotators →defaults.annotate) and a flow suffix 2b (getInferredValueFor+applyInferredAnnotations); since flow lives only in 2b, cache the pre-flow (post-2a) type per tree and recompute only 2b. Per-factory, per-hierarchy measurement shows the pre-flow type is stable, so the cache looked sound:category all-systems (120f) loops value leaf (literal / var / field / param id) disagree 0, compl 0 / 58,447 repeats 0 / 0 / 130,894 compound expression disagree 213 (0.69%), compl 0 / 30,949 0 / 0 / 33,604 type-name identifier (class/enum) excluded (≈0.7% context-dependent) — (Value leaves never disagree or under-annotate; the compound ~0.69% residual is the genuinely context-dependent
NewClass/NewArray/conditional set — the #602 trees of #3 below; type-name identifiers ~0.7%, excludable by element kind.) A working cache was then implemented — a per-factoryIdentityHashMap<Tree, AnnotatedTypeMirror>of frozen pre-flow types populated just before theuseFlowblock, withgetAnnotatedTypeoverridden to returncached.deepCopy()+ re-applied 2b on a value-leaf hit, cleared per CU, toggled by-Dcf.preflowcache. Two independent findings killed it:- Unsound for any checker that overrides
addComputedTypeAnnotations.IndexTestfails (PredecrementTest.java:8, a spuriousarray.access.unsafe.low).UpperBoundAnnotatedTypeFactory(and Lower Bound, Optional, Lock, Signedness) overrideaddComputedTypeAnnotationsto add flow-dependent annotations aftersuper— Index pulls the Value Checker's type and callsaddUpperBoundTypeFromValueType(its comment cites"int i = 1; --i;", exactly the failing test). Intercepting atgetAnnotatedTypeand re-applying only the GATF-level flow step bypasses that subclass tail, so hits drop the index refinement. Caching insideaddComputedTypeAnnotationsso the subclass tail still runs would need a deep "replace annotations onto the passed-in type" and saves only 2a (notfromExpression) — more complexity, smaller win. A nullness-only diagnostic diff (cache on vs off) was clean, which is exactly why nullness-only validation missed it. - Flat-to-mixed even where it is sound (nullness). Deterministic A/B (median-of-5 alloc;
2nd-best-of-4 wall), off vs on: all-systems alloc 2658.7 → 2616.1 MB (−1.6%), wall 12.12 →
12.10 s (~0); loops alloc 1078.2 → 1100.1 MB (+2.0%, worse), wall 7.18 → 6.84 s (−4.7%). The
projected ~4–15% (raw 2a self-time) did not materialize — the per-hit
deepCopy+freeze+ map overhead roughly cancels the saved 2a, and the realistic corpus is noise.
Do not pursue. The companion "phase-scoped full cache" idea would hit the same compose-with-overrides wall. Lessons:
- Stability ≠ cacheable. Per-hierarchy stability proved the cache could be sound in isolation,
but a framework-level
getAnnotatedTypecache must compose with subclassaddComputedTypeAnnotationsoverrides, and even then thedeepCopycost makes it a wash. Validate such a change on the override-checkers (Index/Lock/Optional/Signedness), not just nullness. - Measuring per-tree type stability has two traps (both nearly produced a false "unstable
25–59%, unsound" verdict): (1) never compare via
AnnotatedTypeMirror.toString()— it conflates cross-hierarchy completeness (a literal printing asintvs@Initialized intis the Initialization annotation absent-vs-present, not a within-hierarchy change) with real disagreement; compare per hierarchy as {top-name → annotation}. (2) a checker runs severalAnnotatedTypeFactoryinstances (anullnesscompile drivesNullnessNoInit,Initialization,InitializationFieldAccess,KeyFor) — never key instrumentation or a cache onTreeinstaticstate, or you compare one type system's snapshot against another's. (Note:getTopAnnotations()is not an ordering hazard — it returns a build-onceArrayList-backedAnnotationMirrorSetover aTreeMap<QualifierKind,…>, a stable deterministic order.)
- Unsound for any checker that overrides
-
Split flow-independent structure from flow-dependent annotations — MEASURED, ALREADY IMPLEMENTED (June 2026). The hypothesis was:
fromExpression(~24% inclusive) builds a deterministic-per-tree skeleton, so cache the frozen skeleton and re-apply only the annotation/default layer on a copy. This is already whatAnnotatedTypeFactory.fromExpressiondoes:fromExpressionTreeCachestores the frozenTypeFromTree.fromExpressionskeleton per tree, returnscached.deepCopy()on a hit, and the caller (getAnnotatedType) re-applies onlyaddComputedTypeAnnotationson that copy. Instrumented the build-vs-copy split (counters around the hit-copy, the miss buildTypeFromTree.fromExpression, and the putfrozenDeepCopy; single forkedjavac). The split criterion ("pays off only if build ≫ copy") is satisfied and already harvested: build is 96–98% of measuredfromExpressiontime, copy only 1.5–4%, and the cache already serves 66% (synthetic generic size sweep) to 79% (all-systems ×80) of calls as ~0.4 µsdeepCopys. The residual 98% "build" is the irreducible first construction of each distinct cacheable tree — not cacheable away. The only remaining redundancy is the Issue #602 exclusions (NewClassTree/NewArrayTree/ConditionalExpressionTree): on all-systems, ~20% of build time (an upper bound — build timing is inclusive of nested re-entrantfromExpressioncalls) is spent rebuilding just 153 distinct trees ~16× each (94% repeats). That is blocked by #602: those trees' types are flow-dependent (they change across fixpoint iterations), the same soundness crux that rejected the per-analysisgetValueFromFactorymemo (#1, where 8–18% of flow-dependent repeats return a different type). No safe caching win remains. Do not pursue. -
Cache the applied defaults (not just the
DefaultSet, which is already cached) per (element, structural shape) to skip the per-callDefaultApplierElementImpl.scan(~12% incl). MEASURED AND REJECTED (June 2026) — prototype built and A/B'd; see Per-CU tree-defaults memoization under "Tried and rejected". High hit rate (80–86%) but net neutral-to-worse on every axis, because the per-call defaulting scans are cheap in-place annotation mutations while the cache key/snapshot machinery is not. -
methodFromUse/constructorFromUseper-tree memo (~15% incl), scoped like #1. DONE / NOT OPEN — now MEASURED (June 2026), not just argued by analogy. The flow-independent(methodElt, receiverType)form already shipped as themethodAsMemberOfCache(PR #1777 — see "methodFromUse/asMemberOfcache — APPLIED" below); the additional per-tree form hits direction #1's soundness wall (the result depends on the evolving flow store). Instrumented both protected impls (recordat each return: per-tree redundancy via anIdentityHashMap, and the unsound fraction = repeats whoseexecutableType.toString()+typeArgsdiffers from the prior result for the same tree; plus a simulatedexecutableType.deepCopy()for copy cost; single forkedjavac,inferTypeArgspath only). Findings: redundancy is high and copy is cheap (the tempting part) —methodFromUse~8.0× redundancy on both the size sweep and all-systems, copy 0.5–1.6% of build;constructorFromUse~23× on all-systems, copy ~1.4% — but the unsound fraction is decisive: 43–45% ofmethodFromUserepeats and ~15% ofconstructorFromUserepeats return a different result (receiver/argument types are flow-refined and evolve across fixpoint iterations). That is worse than the per-analysis memo (#1, 8–18%) that was already rejected, so a persistent per-tree memo would serve a stale/wrong type ~45% of the time it hit. The cacheable (flow-independent) part is already themethodAsMemberOfCache; the expensive part is exactly the flow-dependent viewpoint-adaptation/inference, which is the part that varies — same structure as #3's residual. Closed by measurement. Do not re-open as "unshipped." -
Parallelize checking across classes/methods — the only constant-factor-by-core-count lever, but the factory + its caches + javac symbol state are shared mutable state; research-scale. Not pursued this session: the immutability program (delete
deepCopy, ~10% inclusive) is the other big architectural bet and is covered in its own narrative below; the notes say it is already largely harvested.
-
-
ElementUtils.qualifiedNameCachebacking map. Hot method (getQualifiedNameunderliesannotationName,getBinaryName, theisXtype predicates, etc.). Today it is aCollections.synchronizedMap(new WeakHashMap<>()), and an inline TODO already asks whether anIdentityHashMapwould be better. Two separable costs on the hot path: (1) thesynchronizedMaplock on everyget/put, and (2)WeakHashMap's reference-queue expunging plus aWeakReferenceallocation perput. javacSymbols use identityequals/hashCode, so the key semantics would not change. Open questions, none answerable without measurement: the lock cannot be dropped without auditing every reachable thread, and a static cache is more exposed than the per-factory field the campaign already de-synchronized; the language server and the Gradle daemon run analyses in long-lived JVMs, where weak keys let old compilations'Symbols be collected — a strongIdentityHashMapwould retain them. JFR capture (June 2026,allNullnessTests -PmaxParallelForks=1) confirmed the lock/expunge cost:WeakHashMap.getviaCollections$SynchronizedMap.getappeared at 110/18,969 execution samples (0.58%), with callers split acrossannotationName,isSupportedQualifier,AnnotationFileElementTypes, andnormalizeAndCheck. Still needs the thread-reachability audit and daemon/LSP memory analysis before any change. -
Maintained content hash on
CFAbstractStoreto make same-sizeequalsO(1). The only way to rescue the rejected equal-store merge short-circuit (see Tried and rejected): the residual cost is the same-size/different-valuesupersetOfwalk, which a running content hash — updated incrementally on everyput/remove/clearValue/insertValueand compared before the walk — could reject in O(1). Blockers, none cheap: (1) it replaces the deliberately-cheap size-onlyhashCodewith one that must stay consistent across all store mutation sites (the exact mutable-cache-invariant hazard CLAUDE.md flags forAnnotatedTypeMirror); (2) on the loop-fixpoint case the store changes every iteration, so the hash is recomputed/invalidated each time and the saving may not materialize; (3) the ceiling is the ~1.4% allocation the short-circuit was already worth — no CPU win. Only worth prototyping if a memory-boundchecknullnessA/B (default heap, warm daemon) first shows that allocation delta converting to wall clock. Spotted June 2026 (PR #1793 review) while auditing why==/.equalscould not make the store short-circuit pay off.
A June 2026 inclusive-time / co-occurrence investigation (looking for architectural redundancy rather than leaf hot spots, since the leaf profile is now flat) surfaced a further candidate, blocked by a correctness invariant, which is why the campaign left it:
- The lazy JDK-stub cascade runs the full type-annotation pipeline during
parsing, uncached. Stacks captured on
allNullnessTestsshowmaybeParseEnclosingJdkClass→annotateSupertypes→directSupertypes→addComputedTypeAnnotations→DefaultQualifierForUseTypeAnnotator→getExplicitAnnos→fromElement→maybeParseEnclosingJdkClass… repeating 3–4 times in one stack: computing the defaults/supertypes of one JDK class pulls in another class's stub, whose own defaults/supertypes are then computed, all whilestubTypes.isParsing()disables the factory caches, so the same work is redone during real checking. The staticStubUnitcache above removes the parse half of this; the resolution/defaulting half remains. Blocker: the caching-disabled-during-parsing rule exists because partially-loaded stubs yield incomplete annotations; changing when defaults are computed for JDK supertypes is correctness-sensitive and needs its own design.
Context for future sessions: the leaf self-time profile is flat (no single CF leaf above ~3.6%), so the campaign's per-leaf wins are exhausted. The remaining cost is architectural — the per-node type-computation pipeline. Use inclusive-time and allocation analysis, not leaf self-time, to make progress.
Pick the right workload. allNullnessTests is dominated by test-harness
amplification — it runs hundreds of tiny per-directory compilations in one worker
JVM, so JDK-stub work (parse + resolve) is ~28–32% inclusive there but only ~6% in a
real single compilation. For realistic venues, profile a single forked-javac
compile: :checker:checkNullness (then isolate the worker cknull-<pid>.jfr — the
file whose stacks contain GenericAnnotatedTypeFactory.performFlowAnalysis; the
launcher/daemon/shadowJar files are noise). In that worker: flow analysis ≈ 38%
inclusive, getAnnotatedType ≈ 47%, and — crucially — Object[] is ~61% of all TLAB
events, ~91% of which are IdentityHashMap backing arrays from AnnotatedTypeScanner
(reset 52%, <init> 17%) and AnnotatedTypeCopier.visit (22%). Flow analysis's own
self-time is the type pipeline (scanning, defaulting, copying, map lookups,
TreePath, symbol completion), not dataflow logic — the dataflow framework itself
(CFStore/CFValue) does not appear in self-time.
Where the wall-clock goes (from jfr-analyze.java phase on the worker, June 2026,
post-scanner-reuse). The compile is CPU-bound: ~96% on-CPU Java, GC pauses only
~1.35 s (~4%, sumOfPauses), and real I/O ≈ 0 (the many NativeMethodSamples are
99.5% EPoll.wait on the idle Gradle messaging thread — exclude them). The on-CPU Java
time splits, mutually exclusively by innermost subsystem (so the type computation that
dataflow and the visitor trigger is attributed to the type factory, not to them):
- Annotated-type computation ≈ 54% —
getAnnotatedType/fromElement, defaulting, supertypes, ATM copying/scanning, plus itsjavacutilsupport (ElementUtils,AnnotationUtils, qualifier-hierarchy lookups, which make up most of the separate "Other CF" 14% bucket). This is the core cost and where the campaign focused. - javac internals ≈ 32% — but ~77% of that is CF-triggered (forced
Symbol.complete/apiComplete,Name/UTF-8 decoding viaConvert.utf2chars,TreePathconstruction, tree walks). Only ~7% of the total is javac's autonomous front-end (parse/enter/attribute). So ~25% of all time is CF reaching into javac. - Dataflow machinery ≈ 5% (CFG build + fixpoint + transfer/store, excluding the type lookups it calls — note this is the exclusive figure; flow analysis is ~38% inclusive precisely because it triggers so much type computation).
- Stub/JDK annotation loading ≈ 3% (small in one compile; the ~28% monster only in the test suite). Visitor check logic itself ≈ 1% — almost all cost is producing the annotated types, not checking them.
Two takeaways for picking venues: (1) GC/allocation is not the wall-clock bottleneck on
a single compile (which is why the scanner-reuse and AnnotationMirrorSet allocation
wins did not move single-compile time — their value is GC pressure at scale); CPU is.
(2) The largest non-obvious CPU slice is CF driving javac (symbol completion + name
decoding + tree/path walks), bigger than dataflow + stubs + visitor combined.
Guava cross-check (first hotspot JFR of a large heavy-generics codebase, June 2026).
test-guava-nullness.sh (Nullness Checker on the guava module, 625 files; JFR injected
via the forked-compiler -J args in the checkerframework-local profile; 7,814
ExecutionSamples). Two findings:
- The leaf self-time profile generalizes — nothing new at the leaf. Same flat shape as
checkNullness: no CF leaf above 3.85%; the top isIdentityHashMap.get/put(≈6.8% combined), the ATM traversal (AnnotatedTypeScanner.scan/visitDeclared/reduce≈7%),Symbol.apiComplete,DefaultApplierElementImpl.scan,AnnotatedTypeCopier.visitDeclared,AnnotatedTypeMirror.createType. Allocation:[Ljava.lang.Object;32%,ArrayList8.7%,IdentityHashMap5.5%,AnnotationMirrorSet4.5% — the same ATM-pipeline allocation. - New fact (not a new lever): type-argument inference is a top-2 inclusive cost on heavy
generics, where it is negligible on
all-systems/checkNullness.getAnnotatedType50% inclusive (as usual), but thenmethodFromUse24.6% →inferTypeArgs23.0% →InvocationTypeInference.infer22.4% →ConstraintSet.reduceOneStep14.1%. Decomposing the 23% inference slice (cooccur): 50% of it is undergetAnnotatedType(building argument/receiver/bound types) and 28% underincorporateToFixedPoint(the machinery PR #1829 already optimized); the inference-specific self-time (TypeConstraint.hashCode,ConstraintSet, constraint-collection churn —LinkedHashMap/ListBuffer/List$2≈6% of allocation) is small and diffuse. So Guava reinforces rather than overturns the Short list: the biggest lever is stillgetAnnotatedType(now shown to dominate inference too, not just checking), inference's big win already shipped (#1829), and the resolution items (#3/#4) stay low-value — their self-time is tiny even here. Takeaway for future sessions:all-systemsandcheckNullnessunder-represent inference; profile Guava (or jspecify-conformance) for any inference work, but expect the cost to be the type pipeline it drives, not a clean inference-specific frame.
Open venues, roughly by tractability:
- Reduce ATM deep copying (
AnnotatedTypeCopier.visit= 22% ofObject[]). Defensive deep copies exist only because ATMs are mutable (everyfromElementcache hit, manygetAnnotatedTypepaths).ATF.getElementAnnotations(committed) was a one-caller nibble. The real lever is copy-on-write annotation sets or immutable ATMs; this also unblocks thedirectSupertypescache above. Large, architectural, high value. This is now an active staged program — see "AnnotatedTypeMirror value-semantics program" below.
This subsection is the detailed methodology log for the cache campaign and the immutability program. Canonical statuses live in the top-level sections; this is the "how we got there" record. Status map:
- Shipped (see Applied optimizations): the
methodAsMemberOf,directSupertypes, andelementType/Phase-1 caches (PR #1777); the smaller-scopedeclarationFromElementscan (PR #1780); thefreeze()mechanism + theAnnotatedTypeCopiervararg-aliasing fix + freezing all eight cache masters (PR #1798) — the immutability program's foundation, behavior-neutral and perf-neutral. - Tried and rejected (see that section):
constructorFromUsecache, poly-deferral,declarationFromElementviatrees.getTree/single-pass-map, shrinking the heavy caches; and (PR #1798 session) the cache-boundary flips (returning the shared frozen instance instead of a copy — Element boundary,methodFromUsecopy-elision),hashCodecaching on frozen ATMs, and the shallow-location defaulting shortcut. - Open (see "Open items" at the end of Short list): the immutability allocation win (delete
deepCopy/drop copy-on-return) is blocked — see the load-bearing-copy finding below — pending copy-on-write or eliminating redundant re-annotation (Defaulting Phase 2).
Goal of the immutability program: make ATMs effectively immutable / copy-on-write so
deepCopy/shallowCopy can be deleted and the cache boundaries stop paying the deep-copy tax
(AnnotatedTypeCopier ~2% self-time + the dominant share of Object[] allocation) — and the
+50–70 MB the shipped caches retain goes away.
Status after PR #1798: the foundation shipped; the allocation win is blocked, not merely "next".
PR #1798 makes a frozen type effectively immutable (a frozen bit; checkMutable() on the three
primary-annotation sinks addAnnotation/removeAnnotation/clearAnnotations, with
primaryAnnotations.makeUnmodifiable() as a backstop; a cycle-safe deep freeze() that freezes only
already-initialized components, with the lazy getters freezing components they create later) and
freezes every cache master, so a latent in-place mutation of a cached type now fails fast with
BugInCF instead of silently corrupting a shared value. The caches still hand out a deepCopy() on
every hit, so it is behavior-neutral and (measured, PR #1798) perf-neutral. But four independent
attempts this session showed the cache-return deepCopy is load-bearing — the dominant consumers
mutate the result, so removing the copy needs a deeper change than a boundary flip. The evidence and
the dead ends are below; the immutability program is therefore paused at its foundation, not the
"recommended next direction" it was before this session.
Validation spike (DONE, GO) — and this spike SHIPPED: see "methodFromUse/asMemberOf cache —
APPLIED in PR #1777" below. This paragraph is the precursor, NOT an open candidate. A throwaway
methodFromUse cache (non-generic methods, key (methodElt, structural receiverType, inferTypeArgs),
copy-on-store/return) on :checker:checkNullness: 66.7% hit rate, AnnotatedTypes.asMemberOf
(12.4% inclusive) eliminated on hits, net allocation down even with the copy-tax, ~5% fewer on-CPU
samples; the structural-key hashing and copy-tax stayed below the inclusive threshold (cheap). Payoff
confirmed → productionized as the methodAsMemberOfCache.
Standalone caching needs poly-handling + opt-outs — but NOT the full immutability program. Two experiments settled this; a methodological trap nearly sent us down the wrong road, so the correction is recorded carefully.
The recompute cross-check is INVALID for this computation. A natural validator — on a cache hit,
recompute computeMethodTypeAsMemberOf and assert it structurally equals the cached value — fired
across ~20 checkers (with either identity-based or Types.isSameType-based comparison). It looked
like a deep "value-identity wall." It was an artifact: an idempotency probe (compute the same
(tree, methodElt, receiverType) twice in a row and compare) showed the two results have identical
toString() but compare unequal — because substitution / capture conversion mints fresh
type-variable and captured-type instances on every call (isSameType(CAP#1, CAP#2) == false). So
the recompute cross-check can never succeed on any type-variable- or capture-bearing result,
regardless of whether the cache is actually correct. Do not use a recompute-and-compare cross-check
to validate ATM-producing caches; validate with alltests diagnostics instead. (EqualityAtmComparer
also compares underlying types by identity — line 55, ut1.equals(ut2), javac Type has no value
equals, @SuppressWarnings("TypeEquals") // TODO — which contributes, but isSameType does not fix
it because of the fresh-capture issue above.)
The real breakage, validated by diagnostics (cache on, cross-check OFF), is bounded — ~9 suites:
NullnessTest (3, polymorphic qualifiers), H1H2CheckerTest/SubtypingEncryptedTest (poly),
ValueTest/ValueIgnoreRangeOverflow/ValueNonNullStringsConcatenation/ValueUncheckedDefaults
(the Value checker — its method results are call-/argument-dependent), IndexTest (MethodVal
reflection), InitializedFieldsValueTest. The ~20-checker "breadth" was the cross-check artifact, not
real. So caching the (methodElt, receiverType)-determined substitution is sound for most checkers;
it is unsound where the method type is genuinely call-dependent (polymorphic-qualifier resolution;
Value; reflection).
Decision — bounded, not the megaproject. The wall-clock-win cache is achievable with:
(1) a polymorphic-qualifier guard — skip caching when the method's declared type contains a
@Poly* qualifier (must check the declared type, not the computeMethodTypeAsMemberOf result:
methodFromUsePreSubstitution — its boolean param is literally resolvePolyQuals — resolves the
poly qualifiers to concrete ones before the result, so scanning the result misses them; cached per
element);
(2) a per-checker opt-out predicate shouldCacheMethodAsMemberOf() (default true) for genuinely
call-dependent checkers — overridden false in ValueAnnotatedTypeFactory (results computed from
argument values) and MethodValAnnotatedTypeFactory (reflection);
(3) validate via alltests diagnostics, NOT a recompute cross-check.
The copy-tax for value stability is cheap (measured). This is bounded work, far less than the
immutability rewrite. The immutability program is therefore decoupled: it remains worthwhile for the
allocation win (deleting deepCopy, ~22% of Object[]) and the clean end-state, but it is NOT a
prerequisite for the wall-clock-win cache.
methodFromUse/asMemberOf cache — APPLIED in PR #1777. The cache as above is implemented in
AnnotatedTypeFactory.methodFromUse (the inner 4-arg overload): cache the
(methodElt, receiverType)-determined computeMethodTypeAsMemberOf result, keyed with a cache-local
isSameType-based structural comparison (IsSameTypeAtmComparer, so structurally-equal receivers
share an entry and distinct captures stay distinct, without touching the global ATM.equals);
deep-copy on store/return; skip declared-@Poly* methods; Value/MethodVal opt out.
Correctness: full :framework:test + :checker:test pass (0 diagnostic failures); the framework
nullness self-check passes. Performance — single-subproject slice (:checker:checkNullness, two
captures): asMemberOf inclusive 12.4% → absent; on-CPU Java samples 3,443 → ~2,690 (−22% of that
one worker); GC pause down too. A cache hit skips all of computeMethodTypeAsMemberOf (including the
getAnnotatedType(methodElt) deep-copy and fake-overrides), which is why the win exceeds asMemberOf
alone. CAVEAT — this −22% is a slice, not the build (see the full-build A/B below): it is the
on-CPU type-factory work of one forked compile, which is a minority of ./gradlew checknullness
wall-clock (10 subprojects + per-fork JVM startup + parse/enter/attribute). Always state the
combined-cache full-build number, not this slice, as the headline.
Deferring poly resolution past the cache — DESIGNED + PAYOFF MEASURED, NOT WORTH IT (for realistic
workloads). The idea: drop the declared-@Poly* guard so poly methods are cached too, by running
methodFromUsePreSubstitution per-call after the cache instead of inside computeMethodTypeAsMemberOf.
The clean design avoids any per-call cost: keep the cached per-element methodDeclaresPolymorphicQualifier
check and use it to route the hook (run it inside the cached compute for non-poly methods — unchanged;
defer it to a per-call copy only for poly methods), rather than to block caching. So non-poly methods
are untouched and poly methods would gain a cached asMemberOf. But the payoff is negligible: an
instrumented full ./gradlew checknullness found that only 0.1% of cacheable method calls (322 of
250,000) are on poly-declared methods — the guard already lets 99.9% of method calls into the cache
(86% hit rate). Caching the remaining 0.1% (even at their 88% would-be hit rate) is ≈ zero wall-clock,
not worth the soundness risk of reordering poly resolution after asMemberOf/viewpoint-adapt (and after
MustCall's non-owning→top adjustment, which shares the hook). The design is sound and would help a
checker whose calls are dominated by polymorphic methods, but that is not the realistic target. Keep
the guard. (Lesson, again: a "drop the guard / extend the proven cache" idea still needs its payoff
measured — the guard turned out to cost 0.1% of coverage, not the meaningful slice assumed.) Mechanism
detail retained below for whoever revisits it:
The poly guard and the type-variable non-guard are not a fundamental asymmetry; they are
an artifact of where in the methodFromUse pipeline each call-site specialization happens relative to
where the cache stores its value. The cache stores computeMethodTypeAsMemberOf (stops after
asMemberOf). Method type arguments are substituted after the cache — findTypeArguments +
typeVarSubstitutor.substitute run per call on the deepCopy() (inner methodFromUse, ~lines
2735–2747) — so the cached value is still generic in the method's type variables and two calls with the
same (methodElt, receiverType) but different (explicit or inferred) type arguments correctly diverge on
their own copies. That is exactly why the key is (methodElt, receiverType), not (…, typeArgs), and
why type variables need no guard; guarding them would needlessly disable the cache for every generic
method. Polymorphic qualifiers, by contrast, are resolved inside the cached computation — at
methodFromUsePreSubstitution(tree, …, resolvePolyQuals) (~line 2792), which reads the call-site
arguments and bakes concrete qualifiers in — so the stored value is already specialized to one call
site's arguments, which the key does not capture; hence the guard. If poly resolution were moved to a
post-cache, per-call step (the same side of the boundary as type-arg substitution), the cached value
would be poly-generic and the declared-@Poly* guard could be deleted — recovering the
Nullness/H1H2/Subtyping suites that currently bypass the cache. Larger and riskier than the guard
(it relocates methodFromUsePreSubstitution's poly handling onto the copy and must preserve the
arguments→qualifiers resolution semantics), so deferred; the guard is the bounded, sound choice for now.
Note the base methodFromUsePreSubstitution is empty and its only contract is the resolvePolyQuals
parameter, so the declared-@Poly* guard covers exactly the tree-dependent work that bakes into the
cached value; an override doing other tree-dependent work there must use the shouldCacheMethodAsMemberOf()
opt-out instead (which is why Value/MethodVal disable the cache wholesale).
directSupertypes cache — APPLIED in PR #1777. directSupertypes(type) is a
pure function of type's structure and annotations (the only hook, postDirectSuperTypes(type, supertypes), takes no tree/args; it copies the receiver's effective annotations and applies
element-based defaults), so — unlike methodFromUse — it needs no poly guard and no per-checker
opt-out. AnnotatedTypeFactory.getDirectSupertypes(AnnotatedDeclaredType) caches it, keyed on the
type with the same cache-local isSameType structural comparison; deep-copy on store/return (callers
mutate the supertypes' annotations); AnnotatedDeclaredType.directSupertypes() delegates to it.
Correctness: full :framework:test + :checker:test pass (0 failures); framework nullness
self-check passes. Performance (single-subproject slice, :checker:checkNullness):
directSupertypes 13.5% inclusive → absent; primarily an allocation win — TLAB events −13.5%
(Object[] −17.5%) — plus a modest ~5% on-CPU on that slice.
Full-build A/B — the headline numbers (June 2026). The slice figures above (−22%, −26%) badly
overstated the build-level impact because they profiled a single forked compile (~2,600 samples /
~26 s), whereas the unqualified ./gradlew checknullness runs the checker over 10 subprojects
(checker, checker-qual{,-android}, checker-util, dataflow, docs, framework, framework-perf,
framework-test, javacutil), all routed through one persistent Gradle compiler-worker JVM. Profiling
the full task (--no-daemon, JFR on every JVM via JAVA_TOOL_OPTIONS, then analyzing the one large
worker file) gives a complete trace of ~15.5–17k samples / ~155–172 s — 6× the slice. Clean A/B,
both caches applied vs. reverted (processor shadowJar rebuilt each side, identical --no-daemon run):
metric (full ./gradlew checknullness) |
baseline | with caches | delta |
|---|---|---|---|
| on-CPU Java samples (whole worker) | 17,227 | 15,555 | −9.7% |
| Type-factory phase samples | 7,121 | 5,893 | −17.2% |
wall clock, --no-daemon |
229 s | 209 s | −8.7% |
| wall clock, warm daemon (user-observed) | ~180 s | ~157 s | ~−13% |
So the two caches are worth ~9% (cold) to ~13% (warm-daemon) end-to-end wall clock, and ~17% of the
type-factory phase specifically — a real, solid win, but roughly half what the single-worker slice
implied. TLAB allocation is down correspondingly. Both caches are decoupled from the immutability
program. Methodology lesson (do not repeat): profile ./gradlew checknullness (the full
multi-subproject task), not :checker:checkNullness (one subproject), and report the whole-worker
sample delta + wall clock, never a single phase's inclusive % as if it were the build. The
record-jfr.sh "analyze the largest file, the rest is noise" advice is correct for a single-project
task but silently undercounts here, because the largest file is the only real worker and it contains
all 10 compiles — analyze it, but know it is the whole build, not one subproject.
Post-cache re-profile — next venues (June 2026, full-build worker, 15,555 on-CPU samples).
With both caches applied, the jfr-analyze.java phase breakdown on the full ./gradlew checknullness
worker: Type factory 37.9% (baseline 41.3% — the two caches removed ~3.4 points of the total, i.e.
−17% within the phase), javac internals 34.7% (now the relative leader), Other CF 13.3%,
Dataflow 6.8%, Stub 4.2%, Visitor 3.0%. The leaf self-time profile is very flat — no CF leaf above
~3% (HashMap.getNode 3.4%, QualifierDefaults$DefaultApplierElementImpl.scan 2.9%,
AnnotatedTypeScanner.scan 2.8%) — so the remaining work is squarely architectural / aggregate,
not single-leaf. Re-prioritized venues:
-
CF driving javac internals — RE-MEASURED on current HEAD (Phase 1 + all caches committed), and the picture changed substantially. Fresh full
./gradlew checknullnesstrace: 12,182 on-CPU samples (down from ~14.7k — the committed work landed), javac internals now the #1 phase at 37.2% (type factory 34%). The breakdown is not what the pre-cache bullet below assumed:AnnotatedTypeFactory.declarationFromElement= 13.1% (1,597 samples) — the single largest CF→javac cost, but no cheap fix found yet. It caches (elementToTreeCache, cleared per CU) but the miss path'sdefaultbranch — local variables, parameters, resource/exception params, type params — callsTreeInfo.declarationFor(sym, root), which scans the whole compilation-unit tree to find one declaration (~95% of its self-time isJCIdent.accept/DeclScanner.scan/TreeScanner.scan; hit rate only ~9%; the scan branch is ~19.5% of misses but ~all the cost). Triedtrees.getTree(elt)for the 4 variable kinds — VERIFIED CORRECT, but a NO-OP, reverted. The hope:trees.getTreeis the position-based path class/method/field already use and which does not show up in theDeclScannercost. Correctness checked exhaustively (instrumented: 8,124/8,124 match vs the scan, 0 differ, 0 missed;:framework/:javacutil/:dataflow/:checkertests all pass) — but the warm-daemon A/B was flat (total on-CPU 12,182→12,140,declarationFromElement1,597→1,517) becausetrees.getTreefor a local/parameter internally calls the sameTreeInfo$DeclScanner— it relocates the scan, it doesn't avoid it (post-change,DeclScanneris still 100% underdeclarationFromElement). Lesson (again): verifying the result matches is not verifying the cost drops — for a "use a different API" change, confirm the expensive leaf disappears, not just that the output is identical. Single-pass declaration map — IMPLEMENTED, MEASURED, REJECTED (defeated by deferred attribution). Built a per-CUIdentityHashMap<Element,Tree>via oneTreeScannerpass overroot(recording eachVariableTree's.sym → tree, keyed exactly asTreeInfo.declarationFormatches), lazily on first variable query, invalidated insetRoot, with a scan fallback and a null-.symskip. Correctness: after fixing a crash (the eager pass hitVariableTrees whose symbol is not yet set —TreeUtils.elementFromDeclarationthrows on a null symbol; switched to a direct null-safe.symread), full:framework:test+:checker:testpass. But the warm-daemon A/B was flat (treatment 130–142s vs baseline 131–133s). The traced run shows why: the single pass was cheap (getRootVariableDeclarations11 samples) butDeclScannerwas still 1,327 samples, 99.7% underdeclarationFromElementvia the fallback — i.e. the map was nearly empty. Root cause: javac defers attribution of lambda/generic-method bodies, so the variables that dominate the cost have null symbols when the single pass runs and are skipped; they get a symbol later, miss the map, and fall back to the full scan. And for the variables the map does miss, it is redundant with the existing per-elementelementToTreeCache(one scan then cached either way). So pre-population can't win here — the expensive variables aren't attributed at any single build point. Reverted.declarationFromElementsmaller-scope scan — APPLIED in PR #1780. Instead ofTreeInfo.declarationFor(sym, root)(scan the whole CU), scan only the variable's enclosing method/class subtree:TreeInfo.declarationFor(sym, trees.getTree(elt.getEnclosingElement())), falling back to the whole-CU scan if the enclosing tree is unavailable or does not contain the declaration. The key difference from the failedtrees.getTree(localVar): heretrees.getTreeis called on the enclosing method (cheap, position-based — the path the class/method/field case already uses), not on the local (which internally scans). It attacks per-scan size, so it sidesteps the attribution-timing problem that killed the single-pass map. Also short-circuitsTYPE_PARAMETERtonull(it was scanning the whole CU only to return null — ~8% of default-branch calls). Correctness: full:framework:test+:checker:testpass (the fallback covers any edge case where the enclosing subtree lacks the declaration, e.g. some initializer-block locals). Performance (same-session traced A/B on fullchecknullness):declarationFromElement1,695 → 1,139 (−33%),DeclScanner1,652 → 1,099 (−33%), total on-CPU 12,284 → 11,658 (−5.1%); warm-daemon wall clock ~2–4% (noisy). The ~33% (not ~90%) reduction reflects that enclosing methods are a non-trivial fraction of their files plus some fallbacks; scoping tighter than the method (no element exists for a block) would need a different mechanism. This is the real lever ondeclarationFromElementthattrees.getTree(relocates the scan) and the single-pass map (defeated by deferred attribution) both missed.- Symbol completion is now small (~2.4%): largely solved.
apiComplete1.46% +ClassSymbol.complete0.48% +ClassFinder.complete0.48%. The earlier "1.50%/0.97% leaders" are gone (PR #1763 getKind overrides + Phase 1 cuttingcreateTypetraffic). ThegetKind→completion hypothesis is resolved:CFAbstractValue.canBeMissingAnnotationsdoes sit aboveapiComplete(31% ofapiComplete, ~0.46% of total) andcreateTypethe rest — real but minor. - Name decoding ~2.3% (
Convert.utf2chars1.43% +utf2string0.86%). The "annotation formatting in the hot path" question is resolved, opposite to the prior guess: 36% ofutf2charsis underDefaultAnnotationFormatter.isInvisibleQualified(22%) +AnnotationUtils.toStringSimple(14%) — i.e.ATM.toString/CFAbstractValue.toStringinvoked during type-checking, not the stub parser. Update (PR #1796): the name-comparison share of this slice is addressed (interned-Nameidentity helpers +sameName); what remains is the formatting/stringification share — the unguardedtoStringwas found (see the eager-error-formatting bullet below), plusProperType.computeHashCode(hashestoString()) andSourceChecker.shouldSkipUses(Symbol.toString()per call for a regex match). Update (PR #1797, June 2026): the stringification share is now addressed —FoundRequiredformatting is lazy,shouldSkipUsesis cached,ProperType/Variablehash withouttoString(),LocalVariableNode.hashCode/equalsreadNamedirectly, and the annotation name maps useIdentityHashMap<Name>. Measured full-build (./gradlew checknullness) warm-daemon A/B: branch ~2m10s, master ~2m16–18s (~7s, consistently one-sided but near the 5–10s noise floor — as expected for a ~0.9% utf2* share). JFR:utf2chars0.57% +utf2string0.32% = 0.89% combined (down from the ~2.3% pre-#1796 level); remaining callers are cold stub-parsing paths, diagnostic-only formatting, and first-visit cache misses.
(The earlier pre-cache attribution of this phase was removed as superseded: symbol completion is now ~2.4% and largely solved, the
getKind()→completion question is resolved per the bullet above, and the name-comparison forcers —isConstructor/isEnumSuperCall/findElement— were addressed by PR #1796.) -
Defaulting: fuse all defaults into one type-tree traversal — APPLIED in PR #1836 (June 2026).
QualifierDefaults.applyDefaultsElementscanned the whole type tree once per default (~9.3 scans/call, ~14% inclusive on Guava) — the maintainers' TODO ("only one iteration through the defaults should be necessary"). Now it does one traversal that applies every default at each node:DefaultApplierElementImpl.scanloops over a precedence-ordered default list (built byfusedDefaultsFor) calling the extracted per-nodeapplyOneAtNode, and the type-variable/wildcard visitors descend once for all defaults.addMissingAnnotationonly adds when the hierarchy is unannotated, so applying the defaults in order preserves precedence — behavior-preserving by construction. The two subtle cases are faithful: use-only defaults (TYPE_VARIABLE_USE, top-level-localLOCAL_VARIABLE) are applied at the use node and are no-ops at bound nodes, so always descending is safe; and a parametric qualifier is excluded from a type variable and its bounds via a stickyinTypeVarBoundflag (replicatingvisitTypeVariable's early return). Subsumes the earlier top-level-only-skip. Behavior-preserving: byte-identical diagnostics on all-systems (269 files); the full:framework:test+:checker:test(56 suites, incl.SubtypingEncryptedTestwhich exercises parametric qualifiers, plus Value/Nullness/Index/Units/Interning/Regex/Lock/…) all pass. Perf — generics-heavy (Guava nullness, apples-to-apples JFR):applyDefaultsElement13.80% → ~9.7% inclusive (−30%),QualifierDefaults.annotate15.57% → ~11.7% (−25%); ~noise on all-systems where defaulting is not hot (the cost scales with field/param/return types being deep generics — exactly Guava). Floor reached: the residualapplyDefaultsElementis now dominated by the per-nodeapplyOneAtNode(each default is still checked at each node — the fusion only collapsed the traversal, ~9.3× → 1×, not the per-node application); cutting further needs per-node default filtering (skip defaults whose location can't match a node kind), diminishing returns. The fused list itself is also memoized — see the next entry (an early per-DefaultSetcontent cache looked useless/unsafe, but the correct structure was later built and measured). -
Defaulting: memoize the fused default list — APPLIED (June 2026, branch
cpu-experiments).fusedDefaultsForrebuilt the precedence-ordered list on everyapplyDefaultsElementcall (one of the hottest paths: ~once per defaulted type). Instrumentation settled the shape: in a generics-heavy compile 99.8% of calls pass an empty scopeDefaultSet(63,376 / 63,485), and there are only 2 distinctDefaultSetcontents by value — the 6,086 "distinct" sets are per-element empty objects (defaultsAtallocates a fresh empty set per element to short-circuit its cache; they are content-equal). So the result is highly memoizable. Two-part cache: (1) anisEmpty()fast-path returns one of two shared constant lists (the code defaults, ±unchecked) — covers the empty case with no map and no hashing; (2) non-empty scopes go through an identity-keyedIdentityHashMap<DefaultSet, List<Default>>(×2 for conservative). Identity, not content, keying: aDefaultSetis mutated in place byaddElementDefault, so a content/hashCode key would corrupt the map;defaultsAtreturns a stable per-scope object shared across a scope's members, so identity hits well. All caches are cleared byinvalidateFusedDefaults()from the three (and only) default-set mutators (addCheckedCodeDefault,addUncheckedCodeDefault,addElementDefault); afusedDefaultsCachedflag (set infusedDefaultsFor) makes that a no-op while defaults are still being registered, before any cache is populated. The returned lists are shared read-only (the scanner only reads them). Why the earlier reject was wrong: the first attempt keyed onDefaultSetidentity for all calls — useless, because the 6,086 empty objects gave 6,086 keys; and a content key was dismissed as needing a per-call hash. The fix is the split (constants for empty, identity map for non-empty), which neither earlier framing found. Measured (no-memo baseline vs memo, n=3, deterministic alloc): unmarked Generic300 −0.84%, many-fields −0.51%; wall neutral (within noise). It is an allocation-only win (~1%) — the list-build was never the CPU cost (the per-node scan is), so it does not move wall time. Kept because it is the morally-correct code (don't rebuild a constant 63k×), it is safe, and — crucially — it scales with JSpecify (next paragraph). Correctness::framework:test(incl.ValueUncheckedDefaultsTest),NullnessTest,NullnessNullMarkedTestall green. -
Why the memo matters under JSpecify — MEASURED (June 2026).
@NullMarked/@NullUnmarkedare aliased to@DefaultQualifier(NullnessNoInitAnnotatedTypeFactory,jspecifyNullMarkedAlias), so they populate the scopeDefaultSetviadefaultsAt's enclosing-element inheritance — every element under a@NullMarkedscope gets a non-empty scope default. So as JSpecify spreads the empty-fast-path's coverage shrinks while the identity cache's grows. Measured with one class-level@DefaultQualifier(= the@NullMarkedalias target): empty fraction 99.8% → 73.3%, distinct content 2 → 3 (still a handful — every marked scope yields identical content). A fully-marked codebase trends empty → the library/JDK floor (unmarked external elements stay empty), so the non-empty identity cache carries the savings. The marked workload shows the larger allocation win (−1.03% vs −0.84%), still wall-neutral. Net: today the non-empty case is 0.2% of calls (not worth caching alone); under JSpecify it becomes the majority, and the distinct-content count stays tiny — exactly the regime where a result memo pays off. The split design is future-proof to marking density without betting on either case dominating. -
Defaulting: drop the vestigial scanner type parameter — APPLIED (June 2026, branch
cpu-experiments). Cleanup the fusion exposed:DefaultApplierElementImpl extends AnnotatedTypeScanner<Void, AnnotationMirror>, but post-fusion the scanner reads the fused list fromouter.fusedDefaults, never threading the per-default annotation as the scanner'sP. Everyscan/visit*override only carried anunusedQualand passednulldown. The only non-null-Pcaller, the legacyapplyDefault(Default)single-default method, had zero callers and was in fact broken (it setlocationand calledvisit, butscanreadsfusedDefaults, which onlyapplyDefaults(List)sets → NPE). Removed it and changed the scanner toAnnotatedTypeScanner<Void, Void>(three override signaturesAnnotationMirror unusedQual→Void unusedQual). No subclasses ofQualifierDefaultsand no external references toDefaultApplierElement/applyDefaultexist, so the change is local. Behavior-preserving by construction (the threaded values were alwaysnull);applyOneAtNode's realAnnotationMirror qual(from the fused list) is untouched. -
The defaulting walk is the largest CF-controlled leaf cluster — FEASIBILITY MEASURED (June 2026), verdict: highly memoizable, worth building.
QualifierDefaults.DefaultApplierElementImpl.scanplusAnnotatedTypeScanner.visitDeclared/scan/reduceare the biggest type-factory leaf group. NoteQualifierDefaults.elementDefaultsalready caches the per-element DefaultSet; the profiled cost is the application —applyDefaultsElementscans the whole type tree once perDefault. InstrumentedapplyDefaultsElementon:framework:checkNullness(one fork, ≥3.0M calls, ~28M scans), keying each call on(identityHashCode(scope), structural ATM.hashCode of the input type BEFORE mutation)— a 64-bit composite, so hash-collision inflation is negligible at ~300k distinct keys:- scans per call ≈ 9.32 — each call triggers ~9 full type-tree scans (one per default in the set
- checked/unchecked-code defaults). High multiplier: a single cache hit elides all ~9 at once.
- repeat rate (same
(scope, input-type-structure)already seen): tree-path 88.0% (1.41M calls, 168k distinct), element-path 91.6% (1.59M calls, 133k distinct). So defaulting is overwhelmingly redundant recompute, not use-site-unique — the core feasibility question is answered yes. - Cost model favors a cache. Per call a
(scope, structural-type)cache costs ~1ATM.hashCodewalk (≈1 scan-equivalent) for the key + a deep-copy on a hit; amortized at ~90% hit that is ~2.9 scan-equivalents/call vs. ~9.3 today — roughly a 3× cut in defaulting work (defaulting is a single-digit-% slice of self-time, so expect a few % end-to-end; confirm with a full-build A/B). - Refinement — split by path, because the two want different keys. The element path (91.6%,
annotate(Element, type)fromgetAnnotatedType) has an input type that is a pure function of the element, so it can be keyed on the element identity (cheap), no structural hash needed. NB (corrected): this redundancy is NOTelementCacheeviction churn — see theelementCachemeasurement below:elementCachealready hits ~92%, but it stores the type before defaults (fromElement's contract), and defaulting (annotate(Element, type)) runs afterfromElementon everygetAnnotatedTypecall regardless of the cache hit. So the element path needs its own post-defaults memoization (a new cache keyed on element identity), which enlargingelementCachewould not provide. The tree path (88.0%,annotate(Tree, type)) has use-site-specific types and genuinely needs the structural(scope, type)key; it pays the uncached-ATM.hashCodekey cost (the immutability-plan risk #2 — measure that the hash walk does not eat the win), but 9.3 scans/call × 88% repeat says it still pays. Both paths are real; the tree path is the novel part. - Soundness + validation. Defaulting only adds missing annotations and is deterministic given
(scope, input-type-structure), so the structural repeats produce identical outputs; cache with copy-on-store/return, same recipe as theasMemberOf/directSupertypescaches. Validate viaalltestsdiagnostics, never a recompute cross-check (the non-idempotency trap above). - Honest bounds: numbers are from one subproject (ratios should generalize, but the absolute %
needs the full
checknullnessA/B); and the ~9.3 multiplier assumes eachapplyDefault≈ a full scan — if some short-circuit, both the savings and the key/copy cost shrink together, so the favorable ratio is robust but the magnitude is not yet pinned.
- scans per call ≈ 9.32 — each call triggers ~9 full type-tree scans (one per default in the set
-
elementTypecache (Phase 1) — APPLIED in PR #1777. Implemented the value-returning element-keyed cache: a newAnnotatedTypeFactory.elementTypeCache(LRU(getCacheSize()), deep-copy on store/return) memoizes the fully-computedgetAnnotatedType(Element)result (postfromElement+addComputedTypeAnnotations, i.e. after type annotators + qualifier defaulting). A hit returns a deep copy and skips the whole pipeline. Cheap element-identity key (noATM.hashCode); no poly guard needed (declaration defaulting does not resolve@Polyfrom arguments — likedirectSupertypes); overridableshouldCacheElementType()opt-out (default true) for checkers whoseaddComputedTypeAnnotations(Element, …)is not a pure function of the element. Not cleared between CUs (element-keyed, stable — same aselementCache). Correctness: full:framework:test+:javacutil:test+:dataflow:test+:checker:testpass (0 diagnostic failures) — no bundled checker needed the opt-out. Performance —≈10%wall clock (worth keeping). Mechanism (single--no-daemonback-to-back, full./gradlew checknullness): element-path defaulting roughly halved —DefaultApplierElementImpl.scan361→247 (−32%),DefaultApplierElement.shouldBeAnnotated135→63 (−53%) — and the type-factory phase dropped −15.2% (5,594→4,742 samples). Wall clock, the metric that matters (warm-daemon, 3–4 reps/side, median): baseline PR 1777 2m34s → isolated Phase 1 2m19s, ≈ −15 s / −10%, and the Phase-1 reps were tightly clustered (2m19s ×3–4) vs the baseline's 152–157 s spread. This is a real, consistent win — keep Phase 1. Two measurement traps this corrected (see "Measuring wall-clock effects" in the SKILL): (1) my first read called it "≈2%/noise" — that was a single--no-daemonrun; cold per-fork JVM startup dilutes the type-checking gain and a single run is noise-dominated. The warm-daemon multi-rep wall-clock is the reliable measure (≈10%). (2) An intermediate A/B that mixed Phase 1 with adirectSupertypes-cap experiment showed zero wall-clock change — because the two effects cancelled (see the next bullet). Never A/B two changes at once. Phase 2 (tree path, structural(scope, type)key + write-back) deferred per plan — re-profile after Phase 1 to see whether tree-path defaulting is still worth its write-back tax. -
constructorFromUsecache (analog ofmethodFromUse) — IMPLEMENTED, MEASURED, REJECTED. A tempting target:constructorFromUseis ~12% inclusive (even on the fully-cached branch), and a spike showed a 96.4% hit rate on(ctor, instantiated-type)with only 1.7% anonymous (skipped) and ~176 distinct keys (so ~free on memory). Implemented the full cache (same recipe asmethodFromUse: structural key, deep-copy on store/return, poly guard,shouldCacheConstructorFromUseopt-out defaulting to the method opt-out, anonymous-class carve-out, plus atype.deepCopy()on the stored key because the instantiatedtypecan alias the returned constructor's in-place-mutated return type). Correctness: full:framework:test/:javacutil:test/:dataflow:test/:checker:testpass (0 failures). But the warm-daemon wall-clock A/B (cache on vs off via the opt-out, Phase 1 constant) showed NO benefit — 2m21s vs 2m19s, i.e. flat-to-slightly-negative. Why the 96% hit rate didn't translate (the lesson): the deep-copy-cache overhead floor — a structural key hash (type.hashCode(), an uncached ATM walk) on every call + a deep-copy on hit + a deep-copy of the stored key ≈ 2 type-walks — roughly equals the work a hit saves, because the saved part is just the constructorasMemberOf(getAnnotatedType(ctor)is already Phase-1-cached) and constructors are infrequent (~5–10k calls), so the fixed overhead never amortizes. ContrastmethodFromUse/directSupertypes, which save more than the tax per hit and fire far more often. Takeaways: (1) hit rate is necessary but not sufficient — always confirm with the wall-clock A/B; (2) a cache only wins when (per-hit saving − deep-copy tax) × frequency is positive, which immutability (removing the deep-copy tax) would change — so this could be worth revisiting after immutability, but not before. Reverted. -
Do NOT shrink the heavy caches to save memory — MEASURED, the cap is worth ≈10% wall clock. PR 1777's two LRU caches add ≈ +50–70 MB retained live heap on a full
checknullness(measured master vs branch via post-GCjdk.GCHeapSummary"After GC": median 207→259 MB, p90 358→426 MB; both caches fill to their 2000 cap;directSupertypesstoresList<AnnotatedDeclaredType>, the heaviest per entry). That footprint is JDK-independent but caused memory pressure on Java 8 CI specifically (root cause: on Java 8 thecheck*tasks ran in-process in the shared Gradle daemon heap, not forked — fixed separately by forking them, PR #1778). The tempting fix — halve the cache size — was tried and rejected:directSupertypesatcacheSize/2(1000) is a ≈10% wall-clock regression (the mixed Phase-1+directSupertypes@halfA/B landed at 2m34s, i.e. the shrink gave back all of Phase 1's ≈15 s gain), far worse than its 90.5%→81% hit-rate delta suggested. Per-factory hit-rate vs cap (measured):directSupertypes256/512/1024/2000 → 54.6/65.3/81.2/90.5%;asMemberOf→ 49.3/60.0/69.8/80.2%. Conclusion: keep all caches at fullcacheSize. The right way to cut their memory without losing speed is reducing per-entry weight — the immutability program (shared frozen values, nodeepCopy) — not reducing entry count. (elementCacherejected-unbounding note below is the dual: don't grow element caches either.) -
elementCacheunbounding / enlarging — MEASURED, REJECTED (not worth it). Question raised: sinceelementCacheis element-keyed, should it cache all elements (drop theLRU(2000))? The "no limit for element keys" reasoning does not transfer: unlike theBoolean/AnnotationMirrorSet-valued element caches (methodDeclaresPolyCache,cacheDeclAnnos),elementCache's value is a deep-copied fullAnnotatedTypeMirror(heavy), and it is a shared base-class cache facing arbitrary downstream projects, so unbounding risks OOM on large builds. Instrumented thefromElementget/put on:framework:checkNullness(shadow LRUs at 2000 / 32000 / unbounded, aggregated across the nullness checker's factories): realLRU(2000)already hits 91.9%;LRU(32000)and unbounded both hit 92.9% — only +1.0 pp — and they are equal because the largest single factory holds just 19,398 distinct elements, well under 32000, so unbounded buys nothing over a modest bump. Verdict: not worth changing — a +1 pp hit-rate gain on a cache that already hits ~92%, against an OOM risk on large downstream projects (whose distinct-element count can exceed 32000, where unbounded would diverge). Key correction this produced: the element-path defaulting redundancy (above) is notelementCacheeviction churn —elementCachestores the pre-defaults type and defaulting re-runs afterfromElementon everygetAnnotatedType, so enlargingelementCachewould not reduce defaulting cost; the defaulting venue needs its own post-defaults cache. -
HashMap.getNode(3.38% self) is flat and distributed — no single fix. Nearest-CF split:getDeclAnnotations27% (already cached incacheDeclAnnos; the cost is the lookup itself, not a recompute),isSupportedQualifier10.5% (aSet<String>.containson the supported-qualifier names),fromElement/getDeclAnnotation/declarationFromElementthe remainder. These are unavoidable map lookups on already-cached data; the only lever is reducing call frequency (fewergetDeclAnnotations/isSupportedQualifiercalls per node), not the per-lookup cost. Low value as a direct target; better addressed indirectly if the defaulting/CF-into-javac venues reduce node visits. -
Annotation formatting in the hot path — APPLIED in PR #1797 (June 2026). Stack samples on the full
checknullnessbuild settled the "confirm before chasing" question: 148 samples (~1.1%) containedAnnotatedTypeMirror.toString, and the callers were not diagnostics-only. The two paths: (1)BaseTypeVisitor.checkContainsSameToString— a staticSimpleAnnotatedTypeScannerwhose lambda callstype.toString()andtype.toString(true)on every component of every type — invoked viacontainsSameToStringfromFoundRequired.ofandshouldPrintVerbose; (2)reportCommonAssignmentError/reportMethodInvocabilityError, which builtFoundRequired(i.e. formatted both full types) beforechecker.reportError, so the formatting cost was paid even when the warning was subsequently suppressed. Fix:FoundRequired.found/requiredchanged fromStringto lazyObjectwrappers;shouldPrintVerboseresult memoized. See the Applied optimizations entry above for the measured A/B. -
CF driving javac internals — the biggest realistic CPU lever (~25% of total). The wall-clock breakdown above attributes ~25% of all time to CF reaching into javac: forced
Symbol.complete/apiComplete(fromgetKind/createType/CFAbstractValue.canBeMissingAnnotations/getErased/ElementUtils.isTypeElement),Name/UTF-8 decoding (Convert.utf2chars/utf2string,Utf8NameTable.equals— every time CF compares or stringifies aNamethat isn't yet decoded/interned), and repeatedTreePathconstruction/tree walks. PR #1763 (getKind()overrides), PR #1673 (interned-name caching), PR #1796 (interned-Nameidentity comparison — removed the name-comparison share), and PR #1797 (lazyFoundRequiredformatting,shouldSkipUsescache,ProperType/Variable/LocalVariableNodehash fixes,IdentityHashMap<Name>annotation maps — removed the stringification share; combined utf2* now 0.89% on the fullchecknullnessbuild) each chipped at one facet. This is bigger than dataflow + stubs- visitor combined and is the highest-leverage remaining CPU target for realistic
compiles; it is incremental, not architectural — audit the remaining forcers/decoders
that already have (or could cache) the needed info. Confirmed real, not the
assert-guardedvalidateSetpath (:checker:checkNullness's forked javac runs without-ea).
- visitor combined and is the highest-leverage remaining CPU target for realistic
compiles; it is incremental, not architectural — audit the remaining forcers/decoders
that already have (or could cache) the needed info. Confirmed real, not the
-
Redundant type computation across the flow fixpoint (the 38%). Flow analysis recomputes node types across iterations; the self-time is the type pipeline. Memoizing flow-insensitive node types within a run could help but is hard because of flow-sensitivity. Architectural.
Investigated and rejected this session:
- Changing
TreeUtils.annotationsFrom*to returnAnnotationMirrorSet(so theaddAnnotationscallers hit the index-based overload). Rejected: it is a public-API return-type break onTreeUtils(used by downstream checkers) rippling through ~15 internal callers that declare the result asList, it shiftsList(ordered, duplicates) toSet(dedups byareSame) semantics, and only ~2 of ~18 callers pass the result straight toaddAnnotations— and both are cold per-tree construction paths. Net: large break + semantic risk to remove two cold iterators. - A new
AnnotationMirrorSet.singleton(anno)factory for theaddMissingAnnotations(Collections.singleton(x))sites. Rejected in favor of the existing singularaddMissingAnnotation(x)(committed): the singular method allocates nothing, whereas anAnnotationMirrorSetsingleton allocates anArrayList-backed set — heavier than the JDK's immutable singleton. Rule for future: a single annotation → the singularadd/addMissing/replaceAnnotationmethod, never a one-element collection.
PR #1798 — the immutability foundation, and why the allocation win is blocked (June 2026). This is
the session that built the freeze() mechanism and tried to cash it in for the deep-copy-removal
allocation win. The foundation shipped; the allocation win did not, and the dead ends are precise and
worth not re-walking.
What shipped (PR #1798), behavior- and perf-neutral. A frozen bit on AnnotatedTypeMirror;
checkMutable() throwing BugInCF on the three primary-annotation sinks
(addAnnotation/removeAnnotation/clearAnnotations — every other annotation mutator routes through
them), with primaryAnnotations.makeUnmodifiable() as a backstop for the getAnnotationsField() and
AnnotatedDeclaredTypeNoHierarchy.addAnnotation paths; a cycle-safe deep freeze() (the frozen bit
is the visited marker) that freezes only already-initialized components, with the lazy getters
freezing components they create after the owner is frozen; and freezing the master stored at all eight
caches (elementCache, elementTypeCache, classAndMethodTreeCache, from{Member,Expression,Type}TreeCache,
methodAsMemberOfCache, directSupertypesCache). The caches still deepCopy() on every hit, so this
is behavior-neutral. Structural setters are deliberately left unguarded — the corruption vector is
annotation mutation, and deep freeze() already freezes every reachable component's annotations;
guarding the structural setters would need a raw-setter split of BoundsInitializer (the bound setters
are called on the frozen owner during lazy init) for no safety gain. A/B (deterministic
jdk.ThreadAllocationStatistics, median of 3, + wall + on-CPU): allocation −0.09% on a 300-method
generic file and +0.07% on a 400-vararg-method file (both within the ~0.15% band); freeze() does
not appear in 1,725 on-CPU samples on a 1500-method compile (<0.06%); wall within noise. The frozen
boolean adds no per-object allocation (it fits existing object padding — total allocation did not move).
The flush traced to ONE copier bug, not pervasive aliasing. Freezing the masters initially flushed
MethodValInferenceTest + ~12 NullnessTest cases as BugInCF ("Attempted to mutate a frozen
AnnotatedTypeMirror"), which looked like the construction pipeline embedding cached substructure
everywhere. It was a single bug: AnnotatedTypeCopier.visitExecutable did
copy.setVarargType(original.getVarargType()) — aliasing the original's vararg AnnotatedArrayType
into the "copy" instead of copying it, so deepCopy() of an executable type was not fully independent
and shared its whole vararg subtree (Object[]/Class<?>[]/LinkOption[] and everything reachable —
which is why every flushed underlying type was an array or an array-subtree node). Defaulting then
mutated that shared subtree. Fixed with copy.setVarargType((AnnotatedArrayType) visit(original.getVarargType(), originalToCopy)) — the originalToCopy map returns the already-made parameter copy when the vararg is
the last parameter (the common case, so the fix is allocation-neutral), else a fresh copy. With that one
fix, freezing all eight masters is green on the full suite. Lesson: when freezing flushes a cluster of
mutations, look for a shared copier/construction bug before assuming pervasive aliasing.
The aliasing was benign to results — the only symptom is the freeze crash. The vararg type is consumed
read-only (PropagationTreeAnnotator, BaseTypeVisitor); its only post-copy mutator is qualifier
defaulting, which is idempotent (addMissingAnnotation), so the shared subtree always got the same
annotations and no wrong diagnostic ever resulted — which is why it was latent and master's suite was
green. Confirmed three ways on one program (JDK vararg calls Arrays.asList/String.format/Class.getMethod):
clean on master, BugInCF on freeze-without-fix, clean on freeze+fix. Consequence: the regression test
(checker/tests/nullness/VarargCacheAliasing.java, PR #1798) only demonstrates the bug with the freeze
enforcement present; a standalone fix would need a unit test asserting deepCopy independence. This is
also why PR #1798 keeps the fix and the freeze work in one change.
The load-bearing-copy finding — four attempts, all confirming the cache-return deepCopy cannot just be
dropped. The whole point of freezing masters was to then return the shared frozen instance and delete the
copy. It does not work, because the dominant consumers mutate what they get back:
- Element-boundary flip (
getAnnotatedType(Element)returns the frozen master): flushedDefaultInferredTypesApplier(flow refinement, 60),constructorFromUse(type = getAnnotatedType(elt); type.clearAnnotations(), 25),CommitmentTypeAnnotator,DefaultQualifierPolymorphism,ValueTreeAnnotator, ... The results feed the always-mutating tree pipeline (visitIdentifier/visitMemberSelect/asMemberOf→addComputedTypeAnnotations), so fixing each site means adeepCopy()before the mutation — which moves the copy to the consumer, not removes it. The flip saves a copy only for read-only direct consumers, the minority. methodFromUsecopy-elision (skip the on-hitdeepCopysincetypeVarSubstitutor.substitutecopies again for generic methods): type-argument inference mutates the method type in place —findTypeArguments→DefaultTypeArgumentInference.inferTypeArgs→typeinference8.Resolution.resolveWithLowerBoundscallsreplaceAnnotationson a component ofpreType. So the pre-inference copy is load-bearing.hashCodecaching on frozen ATMs (the perf-notes' standing "can't cache because mutable" item): instrumented 0.0% ofhashCode()calls land on frozen types (0 of 185k / 370k on the size sweep) — every hot hash target is a mutable working copy, because the caches return copies. Worthless in the current architecture, and doubly blocked (it would need the boundary flip, which itself does not pay).- Shallow-location defaulting shortcut (skip the recursive descent for the top-level-only locations
FIELD/PARAMETER/RETURN/RECEIVER/RESOURCE_VARIABLE/EXCEPTION_PARAMETER/CONSTRUCTOR_RESULT — but NOT
LOCAL_VARIABLE, which has a type-variable-use special case): cut scan calls only 10.2% (586k→527k),
and those saved scans are over cheap shallow types (
Objectparameters); allocation flat. The expensive defaulting is the deepOTHERWISE+ bound-location traversals over generic types, which the shortcut does not touch. (Separately measured:addMissingAnnotationis 74% no-op, andapplyDefaultsElementdoes N full scans, one perDefault— so a full single-pass merge could pay, but it is a high-risk refactor of the recursive bound logic with a ~2% ceiling.)
Where this leaves the program. A boundary flip is achievable — but it relocates the copy to each mutating consumer rather than removing it, so the realized win is only the read-only-consumer fraction. Two findings refine the earlier "blocked" verdict:
- The cross-cutting blocker was one latent bug, now fixed (PR #1798): side-effecting equality.
ValueAnnotatedTypeFactory'sarePrimaryAnnosEqualoverride normalized its operands by mutating them (replaceAnnotation) before comparing. That fired during every cache flip (it runs in subtyping/equality, which all cache results flow through). Made non-mutating (compute the canonical annotations, compare without mutating). It is the prerequisite for any flip and a correctness fix on its own. classAndMethodTreeCacheflip shipped (PR #1798) — green, but modest. With the equality fix plus copy-on-frozen at the ~6 mutating consumers it flushed (getMethodReturnType,getSelfType, thegetAnnotatedType(Tree)pipeline choke-point,constructorFromUse's enclosing type, andValueVisitor.checkOverride), the flip is green on the full suite. But deterministic A/B is ~−1% on a method-heavy file and ~0% on realistic code —classAndMethodTreeCacheis low-volume. Shipped for GC-relief + to establish the copy-on-frozen consumer-fix pattern.- The high-volume
elementTypeCacheis mutation-dominated, so likely also modest. Its flip flushed 108 events; its dominant consumer isasMemberOf(every method call viamethodFromUse), which mutates the result (poly resolution, substitution,postAsMemberOf). Flipping it needsasMemberOfto copy-on-frozen on its alias-return paths, which moves the copy back — limiting the win to read-only element-type queries. Not pursued: large fix set, likely-modest win.
So the per-cache lesson: the flip is mechanically unblockable (copy-on-frozen at the enumerated mutating consumers; the freeze tripwire makes it incrementally safe), but the high-volume caches' hot consumers mutate, so the realized win is small. The larger allocation win still needs copy-on-write (mutator returns a fresh node sharing unchanged children — though the whole-tree re-annotators like defaulting/flow get no benefit) or eliminating redundant re-annotation (Defaulting Phase 2). Prototype + JFR-A/B before more flips. The higher-leverage perf target remains CF→javac internals (see the open venues).
A fresh full-checknullness trace (11,009 on-CPU samples; javac internals 35.8% / type factory 34.6%)
has a flat leaf profile (no leaf > ~3%) — the per-leaf hot spots are mined out. Remaining
CF-controllable clusters and their state, highest-leverage first:
- Immutability program — foundation + first flip shipped (PR #1798); remaining win small per cache.
Shipped: the
freeze()mechanism, theAnnotatedTypeCopiervararg-aliasing fix, freezing all eight cache masters, the non-mutating-equality fix, and the first boundary flip (classAndMethodTreeCachereturns the shared frozen value, with copy-on-frozen at its mutating consumers). The flip is green but ~−1%/~0% (low-volume cache). The flip technique is mechanically unblockable (copy-on-frozen at the enumerated mutating consumers), but the high-volume caches' dominant consumers mutate (elementTypeCache→asMemberOf), so their realized win is also likely modest. The larger win needs copy-on-write or eliminating redundant re-annotation, not more boundary flips — see the narrative ("Where this leaves the program") and Tried and rejected. Re-open with a copy-on-write prototype, measured. - Defaulting Phase 2 (tree-path memoization). Measured 88%
(scope, type)repeat on the tree path, ~9.3 scans/call; per-CU clearing bounds the memory. Gate on a within-CU-repeat measurement first, and note it carries the same write-back tax that sank theconstructorFromUsecache (real flat-risk). PR #1798 also measured the cheaper cache-free variant (a "shallow-location" shortcut) and found it negligible — see Tried and rejected; the deepOTHERWISE/bound traversals are where the cost is, and merging those is the risky part. getPath/ TreePath construction (~3.2%) — largely addressed by PR #1786 + #1788. 68% ofTreePath.<init>was underAnnotatedTypeFactory.getPath's slow path (uncachedTreePath.getPath(root, tree)scans on cache miss + heuristic failure). PR #1786 caches the per-body lookup; PR #1788 makesTreePathCacherlazy and routesgetPaththrough it, removing most of that allocation. Residual — RESOLVED by PR #1789. A single class with very many methods still allocated super-linearly after #1786/#1788 (1500 methods 4.9 GB → 3000 11.8 GB → 6000 32.1 GB, ~2.5–2.7× per doubling). Analloc-by-nearest-CF-frame capture (via agen-sized-program.pysize sweep) traced it togetPathsearches that rescanned the whole class per lookup; PR #1789 starts those searches from the tightest known path, making it linear (6000 methods 32.1 GB → 14.8 GB). See "LineargetPathsearches" in Applied optimizations.declarationFromElementresidual (~5–7%). Still the largest single javac-interaction cost after the smaller-scope scan; residual is method-subtree scanning. The cheap levers are exhausted (scoping tighter than a method has no element;trees.getTreeand the single-pass map were rejected).- Small / blocked:
ElementUtils.qualifiedNameCache(synchronizedMap+WeakHashMaplock/expunge, ~0.58%, blocked on a thread-reachability + daemon-memory audit — see Short list above). Annotation formatting in the hot path is now resolved (PR #1797 lazyFoundRequired); remaining utf2* at 0.89% is cold-path / first-visit-miss only.
Copy-on-write ATMs — PROTOTYPED (June 2026, un-merged experiment): solves the soundness blocker,
allocation win real, but wall-clock-negative. Recorded in PR #1835; the COW code lives on branch
cow-prototype (a PR was opened but will not be merged — kept as a reference implementation). The
blocker (above) was that returning a shared frozen cache master crashes
when a consumer reparents a frozen child into a fresh non-frozen result and mutates it (root-level
deepCopy() guards can't catch a non-frozen root holding a frozen child; Guava found what alltests
- 9 fixes missed). A working COW prototype was built (branch
cow-prototype, gated by-Dcf.cow): the six post-pipeline caches (elementType,element,fromMember,fromExpression,fromType,methodAsMemberOf) returncowCopy()— a non-frozenshallowCopy()that shares the master's frozen children — instead ofdeepCopy(); the ~13 child accessors (getUpperBound/getTypeArguments/…) plus the threefixupBoundAnnotations(which mutate bound fields directly, bypassing the accessors — the second class of reparenting path) lazily unshare a frozen child of a non-frozen parent (cowChild/cowChildren), so a mutation copies only the spine it touches and a read-only hit copies one node. A per-nodecowDirtyflag (set bycowCopy, checked in the accessors) keeps the COW scan off the hot path for the majority of (non-cache) types.
- Soundness: complete and validated. Passes the regression test
ElementTypeCacheWildcardBound, all-systems (269 files, byte-identical diagnostics to COW-off), and — the decisive test — a full Guava nullness build (BUILD SUCCESS, 0 crashes), the venue that caught the original flip crash. COW-on-access is the complete fix the narrative predicted: it makes all six caches flippable and the whole reparenting bug class disappears. Two non-obvious lessons: (a) the executable type'sgetTypeVariablesand the union/intersectiongetAlternatives/getBoundsare easy accessors to miss; (b) thefixupBoundAnnotationsfield-level mutation is a second reparenting surface that COW-on-access alone (intercepting only the public getters) does not cover — it must be routed throughcowChild/cowChildrentoo. - Allocation: −4.8% (real). Deterministic
ThreadAllocationStatistics, all-systems 269, median of 5: 5709.6 → 5434.3 MB; loops −3.6%. Larger than the ~1% the oldelementType+classAndMethodboundary flips got, because COW flips all six caches andshallowCopyis far cheaper thandeepCopyon read-only hits. - Wall clock: +5.3% (regression). all-systems 269, min of 3: 19.81 → 20.86 s. The per-access COW
machinery costs more CPU than the allocation reduction returns. The
cowDirtyflag fixed the one hot frame (cowChildren3.74% → 0.50% self-time), but the residual is diffuse — the distributed per-access checks plus per-childcowCopyallocations across the walk (noteshallowCopyfor type-variables/wildcards is itself adeepCopy, so a generics-heavy type deep-copies each bound on access). This confirms the post-mortem: by now the copier is cheap, so eliminating it does not pay in wall clock — COW is an allocation / GC-pressure win (valuable at scale, on tight heaps), not a single-compile wall win. - Wall optimization attempted — COW cannot be made wall-positive (it is a memory win, not a wall
win). Two levers tried (branch
cow-prototype, second commit): (1) gate every child accessor withcowActive()(= COW && cowDirty) so non-cache types skip thecowChildcall and the field write — removed the per-access write but did not move wall, so the write was not the cost; (2) flip onlyelementTypeCache(the one cache with read-only-majority consumers, 65–88%) and revert the full-walk caches (element/fromMember/fromExpression/fromType/methodAsMemberOf) todeepCopy. Results (all-systems 269): all-six COW = alloc −4.8% / wall +5.3%; elementTypeCache-only = alloc −0.2% / wall +1.8% (noise). No configuration is wall-positive, for two structural reasons: (a) the copier is already cheap (~1–2% — the post-mortem above), so eliminating it cannot beat the global per-accessorcowActive()tax COW imposes on every type, not just cache results; and (b) the cache consumers (defaulting/annotators) fully walk the result, so the read-only-skip benefit never materializes and piecemeal per-childcowCopyis slower than one batcheddeepCopy. - Downstream / GC-bound win? — TESTED, no. Hypothesis: the −4.8% allocation has no wall value on a
roomy single compile (GC ≈4%) but should pay off on a memory-pressured build. Heap sweep (all-systems
269, all-six COW): the wall penalty shrinks as the heap tightens (−Xmx 512m +8.1%, 320m +0.5%,
256m +0.1%) — which looks like the GC story — but a clean GC measurement at −Xmx256m shows COW
does not reduce GC: 0.77 s vs 0.76 s pauses, 225 collections both. So the gap-closing is
tight-heap measurement variance (the baseline slows), not a COW GC saving — the −4.8% allocation
does not convert to fewer collections. Root cause: CF compilation is CPU-bound (~96% on-CPU, GC
≤4% even on the large
checknullnessbuild), so the GC ceiling is ~4% and COW captures ~none of it (−4.8% alloc → ≈−0.2% wall), nowhere near the +5% CPU tax. (COW reduces transient churn, not retained heap — the cache masters are unchanged — so it does not relieve the OOM/footprint pressure either; that needs per-entry-weight reduction, the original immutability goal.) No downstream timing win. - Verdict. COW is the correct, complete solution to the soundness blocker (Guava-validated) and
delivers the allocation win (−4.8%), so it is the right tool if the goal is GC pressure / peak
memory at scale or a clean immutable end-state. It is not a wall-clock win — for wall, the
existing
deepCopyis already optimal. Branchcow-prototypeis kept as the reference implementation. A wall win in this region, if one exists, is not here (copier already harvested) — it is in not producing the types (thegetAnnotatedType-redundancy family, already closed), not in copying them more cheaply.
Use checker/bin-devel/record-jfr.sh
for trace capture and
.claude/skills/cf-performance/jfr-analyze.java
for analysis; see
.claude/skills/cf-performance/SKILL.md
for the analysis pipeline and the known pitfalls (the silent 10 ms
MIN_SAMPLE_PERIOD floor, Maven multi-module filename handling). Always
re-capture on the same workload after applying a patch to confirm the
targeted self-time percentage moved. A patch that passes tests but doesn't
move the profile is wrong by definition.
Three tooling-reliability bugs were found and fixed in June 2026 while auditing whether the profiler gave trustworthy data; all three silently produced misleading traces:
stackdepthwas being ignored. It was passed inside-XX:StartFlightRecording=, where it is not a valid option (the JVM warnsThe .jfc option/setting 'stackdepth' doesn't exist.and falls back to depth 64). It is a-XX:FlightRecorderOptionsoption and must be set there;record-jfr.shnow does.- Same-filename clobbering.
JAVA_TOOL_OPTIONS/GRADLE_OPTSreach every JVM the build spawns (launcher, daemon, test worker, forked javac). Pointing them all at onefilename=made them overwrite each other and corrupt the constant pool — traces came back with<null>thread names and a leaderboard dominated by the launcher's idle frames (EPoll.wait~49%,ProcessHandleImpl.waitForProcessExit0~21%) rather than type-checking.record-jfr.shnow uses the JFR%pfilename token so each JVM writes its own file; the largest is the worker. jfr print/jfr viewcrash on JDK 25 with aStringIndexOutOfBoundsExceptioninValueFormatter.formatMethod/PrettyWriter.formatMethod, making the documentedjfr view hot-methodspipeline unusable.jfr-analyze.javareads the recording viajdk.jfr.consumer.RecordingFileand avoids the broken formatter. It also computes self-time fromjdk.ExecutionSampleonly — includingjdk.NativeMethodSamplefloods the leaderboard with idle native frames.