Skip to content

Conversation

@ChrisHegarty
Copy link
Contributor

@ChrisHegarty ChrisHegarty commented Oct 9, 2025

Fixes a potential hang with initialisation of TermsEnum and BaseTermsEnum, by simply removing the dependency and replicating the small amount of code.

Problem description: the static TermsEnum.EMPTY initialises to an implementation of BaseTermsEnum. TermsEnum is a superclass of BaseTermsEnum, so there is a clear dependency between these classes. If a subclass of BaseTermsEnum is initialising it may grab the lock on BaseTermsEnum, and prevent TermsEnum from initialising. E.g.

"main" #3 [8963] prio=5 os_prio=31 cpu=30.57ms elapsed=2.19s tid=0x0000000123815a00 nid=8963 in Object.wait()  [0x000000016b24e000]
   java.lang.Thread.State: RUNNABLE
	at TermsEnumClinitHang.main(TermsEnumClinitHang.java:22)
	- waiting on the Class initialization monitor for org.apache.lucene.index.TermsEnum

"TEST_THREAD" #25 [27395] prio=5 os_prio=31 cpu=1.54ms elapsed=2.17s tid=0x000000012382e800 nid=27395 in Object.wait()  [0x000000016d632000]
   java.lang.Thread.State: RUNNABLE
	at org.apache.lucene.index.TermsEnum.<clinit>(TermsEnum.java:199)
	- waiting on the Class initialization monitor for org.apache.lucene.index.BaseTermsEnum
	at TermsEnumClinitHang$1.run(TermsEnumClinitHang.java:15)
...

fixes #15317

@github-actions
Copy link
Contributor

github-actions bot commented Oct 9, 2025

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

@github-actions github-actions bot added this to the 10.4.0 milestone Oct 9, 2025
Copy link
Contributor

@dweiss dweiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh. Vicious. I liked the static before-everything initializer that you mentioned - this was explicit and could be documented appropriately as to why it's there. This solution is also fine although I'm a bit worried that, over time, somebody (or some-ai-thing) may have an idea to refactor it back to remove duplicated code...

Copy link
Member

@rmuir rmuir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for this fix.

I think there's a possible static detector for this: https://errorprone.info/bugpattern/ClassInitializationDeadlock

Unfortunately the lint is disabled right now, will take a look into it separately and see if we can turn it on / fix others.

@ChrisHegarty
Copy link
Contributor Author

ChrisHegarty commented Oct 9, 2025

This solution is also fine although I'm a bit worried that, over time, somebody (or some-ai-thing) may have an idea to refactor it back to remove duplicated code.

A manual test can detect this trivially, but it needs to run independently so that it can effectively control the order of which classes get loaded, so is not really suited to our testing infrastructure. I think the static detector that Robert mentioned above is the way to prevent these things creeping back in.

@dweiss
Copy link
Contributor

dweiss commented Oct 9, 2025

What I meant was this -

  static {
    try {
      MethodHandles.lookup().ensureInitialized(TermsEnum.class);
    } catch (IllegalAccessException e) {
      throw new RuntimeException(e);
    }
  }

but maybe it's too much voodoo.

@ChrisHegarty
Copy link
Contributor Author

but maybe it's too much voodoo.

Yeah, I consider it too much voodoo to add in our source base, given that the source code change is so trivial.

I add a comment in the code which notes to not refactor this class in such a way that can break things. This should be enough until we get the static analysis.

My primary motivation for mentioning the voodoo workaround is to help users of Lucene if they run into this problem - until we can get the fix into 10.4. For example, we've seen this deadlock several times in Elasticsearch, but only coming to my attention now. So we'll apply the voodoo workaround there until the next 10.x release.

@rmuir
Copy link
Member

rmuir commented Oct 9, 2025

Now i remember why this check isn't enabled. it is exactly as the comment states, some real work. The first error you see from the check against main is promising:

> Task :lucene:core:compileJava
/home/rmuir/workspace/lucene/lucene/core/src/java/org/apache/lucene/index/TermsEnum.java:200: error: [ClassInitializationDeadlock] Possible class initialization deadlock: BaseTermsEnum is a subclass of the containing class TermsEnum
      new BaseTermsEnum() {
          ^
    (see https://errorprone.info/bugpattern/ClassInitializationDeadlock)

Unfortunately, there are 82 errors in lucene-core alone. Here's a condensed version:

/home/rmuir/workspace/lucene/lucene/core/src/java/org/apache/lucene/store/IOContext.java:53: error: [ClassInitializationDeadlock] Possible class initialization deadlock: DefaultIOContext is a subclass of the containing class IOContext
  IOContext DEFAULT = new DefaultIOContext();
                          ^
    (see https://errorprone.info/bugpattern/ClassInitializationDeadlock)
/home/rmuir/workspace/lucene/lucene/core/src/java/org/apache/lucene/store/IOContext.java:61: error: [ClassInitializationDeadlock] Possible class initialization deadlock: DefaultIOContext is a subclass of the containing class IOContext
  IOContext READONCE = new DefaultIOContext(DataAccessHint.SEQUENTIAL, ReadOnceHint.INSTANCE);
                           ^
    (see https://errorprone.info/bugpattern/ClassInitializationDeadlock)
/home/rmuir/workspace/lucene/lucene/core/src/java/org/apache/lucene/util/packed/BulkOperation.java:25: error: [ClassInitializationDeadlock] Possible class initialization deadlock: BulkOperationPacked1 is a subclass of the containing class BulkOperation
        new BulkOperationPacked1(),
            ^
... (repeated for different N of BulkOperationPackedN, BulkOperationPackedSingleBlock, etc) ...

/home/rmuir/workspace/lucene/lucene/core/src/java/org/apache/lucene/analysis/CharArrayMap.java:35: error: [ClassInitializationDeadlock] Possible class initialization deadlock: EmptyCharArrayMap<V> is a subclass of the containing class CharArrayMap<V> (via UnmodifiableCharArrayMap<V>, which can be initialized from outside the current file)
  private static final CharArrayMap<?> EMPTY_MAP = new EmptyCharArrayMap<>();
                                                       ^
    (see https://errorprone.info/bugpattern/ClassInitializationDeadlock)

@rmuir
Copy link
Member

rmuir commented Oct 9, 2025

best thought i have is to @SuppressWarnings all the other existing problems with // TODO and enable the check to stop any new instances of this from entering the codebase. We'll have to work off the existing problems from there...

@ChrisHegarty
Copy link
Contributor Author

Unfortunately, there are 82 errors in lucene-core alone.

oh boy! this is quite a bit more than I would have expected! :-(

@rmuir
Copy link
Member

rmuir commented Oct 10, 2025

it turned out to just be "multiplied" by the code generator: not that many unique violations: #15322

@ChrisHegarty ChrisHegarty merged commit aa4f143 into apache:main Oct 10, 2025
12 checks passed
@ChrisHegarty ChrisHegarty deleted the termsEnum_clinit_fix branch October 10, 2025 07:20
ChrisHegarty added a commit that referenced this pull request Oct 10, 2025
Fixes a potential hang with initialisation of TermsEnum and BaseTermsEnum, by simply removing the dependency and replicating the small amount of code.

Problem description: the static TermsEnum.EMPTY initialises to an implementation of BaseTermsEnum. TermsEnum is a superclass of BaseTermsEnum, so there is a clear dependency between these classes. If a subclass of BaseTermsEnum is initialising it may grab the lock on BaseTermsEnum, and prevent TermsEnum from initialising.

---------

Co-authored-by: Robert Muir <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Potential hang initialisation TermsEnum and BaseTermsEnum

3 participants