Skip to content

Avoid unnecessary determinization in index pattern conflict checks #128362

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

jimczi
Copy link
Contributor

@jimczi jimczi commented May 23, 2025

Starting with Lucene 10, CharacterRunAutomaton is no longer determinized automatically.
In Elasticsearch 9, we adapted to this by eagerly determinizing automatons early (via Regex#simpleMatchToAutomaton). However, this introduced regression: operations like index template conflict checks, which only require intersection testing, now pay the cost of determinization, an expensive step that wasn’t needed before. In some cases, especially when many wildcard patterns are involved, determinization can even fail due to state explosion.

This change removes the unnecessary determinization for index patterns conflict check, restoring the pre-9.0 behavior and allowing valid index templates with many patterns to be registered again.

closes: #127972

Starting with Lucene 10, `CharacterRunAutomaton` is no longer determinized automatically.
In Elasticsearch 9, we adapted to this by eagerly determinizing automatons early (via `Regex#simpleMatchToAutomaton`).
However, this introduced  regression: operations like index template conflict checks, which only require intersection testing, now pay the cost of determinization—an expensive step that wasn’t needed before. In some cases, especially when many wildcard patterns are involved, determinization can even fail due to state explosion.

This change removes the unnecessary determinization, restoring the pre-9.0 behavior and allowing valid index templates with many patterns to be registered again.
@jimczi jimczi requested a review from a team as a code owner May 23, 2025 10:13
@jimczi jimczi added >bug :Data Management/Indices APIs APIs to create and manage indices and templates v9.1.0 labels May 23, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@elasticsearchmachine elasticsearchmachine added the Team:Data Management Meta label for data/management team label May 23, 2025
@elasticsearchmachine
Copy link
Collaborator

Hi @jimczi, I've created a changelog YAML for you.

/**
* Return a deterministic Automaton that matches the union of the provided patterns.
*/
public static Automaton simpleMatchToAutomaton(String... patterns) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small drive-by comment: should add to the javadocs when one or the other should be used?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, Luca. I've raised a similar question internally within the relevant discussion thread.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++, 268dc56
I also added the variant for the single pattern case since to be complete.

Copy link
Contributor

@pawankartik-elastic pawankartik-elastic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall. Leaving 2 tiny comments.

Automaton b = Regex.simpleMatchToNonDeterminizedAutomaton(Arrays.copyOfRange(patterns, patterns.length / 2, patterns.length));
assertFalse(Operations.isEmpty(Operations.intersection(a, b)));
IllegalArgumentException exc = expectThrows(IllegalArgumentException.class, () -> assertMatchesAll(a, "my_test"));
assertThat(exc.getMessage(), containsString("deterministic"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit pick: can we improve the expectation around the exception's error message? Readers like me could be curious about what kind of message to expect.

@@ -250,4 +254,14 @@ public void testThousandsAndLongPattern() throws IOException {
assertTrue(predicate.test(patterns[i]));
}
}

public void testIntersectNonDeterminizedAutomaton() {
String[] patterns = randomArray(20, 100, size -> new String[size], () -> "*" + randomAlphanumericOfLength(10) + "*");
Copy link
Contributor

@pawankartik-elastic pawankartik-elastic May 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add some minor detailing about how these size-related values were chosen? If I'm not mistaken, don't they relate to the default determinisation limit that comes from Lucene?

Copy link
Member

@dakrone dakrone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for tagging us

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Data Management/Indices APIs APIs to create and manage indices and templates Team:Data Management Meta label for data/management team v9.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

too_complex_to_determinize_exception is thrown in a few cases due to the patterns in index template
5 participants