Skip to content

Fix concurrency issue in Text class #128403

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

Conversation

cbuescher
Copy link
Member

@cbuescher cbuescher commented May 23, 2025

After updating the Text class to use ByteBuffer in #127666 we saw test failures where similar Text instances are shared between different threads and tested for equals(). The reason is that calling bytes() lazily materializes the internal ByteBuffer. That method is what the equals method calls on both instances it tests. Apparently this can leads to race conditions when instances are shared across threads. Making the internal bytes representation volatile fixes the problem.

Closes #127971
Closes #128029

After updating the Text class to use ByteBuffer in elastic#127666 we saw test
failures where similar Text instances are shared between different
threads and tested for equals(). The reason is that calling bytes()
lazily materializes the internal ByteBuffer. That method is what the
equals method calls on both instances it tests. Apparently this can
leads to race conditions when instances are shared across threads.
Making the internal `bytes` representation volatile fixes the problem.

Closes elastic#128029
@cbuescher cbuescher requested a review from jordan-powers May 23, 2025 19:33
@cbuescher cbuescher requested a review from a team as a code owner May 23, 2025 19:33
@cbuescher cbuescher added >test Issues or PRs that are addressing/adding tests :Core/Infra/Core Core issues without another label v8.19.0 v9.1.0 labels May 23, 2025
@elasticsearchmachine elasticsearchmachine added the Team:Core/Infra Meta label for core/infra team label May 23, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra (Team:Core/Infra)

@cbuescher
Copy link
Member Author

@parkertimmins please keep me honest if you think this is the right fix for this problem.
I was able to reproduce the problem with 1000 iterations and a particular seed main, i.e.

./gradlew ":server:test" --tests "org.elasticsearch.search.SearchHitsTests.testConcurrentEquals" -Dtests.seed=8FE8D9203AB75971:6C073DA4F1474D7E -Dtests.iters=1000

I don't know what performance implications making that internal field "volatile" has, but I think its likely that we share similar Text instances across threads at some point and the observed diverging behaviour of "equals" is very puzzling in these cases.

While debugging I also found that "toXContent" works differently depending on whether the Text is initialized using the "text" or the "bytes" field, if no other method with a side effect is called before serializing the object to xContent.
i.e. the following test fails when run repeatedly:

@Repeat(iterations = 100)
public void testTextToXContent() throws IOException {
    String randomText = randomBoolean() ?  randomAlphaOfLengthBetween(10, 30) : randomRealisticUnicodeOfCodepointLengthBetween(10, 30);
    Text text1 = new Text(randomText);
    Text text2 = new Text( StandardCharsets.UTF_8.encode(randomText));
    XContentBuilder b1 = XContentFactory.jsonBuilder();
    XContentBuilder b2 = XContentFactory.jsonBuilder();
    text1.toXContent(b1, ToXContent.EMPTY_PARAMS);
    text2.toXContent(b2, ToXContent.EMPTY_PARAMS);
    assertEquals(Strings.toString(b1), Strings.toString(b2));
}

I can open another issue for that if you like, I find that behavior as surprising and confusing at the issue causing the failures in https://github.com/elastic/elasticsearch/issues/128971 and #128029

@parkertimmins
Copy link
Contributor

@cbuescher I spent some time looking at this. Unfortunately, I still can't get it to reproduce on my machine. My one issue with the fix is that I don't understand why the same issue wasn't occurring before. It looks like the ByteReference was also getting set lazily. That said, if it works, it seems fine to me. It's hard for me to believe that changing to volatile would be a significant performance regression.

@cbuescher
Copy link
Member Author

I still can't get it to reproduce on my machine.

That's interesting, did you run the same seed as mentioned above on "main" and used "-Dtests.iters=1000"? That fails pretty consistently on my machine (M1 Mac, Oracle java 21.0.4 2024-07-16 LTS).

It looks like the ByteReference was also getting set lazily.

I don't understand all the details of that change in ##127666 either but reverting it also seems to fix the reproducability of the error.

That said, if it works, it seems fine to me. It's hard for me to believe that changing to volatile would be a significant performance regression.

If its just the test, we can fix that another way e.g. by making sure the side-effects of materializing both data fields in the Text class happen before passing the instances of to the concurrent thread. But because this behavior is quite surprising to me I wouldn't want to rush this PR in without understanding it better. Maybe we should see if this reproduces somewhere else and try to understand whats happening better before merging this.
Another thing is the surprising behavior of "toContent" I posted the test above for. Strangely the "toString()" output of the instances in question here look different in the error cases, i.e. something like

testinstance:

  "highlight" : {
                    "BmFkfGMAUCDFXyJl" : [                     "\uD800\uDD47\uD800\uDD82\uD800\uDD8D\uD800\uDD42\uD800\uDD60\uD800\uDD4D\uD800\uDD67\uD800\uDD4B\uD800\uDD55\uD800\uDD49\uD800\uDD70",   <---- (Text array element in HighlightField)
                      "mJQjlNFDYpLLAEwLt"
                    ]

copy:

"BmFkfGMAUCDFXyJl" : [
                      "𐅇𐆂𐆍𐅂𐅠𐅍𐅧𐅋𐅕𐅉𐅰",      <---- (Text array element)
                      "mJQjlNFDYpLLAEwLt"
                    ],

I don't quite understand why I see the first rendering of the original value, but it also seems to be related to non-ASCII characters.

@ldematte
Copy link
Contributor

Re: the first issue: I investigated a bit this, because I could not wrap my head around why this was working before without volatile.
I think volatile is a red herring, and the issue lies in the use of ByteBuffer.
(I am able to reproduce the problem with volatile too -- I think that using volatile simply makes this less likely because it forces a different ordering)

This test:

var text = new Text("test");
var bytes = text.bytes();
var copy = Arrays.copyOf(bytes.array(), bytes.limit());
var text2 = new Text(ByteBuffer.wrap(copy));

assertEquals(text, text2);

var x = text2.string();
assertEquals(text, text2);

Fails the second assertEquals.
This is because StandardCharsets.UTF_8.decode in

public String string() {
  if (text == null) {
     text = StandardCharsets.UTF_8.decode(bytes).toString();
   }
   return text;
}

"consumes" the bytes - it makes the internal position advance.
I think the solution here is to always wrap the bytes in a new ByteBuffer

text = StandardCharsets.UTF_8.decode(ByteBuffer.wrap(bytes.array())).toString();

That should be done always, e.g. (especially) in ByteBuffer bytes(), or the caller might use it and "consume" it.

@ldematte
Copy link
Contributor

Possibly the second issue can be related; if something tries to use bytes() after text has been rendered, the internal position points somewhere else.
Probably a better alternative is to not use a ByteBuffer, but use something that does not have all its semantics, it's just e holder for array + offset + length (that we can use easily in our stuff, e.g. BytesReference).
Looking at how we use it today, this is probably even more efficient (and more robust!)

@ldematte
Copy link
Contributor

Most importantly, this looks like a genuine bug, not just a test failure. I would consider fixing this a blocker for 8.19.0 and 9.1.0

@cbuescher
Copy link
Member Author

I think volatile is a red herring, and the issue lies in the use of ByteBuffer.
(I am able to reproduce the problem with volatile too -- I think that using volatile simply makes this less likely because it forces a different ordering)

Thanks @ldematte for taking a second look, I think its important to not rush to a solution and the "volatile" was just a first hypothesis of mine but glad you found another explanation. If you think this is not "just" a test issue but more of a general issue maybe we should close this PR and instead open a proper bug for it then.

@ldematte
Copy link
Contributor

++
I think this need to be fixed properly; I'm worried about the ramifications of exposing a possibly "consumed" ByteBuffer, but I have no idea how serious this is.

@ldematte
Copy link
Contributor

We decided to revert the PR and fix it with a clear mind and more thoroughly. Here is the revert PR: #128484
I think this can be closed, and when the PR is merged we can close the associated CI failures too.

@cbuescher cbuescher closed this May 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Core/Infra/Core Core issues without another label Team:Core/Infra Meta label for core/infra team >test Issues or PRs that are addressing/adding tests v8.19.0 v9.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CI] SearchHitsTests testConcurrentSerialization failing [CI] SearchHitsTests testConcurrentEquals failing
4 participants