-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Fix concurrency issue in Text class #128403
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
After updating the Text class to use ByteBuffer in elastic#127666 we saw test failures where similar Text instances are shared between different threads and tested for equals(). The reason is that calling bytes() lazily materializes the internal ByteBuffer. That method is what the equals method calls on both instances it tests. Apparently this can leads to race conditions when instances are shared across threads. Making the internal `bytes` representation volatile fixes the problem. Closes elastic#128029
Pinging @elastic/es-core-infra (Team:Core/Infra) |
@parkertimmins please keep me honest if you think this is the right fix for this problem.
I don't know what performance implications making that internal field "volatile" has, but I think its likely that we share similar Text instances across threads at some point and the observed diverging behaviour of "equals" is very puzzling in these cases. While debugging I also found that "toXContent" works differently depending on whether the Text is initialized using the "text" or the "bytes" field, if no other method with a side effect is called before serializing the object to xContent.
I can open another issue for that if you like, I find that behavior as surprising and confusing at the issue causing the failures in https://github.com/elastic/elasticsearch/issues/128971 and #128029 |
@cbuescher I spent some time looking at this. Unfortunately, I still can't get it to reproduce on my machine. My one issue with the fix is that I don't understand why the same issue wasn't occurring before. It looks like the ByteReference was also getting set lazily. That said, if it works, it seems fine to me. It's hard for me to believe that changing to volatile would be a significant performance regression. |
That's interesting, did you run the same seed as mentioned above on "main" and used "-Dtests.iters=1000"? That fails pretty consistently on my machine (M1 Mac, Oracle java 21.0.4 2024-07-16 LTS).
I don't understand all the details of that change in ##127666 either but reverting it also seems to fix the reproducability of the error.
If its just the test, we can fix that another way e.g. by making sure the side-effects of materializing both data fields in the Text class happen before passing the instances of to the concurrent thread. But because this behavior is quite surprising to me I wouldn't want to rush this PR in without understanding it better. Maybe we should see if this reproduces somewhere else and try to understand whats happening better before merging this. testinstance:
copy:
I don't quite understand why I see the first rendering of the original value, but it also seems to be related to non-ASCII characters. |
Re: the first issue: I investigated a bit this, because I could not wrap my head around why this was working before without volatile. This test:
Fails the second
"consumes" the bytes - it makes the internal position advance.
That should be done always, e.g. (especially) in |
Possibly the second issue can be related; if something tries to use bytes() after text has been rendered, the internal position points somewhere else. |
Most importantly, this looks like a genuine bug, not just a test failure. I would consider fixing this a blocker for 8.19.0 and 9.1.0 |
Thanks @ldematte for taking a second look, I think its important to not rush to a solution and the "volatile" was just a first hypothesis of mine but glad you found another explanation. If you think this is not "just" a test issue but more of a general issue maybe we should close this PR and instead open a proper bug for it then. |
++ |
We decided to revert the PR and fix it with a clear mind and more thoroughly. Here is the revert PR: #128484 |
After updating the Text class to use ByteBuffer in #127666 we saw test failures where similar Text instances are shared between different threads and tested for equals(). The reason is that calling bytes() lazily materializes the internal ByteBuffer. That method is what the equals method calls on both instances it tests. Apparently this can leads to race conditions when instances are shared across threads. Making the internal
bytes
representation volatile fixes the problem.Closes #127971
Closes #128029