-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Revise strategy for opening an index for reading #14607
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Revise strategy for opening an index for reading #14607
Conversation
…tsInfos under the right conditions
|
As states on the dev list, I don't think we should do this. |
Can you please elaborate why? I have tried to address the concern you brought up on the mailing list via the implementation in this PR. The primary premise behind the API is that IF all segments of an index are created by the LATEST version, the index in all respects is LATEST. "indexCreatedVersionMajor" should ideally not block a Lucene upgrade in that case. Happy to learn if I am missing anything. |
|
Because that's not what this variable means. If you want to change index backwards compatibility policy, lying about the version is not the way. -1 |
|
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution! |
|
An index where all segments have minVersion=LATEST and version=LATEST is essentially an index with version LATEST. Please note that this API does NOT require Lucene to perform the upgrade/reindexing part. That will be done by the upper layer (Solr, Elasticsearch, etc) , say, through a custom merge policy. The requirement from the Lucene API is "If all the segments have minVersion=version=Version.LATEST, please allow the user to work with the index irrespective of what the SegmentInfos.indexCreatedVersionMajor says". [Approach 2] For a fresh index, currentIndexMajorVersion=indexCreatedVersionMajor=Version.LATEST Upon an upgrade, if the upper layer (Solr, Elasticsearch, etc) successfully converts all segments to Version.LATEST, calling the IndexWriter.commitAndUpdateVersionCreatedMajor() (which will now be called commitAndUpdateCurrentMajorVersion() ) should set currentIndexMajorVersion=Version.LATEST, while indexCreatedVersionMajor remains untouched. SegmentInfos.readCommit() now relies on SegmentInfos.currentIndexMajorVersion for codec compatibility checks. @jpountz It would be valuable to get your thoughts on this too. Thank you. Update (09/19/2025): |
|
[Approach 3] Do we HAVE to rely on indexCreatedVersionMajor to decide index compatibility? If in SegmentInfos.readCommit() there could be a check over the list of SegmentInfo and if all segments have minVersion and version >= MIN_SUPPORTED_MAJOR, it should be good to open. Considering that even for an average large index this would mean a few hundred iterations, would this one time check at startup time be a worthy tradeoff? |
|
Yes, I think your "Approach 3" is nice since it doesn't have us lying about the min created version, and it doesn't require rewriting anything immutable. It just changes our readers so they can read an index whose segments were written by compatible Lucene versions. It still prevents from reading indexes that were originally written with an index version < MIN_SUPPORTED_MAJOR. |
|
Appreciate your inputs @msokolov . I will rework this PR with Approach 3 soon.
I guess you meant this already but just in the interest of complete clarity, it DOES allow reading indexes originally written with an index version < MIN_SUPPORTED_MAJOR provided that ALL of the current segments of the index have version and minVersion >= MIN_SUPPORTED_MAJOR (as opposed to relying on "indexCreatedVersionMajor" property in SegmentInfos today). |
This doesnt work. Sometimes we need to force a reindex. That's because the original thing encoded was lossy (e.g. norms) and we need to change the encoding, or because it was indexed incorrectly. all the merging in the world can't fix that. |
For each older segment which is >=MIN_SUPPORTED_MAJOR but <LATEST, if it can be excluded from merges (eg: via a custom merge policy) and, in parallel, can be read and reindexed outside of Lucene (Eg: in Solr, etc ), the result would be a LOSSLESS index where ALL segments are LATEST version. Of course this only applies to any index where all fields are either stored or docValues enabled. Such indexes can benefit from this kind of in-place reindexing giving the ability to keep "upgrading" even in case of any back incompatible changes like the norms change that happened in 7.x (as long as Lucene maintains readability for LATEST-1 version segments). I did mention this basic premise for this PR on the mailing list, but in hindsight, could have re-stressed upon it. The Solr side effort is being tracked under https://issues.apache.org/jira/browse/SOLR-17725 . Once Solr can achieve the state as mentioned above (aka all segments have been converted to LATEST version), today the only thing preventing the index from opening is the check in SegmentInfos based on indexCreatedVersionMajor (https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/SegmentInfos.java#L349). The three approaches outlined in this PR offer alternatives to this check. |
|
Right, I forgot the key fact here that also came out during your presentation at ApacheCoC. Namely in the system where you want to use this, the documents have been completely reindexed from source that is stored in the index in a stored field (IIRC) so there can be no question of lingering data that was indexed by a previous version. So all the individual segments are "new version" segments and it is only the index-wide min-segment version that is preventing the reader from opening the index, which is really now just a piece of "ghost metadata" that preserves the fact that at one time there were some segments in this index that had been created by an older version of Lucene. I think the important restriction to keep in place is that readers will still not be able to open indexes where any segment was originally written with an index version < MIN_SUPPORTED_MAJOR. The change is to have the reader make the decision based on the aggregate of the segment-level metadata (in SegmentInfo) rather than looking at the index-wide metadata (in SegmentInfos, I think). |
|
@msokolov That's absolutely correct. You summarized every bit perfectly. Thank you. So the breakdown of tasks for the complete solution looks like this:
Also, another requirement for the overall solution to work is that the user is not jumping across incompatible major versions. Eg: If an index is created in Lucene 11 (with MIN_SUPPORTED_MAJOR=10) and a breaking change happens in Lucene 13 (while MIN_SUPPORTED_MAJOR is still 10), then before moving to Lucene 14 (with now MIN_SUPPORTED_MAJOR=13), user would need to reindex on Lucene 13 via the above process. They can't jump versions, say from Lucene 12 to Lucene 14 and expect to reindex on 14 (since the index won't open for reading on 14). |
… instead of indexCreatedVersionMajor
|
This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR. |
|
This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR. |
|
@msokolov @mikemccand Would appreciate if you could take a look please.. |
| SegmentInfo info = | ||
| codec.segmentInfoFormat().read(directory, segName, segmentID, IOContext.READONCE); | ||
| info.setCodec(codec); | ||
| Version segMinVersion = info.getMinVersion(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it a functional change to move this version checking early in the function? It makes reviewing harder since the diffs look bigger and one has to carefully read and scroll around to compare by eyeball, so if it's not needed, let's leave it where it was.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that was exactly the intention- to move the check early in the function to fail fast. Agree it makes the reviewing trickier. Will incorporate the suggestion.
| Version segMinVersion = info.getMinVersion(); | ||
| Version segmentVersion = info.getVersion(); | ||
|
|
||
| if (!segmentVersion.onOrAfter(infos.minSegmentLuceneVersion)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we prefer == false to ! for clarity - it helps to avoid mistakes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. I generally find the ! to be more readable in a natural language flow. Will follow Lucene's convention with == false.
| if (segMinVersion.major < minSupportedMajorVersion) { | ||
| throw new IndexFormatTooOldException( | ||
| input, | ||
| "Index has segment traces from Lucene version " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure what you mean by "traces" here? Maybe like "traces of an older version"? I think it will be clearer to just say "Index segment derived from version "?
| + minSupportedMajorVersion | ||
| + "). To resolve this issue: (1) Re-index your data using Lucene " | ||
| + minSupportedMajorVersion | ||
| + ".x or later (preferably " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's strike "preferably" and say explicitly what is allowed
|
@msokolov Thanks for taking the time to review this. I have refactored the PR with your suggestions. |
Description
This PR is created after discussing the idea on the lucene dev list.
https://lists.apache.org/thread/gk3kwplon73llz356szz1mn3myn3nnm3
The revised proposal [Approach 2] (#14607 (comment)) keeps the SegmentInfos.indexCreatedVersionMajor untouched. Instead the proposal is to introduce a new property currentIndexMajorVersion which can be updated as appropriate to facilitate index back compatibility.
Update (09/19/2025): There is also an [Approach 3] presented in #14607 (comment)
Update (09/23/2025) - Summary:
Problem statement
If all the segments have minVersion and version >= MIN_SUPPORTED_MAJOR , please allow the user to work with the index irrespective of when the index was originally created (i.e. even if SegmentInfos.indexCreatedVersionMajor<MIN_SUPPORTED_MAJOR)
Example scenario
User creates an index on Lucene 10 (also, MIN_SUPPORTED_MAJOR=10). Say it stays the same until Lucene 13. In Lucene 13 there is a change which requires reindexing. This means that when Lucene 14 releases, MIN_SUPPORTED_MAJOR moves up to 13.
However, while on Lucene 13 if user is able to reindex in a way that at the end of the process, for each segment, version=minVersion=13 (refer #14607 (comment) for how this can be achieved outside of Lucene), then upon moving to Lucene 14, this index should be readable even though index was originally created on 10.
For more context
#14607 (comment)
#14607 (comment)
Proposed approaches
(All approaches assume the prerequisite that all segments of the index have been converted outside of Lucene to have version=minVersion=LATEST )
[Approach 1] Rewrite SegmentInfos.indexCreatedVersionMajor to LATEST upon commit. (PR implementation as of 09/23/2025)
[Approach 2] Keep indexCreatedVersionMajor untouched. Introduce a mutable parallel property in SegmentInfos (say "currentIndexMajorVersion") and use this for deciding whether an index should be opened.
[Approach 3] Look at the list of SegmentInfo (instead of indexCreatedVersionMajor) at read time to determine compatibility.