Skip to content

Conversation

@rahulgoswami
Copy link

@rahulgoswami rahulgoswami commented May 4, 2025

Description

This PR is created after discussing the idea on the lucene dev list.

https://lists.apache.org/thread/gk3kwplon73llz356szz1mn3myn3nnm3

The revised proposal [Approach 2] (#14607 (comment)) keeps the SegmentInfos.indexCreatedVersionMajor untouched. Instead the proposal is to introduce a new property currentIndexMajorVersion which can be updated as appropriate to facilitate index back compatibility.

Update (09/19/2025): There is also an [Approach 3] presented in #14607 (comment)

Update (09/23/2025) - Summary:

Problem statement
If all the segments have minVersion and version >= MIN_SUPPORTED_MAJOR , please allow the user to work with the index irrespective of when the index was originally created (i.e. even if SegmentInfos.indexCreatedVersionMajor<MIN_SUPPORTED_MAJOR)

Example scenario
User creates an index on Lucene 10 (also, MIN_SUPPORTED_MAJOR=10). Say it stays the same until Lucene 13. In Lucene 13 there is a change which requires reindexing. This means that when Lucene 14 releases, MIN_SUPPORTED_MAJOR moves up to 13.
However, while on Lucene 13 if user is able to reindex in a way that at the end of the process, for each segment, version=minVersion=13 (refer #14607 (comment) for how this can be achieved outside of Lucene), then upon moving to Lucene 14, this index should be readable even though index was originally created on 10.

For more context
#14607 (comment)
#14607 (comment)

Proposed approaches
(All approaches assume the prerequisite that all segments of the index have been converted outside of Lucene to have version=minVersion=LATEST )

[Approach 1] Rewrite SegmentInfos.indexCreatedVersionMajor to LATEST upon commit. (PR implementation as of 09/23/2025)

[Approach 2] Keep indexCreatedVersionMajor untouched. Introduce a mutable parallel property in SegmentInfos (say "currentIndexMajorVersion") and use this for deciding whether an index should be opened.

[Approach 3] Look at the list of SegmentInfo (instead of indexCreatedVersionMajor) at read time to determine compatibility.

@rmuir
Copy link
Member

rmuir commented May 4, 2025

As states on the dev list, I don't think we should do this.

@rahulgoswami
Copy link
Author

As states on the dev list, I don't think we should do this.

Can you please elaborate why? I have tried to address the concern you brought up on the mailing list via the implementation in this PR.

The primary premise behind the API is that IF all segments of an index are created by the LATEST version, the index in all respects is LATEST. "indexCreatedVersionMajor" should ideally not block a Lucene upgrade in that case.

Happy to learn if I am missing anything.

@rmuir
Copy link
Member

rmuir commented May 5, 2025

Because that's not what this variable means. If you want to change index backwards compatibility policy, lying about the version is not the way. -1

@github-actions
Copy link
Contributor

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label May 21, 2025
@rahulgoswami
Copy link
Author

rahulgoswami commented Sep 8, 2025

An index where all segments have minVersion=LATEST and version=LATEST is essentially an index with version LATEST.

Please note that this API does NOT require Lucene to perform the upgrade/reindexing part. That will be done by the upper layer (Solr, Elasticsearch, etc) , say, through a custom merge policy. The requirement from the Lucene API is "If all the segments have minVersion=version=Version.LATEST, please allow the user to work with the index irrespective of what the SegmentInfos.indexCreatedVersionMajor says".

[Approach 2]
If the concern is to maintain tracking data regarding the origins of the index, SegmentInfos.indexCreatedVersionMajor could be left untouched. Instead a new property "currentIndexMajorVersion" could be introduced which sits parallel to indexCreatedVersionMajor

For a fresh index, currentIndexMajorVersion=indexCreatedVersionMajor=Version.LATEST

Upon an upgrade, if the upper layer (Solr, Elasticsearch, etc) successfully converts all segments to Version.LATEST, calling the IndexWriter.commitAndUpdateVersionCreatedMajor() (which will now be called commitAndUpdateCurrentMajorVersion() ) should set currentIndexMajorVersion=Version.LATEST, while indexCreatedVersionMajor remains untouched.

SegmentInfos.readCommit() now relies on SegmentInfos.currentIndexMajorVersion for codec compatibility checks.

@jpountz It would be valuable to get your thoughts on this too. Thank you.

Update (09/19/2025):
Since #15012 got merged earlier today, rewording certain parts of the requirement..."if all the segments have minVersion and version >= MIN_SUPPORTED_MAJOR , please allow the user to work with the index irrespective of what the SegmentInfos.indexCreatedVersionMajor says"

@github-actions github-actions bot removed the Stale label Sep 9, 2025
@rahulgoswami
Copy link
Author

rahulgoswami commented Sep 19, 2025

[Approach 3]
This is inspired by @msokolov 's idea at the recently concluded Community Over Code where I gave a talk on this topic ("Achieving zero downtime and zero cost index upgrade in Apache Solr").

Do we HAVE to rely on indexCreatedVersionMajor to decide index compatibility?

If in SegmentInfos.readCommit() there could be a check over the list of SegmentInfo and if all segments have minVersion and version >= MIN_SUPPORTED_MAJOR, it should be good to open. Considering that even for an average large index this would mean a few hundred iterations, would this one time check at startup time be a worthy tradeoff?

@msokolov
Copy link
Contributor

Yes, I think your "Approach 3" is nice since it doesn't have us lying about the min created version, and it doesn't require rewriting anything immutable. It just changes our readers so they can read an index whose segments were written by compatible Lucene versions. It still prevents from reading indexes that were originally written with an index version < MIN_SUPPORTED_MAJOR.

@rahulgoswami
Copy link
Author

rahulgoswami commented Sep 23, 2025

Appreciate your inputs @msokolov . I will rework this PR with Approach 3 soon.

It still prevents from reading indexes that were originally written with an index version < MIN_SUPPORTED_MAJOR.

I guess you meant this already but just in the interest of complete clarity, it DOES allow reading indexes originally written with an index version < MIN_SUPPORTED_MAJOR provided that ALL of the current segments of the index have version and minVersion >= MIN_SUPPORTED_MAJOR (as opposed to relying on "indexCreatedVersionMajor" property in SegmentInfos today).

@rmuir
Copy link
Member

rmuir commented Sep 23, 2025

I guess you meant this already but just in the interest of complete clarity, it DOES allow reading indexes originally written with an index version < MIN_SUPPORTED_MAJOR provided that ALL of the current segments of the index have version and minVersion >= MIN_SUPPORTED_MAJOR (as opposed to relying on "indexCreatedVersionMajor" property in SegmentInfos today).

This doesnt work. Sometimes we need to force a reindex. That's because the original thing encoded was lossy (e.g. norms) and we need to change the encoding, or because it was indexed incorrectly. all the merging in the world can't fix that.

@rahulgoswami
Copy link
Author

rahulgoswami commented Sep 23, 2025

This doesnt work. Sometimes we need to force a reindex. That's because the original thing encoded was lossy (e.g. norms) and we need to change the encoding, or because it was indexed incorrectly. all the merging in the world can't fix that.

For each older segment which is >=MIN_SUPPORTED_MAJOR but <LATEST, if it can be excluded from merges (eg: via a custom merge policy) and, in parallel, can be read and reindexed outside of Lucene (Eg: in Solr, etc ), the result would be a LOSSLESS index where ALL segments are LATEST version. Of course this only applies to any index where all fields are either stored or docValues enabled.

Such indexes can benefit from this kind of in-place reindexing giving the ability to keep "upgrading" even in case of any back incompatible changes like the norms change that happened in 7.x (as long as Lucene maintains readability for LATEST-1 version segments).

I did mention this basic premise for this PR on the mailing list, but in hindsight, could have re-stressed upon it. The Solr side effort is being tracked under https://issues.apache.org/jira/browse/SOLR-17725 .

Once Solr can achieve the state as mentioned above (aka all segments have been converted to LATEST version), today the only thing preventing the index from opening is the check in SegmentInfos based on indexCreatedVersionMajor (https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/SegmentInfos.java#L349).

The three approaches outlined in this PR offer alternatives to this check.

@msokolov
Copy link
Contributor

msokolov commented Sep 24, 2025

Right, I forgot the key fact here that also came out during your presentation at ApacheCoC. Namely in the system where you want to use this, the documents have been completely reindexed from source that is stored in the index in a stored field (IIRC) so there can be no question of lingering data that was indexed by a previous version. So all the individual segments are "new version" segments and it is only the index-wide min-segment version that is preventing the reader from opening the index, which is really now just a piece of "ghost metadata" that preserves the fact that at one time there were some segments in this index that had been created by an older version of Lucene.

I think the important restriction to keep in place is that readers will still not be able to open indexes where any segment was originally written with an index version < MIN_SUPPORTED_MAJOR. The change is to have the reader make the decision based on the aggregate of the segment-level metadata (in SegmentInfo) rather than looking at the index-wide metadata (in SegmentInfos, I think).

@rahulgoswami
Copy link
Author

rahulgoswami commented Sep 25, 2025

@msokolov That's absolutely correct. You summarized every bit perfectly. Thank you.

So the breakdown of tasks for the complete solution looks like this:

  1. Reindex all older segments with a custom merge policy (to prevent them from merging with LATEST version segments during the reindexing process) ==> to be done by Solr ==> Results in ALL segments being LATEST version
  2. At index read time, perform a check to ensure none of the segments are < MIN_SUPPORTED_MAJOR. Don't rely on the immutable metadata ("indexCreatedVersionMajor") in SegmentInfos ==> to be done by Lucene.

Also, another requirement for the overall solution to work is that the user is not jumping across incompatible major versions. Eg: If an index is created in Lucene 11 (with MIN_SUPPORTED_MAJOR=10) and a breaking change happens in Lucene 13 (while MIN_SUPPORTED_MAJOR is still 10), then before moving to Lucene 14 (with now MIN_SUPPORTED_MAJOR=13), user would need to reindex on Lucene 13 via the above process. They can't jump versions, say from Lucene 12 to Lucene 14 and expect to reindex on 14 (since the index won't open for reading on 14).

@rahulgoswami rahulgoswami changed the title Update created version major Revise condition for opening an index for reading Sep 26, 2025
@rahulgoswami rahulgoswami changed the title Revise condition for opening an index for reading Revise strategy for opening an index for reading Sep 26, 2025
@github-actions
Copy link
Contributor

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

@github-actions
Copy link
Contributor

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

@github-actions github-actions bot added this to the 11.0.0 milestone Oct 14, 2025
@rahulgoswami rahulgoswami marked this pull request as draft October 14, 2025 04:12
@rahulgoswami rahulgoswami marked this pull request as ready for review October 16, 2025 04:22
@rahulgoswami
Copy link
Author

@msokolov @mikemccand Would appreciate if you could take a look please..

SegmentInfo info =
codec.segmentInfoFormat().read(directory, segName, segmentID, IOContext.READONCE);
info.setCodec(codec);
Version segMinVersion = info.getMinVersion();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it a functional change to move this version checking early in the function? It makes reviewing harder since the diffs look bigger and one has to carefully read and scroll around to compare by eyeball, so if it's not needed, let's leave it where it was.

Copy link
Author

@rahulgoswami rahulgoswami Oct 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that was exactly the intention- to move the check early in the function to fail fast. Agree it makes the reviewing trickier. Will incorporate the suggestion.

Version segMinVersion = info.getMinVersion();
Version segmentVersion = info.getVersion();

if (!segmentVersion.onOrAfter(infos.minSegmentLuceneVersion)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we prefer == false to ! for clarity - it helps to avoid mistakes

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. I generally find the ! to be more readable in a natural language flow. Will follow Lucene's convention with == false.

if (segMinVersion.major < minSupportedMajorVersion) {
throw new IndexFormatTooOldException(
input,
"Index has segment traces from Lucene version "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure what you mean by "traces" here? Maybe like "traces of an older version"? I think it will be clearer to just say "Index segment derived from version "?

+ minSupportedMajorVersion
+ "). To resolve this issue: (1) Re-index your data using Lucene "
+ minSupportedMajorVersion
+ ".x or later (preferably "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's strike "preferably" and say explicitly what is allowed

@rahulgoswami
Copy link
Author

@msokolov Thanks for taking the time to review this. I have refactored the PR with your suggestions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants