Releases: apache/lucene
10.3.1
9.12.3
Bug fixes
- Support for JDK24+
- Reduce sharedArenaMaxPermits from 1024 to 64
- Use READONCE to read segment infos
- Fix rare spin-loop in RefCountedSharedArena
- Ensure vector queries handle advanceShallow correctly
- Fix failure due to hole bridge being coplanar with polygon edge
10.3.0
Lucene 10.3 brings major performance improvements.
Lexical search is now vectorized to better take advantage of SIMD instructions, more efficient memory access patterns, CPU pipelining and amortize the cost of virtual function calls. Lucene's nightly benchmarks report a 40% speedup compared with Lucene 10.2 when computing top-100 hits by score on disjunctive and conjunctive queries.
Vector search now better parallelizes fetching vectors into the CPU cache. Lucene's nightly benchmarks report a 15%-20% speedup compared with Lucene 10.2.
The terms dictionary performs about 30% faster than in Lucene 10.2 on primary-key lookups according to Lucene's nightly benchmarks. This should help speed up workloads that rely on terms dictionary lookup performance including primary-key lookups, indexing operations that specify an ID and TermInSet queries.
New Features
- Supports reranking with late interaction model multi-vectors, full precision vector similarity scores, or any provided DoubleValuesSource, enabling improved ranking of search results.
- Adds a MultiIndexMergeScheduler β a multi-tenant wrapper that allows sharing a common merge scheduler across multiple instances.
API Changes
- Adds API to fetch the size of off-heap memory required by a KNN field. This size can be used to help determine the memory requirements for optimal search performance, which can be greatly affected by page faults when not enough memory is available.
- RandomVectorScorer now supports a bulk scoring interface.
- LeafReader#searchNearestVectors now accepts an AcceptDocs instance instead of a Bits instance to identify document IDs to filter.
- Collectors can now take advantage of pre-aggregated data to speed up faceting using LeafCollector#collectRange.
Improvements and Optimizations
- Adds optimistic knn search to vector queries. Optimistic knn search addresses a major issue where we return inconsistent results due to race conditions in the shared queue previously used over multi-segment search.
- Faster vector search on HNSW graphs through GroupVarInt encoding.
- Searcher managers now support 'Adaptive Refresh', enabling users to control the commit points it refreshes on. This helps with graceful handling of large replication payloads in segment-replicated systems.
Runtime Behavior Changes and Bug Fixes
- The default ReadAdvice has been changed from RANDOM to NORMAL. MMapDirectory will no longer set any specific read advice out-of-the-box.
- Default RefCountedSharedArena.DEFAULT_MAX_PERMITS are reduced to 64. Also fixes the infinite loop when RefCountedSharedArena's underlying Arena#close fails due to concurrent usage of segments.
- Uses READONCE when reading segment infos, to fix mmap leaks on segment info files. Includes fixes for multiple other resource leaks.
9.12.2
Bug fixes
- Reduce NeighborArray on-heap memory during HNSW graph building
- Fix IndexSortSortedNumericDocValuesRangeQuery for int sort
- ValueSource.fromDoubleValuesSource(dvs).getSortField() would throw errors when used if the DoubleValuesSource needed scores
- Disable connectedComponents logic in HNSW graph building.
10.2.2
Bug fixes
- Reduce NeighborArray on-heap memory during HNSW graph building
- Fix IndexSortSortedNumericDocValuesRangeQuery for int sort
- ValueSource.fromDoubleValuesSource(dvs).getSortField() would throw errors when used if the DoubleValuesSource needed scores
10.2.1
This patch release contains bug fixes that are highlighted below.
- Fix DISIDocIdStream::count so that it does not try to count beyond max.
- Correct TermOrdValComparator competitive iterator so that it forces sparse field iteration to be at least scoring window baseline when doing intoBitSet.
- Provide better impacts for fields indexed with IndexOptions.DOCS
- Fixed lead cost computations for bulk scorers of conjunctive queries that mix MUST and FILTER clauses, and disjunctive queries that configure a minimum number of matching SHOULD clauses.
10.2.0
Lucene 10.2 includes major search-time performance improvements for a wide variety of queries. This is most notably due to:
- Improved storage format of doc IDs in BKD trees for faster decoding.
More vectorization when processing PointRangeQuerys and non-scoring BooleanQuerys. - Encoding of dense blocks of postings lists as bit sets instead of FOR-delta. This change also saves a bit of storage.
- Merging matches of dense conjunctive clauses using bitwise ANDs. This especially helps on postings blocks that are encoded as bit sets.
Implementing the ACORN-1 algorithm for pre-filtered vector searches. - Searches that don't require scores and match many docs should generally see good speedups, depending on how expensive the Collector is. Compared with Lucene 10.1.0, Lucene's nightly benchmarks report the following speedups when counting the number of hits of a the following queries:
* Disjunctions of term queries: 77% to 4x faster
* Conjunctions of term queries: 38% to 5x faster
* Filtered disjunctions of term queries: 2.5x to 4x faster
* Filtered PointRangeQuery: 3.5x faster - And the following speedup when computing top-100 hits:
* Pre-filtered vector search: 3.5x faster
Changes in Runtime Behavior
- TieredMergePolicy's default floor segment size was increased from 2MB to 16MB. This is expected to result in slightly slower indexing and about 10 fewer segments per index for applications that flush frequently. This should in-turn help speed up queries that have a high per-segment overhead such as multi-term queries, point queries and vector search.
New Features
- Added TopDocs#rrf to combine multiple TopDocs instances using reciprocal rank fusion.
- Added SeededKnnVectorQuery, an optimization to KnnVectorQuery that allows selecting better entry points for vector search using a seed Query.
Improvements
- RegexpQuery support for unicode case-insensitive characters and ranges.
Optimizations - Java 24 vector API support
- Efficiency improvements to Automaton and RegExp
- Faster merging of HNSW graphs which translated in a 25% indexing speedup in Lucene's nightly benchmarks.
- Conjunctive queries can now skip applying clauses when they have long runs of matching docs, a case which is not uncommon when an index sort is configured.
- Reduce heap usage during BKD tree merges.
10.1.0
New Features
- Add IndexInput::isLoaded to determine if the contents of anΒ input is resident in physical memory
- FeatureField now supports storing term vectors.
Improvements
- TieredMergePolicy now allows merging up to maxMergeAtOnce segments for merges below the floor segment size, even if maxMergeAtOnce is greater than segmentsPerTier. This makes it more efficient to configure TieredMergePolicy to merge segments aggressively by configuring a high value of floorSegmentSize (e.g. 64MB), a low value of segmentsPerTier (e.g. 4) and a high value of maxMergeAtOnce (e.g. 32).
Optimizations
- Many speedups to top-k query evaluation, in particular: top-level disjunctions, filtered disjunctions, conjunctions, DisjunctionMaxQuery.
- Speedup to exhaustive evaluation of conjunctive queries by vectorizing the intersection of postings lists.
- Reduced contention for top-k query evaluation when IndexSearcher is configured with an executor.
9.12.1
Improvements
- Allow easier configuration of the Panama vectorization provider with newer Java versions. Set the org.apache.lucene.vectorization.upperJavaFeatureVersion system property to increase the set of Java versions that Panama vectorization will provide optimized implementations for.
Bug fixes
- Fixed backwards compatibility bug that caused sparse (not all documents have a vector) KNN indices written with 9.0.0 to give silently (no exception) terrible recall results when searched by any 9.x release
- Improve Tessellatorlogic when two holes share the same vertex with the polygon which was failing in valid polygons.
- Fix backwards compatibility bug that caused 9.12.0 to incorrectly throw IllegalStateException when trying to open an IndexReader on an index created with quantized (int4, int7, int8) KNN vectors using Lucene99HnswScalarQuantizedVectorsFormat.
10.0.0
System requirements
- Lucene 10.0 requires JDK 21 or newer
API changes
- KNN vector values now have a random-access API.
- Deprecated APIs have been removed and a number of API changes have been made. Please consult the migrate guide for an extensive list and actions to take to migrate to 10.0.
New Features
- A new IndexInput#prefetch API has been added, allowing query evaluation logic to let the Directory know about regions of data that are about to be read. This helps perform I/O concurrently under the hood. MMapDirectory implements this API using the madvise system call and the MADV_WILLNEED flag on Linux and Mac OS.
- Lucene now supports sparse indexing on doc values via FieldType#setDocValuesSkipIndexType. The sparse index will record the minimum and maximum values per block of doc IDs. Used in conjunction with index sorting to cluster similar documents together, this allows for very space-efficient and CPU-efficient filtering.
- Search concurrency is now decoupled from the index geometry, so that an index can be searched using any number of threads, regardless of its number of segments.
- Kmeans clustering on vectors
Improvements
- Lucene now opens files with the MADV_RANDOM advice by default on Linux and Mac OS. This results in better efficiency for indexes that exceed the size of the page cache, but can make it slower to load indexes in the page cache. It is possible to revert to the MADV_NORMAL read advice by default by passing -Dorg.apache.lucene.store.defaultReadAdvice=NORMAL as a JVM startup flag.
- Snowball dictionaries have been upgraded, resulting in improved tokenization. This may require reindexing to ensure consistency of search results with pre-10.0 indexes.
- The expressions module is now using MethodHandles and Dynamic Class-File Constants (JEP 309) in combination with hidden classes (JEP 371) to implement a strict and type-safe call to external functions. This allows to easier extend expressions with custom functions in secure way because runtime linking of custom functions is no longer the responsibility of the expressions scripting engine. In addition, the hidden classes created by the expressions engine no longer suffer from global classloader locks.
... plus a multitude of helpful bug fixes!