Skip to content

Conversation

0ctopus13prime
Copy link
Collaborator

Description

RFC : #2875

This PR introduces a new build structure in k-NN for SIMD-based computation, implementing the V1 design described in the RFC linked above.
V1 relies on Faiss distance calculation functions for similarity scoring. The structure introduced here is designed to make it easy to add bulk SIMD operations in subsequent versions. Starting with V1 helps reduce the initial review complexity for maintainers.

The core concept is to extract mapped memory pointers from MemorySegmentIndexInput and leverage native SIMD acceleration to enhance search performance.

Related Issues

Resolves #[Issue number to be closed when this PR is merged]

Check List

  • [O] New functionality includes testing.
  • [O] New functionality has been documented.
  • [O] API changes companion pull request created.
  • [O] Commits are signed per the DCO using --signoff.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@0ctopus13prime
Copy link
Collaborator Author

0ctopus13prime commented Oct 6, 2025

Performance improvement summarization

With early termination + Native scoring for FP16,
I'm seeing 70% QPS improvement compared to when using Faiss C++ for multi segments scenario with Cohere-10M with 1-2% recall drop.
For the single segment case, this is showing slightly less QPS than Faiss C++ which I suspect it's due to the JNI call + Reflection logic extracting MemorySegment[] overhead.

This is what we saw in the POC performance benchmark in the RFC.
Therefore, once BulkSimd + prefetch optimization (e.g. V2) comes in play, I think it will be improved further.

@@ -0,0 +1,562 @@
#include <algorithm>
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, mistakenly included V2.
This will be removed in the next revision.

@navneet1v
Copy link
Collaborator

@0ctopus13prime can we fix the CIs which are failing seems like some tests have failed.

@navneet1v
Copy link
Collaborator

Therefore, once BulkSimd + prefetch optimization (e.g. V2) comes in play, I think it will be improved further.

@0ctopus13prime why we are not implementing V2 here and going with V1?

@0ctopus13prime
Copy link
Collaborator Author

@navneet1v
yeah, I think it's a flacky test, as KNN1030CodecTests complains that Settings was null. Probably initialization is missing.
For the window, it's MinGW compilation related issue, I'm fixing it right now.

Also I'm taking an iterative way to get to V2. To have V2 in this PR, it's going to be too much heavy for review as it's already 36 files have been modified 😅
Since V2 will only have intrinsic for each chip, it will be much easier compared to this version.

@0ctopus13prime 0ctopus13prime force-pushed the use-faise-scoring-v1 branch 5 times, most recently from 060028c to 2ed780f Compare October 6, 2025 16:47
@0ctopus13prime 0ctopus13prime changed the base branch from main to feature/fp16-faiss-bulk October 6, 2025 16:52
Signed-off-by: Doo Yong Kim <[email protected]>
@0ctopus13prime
Copy link
Collaborator Author

Hi @Vikasht34 @shatejas
Fixed build failure in Windows. Please take it a look!

private final KnnCollectorManager knnCollectorManager;

// Ported from Lucene as since 10.3.0, TopKnnCollectorManager does not use MultiLeafKnnCollector no more.
private static class MultiLeafTopKnnCollectorManager implements KnnCollectorManager {
Copy link
Collaborator Author

@0ctopus13prime 0ctopus13prime Oct 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will not be put into main, it's for benchmarking.

@Vikasht34
Copy link
Collaborator

Will look into PR tomorrow morning

Copy link
Collaborator

@shatejas shatejas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still working through the PR - cpp review is pending

@shatejas
Copy link
Collaborator

shatejas commented Oct 9, 2025

Looks good overall, please remove v2 files in the next rev

this.addressAndSize = mmapVectorValues.getAddressAndSize();
this.maxOrd = knnVectorValues.size();
this.nativeFunctionTypeOrd = similarityFunctionType.ordinal();
SimdVectorComputeService.saveSearchContext(query, addressAndSize, nativeFunctionTypeOrd);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if the use of thread local should be minimized to queryVectorSimdAligned. Everything other parameter in search context can be local to the thread. This increases maintaibility of the code while keeping the optimization. let me know what you think

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, sounds good.
But I think it's better to save a list of address so the pointer of vector calculation can be done quickly.
Other than that, I think mostly are native types, we can pass them as parameters.
Will update in the next rev.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @shatejas,
I tried to reflect your comment, but I think we need to save the similarity function in context to create and keep Faiss distance calculator. Otherwise, we need to create the distance calculator each time whenever nodes visited.

Signed-off-by: Dooyong Kim <[email protected]>
Copy link
Collaborator

@navneet1v navneet1v left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will need more time to review the PR. but publishing my initial comments.

annConfig.getQuantizationConfig(),
annConfig.getModelId()
);
log.info("Field [{}] memory_optimized_search_enabled={}", name, memoryOptimizedSearchAvailable);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be a debug log? Also, we should see how we can add this information in our explain API?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be at info level, the message format should be revised though.
I think adding information in explain API is another scope, can be dealt with in a separate PR

Comment on lines +3 to +13
#if defined(__GNUC__) || defined(__clang__)
#define LIKELY(x) (__builtin_expect(!!(x), 1))
#define UNLIKELY(x) (__builtin_expect(!!(x), 0))
#elif defined(_MSC_VER)
// MSVC doesn't have __builtin_expect; just pass through
#define LIKELY(x) (x)
#define UNLIKELY(x) (x)
#else
#define LIKELY(x) (x)
#define UNLIKELY(x) (x)
#endif
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please add proper docs on what are these macros doing.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, will do

// FP16
//

using BulkScoreTransform = void (*)(float*/*scores*/, int32_t/*num scores to transform*/);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
using BulkScoreTransform = void (*)(float*/*scores*/, int32_t/*num scores to transform*/);
// scores and num scores to transform*
using BulkScoreTransform = void (*)(float*, int32_t);

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm.. I think removing variable name can give user confusions on what each parameter is for.
Will keep variable name, and add the comment.

//

using BulkScoreTransform = void (*)(float*/*scores*/, int32_t/*num scores to transform*/);
using ScoreTransform = float (*)(float/*score*/);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to keep the variable names. Will add the comment.

#include <cstdint>

// This class is responsible to convert Faiss distance value to Lucene similarity score.
struct FaissScoreToLuceneScoreTransform final {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why cannot we move this code to java? is there a reason we need to keep these translations in C++ side? Reason being we already have multiple translations of scores on java side. Also, once we are in java we can actually use Lucene VectorUtil class to covert IP to MaxIP.

https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/VectorUtil.java#L367-L376

Also, do we really need this conversion from IP to MaxIP? do we not need to convert back to IP and then do faiss based score translation to keep scores consistent with non-memory optimized search.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is bulk conversion optimization in C++ side which is showing better performance, it's not about using which API.

And we do need conversion from IP to MaxIP. One of the reason is for radial search. We are converting radius to MaxIP score for radial search when memory optimized search is enabled.

Comment on lines +40 to +44
// Transform Faiss L2 distance to be bounded (0, 1]
static float l2Transform(float l2Distance) noexcept {
return 1.0f / (1.0f + l2Distance);
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed the comment.

Comment on lines +159 to +160
// Thread static local SimdVectorSearchContext
thread_local SimdVectorSearchContext THREAD_LOCAL_SIMD_VEC_SRCH_CTX {};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the benefit of storing the SimdVectorSearchContext in thread local here?

and if I am not wrong, this assumes that every java search thread is getting mapped to a new C++ thread right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Java thread is itself a native thread, there's no distinction between C++ thread and Java thread. They are just plain POSIX thread.

Having the thread local is prevent frequent memory allocation and reuse memory space as possible. From the benchmark, I found memory allocation was the bottleneck. (We can call JNI function to get raw pointer every time from float[], but it drags down performance) So, having the thread local to keep the memory space to minimize the allocations possible.

THREAD_LOCAL_SIMD_VEC_SRCH_CTX.tmpBuffer = {};

// Allocate query vector space
if (THREAD_LOCAL_SIMD_VEC_SRCH_CTX.queryVectorByteSize < queryByteSize) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what would be the case in which this condition will not be true?

Copy link
Collaborator Author

@0ctopus13prime 0ctopus13prime Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imagine a scenario where one field has 768 dimensions, and another field can have 1024 dimensions.
In that case, a query for the first field will allocate 768 * 4 bytes initially, and if it's the next query is targeting for the second query, then the internal space needs to be grow. In that case, the condition will be evaluated as false.


if(NOT DEFINED AVX512_SPR_ENABLED)
# Check if the system is Intel(R) Sapphire Rapids or a newer-generation processor
execute_process(COMMAND bash -c "lscpu | grep -q 'GenuineIntel' && lscpu | grep -i 'avx512_fp16' | grep -i 'avx512_bf16' | grep -i 'avx512_vpopcntdq'" OUTPUT_VARIABLE SPR_FLAGS OUTPUT_STRIP_TRAILING_WHITESPACE)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this what we wanted to do here?

if(NOT DEFINED AVX512_SPR_ENABLED)
    set(AVX512_SPR_ENABLED false)
    
    if(AVX512_ENABLED AND ${CMAKE_SYSTEM_NAME} STREQUAL "Linux")
        find_program(LSCPU_PROGRAM lscpu)
        if(LSCPU_PROGRAM)
            execute_process(
                COMMAND ${LSCPU_PROGRAM}
                OUTPUT_VARIABLE CPU_INFO
                ERROR_QUIET
                OUTPUT_STRIP_TRAILING_WHITESPACE
            )
            if(CPU_INFO MATCHES "GenuineIntel" AND 
               CPU_INFO MATCHES "avx512_fp16" AND 
               CPU_INFO MATCHES "avx512_bf16" AND 
               CPU_INFO MATCHES "avx512_vpopcntdq")
                set(AVX512_SPR_ENABLED true)
            endif()
        endif()
    endif()
endif()


if(NOT DEFINED AVX512_SPR_ENABLED)
# Check if the system is Intel(R) Sapphire Rapids or a newer-generation processor
execute_process(COMMAND bash -c "lscpu | grep -q 'GenuineIntel' && lscpu | grep -i 'avx512_fp16' | grep -i 'avx512_bf16' | grep -i 'avx512_vpopcntdq'" OUTPUT_VARIABLE SPR_FLAGS OUTPUT_STRIP_TRAILING_WHITESPACE)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean RESULT_VARIBALE?

include(CheckCXXSourceCompiles)

# Allow user overrides
if(NOT DEFINED AVX2_ENABLED)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we are setting default AVX2, AVX512 and SPR to true, Should it to default false?


if(NOT DEFINED AVX512_SPR_ENABLED)
# Check if the system is Intel(R) Sapphire Rapids or a newer-generation processor
execute_process(COMMAND bash -c "lscpu | grep -q 'GenuineIntel' && lscpu | grep -i 'avx512_fp16' | grep -i 'avx512_bf16' | grep -i 'avx512_vpopcntdq'" OUTPUT_VARIABLE SPR_FLAGS OUTPUT_STRIP_TRAILING_WHITESPACE)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we gauard "COMMAND bash -c "lscpu " WITH below if ?

if(CMAKE_HOST_SYSTEM_NAME STREQUAL "Linux" and CMAKE_CROSS_COMPILINGH)

set(SIMD_OPT_LEVEL "")
set(SIMD_FLAGS "")

if(${CMAKE_SYSTEM_NAME} STREQUAL "Windows" OR (NOT AVX2_ENABLED AND NOT AVX512_ENABLED AND NOT AVX512_SPR_ENABLED))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MSVC now supports AVX2 AND 512 , do we plan to enable for that?

}

// Transform score values if it needs to
return BulkScoreTransformFunc(scores, numVectors);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we are returning with void?

#include <stdint.h>
#include <cmath>

#include "simd_similarity_function_common.cpp"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are pulling cpp file into another CPP files , Will it create build chrun? Can we split them into header?

// FP16
//
// 1. Max IP
DefaultFP16SimilarityFunction<FaissScoreToLuceneScoreTransform::ipToMaxIpTransformBulk, FaissScoreToLuceneScoreTransform::ipToMaxIpTransform> DEFAULT_FP16_MAX_INNER_PRODUCT_SIMIL_FUNC;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be static and other one also?


float calculateSimilarity(SimdVectorSearchContext* srchContext, const int32_t internalVectorId) final {
// Prepare distance calculation
auto vector = reinterpret_cast<uint8_t*>(srchContext->getVectorPointer(internalVectorId));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add null check if vector is null for InternalVectorId?

const int32_t numVectors) final {

// Prepare similarity calculation
auto func = dynamic_cast<faiss::ScalarQuantizer::SQDistanceComputer*>(srchContext->faissFunction.get());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are doinng this dynamic_cast here as well as in below function as well , this would be slow and unnecessary if context always carries SQDistanceComputer, either can we store a typed pointer in the context once or check one at entry and then static cast inside?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants