Native scoring for FP16 V1 implementation. #2922

0ctopus13prime · 2025-10-06T04:01:03Z

Description

This PR introduces a new build structure in k-NN for SIMD-based computation, implementing the V1 design described in the RFC linked above.
V1 relies on Faiss distance calculation functions for similarity scoring. The structure introduced here is designed to make it easy to add bulk SIMD operations in subsequent versions. Starting with V1 helps reduce the initial review complexity for maintainers.

The core concept is to extract mapped memory pointers from MemorySegmentIndexInput and leverage native SIMD acceleration to enhance search performance.

Related Issues

Resolves #[Issue number to be closed when this PR is merged]

Check List

[O] New functionality includes testing.
[O] New functionality has been documented.
[O] API changes companion pull request created.
[O] Commits are signed per the DCO using --signoff.
Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Dooyong Kim <[email protected]>

0ctopus13prime · 2025-10-06T04:07:46Z

Performance improvement summarization

With early termination + Native scoring for FP16,
I'm seeing 70% QPS improvement compared to when using Faiss C++ for multi segments scenario with Cohere-10M with 1-2% recall drop.
For the single segment case, this is showing slightly less QPS than Faiss C++ which I suspect it's due to the JNI call + Reflection logic extracting MemorySegment[] overhead.

This is what we saw in the POC performance benchmark in the RFC.
Therefore, once BulkSimd + prefetch optimization (e.g. V2) comes in play, I think it will be improved further.

0ctopus13prime · 2025-10-06T04:37:06Z

jni/src/simd/similarity_function/x86_avx2_similarity_function.cpp

@@ -0,0 +1,562 @@
+#include <algorithm>


oops, mistakenly included V2.
This will be removed in the next revision.

jni/src/simd/similarity_function/x86_avx512_similarity_function.cpp

jni/src/simd/similarity_function/arm_simd_similarity_function.cpp

navneet1v · 2025-10-06T06:46:50Z

@0ctopus13prime can we fix the CIs which are failing seems like some tests have failed.

navneet1v · 2025-10-06T06:48:18Z

Therefore, once BulkSimd + prefetch optimization (e.g. V2) comes in play, I think it will be improved further.

@0ctopus13prime why we are not implementing V2 here and going with V1?

0ctopus13prime · 2025-10-06T06:51:39Z

@navneet1v
yeah, I think it's a flacky test, as KNN1030CodecTests complains that Settings was null. Probably initialization is missing.
For the window, it's MinGW compilation related issue, I'm fixing it right now.

Also I'm taking an iterative way to get to V2. To have V2 in this PR, it's going to be too much heavy for review as it's already 36 files have been modified 😅
Since V2 will only have intrinsic for each chip, it will be much easier compared to this version.

Signed-off-by: Dooyong Kim <[email protected]>

Signed-off-by: Doo Yong Kim <[email protected]>

0ctopus13prime · 2025-10-07T16:23:20Z

Hi @Vikasht34 @shatejas
Fixed build failure in Windows. Please take it a look!

0ctopus13prime · 2025-10-07T19:27:09Z

src/main/java/org/opensearch/knn/index/query/memoryoptsearch/MemoryOptimizedKNNWeight.java

    private final KnnCollectorManager knnCollectorManager;

+    // Ported from Lucene as since 10.3.0, TopKnnCollectorManager does not use MultiLeafKnnCollector no more.
+    private static class MultiLeafTopKnnCollectorManager implements KnnCollectorManager {


This will not be put into main, it's for benchmarking.

Vikasht34 · 2025-10-08T22:21:45Z

Will look into PR tomorrow morning

shatejas

Still working through the PR - cpp review is pending

src/main/java/org/opensearch/knn/memoryoptsearch/faiss/WrappedFloatVectorValues.java

src/main/java/org/opensearch/knn/memoryoptsearch/MemorySegmentAddressExtractorJDK21.java

src/main/java/org/opensearch/knn/memoryoptsearch/MemorySegmentAddressExtractorJDK22.java

src/main/java/org/opensearch/knn/memoryoptsearch/faiss/FaissIndexScalarQuantizedFlat.java

shatejas · 2025-10-09T21:17:43Z

Looks good overall, please remove v2 files in the next rev

shatejas · 2025-10-09T16:26:15Z

src/main/java/org/opensearch/knn/memoryoptsearch/faiss/NativeRandomVectorScorer.java

+        this.addressAndSize = mmapVectorValues.getAddressAndSize();
+        this.maxOrd = knnVectorValues.size();
+        this.nativeFunctionTypeOrd = similarityFunctionType.ordinal();
+        SimdVectorComputeService.saveSearchContext(query, addressAndSize, nativeFunctionTypeOrd);


I am wondering if the use of thread local should be minimized to queryVectorSimdAligned. Everything other parameter in search context can be local to the thread. This increases maintaibility of the code while keeping the optimization. let me know what you think

Sure, sounds good.
But I think it's better to save a list of address so the pointer of vector calculation can be done quickly.
Other than that, I think mostly are native types, we can pass them as parameters.
Will update in the next rev.

Hi @shatejas,
I tried to reflect your comment, but I think we need to save the similarity function in context to create and keep Faiss distance calculator. Otherwise, we need to create the distance calculator each time whenever nodes visited.

src/main/java/org/opensearch/knn/memoryoptsearch/MemorySegmentAddressExtractorJDK21.java

jni/src/simd/similarity_function/faiss_score_to_lucene_transform.cpp

Signed-off-by: Dooyong Kim <[email protected]>

navneet1v

Will need more time to review the PR. but publishing my initial comments.

navneet1v · 2025-10-06T22:42:44Z

src/main/java/org/opensearch/knn/index/mapper/KNNVectorFieldType.java

            annConfig.getQuantizationConfig(),
            annConfig.getModelId()
        );
+        log.info("Field [{}] memory_optimized_search_enabled={}", name, memoryOptimizedSearchAvailable);


should this be a debug log? Also, we should see how we can add this information in our explain API?

I think this should be at info level, the message format should be revised though.
I think adding information in explain API is another scope, can be dealt with in a separate PR

src/main/java/org/opensearch/knn/indices/ModelDao.java

navneet1v · 2025-10-14T07:01:52Z

jni/include/platform_defs.h

+#if defined(__GNUC__) || defined(__clang__)
+    #define LIKELY(x)   (__builtin_expect(!!(x), 1))
+    #define UNLIKELY(x) (__builtin_expect(!!(x), 0))
+#elif defined(_MSC_VER)
+    // MSVC doesn't have __builtin_expect; just pass through
+    #define LIKELY(x)   (x)
+    #define UNLIKELY(x) (x)
+#else
+    #define LIKELY(x)   (x)
+    #define UNLIKELY(x) (x)
+#endif


can you please add proper docs on what are these macros doing.

sure, will do

navneet1v · 2025-10-14T07:02:54Z

jni/src/simd/similarity_function/default_simd_similarity_function.cpp

+// FP16
+//
+
+using BulkScoreTransform = void (*)(float*/*scores*/, int32_t/*num scores to transform*/);


Suggested change

using BulkScoreTransform = void (*)(float*/*scores*/, int32_t/*num scores to transform*/);

// scores and num scores to transform*

using BulkScoreTransform = void (*)(float*, int32_t);

Hm.. I think removing variable name can give user confusions on what each parameter is for.
Will keep variable name, and add the comment.

navneet1v · 2025-10-14T07:03:07Z

jni/src/simd/similarity_function/default_simd_similarity_function.cpp

+//
+
+using BulkScoreTransform = void (*)(float*/*scores*/, int32_t/*num scores to transform*/);
+using ScoreTransform = float (*)(float/*score*/);


same as above

I think it's better to keep the variable names. Will add the comment.

navneet1v · 2025-10-14T07:10:26Z

jni/src/simd/similarity_function/faiss_score_to_lucene_transform.cpp

+#include <cstdint>
+
+// This class is responsible to convert Faiss distance value to Lucene similarity score.
+struct FaissScoreToLuceneScoreTransform final {


why cannot we move this code to java? is there a reason we need to keep these translations in C++ side? Reason being we already have multiple translations of scores on java side. Also, once we are in java we can actually use Lucene VectorUtil class to covert IP to MaxIP.

https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/VectorUtil.java#L367-L376

Also, do we really need this conversion from IP to MaxIP? do we not need to convert back to IP and then do faiss based score translation to keep scores consistent with non-memory optimized search.

This is bulk conversion optimization in C++ side which is showing better performance, it's not about using which API.

And we do need conversion from IP to MaxIP. One of the reason is for radial search. We are converting radius to MaxIP score for radial search when memory optimized search is enabled.

navneet1v · 2025-10-14T07:10:37Z

jni/src/simd/similarity_function/faiss_score_to_lucene_transform.cpp

+    // Transform Faiss L2 distance to be bounded (0, 1]
+    static float l2Transform(float l2Distance) noexcept {
+        return 1.0f / (1.0f + l2Distance);
+    }
+


same as above

Addressed the comment.

navneet1v · 2025-10-14T07:21:38Z

jni/src/simd/similarity_function/simd_similarity_function_common.cpp

+// Thread static local SimdVectorSearchContext
+thread_local SimdVectorSearchContext THREAD_LOCAL_SIMD_VEC_SRCH_CTX {};


what is the benefit of storing the SimdVectorSearchContext in thread local here?

and if I am not wrong, this assumes that every java search thread is getting mapped to a new C++ thread right?

Java thread is itself a native thread, there's no distinction between C++ thread and Java thread. They are just plain POSIX thread.

Having the thread local is prevent frequent memory allocation and reuse memory space as possible. From the benchmark, I found memory allocation was the bottleneck. (We can call JNI function to get raw pointer every time from float[], but it drags down performance) So, having the thread local to keep the memory space to minimize the allocations possible.

navneet1v · 2025-10-14T07:23:00Z

jni/src/simd/similarity_function/simd_similarity_function_common.cpp

+    THREAD_LOCAL_SIMD_VEC_SRCH_CTX.tmpBuffer = {};
+
+    // Allocate query vector space
+    if (THREAD_LOCAL_SIMD_VEC_SRCH_CTX.queryVectorByteSize < queryByteSize) {


what would be the case in which this condition will not be true?

Imagine a scenario where one field has 768 dimensions, and another field can have 1024 dimensions.
In that case, a query for the first field will allocate 768 * 4 bytes initially, and if it's the next query is targeting for the second query, then the internal space needs to be grow. In that case, the condition will be evaluated as false.

Vikasht34 · 2025-10-15T04:42:40Z

jni/cmake/init-simd.cmake

+
+if(NOT DEFINED AVX512_SPR_ENABLED)
+    # Check if the system is Intel(R) Sapphire Rapids or a newer-generation processor
+    execute_process(COMMAND bash -c "lscpu | grep -q 'GenuineIntel' && lscpu | grep -i 'avx512_fp16' | grep -i 'avx512_bf16' | grep -i 'avx512_vpopcntdq'" OUTPUT_VARIABLE SPR_FLAGS OUTPUT_STRIP_TRAILING_WHITESPACE)


is this what we wanted to do here?

if(NOT DEFINED AVX512_SPR_ENABLED) set(AVX512_SPR_ENABLED false) if(AVX512_ENABLED AND ${CMAKE_SYSTEM_NAME} STREQUAL "Linux") find_program(LSCPU_PROGRAM lscpu) if(LSCPU_PROGRAM) execute_process( COMMAND ${LSCPU_PROGRAM} OUTPUT_VARIABLE CPU_INFO ERROR_QUIET OUTPUT_STRIP_TRAILING_WHITESPACE ) if(CPU_INFO MATCHES "GenuineIntel" AND CPU_INFO MATCHES "avx512_fp16" AND CPU_INFO MATCHES "avx512_bf16" AND CPU_INFO MATCHES "avx512_vpopcntdq") set(AVX512_SPR_ENABLED true) endif() endif() endif() endif()

Vikasht34 · 2025-10-15T05:02:22Z

jni/cmake/init-simd.cmake

+
+if(NOT DEFINED AVX512_SPR_ENABLED)
+    # Check if the system is Intel(R) Sapphire Rapids or a newer-generation processor
+    execute_process(COMMAND bash -c "lscpu | grep -q 'GenuineIntel' && lscpu | grep -i 'avx512_fp16' | grep -i 'avx512_bf16' | grep -i 'avx512_vpopcntdq'" OUTPUT_VARIABLE SPR_FLAGS OUTPUT_STRIP_TRAILING_WHITESPACE)


Did you mean RESULT_VARIBALE?

Vikasht34 · 2025-10-15T05:05:37Z

jni/cmake/init-simd.cmake

+include(CheckCXXSourceCompiles)
+
+# Allow user overrides
+if(NOT DEFINED AVX2_ENABLED)


So we are setting default AVX2, AVX512 and SPR to true, Should it to default false?

Vikasht34 · 2025-10-15T05:08:27Z

jni/cmake/init-simd.cmake

+
+if(NOT DEFINED AVX512_SPR_ENABLED)
+    # Check if the system is Intel(R) Sapphire Rapids or a newer-generation processor
+    execute_process(COMMAND bash -c "lscpu | grep -q 'GenuineIntel' && lscpu | grep -i 'avx512_fp16' | grep -i 'avx512_bf16' | grep -i 'avx512_vpopcntdq'" OUTPUT_VARIABLE SPR_FLAGS OUTPUT_STRIP_TRAILING_WHITESPACE)


Should we gauard "COMMAND bash -c "lscpu " WITH below if ?

if(CMAKE_HOST_SYSTEM_NAME STREQUAL "Linux" and CMAKE_CROSS_COMPILINGH)

Vikasht34 · 2025-10-15T05:09:25Z

jni/cmake/init-simd.cmake

+set(SIMD_OPT_LEVEL "")
+set(SIMD_FLAGS "")
+
+if(${CMAKE_SYSTEM_NAME} STREQUAL "Windows" OR (NOT AVX2_ENABLED AND NOT AVX512_ENABLED AND NOT AVX512_SPR_ENABLED))


MSVC now supports AVX2 AND 512 , do we plan to enable for that?

Vikasht34 · 2025-10-15T05:26:36Z

jni/src/simd/similarity_function/default_simd_similarity_function.cpp

+        }
+
+        // Transform score values if it needs to
+        return BulkScoreTransformFunc(scores, numVectors);


Why we are returning with void?

Vikasht34 · 2025-10-15T05:28:05Z

jni/src/simd/similarity_function/default_simd_similarity_function.cpp

+#include <stdint.h>
+#include <cmath>
+
+#include "simd_similarity_function_common.cpp"


Since we are pulling cpp file into another CPP files , Will it create build chrun? Can we split them into header?

Vikasht34 · 2025-10-15T05:30:02Z

jni/src/simd/similarity_function/default_simd_similarity_function.cpp

+// FP16
+//
+// 1. Max IP
+DefaultFP16SimilarityFunction<FaissScoreToLuceneScoreTransform::ipToMaxIpTransformBulk, FaissScoreToLuceneScoreTransform::ipToMaxIpTransform> DEFAULT_FP16_MAX_INNER_PRODUCT_SIMIL_FUNC;


Should this be static and other one also?

Vikasht34 · 2025-10-15T05:32:14Z

jni/src/simd/similarity_function/default_simd_similarity_function.cpp

+
+    float calculateSimilarity(SimdVectorSearchContext* srchContext, const int32_t internalVectorId) final {
+        // Prepare distance calculation
+        auto vector = reinterpret_cast<uint8_t*>(srchContext->getVectorPointer(internalVectorId));


Can we add null check if vector is null for InternalVectorId?

Vikasht34 · 2025-10-15T05:36:39Z

jni/src/simd/similarity_function/default_simd_similarity_function.cpp

+                                   const int32_t numVectors) final {
+
+        // Prepare similarity calculation
+        auto func = dynamic_cast<faiss::ScalarQuantizer::SQDistanceComputer*>(srchContext->faissFunction.get());


Since we are doinng this dynamic_cast here as well as in below function as well , this would be slow and unnecessary if context always carries SQDistanceComputer, either can we store a typed pointer in the context once or check one at entry and then static cast inside?

0ctopus13prime added 2 commits October 5, 2025 14:59

Added Faiss FP16 native scoring.

b36c306

Signed-off-by: Dooyong Kim <[email protected]>

Added MultiLeafTopKnnCollectorManager.

d6d6ef2

Signed-off-by: Dooyong Kim <[email protected]>

0ctopus13prime self-assigned this Oct 6, 2025

0ctopus13prime requested review from VijayanB, Vikasht34, heemin32, jmazanec15, junqiu-lei, luyuncheng, martin-gaievski, naveentatikonda, navneet1v, ryanbogan, shatejas and vamshin as code owners October 6, 2025 04:01

0ctopus13prime force-pushed the use-faise-scoring-v1 branch from 359665b to 546cfd9 Compare October 6, 2025 04:02

0ctopus13prime commented Oct 6, 2025

View reviewed changes

jni/src/simd/similarity_function/x86_avx512_similarity_function.cpp Outdated Show resolved Hide resolved

jni/src/simd/similarity_function/arm_simd_similarity_function.cpp Outdated Show resolved Hide resolved

0ctopus13prime force-pushed the use-faise-scoring-v1 branch from 546cfd9 to 2acabbe Compare October 6, 2025 04:57

0ctopus13prime force-pushed the use-faise-scoring-v1 branch 5 times, most recently from 060028c to 2ed780f Compare October 6, 2025 16:47

0ctopus13prime changed the base branch from main to feature/fp16-faiss-bulk October 6, 2025 16:52

0ctopus13prime force-pushed the use-faise-scoring-v1 branch from 2ed780f to 1b651ab Compare October 6, 2025 19:15

0ctopus13prime force-pushed the use-faise-scoring-v1 branch from 1b651ab to 4e6572d Compare October 7, 2025 07:26

Added Java docs for native scoring for FP16.

189670b

Signed-off-by: Dooyong Kim <[email protected]>

0ctopus13prime force-pushed the use-faise-scoring-v1 branch from 4e6572d to 189670b Compare October 7, 2025 07:35

Delete jni/Makefile

ad543d7

Signed-off-by: Doo Yong Kim <[email protected]>

0ctopus13prime commented Oct 7, 2025

View reviewed changes

shatejas reviewed Oct 9, 2025

View reviewed changes

Reflect Tejas's comments.

8bead91

Signed-off-by: Dooyong Kim <[email protected]>

0ctopus13prime force-pushed the use-faise-scoring-v1 branch from 561e762 to 8bead91 Compare October 14, 2025 04:28

navneet1v reviewed Oct 14, 2025

View reviewed changes

Vikasht34 reviewed Oct 15, 2025

View reviewed changes

	using BulkScoreTransform = void ()(float/scores/, int32_t/num scores to transform/);
	// scores and num scores to transform*
	using BulkScoreTransform = void ()(float, int32_t);

		// Thread static local SimdVectorSearchContext
		thread_local SimdVectorSearchContext THREAD_LOCAL_SIMD_VEC_SRCH_CTX {};

Native scoring for FP16 V1 implementation. #2922

Are you sure you want to change the base?

Native scoring for FP16 V1 implementation. #2922

Conversation

0ctopus13prime commented Oct 6, 2025

Description

Related Issues

Check List

Uh oh!

0ctopus13prime commented Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Performance improvement summarization

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

navneet1v commented Oct 6, 2025

Uh oh!

navneet1v commented Oct 6, 2025

Uh oh!

0ctopus13prime commented Oct 6, 2025

Uh oh!

0ctopus13prime commented Oct 7, 2025

Uh oh!

0ctopus13prime Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Vikasht34 commented Oct 8, 2025

Uh oh!

shatejas left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

shatejas commented Oct 9, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

navneet1v left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

0ctopus13prime commented Oct 6, 2025 •

edited

Loading

0ctopus13prime Oct 7, 2025 •

edited

Loading

0ctopus13prime Oct 14, 2025 •

edited

Loading