-
Notifications
You must be signed in to change notification settings - Fork 169
Native scoring for FP16 V1 implementation. #2922
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: feature/fp16-faiss-bulk
Are you sure you want to change the base?
Native scoring for FP16 V1 implementation. #2922
Conversation
Signed-off-by: Dooyong Kim <[email protected]>
Signed-off-by: Dooyong Kim <[email protected]>
359665b
to
546cfd9
Compare
Performance improvement summarizationWith early termination + Native scoring for FP16, This is what we saw in the POC performance benchmark in the RFC. |
@@ -0,0 +1,562 @@ | |||
#include <algorithm> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops, mistakenly included V2.
This will be removed in the next revision.
jni/src/simd/similarity_function/x86_avx512_similarity_function.cpp
Outdated
Show resolved
Hide resolved
jni/src/simd/similarity_function/arm_simd_similarity_function.cpp
Outdated
Show resolved
Hide resolved
546cfd9
to
2acabbe
Compare
@0ctopus13prime can we fix the CIs which are failing seems like some tests have failed. |
@0ctopus13prime why we are not implementing V2 here and going with V1? |
@navneet1v Also I'm taking an iterative way to get to V2. To have V2 in this PR, it's going to be too much heavy for review as it's already 36 files have been modified 😅 |
060028c
to
2ed780f
Compare
2ed780f
to
1b651ab
Compare
1b651ab
to
4e6572d
Compare
Signed-off-by: Dooyong Kim <[email protected]>
4e6572d
to
189670b
Compare
Signed-off-by: Doo Yong Kim <[email protected]>
Hi @Vikasht34 @shatejas |
private final KnnCollectorManager knnCollectorManager; | ||
|
||
// Ported from Lucene as since 10.3.0, TopKnnCollectorManager does not use MultiLeafKnnCollector no more. | ||
private static class MultiLeafTopKnnCollectorManager implements KnnCollectorManager { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will not be put into main, it's for benchmarking.
Will look into PR tomorrow morning |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still working through the PR - cpp review is pending
src/main/java/org/opensearch/knn/memoryoptsearch/faiss/WrappedFloatVectorValues.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/memoryoptsearch/faiss/WrappedFloatVectorValues.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/memoryoptsearch/MemorySegmentAddressExtractorJDK21.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/memoryoptsearch/MemorySegmentAddressExtractorJDK21.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/memoryoptsearch/MemorySegmentAddressExtractorJDK22.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/memoryoptsearch/faiss/FaissIndexScalarQuantizedFlat.java
Outdated
Show resolved
Hide resolved
Looks good overall, please remove v2 files in the next rev |
this.addressAndSize = mmapVectorValues.getAddressAndSize(); | ||
this.maxOrd = knnVectorValues.size(); | ||
this.nativeFunctionTypeOrd = similarityFunctionType.ordinal(); | ||
SimdVectorComputeService.saveSearchContext(query, addressAndSize, nativeFunctionTypeOrd); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering if the use of thread local should be minimized to queryVectorSimdAligned
. Everything other parameter in search context can be local to the thread. This increases maintaibility of the code while keeping the optimization. let me know what you think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, sounds good.
But I think it's better to save a list of address so the pointer of vector calculation can be done quickly.
Other than that, I think mostly are native types, we can pass them as parameters.
Will update in the next rev.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @shatejas,
I tried to reflect your comment, but I think we need to save the similarity function in context to create and keep Faiss distance calculator. Otherwise, we need to create the distance calculator each time whenever nodes visited.
src/main/java/org/opensearch/knn/memoryoptsearch/MemorySegmentAddressExtractorJDK21.java
Outdated
Show resolved
Hide resolved
Signed-off-by: Dooyong Kim <[email protected]>
561e762
to
8bead91
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will need more time to review the PR. but publishing my initial comments.
annConfig.getQuantizationConfig(), | ||
annConfig.getModelId() | ||
); | ||
log.info("Field [{}] memory_optimized_search_enabled={}", name, memoryOptimizedSearchAvailable); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be a debug log? Also, we should see how we can add this information in our explain API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be at info level, the message format should be revised though.
I think adding information in explain API is another scope, can be dealt with in a separate PR
#if defined(__GNUC__) || defined(__clang__) | ||
#define LIKELY(x) (__builtin_expect(!!(x), 1)) | ||
#define UNLIKELY(x) (__builtin_expect(!!(x), 0)) | ||
#elif defined(_MSC_VER) | ||
// MSVC doesn't have __builtin_expect; just pass through | ||
#define LIKELY(x) (x) | ||
#define UNLIKELY(x) (x) | ||
#else | ||
#define LIKELY(x) (x) | ||
#define UNLIKELY(x) (x) | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you please add proper docs on what are these macros doing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, will do
// FP16 | ||
// | ||
|
||
using BulkScoreTransform = void (*)(float*/*scores*/, int32_t/*num scores to transform*/); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using BulkScoreTransform = void (*)(float*/*scores*/, int32_t/*num scores to transform*/); | |
// scores and num scores to transform* | |
using BulkScoreTransform = void (*)(float*, int32_t); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm.. I think removing variable name can give user confusions on what each parameter is for.
Will keep variable name, and add the comment.
// | ||
|
||
using BulkScoreTransform = void (*)(float*/*scores*/, int32_t/*num scores to transform*/); | ||
using ScoreTransform = float (*)(float/*score*/); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's better to keep the variable names. Will add the comment.
#include <cstdint> | ||
|
||
// This class is responsible to convert Faiss distance value to Lucene similarity score. | ||
struct FaissScoreToLuceneScoreTransform final { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why cannot we move this code to java? is there a reason we need to keep these translations in C++ side? Reason being we already have multiple translations of scores on java side. Also, once we are in java we can actually use Lucene VectorUtil class to covert IP to MaxIP.
Also, do we really need this conversion from IP to MaxIP? do we not need to convert back to IP and then do faiss based score translation to keep scores consistent with non-memory optimized search.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is bulk conversion optimization in C++ side which is showing better performance, it's not about using which API.
And we do need conversion from IP to MaxIP. One of the reason is for radial search. We are converting radius to MaxIP score for radial search when memory optimized search is enabled.
// Transform Faiss L2 distance to be bounded (0, 1] | ||
static float l2Transform(float l2Distance) noexcept { | ||
return 1.0f / (1.0f + l2Distance); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed the comment.
// Thread static local SimdVectorSearchContext | ||
thread_local SimdVectorSearchContext THREAD_LOCAL_SIMD_VEC_SRCH_CTX {}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the benefit of storing the SimdVectorSearchContext in thread local here?
and if I am not wrong, this assumes that every java search thread is getting mapped to a new C++ thread right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Java thread is itself a native thread, there's no distinction between C++ thread and Java thread. They are just plain POSIX thread.
Having the thread local is prevent frequent memory allocation and reuse memory space as possible. From the benchmark, I found memory allocation was the bottleneck. (We can call JNI function to get raw pointer every time from float[]
, but it drags down performance) So, having the thread local to keep the memory space to minimize the allocations possible.
THREAD_LOCAL_SIMD_VEC_SRCH_CTX.tmpBuffer = {}; | ||
|
||
// Allocate query vector space | ||
if (THREAD_LOCAL_SIMD_VEC_SRCH_CTX.queryVectorByteSize < queryByteSize) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what would be the case in which this condition will not be true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Imagine a scenario where one field has 768 dimensions, and another field can have 1024 dimensions.
In that case, a query for the first field will allocate 768 * 4 bytes initially, and if it's the next query is targeting for the second query, then the internal space needs to be grow. In that case, the condition will be evaluated as false.
|
||
if(NOT DEFINED AVX512_SPR_ENABLED) | ||
# Check if the system is Intel(R) Sapphire Rapids or a newer-generation processor | ||
execute_process(COMMAND bash -c "lscpu | grep -q 'GenuineIntel' && lscpu | grep -i 'avx512_fp16' | grep -i 'avx512_bf16' | grep -i 'avx512_vpopcntdq'" OUTPUT_VARIABLE SPR_FLAGS OUTPUT_STRIP_TRAILING_WHITESPACE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this what we wanted to do here?
if(NOT DEFINED AVX512_SPR_ENABLED)
set(AVX512_SPR_ENABLED false)
if(AVX512_ENABLED AND ${CMAKE_SYSTEM_NAME} STREQUAL "Linux")
find_program(LSCPU_PROGRAM lscpu)
if(LSCPU_PROGRAM)
execute_process(
COMMAND ${LSCPU_PROGRAM}
OUTPUT_VARIABLE CPU_INFO
ERROR_QUIET
OUTPUT_STRIP_TRAILING_WHITESPACE
)
if(CPU_INFO MATCHES "GenuineIntel" AND
CPU_INFO MATCHES "avx512_fp16" AND
CPU_INFO MATCHES "avx512_bf16" AND
CPU_INFO MATCHES "avx512_vpopcntdq")
set(AVX512_SPR_ENABLED true)
endif()
endif()
endif()
endif()
|
||
if(NOT DEFINED AVX512_SPR_ENABLED) | ||
# Check if the system is Intel(R) Sapphire Rapids or a newer-generation processor | ||
execute_process(COMMAND bash -c "lscpu | grep -q 'GenuineIntel' && lscpu | grep -i 'avx512_fp16' | grep -i 'avx512_bf16' | grep -i 'avx512_vpopcntdq'" OUTPUT_VARIABLE SPR_FLAGS OUTPUT_STRIP_TRAILING_WHITESPACE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean RESULT_VARIBALE?
include(CheckCXXSourceCompiles) | ||
|
||
# Allow user overrides | ||
if(NOT DEFINED AVX2_ENABLED) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we are setting default AVX2, AVX512 and SPR to true, Should it to default false?
|
||
if(NOT DEFINED AVX512_SPR_ENABLED) | ||
# Check if the system is Intel(R) Sapphire Rapids or a newer-generation processor | ||
execute_process(COMMAND bash -c "lscpu | grep -q 'GenuineIntel' && lscpu | grep -i 'avx512_fp16' | grep -i 'avx512_bf16' | grep -i 'avx512_vpopcntdq'" OUTPUT_VARIABLE SPR_FLAGS OUTPUT_STRIP_TRAILING_WHITESPACE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we gauard "COMMAND bash -c "lscpu " WITH below if ?
if(CMAKE_HOST_SYSTEM_NAME STREQUAL "Linux" and CMAKE_CROSS_COMPILINGH)
set(SIMD_OPT_LEVEL "") | ||
set(SIMD_FLAGS "") | ||
|
||
if(${CMAKE_SYSTEM_NAME} STREQUAL "Windows" OR (NOT AVX2_ENABLED AND NOT AVX512_ENABLED AND NOT AVX512_SPR_ENABLED)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MSVC now supports AVX2 AND 512 , do we plan to enable for that?
} | ||
|
||
// Transform score values if it needs to | ||
return BulkScoreTransformFunc(scores, numVectors); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we are returning with void?
#include <stdint.h> | ||
#include <cmath> | ||
|
||
#include "simd_similarity_function_common.cpp" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are pulling cpp file into another CPP files , Will it create build chrun? Can we split them into header?
// FP16 | ||
// | ||
// 1. Max IP | ||
DefaultFP16SimilarityFunction<FaissScoreToLuceneScoreTransform::ipToMaxIpTransformBulk, FaissScoreToLuceneScoreTransform::ipToMaxIpTransform> DEFAULT_FP16_MAX_INNER_PRODUCT_SIMIL_FUNC; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be static and other one also?
|
||
float calculateSimilarity(SimdVectorSearchContext* srchContext, const int32_t internalVectorId) final { | ||
// Prepare distance calculation | ||
auto vector = reinterpret_cast<uint8_t*>(srchContext->getVectorPointer(internalVectorId)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add null check if vector is null for InternalVectorId?
const int32_t numVectors) final { | ||
|
||
// Prepare similarity calculation | ||
auto func = dynamic_cast<faiss::ScalarQuantizer::SQDistanceComputer*>(srchContext->faissFunction.get()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are doinng this dynamic_cast here as well as in below function as well , this would be slow and unnecessary if context always carries SQDistanceComputer, either can we store a typed pointer in the context once or check one at entry and then static cast inside?
Description
RFC : #2875
This PR introduces a new build structure in k-NN for SIMD-based computation, implementing the V1 design described in the RFC linked above.
V1 relies on Faiss distance calculation functions for similarity scoring. The structure introduced here is designed to make it easy to add bulk SIMD operations in subsequent versions. Starting with V1 helps reduce the initial review complexity for maintainers.
The core concept is to extract mapped memory pointers from MemorySegmentIndexInput and leverage native SIMD acceleration to enhance search performance.
Related Issues
Resolves #[Issue number to be closed when this PR is merged]
Check List
--signoff
.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.