-
Notifications
You must be signed in to change notification settings - Fork 144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding basic generic vector profiler implementation and tests. #2624
base: feature/vector-profile
Are you sure you want to change the base?
Adding basic generic vector profiler implementation and tests. #2624
Conversation
Signed-off-by: Arun Ganesh <[email protected]>
Signed-off-by: Arun Ganesh <[email protected]>
* SPDX-License-Identifier: Apache-2.0 | ||
*/ | ||
|
||
package org.opensearch.knn.index.codec.KNN990Codec; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not put this into the codec folder. We should move everything to our own separate folder within knn.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I create a separate "sampler" or "profiler" folder within "org/opensearch/knn/index" for example?
/** | ||
* Enum for statistical operators used in KNN vector computation. | ||
*/ | ||
public enum StatisticalOperators implements Computation { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add test for each operator in a separate test class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good start overall. I think our previous conversations may have led to some confusion. I think the purpose of this PR is to exclude any of the lucene persistence details (SegmentWriteState, segments, directories, etc) and just expose a entry point that we can build on top of in the indexing and query path. Not saying that those are not important, but I believe that the persistence/output of the sampling is another conversation on it's own and deserves it's own PR.
private static final int MAX_VECTOR_ELEMENTS_TO_PRINT = 10; | ||
|
||
// Maps to store segment context information | ||
private static final ConcurrentHashMap<String, String> SEGMENT_BASE_NAMES = new ConcurrentHashMap<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you give more context on why we need these in-memory data structures? The generic form of the VectorProfiler class should in some Collection/Iterator of Vectors and performs a set of computations. Introducing lucene concepts here should be done in my opinion in the PR's where we integrate them into k-nn paths (read, write, etc)
* @return float array representing the calculated result vector | ||
* @throws IllegalArgumentException if vectors is null, empty, or contains vectors of different dimensions | ||
*/ | ||
public static <T extends Computation> float[] calculateVector(Collection<float[]> vectors, T computation) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK this is a internal class that takes in a already sampled collection of vectors and performs computations on it. Let's not expose this as public
as other classes don't need to know about this.
Arrays.fill(result, 0); | ||
|
||
for (float[] vec : vectors) { | ||
if (vec.length != dim) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this case will happen as the indexing will fail - can you double check?
int dim = firstVector.length; | ||
|
||
float[] result = new float[dim]; | ||
Arrays.fill(result, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this?
throw new IllegalArgumentException("All vectors must have same dimension"); | ||
} | ||
for (int i = 0; i < dim; i++) { | ||
result[i] = computation.apply(result[i], vec[i])[0]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So my understanding here is that since we want to calculate the computation on each dimension we'll need some way to basically rotate the matrix. So if we have 10 vectors with 5 dimensions we'll need 5 vectors (each corresponding to 1 dimension) with 10 values each. Is there a way to potentially represent the sampled vectors in a matrix similar to that and do execute the computation on each row?
|
||
return computation.apply(result, vectors.size()); | ||
// return result; | ||
} catch (Exception e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Try to avoid catching generic exceptions if we can. Can we try to figure out what exceptions can be thrown here and what we want to handle?
* @param vectors Collection of vectors to record | ||
* @param computation The computation to be performed on the vectors | ||
*/ | ||
public static <T extends Computation> void recordReadTimeVectors( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So from I understand the sampling of the vectors between the ingestion (indexing) flow and the query (read) flow will be slightly different. On the ingestion flow with our current implementation all the vectors for that particular segment are sent to be sampled during flush
while for the query side we'll probably need to do some sort of dynamic window type persistence.
Both paths will need a slightly different approach but since our plan is to have them in separate PR's can we leave the segment level implementation of this to those PRs? It will make it easier for us to look at each case separately instead of having to deduce what we're doing with the segments here.
* @param vectors Collection of vectors to save statistics for | ||
* @throws IOException If an error occurs during file writing | ||
*/ | ||
public static void saveVectorStats(SegmentWriteState segmentWriteState, Collection<float[]> vectors) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we leave the persistence of the vectors to a directory or anything directory related to another PR? This PR should just expose a single entry point of taking in a collection of vectors - sampling it, and then executing an arbitrary amount of computations.
That being said, I see this as the main entry point of the VectorProfiler. Given a collection (unfiltered) set of vectors as input (let's not worry about segments for now)
- Sample the vectors
- Execute computations.
NIT: sampleVector
may be good for now as a generic entry method as we're not saving/persisting anything currently.
|
||
try { | ||
// Calculate all statistics | ||
float[] meanVector = calculateVector(vectors, StatisticalOperators.MEAN); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't want to explicitly include these operations in the method. Imagine a scenario where we have 100+ metrics that we need to compute - this would congest the content of what we're trying to do by a large factor and make refactoring difficult. At the top level of the vector profiler we can include some sort of dynamic collection of computations that we have to execute that can be configurable via a setting and then read from there.
// Calculate all statistics | ||
float[] meanVector = calculateVector(vectors, StatisticalOperators.MEAN); | ||
float[] varianceVector = calculateVector(vectors, StatisticalOperators.VARIANCE); | ||
float[] stdDevVector = calculateVector(vectors, StatisticalOperators.STANDARD_DEVIATION); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
calculateVector
doesn't include any sampling? In the case that this is the entry point we would want to sample the vectors before we compute. We can reuse the sampling implementation in the plugin to do this.
float[] varianceVector = calculateVector(vectors, StatisticalOperators.VARIANCE); | ||
float[] stdDevVector = calculateVector(vectors, StatisticalOperators.STANDARD_DEVIATION); | ||
|
||
String statsFileName = IndexFileNames.segmentFileName( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's move anything related to writing the results to a file/persisting it to the PR's related to indexing and querying unless you feel strongly about including them together. This would improve readability and keep the PR's manageable in terms of turnaround time.
return new ArrayList<>(registeredComputations); | ||
} | ||
|
||
public Map<Computation, float[]> sampleAndCompute(String fieldName, Collection<float[]> vectors, int... sampleSize) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: Final whenever possible
private static VectorProfiler INSTANCE; | ||
private static final int DEFAULT_SAMPLE_SIZE = 1000; | ||
private final Map<String, List<DimensionStatisticAggregator>> fieldToDimensionStats; | ||
private List<Computation> registeredComputations; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may not need this anymore because we will compute everything using the DimensionStatisticAggregator
private final Map<String, List<DimensionStatisticAggregator>> fieldToDimensionStats; | ||
private List<Computation> registeredComputations; | ||
|
||
public VectorProfiler() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Singletons have private constructors.
return aggregateStats; | ||
} | ||
|
||
public Map<String, SummaryStatistics> getSegmentStatistics() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also add a method to retrieve a SummaryStatistic for a particular segment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created a getSegmentStatistic() method
} | ||
|
||
public void addSegmentStatistics(Collection<Float> values) { | ||
String segmentId = UUID.randomUUID().toString(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a TODO saying that we'll attach the segmentId in another PR and why we're using a uuid for now?
private final Map<String, SummaryStatistics> segmentToSummaryMapping; | ||
private final int dimension; | ||
|
||
public DimensionStatisticAggregator(int dimension) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: final variables whenever possible
public class DimensionStatisticAggregator { | ||
private final AggregateSummaryStatistics aggregateStats; | ||
private final Map<String, SummaryStatistics> segmentToSummaryMapping; | ||
private final int dimension; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this corresponding to the # of dimensions or the id of the particular dimension?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ID of the particular dimension; added comments and renamed for better visibility
import java.util.Map; | ||
import java.util.UUID; | ||
|
||
public class DimensionStatisticAggregator { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add Javadoc comments here?
* used in KNN algorithm implementations. | ||
*/ | ||
|
||
public interface Computation { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need the computation interface here. We can leave the selection of computations at the API level since we now have the StatisticalSummaries
.
Some of the tests are failing can we check? |
Signed-off-by: Arun Ganesh <[email protected]>
import java.util.UUID; | ||
|
||
@Log4j2 | ||
public class VectorProfiler { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have javadocs for this class?
return INSTANCE; | ||
} | ||
|
||
public static synchronized void setInstance(VectorProfiler instance) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not have a setInstance here for singletons - for testing we can provide setters for the content within the class
} | ||
|
||
public void setFieldToDimensionStats(Map<String, List<DimensionStatisticAggregator>> stats) { | ||
fieldToDimensionStats.clear(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: No need to clear here - if we want to set this just for testing we can call get() and then clear it ourselves in the test.
public void processVectors(final String fieldName, final Collection<float[]> vectors) { | ||
validateVectors(vectors); | ||
|
||
float[] firstVector = vectors.iterator().next(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling .next()
in the iterator implementation would return the first element but it wouldn't be processed in the sampling as it's just a pointer.
We don't have to determine the dimension here - there should be a way to retrieve it from the field metadata during our write/read paths. We can probably add it as a parameter and leverage this.
|
||
private void validateVectors(final Collection<float[]> vectors) { | ||
if (vectors == null || vectors.isEmpty()) { | ||
throw new IllegalArgumentException("Vectors collection cannot be null or empty"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to throw an exception here - we can just log and exit out of the profiler.
Signed-off-by: Arun Ganesh <[email protected]>
|
||
fieldToDimensionStats.remove(fieldName); | ||
|
||
initializeDimensionAggregators(fieldName, dimensions); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we've seen the field before, there's no need to create a new list of DA. Reuse the old one and add a segment entry to it.
return; | ||
} | ||
|
||
fieldToDimensionStats.remove(fieldName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't have this here -> this will remove the stats for previous segments added.
public void processVectors(final String fieldName, final Collection<float[]> vectors, final int dimensions) { | ||
if (vectors == null || vectors.isEmpty()) { | ||
log.warn("No vectors to process for field: {}", fieldName); | ||
fieldToDimensionStats.remove(fieldName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't have this here -> this will remove the stats for previous segments added.
private static final int TEST_DIMENSIONS = 3; | ||
|
||
@Before | ||
public void setUp() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to clear the map of fieldToDimensionStats, we can provide an accessor to VectorProfiler and clear the map within the test here.getFieldToDimensionStats.clear()
Signed-off-by: Arun Ganesh <[email protected]>
Signed-off-by: Arun Ganesh <[email protected]>
Thanks for the revisions. LGTM - @jmazanec15 can you take a look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I think its a good start! I think we might want to make a few changes based on how its going to be used.
Should a vector profiler be tied to the lifecycle of the NativeEngineWriter instance of a global singleton? See https://github.com/opensearch-project/k-NN/blob/main/src/main/java/org/opensearch/knn/index/codec/KNN990Codec/NativeEngines990KnnVectorsWriter.java#L109. This is where per segment values are added and quantization samples/trained. If we are going to serialize the information in the segment, we will need access here.
Then, in order to aggregate at the segment, shard and index level, we follow path of warmup/clear cache in the KNNIndexShard. See https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsReader.java#L86-L87 - we would need to access this info somehow through a custom KnnCollector.
@@ -219,7 +219,6 @@ public Collection<Object> createComponents( | |||
KNNQueryBuilder.initialize(ModelDao.OpenSearchKNNModelDao.getInstance()); | |||
KNNWeight.initialize(ModelDao.OpenSearchKNNModelDao.getInstance()); | |||
TrainingModelRequest.initialize(ModelDao.OpenSearchKNNModelDao.getInstance(), clusterService); | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: revert this so it doesnt show up in diff.
CHANGELOG.md
Outdated
@@ -20,6 +20,8 @@ Add filter function to KNNQueryBuilder with unit tests and integration tests [#2 | |||
|
|||
## [Unreleased 2.x](https://github.com/opensearch-project/k-NN/compare/2.19...2.x) | |||
### Features | |||
* [Vector Profiler] Adding generic vector profiler implementation to sample vector statistics and output towards the API (#2622) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: check the format in change log on PR link
/** | ||
* Aggregates statistics for a specific dimension across multiple segments. | ||
* This class is used to collect and analyze statistical data for a particular dimension | ||
* across multiple segments of a vector. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: segments of an index?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean to include the wording segments of an index? Each segment contains a subset of the indexed vectors, and this class helps aggregate statistics across these different portions of the index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Meant this:
* This class is used to collect and analyze statistical data for a particular dimension
* across multiple segments of avectorindex.
/** | ||
* @return the aggregate statistics across all segments | ||
*/ | ||
public StatisticalSummary getAggregateStatistics() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: use lombok getter for retrieving member variables
} | ||
|
||
/** | ||
* Process a collection of vectors for a specific field and dimension size. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: should this be profile? What does process mean in this context?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, I have changed it to profileVectors() for consistency.
* Retrieve the entire map of field to dimension statistics. | ||
* @return fieldToDimensionStats | ||
*/ | ||
Map<String, List<DimensionStatisticAggregator>> getFieldToDimensionStats() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lombok getter
import java.util.Collections; | ||
import java.util.List; | ||
|
||
public class VectorProfilerTests extends OpenSearchTestCase { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the profiler need to be singleton? This is one of the problems with going singleton route. It can complicate the testing. Instead, we can create via the KNNPLugin createComponents and pass into the codec/required transport actions from there.
Signed-off-by: Arun Ganesh <[email protected]>
Signed-off-by: Arun Ganesh <[email protected]>
…ats. Signed-off-by: Arun Ganesh <[email protected]>
Description
Basic generic implementation of a vector profiler that will produce a list of sampling statistics such as mean, standard deviation, and variance. Utility class for performing vector calculations and profiling operations on collections of float arrays representing vectors. This class utilizes the singleton pattern to maintain consistent state across operations within OpenSearch.
Utilizing The Apache Commons Mathematics Library for generating vector statistics per dimension.
VectorProfiler provides functionality to:
Example input:
Observed example output:
Related Issues
Implements #2622
Check List
--signoff
.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check ✔️ .