Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding basic generic vector profiler implementation and tests. #2624

Open
wants to merge 12 commits into
base: feature/vector-profile
Choose a base branch
from

Conversation

oaganesh
Copy link
Contributor

@oaganesh oaganesh commented Mar 22, 2025

Description

Basic generic implementation of a vector profiler that will produce a list of sampling statistics such as mean, standard deviation, and variance. Utility class for performing vector calculations and profiling operations on collections of float arrays representing vectors. This class utilizes the singleton pattern to maintain consistent state across operations within OpenSearch.

Utilizing The Apache Commons Mathematics Library for generating vector statistics per dimension.

VectorProfiler provides functionality to:

  • Calculate statistical measures (mean, variance, standard deviation) on vector collections
  • Sample vectors from larger collections for efficient processing
  • Track and record vectors during read and write operations
  • Save vector statistics to disk for later analysis

Example input:

  1. Creating the index
curl -X PUT "http://localhost:9200/my-index" -H 'Content-Type: application/json' -d '     
{
  "settings": {                          
    "index": {
      "knn": true
    }
  },
  "mappings": {
    "properties": {
      "my_vector_field": {
        "type": "knn_vector",
        "dimension": 4
      }
    }
  }
}'
  1. Adding vectors to the index
curl -X POST "http://localhost:9200/my-index/_doc" -H 'Content-Type: application/json' -d'
{
  "my_vector_field": [0.1, 2.2, 3.3, 0.4]
}'

Observed example output:

[2025-03-28T16:10:27,867][INFO ][o.o.k.p.SegmentProfilerState] [integTest-0] Starting vector profiling with dimension: 4
[2025-03-28T16:10:27,887][INFO ][o.o.k.p.SegmentProfilerState] [integTest-0] Vector profiling completed - processed 1 vectors with 4 dimensions
[2025-03-28T16:10:27,887][INFO ][o.o.k.p.SegmentProfilerState] [integTest-0] Dimension 0 stats: mean=0.10000000149011612, std=0.0, min=0.10000000149011612, max=0.10000000149011612
[2025-03-28T16:10:27,888][INFO ][o.o.k.p.SegmentProfilerState] [integTest-0] Dimension 1 stats: mean=2.200000047683716, std=0.0, min=2.200000047683716, max=2.200000047683716
[2025-03-28T16:10:27,888][INFO ][o.o.k.p.SegmentProfilerState] [integTest-0] Dimension 2 stats: mean=3.299999952316284, std=0.0, min=3.299999952316284, max=3.299999952316284
[2025-03-28T16:10:27,888][INFO ][o.o.k.p.SegmentProfilerState] [integTest-0] Dimension 3 stats: mean=0.4000000059604645, std=0.0, min=0.4000000059604645, max=0.4000000059604645

Related Issues

Implements #2622

Check List

  • [ ✔️ ] New functionality includes testing.
  • [ ✔️ ] New functionality has been documented.
  • [ ✔️ ] API changes companion pull request created.
  • [✔️ ] Commits are signed per the DCO using --signoff.
  • [ ✔️ ] Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check ✔️ .

* SPDX-License-Identifier: Apache-2.0
*/

package org.opensearch.knn.index.codec.KNN990Codec;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not put this into the codec folder. We should move everything to our own separate folder within knn.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I create a separate "sampler" or "profiler" folder within "org/opensearch/knn/index" for example?

@oaganesh oaganesh requested a review from markwu-sde March 24, 2025 17:25
/**
* Enum for statistical operators used in KNN vector computation.
*/
public enum StatisticalOperators implements Computation {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add test for each operator in a separate test class?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good start overall. I think our previous conversations may have led to some confusion. I think the purpose of this PR is to exclude any of the lucene persistence details (SegmentWriteState, segments, directories, etc) and just expose a entry point that we can build on top of in the indexing and query path. Not saying that those are not important, but I believe that the persistence/output of the sampling is another conversation on it's own and deserves it's own PR.

private static final int MAX_VECTOR_ELEMENTS_TO_PRINT = 10;

// Maps to store segment context information
private static final ConcurrentHashMap<String, String> SEGMENT_BASE_NAMES = new ConcurrentHashMap<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give more context on why we need these in-memory data structures? The generic form of the VectorProfiler class should in some Collection/Iterator of Vectors and performs a set of computations. Introducing lucene concepts here should be done in my opinion in the PR's where we integrate them into k-nn paths (read, write, etc)

* @return float array representing the calculated result vector
* @throws IllegalArgumentException if vectors is null, empty, or contains vectors of different dimensions
*/
public static <T extends Computation> float[] calculateVector(Collection<float[]> vectors, T computation) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK this is a internal class that takes in a already sampled collection of vectors and performs computations on it. Let's not expose this as public as other classes don't need to know about this.

Arrays.fill(result, 0);

for (float[] vec : vectors) {
if (vec.length != dim) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this case will happen as the indexing will fail - can you double check?

int dim = firstVector.length;

float[] result = new float[dim];
Arrays.fill(result, 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this?

throw new IllegalArgumentException("All vectors must have same dimension");
}
for (int i = 0; i < dim; i++) {
result[i] = computation.apply(result[i], vec[i])[0];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So my understanding here is that since we want to calculate the computation on each dimension we'll need some way to basically rotate the matrix. So if we have 10 vectors with 5 dimensions we'll need 5 vectors (each corresponding to 1 dimension) with 10 values each. Is there a way to potentially represent the sampled vectors in a matrix similar to that and do execute the computation on each row?


return computation.apply(result, vectors.size());
// return result;
} catch (Exception e) {
Copy link
Contributor

@markwu-sde markwu-sde Mar 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try to avoid catching generic exceptions if we can. Can we try to figure out what exceptions can be thrown here and what we want to handle?

* @param vectors Collection of vectors to record
* @param computation The computation to be performed on the vectors
*/
public static <T extends Computation> void recordReadTimeVectors(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So from I understand the sampling of the vectors between the ingestion (indexing) flow and the query (read) flow will be slightly different. On the ingestion flow with our current implementation all the vectors for that particular segment are sent to be sampled during flush while for the query side we'll probably need to do some sort of dynamic window type persistence.

Both paths will need a slightly different approach but since our plan is to have them in separate PR's can we leave the segment level implementation of this to those PRs? It will make it easier for us to look at each case separately instead of having to deduce what we're doing with the segments here.

* @param vectors Collection of vectors to save statistics for
* @throws IOException If an error occurs during file writing
*/
public static void saveVectorStats(SegmentWriteState segmentWriteState, Collection<float[]> vectors) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we leave the persistence of the vectors to a directory or anything directory related to another PR? This PR should just expose a single entry point of taking in a collection of vectors - sampling it, and then executing an arbitrary amount of computations.

That being said, I see this as the main entry point of the VectorProfiler. Given a collection (unfiltered) set of vectors as input (let's not worry about segments for now)

  1. Sample the vectors
  2. Execute computations.

NIT: sampleVector may be good for now as a generic entry method as we're not saving/persisting anything currently.


try {
// Calculate all statistics
float[] meanVector = calculateVector(vectors, StatisticalOperators.MEAN);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want to explicitly include these operations in the method. Imagine a scenario where we have 100+ metrics that we need to compute - this would congest the content of what we're trying to do by a large factor and make refactoring difficult. At the top level of the vector profiler we can include some sort of dynamic collection of computations that we have to execute that can be configurable via a setting and then read from there.

// Calculate all statistics
float[] meanVector = calculateVector(vectors, StatisticalOperators.MEAN);
float[] varianceVector = calculateVector(vectors, StatisticalOperators.VARIANCE);
float[] stdDevVector = calculateVector(vectors, StatisticalOperators.STANDARD_DEVIATION);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

calculateVector doesn't include any sampling? In the case that this is the entry point we would want to sample the vectors before we compute. We can reuse the sampling implementation in the plugin to do this.

float[] varianceVector = calculateVector(vectors, StatisticalOperators.VARIANCE);
float[] stdDevVector = calculateVector(vectors, StatisticalOperators.STANDARD_DEVIATION);

String statsFileName = IndexFileNames.segmentFileName(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move anything related to writing the results to a file/persisting it to the PR's related to indexing and querying unless you feel strongly about including them together. This would improve readability and keep the PR's manageable in terms of turnaround time.

return new ArrayList<>(registeredComputations);
}

public Map<Computation, float[]> sampleAndCompute(String fieldName, Collection<float[]> vectors, int... sampleSize) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: Final whenever possible

private static VectorProfiler INSTANCE;
private static final int DEFAULT_SAMPLE_SIZE = 1000;
private final Map<String, List<DimensionStatisticAggregator>> fieldToDimensionStats;
private List<Computation> registeredComputations;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may not need this anymore because we will compute everything using the DimensionStatisticAggregator

private final Map<String, List<DimensionStatisticAggregator>> fieldToDimensionStats;
private List<Computation> registeredComputations;

public VectorProfiler() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Singletons have private constructors.

return aggregateStats;
}

public Map<String, SummaryStatistics> getSegmentStatistics() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also add a method to retrieve a SummaryStatistic for a particular segment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created a getSegmentStatistic() method

}

public void addSegmentStatistics(Collection<Float> values) {
String segmentId = UUID.randomUUID().toString();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a TODO saying that we'll attach the segmentId in another PR and why we're using a uuid for now?

private final Map<String, SummaryStatistics> segmentToSummaryMapping;
private final int dimension;

public DimensionStatisticAggregator(int dimension) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: final variables whenever possible

public class DimensionStatisticAggregator {
private final AggregateSummaryStatistics aggregateStats;
private final Map<String, SummaryStatistics> segmentToSummaryMapping;
private final int dimension;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this corresponding to the # of dimensions or the id of the particular dimension?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ID of the particular dimension; added comments and renamed for better visibility

import java.util.Map;
import java.util.UUID;

public class DimensionStatisticAggregator {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add Javadoc comments here?

* used in KNN algorithm implementations.
*/

public interface Computation {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need the computation interface here. We can leave the selection of computations at the API level since we now have the StatisticalSummaries.

@markwu-sde
Copy link
Contributor

Tests with failures: org.opensearch.knn.profiler.VectorProfilerTests.testDifferentDimensions

Some of the tests are failing can we check?

@oaganesh oaganesh requested a review from markwu-sde March 26, 2025 21:24
import java.util.UUID;

@Log4j2
public class VectorProfiler {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have javadocs for this class?

return INSTANCE;
}

public static synchronized void setInstance(VectorProfiler instance) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not have a setInstance here for singletons - for testing we can provide setters for the content within the class

}

public void setFieldToDimensionStats(Map<String, List<DimensionStatisticAggregator>> stats) {
fieldToDimensionStats.clear();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: No need to clear here - if we want to set this just for testing we can call get() and then clear it ourselves in the test.

public void processVectors(final String fieldName, final Collection<float[]> vectors) {
validateVectors(vectors);

float[] firstVector = vectors.iterator().next();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling .next() in the iterator implementation would return the first element but it wouldn't be processed in the sampling as it's just a pointer.

We don't have to determine the dimension here - there should be a way to retrieve it from the field metadata during our write/read paths. We can probably add it as a parameter and leverage this.


private void validateVectors(final Collection<float[]> vectors) {
if (vectors == null || vectors.isEmpty()) {
throw new IllegalArgumentException("Vectors collection cannot be null or empty");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to throw an exception here - we can just log and exit out of the profiler.


fieldToDimensionStats.remove(fieldName);

initializeDimensionAggregators(fieldName, dimensions);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we've seen the field before, there's no need to create a new list of DA. Reuse the old one and add a segment entry to it.

return;
}

fieldToDimensionStats.remove(fieldName);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't have this here -> this will remove the stats for previous segments added.

public void processVectors(final String fieldName, final Collection<float[]> vectors, final int dimensions) {
if (vectors == null || vectors.isEmpty()) {
log.warn("No vectors to process for field: {}", fieldName);
fieldToDimensionStats.remove(fieldName);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't have this here -> this will remove the stats for previous segments added.

private static final int TEST_DIMENSIONS = 3;

@Before
public void setUp() throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to clear the map of fieldToDimensionStats, we can provide an accessor to VectorProfiler and clear the map within the test here.getFieldToDimensionStats.clear()

@markwu-sde
Copy link
Contributor

Thanks for the revisions. LGTM - @jmazanec15 can you take a look?

Copy link
Member

@jmazanec15 jmazanec15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I think its a good start! I think we might want to make a few changes based on how its going to be used.

Should a vector profiler be tied to the lifecycle of the NativeEngineWriter instance of a global singleton? See https://github.com/opensearch-project/k-NN/blob/main/src/main/java/org/opensearch/knn/index/codec/KNN990Codec/NativeEngines990KnnVectorsWriter.java#L109. This is where per segment values are added and quantization samples/trained. If we are going to serialize the information in the segment, we will need access here.

Then, in order to aggregate at the segment, shard and index level, we follow path of warmup/clear cache in the KNNIndexShard. See https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsReader.java#L86-L87 - we would need to access this info somehow through a custom KnnCollector.

@@ -219,7 +219,6 @@ public Collection<Object> createComponents(
KNNQueryBuilder.initialize(ModelDao.OpenSearchKNNModelDao.getInstance());
KNNWeight.initialize(ModelDao.OpenSearchKNNModelDao.getInstance());
TrainingModelRequest.initialize(ModelDao.OpenSearchKNNModelDao.getInstance(), clusterService);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: revert this so it doesnt show up in diff.

CHANGELOG.md Outdated
@@ -20,6 +20,8 @@ Add filter function to KNNQueryBuilder with unit tests and integration tests [#2

## [Unreleased 2.x](https://github.com/opensearch-project/k-NN/compare/2.19...2.x)
### Features
* [Vector Profiler] Adding generic vector profiler implementation to sample vector statistics and output towards the API (#2622)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: check the format in change log on PR link

/**
* Aggregates statistics for a specific dimension across multiple segments.
* This class is used to collect and analyze statistical data for a particular dimension
* across multiple segments of a vector.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: segments of an index?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean to include the wording segments of an index? Each segment contains a subset of the indexed vectors, and this class helps aggregate statistics across these different portions of the index.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meant this:

* This class is used to collect and analyze statistical data for a particular dimension
* across multiple segments of a vector index.

/**
* @return the aggregate statistics across all segments
*/
public StatisticalSummary getAggregateStatistics() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use lombok getter for retrieving member variables

}

/**
* Process a collection of vectors for a specific field and dimension size.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should this be profile? What does process mean in this context?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, I have changed it to profileVectors() for consistency.

* Retrieve the entire map of field to dimension statistics.
* @return fieldToDimensionStats
*/
Map<String, List<DimensionStatisticAggregator>> getFieldToDimensionStats() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lombok getter

import java.util.Collections;
import java.util.List;

public class VectorProfilerTests extends OpenSearchTestCase {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the profiler need to be singleton? This is one of the problems with going singleton route. It can complicate the testing. Instead, we can create via the KNNPLugin createComponents and pass into the codec/required transport actions from there.

@oaganesh oaganesh requested a review from markwu-sde March 28, 2025 23:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants