Skip to content

Conversation

oaganesh
Copy link
Contributor

Description

Adding implementation for profiling and statistical analysis of KNN vector segments in OpenSearch with compression when benchmarking. Provides functionality to collect, process, and store statistical information about vector dimensions across different shards and segments.

Related Issues

Implements #2622

Check List

  • [✔️ ] New functionality includes testing.
  • [✔️ ] New functionality has been documented.
  • [✔️ ] API changes companion pull request created.
  • [✔️ ] Commits are signed per the DCO using --signoff.
  • [✔️ ] Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check ✔️.

Example input:

curl -XPUT "http://localhost:9200/target_index" -H 'Content-Type: application/json' -d'{      
  "settings": {
    "index": {
      "knn": true,
      "number_of_shards": 1,
      "number_of_replicas": 0
    }
  },                                                                                            
  "mappings": {
    "properties": {
      "target_field": {
        "type": "knn_vector",
        "dimension": 128,
        "space_type": "l2",
        "compression_level": "16x",
        "method": {
          "name": "hnsw",
          "space_type": "l2",
          "engine": "faiss",
          "parameters": {
            "ef_construction": 100,
            "m": 16
          }
        }
      }
    }
  }
}'

curl -XPOST "http://localhost:9200/target_index/_doc" -H 'Content-Type: application/json' -d'{
  "target_field": [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 1.0, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2.0]
}'
curl -X GET "localhost:9200/_plugins/_knn/warmup/target_index?pretty" 
curl -X GET "localhost:9200/_plugins/_knn/profile/target_index?pretty"

Example output:

  "total_shards" : 1,
  "successful_shards" : 1,
  "failed_shards" : 0,
  "profile_results" : {
    "target_index" : {
      "dimensions" : [
        {
          "min" : 1.3,
          "max" : 1.9,
          "variance" : 0.0,
          "mean" : 3.8,
          "count" : 5,
          "std_deviation" : 0.0,
          "sum" : 1.9,
          "dimension" : 0
        },
        {
          "min" : 0.2,
          "max" : 1.89
          "variance" : 0.2,
          "mean" : 3.78,
          "count" : 5,
          "std_deviation" : 0.3,
          "sum" : 1.89,
          "dimension" : 1
        }
      ]
    }
  }
}

CHANGELOG.md Outdated
## [Unreleased 2.x](https://github.com/opensearch-project/k-NN/compare/2.19...2.x)
### Features
* [Vector Profiler] Adding basic generic vector profiler implementation and tests. [#2624](https://github.com/opensearch-project/k-NN/pull/2624)
* [Vector Profiler] Updating serializer, compression, and api. [#2687](https://github.com/opensearch-project/k-NN/pull/2687)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be a little more specific on what we're updating?

Also AFAIK we're not updating anything related to how compression is done or any API (but we're introducing one). Should we reword this?

* @param fieldName The name of the vector field to profile
* @return List of statistical summaries for each dimension
*/
// TODO: Write unit tests to ensure that the segment statistic aggregation is correct.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we finish this TODO and write unit tests for the aggregation?

* @return List of statistical summaries for each dimension
*/
// TODO: Write unit tests to ensure that the segment statistic aggregation is correct.
public List<StatisticalSummaryValues> profile(String fieldName) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: final variables whenever possible


// Get dimension and validate all segments have the same dimension
int dimension = segmentLevelProfilerStates.get(0).getDimension();
boolean dimensionsMatch = segmentLevelProfilerStates.stream().allMatch(state -> state.getDimension() == dimension);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would there ever be a case where the dimensions won't match accordingly here? Do we need to check for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverting back to previous way of implementing KNNIndexShard as current way is less robust.


if (!dimensionsMatch) {
log.error("[KNN] Inconsistent dimensions found across segments");
return shardVectorProfile; // Return empty list
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: you can return List.of()


for (SegmentProfilerState state : segmentLevelProfilerStates) {
List<SummaryStatistics> stateStats = state.getStatistics();
if (dimensionId < stateStats.size()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again - if the dimension is constant here do we really need this check? it will always be within bounds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood, reverting back to previous way of implementing KNNIndexShard as current way is less robust.

List<SummaryStatistics> stateStats = state.getStatistics();
if (dimensionId < stateStats.size()) {
SummaryStatistics stat = stateStats.get(dimensionId);
if (stat != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When can this be possible?

log.info("[KNN] Profiling completed for field: {} in shard: {}", fieldName, indexShard.shardId());
} catch (Exception e) {
log.error(
"[KNN] Critical error during profiling for field: {} in shard: {}: {}",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Critical sounds a bit too extreme here given we're not in the hot path.

* @return List of statistical summaries for each dimension
*/
public List<StatisticalSummaryValues> profile() {
return profile("target_field");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't hard code a field name here. I understand we're doing this from a benchmarking OSB perspective but if we want customers to be able to leverage this feature we need to provide the ability to take in a field/index name as input.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed specific field input

int fieldNumber = segmentReadState.fieldInfos.fieldInfo(field).getFieldNumber();

try (IndexInput input = segmentReadState.directory.openInput(quantizationStateFileName, IOContext.READONCE)) {
try (IndexInput input = segmentReadState.directory.openInput(quantizationStateFileName, IOContext.DEFAULT)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you give some context around this? Why are we changing from READONCE to DEFAULT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly did it for consistency when accessing quantization state files multiple times.
https://lucene.apache.org/core/9_12_1/core/org/apache/lucene/store/IOContext.html?is-external=true

// should skip graph building only for non quantization use case and if threshold is met
if (quantizationState == null && shouldSkipBuildingVectorDataStructure(totalLiveDocs)) {
log.debug(
log.info(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: use log.debug wherever possible


SegmentProfilerState segmentProfilerState = null;
if (totalLiveDocs > 0) {
// TODO:Refactor to another init
Copy link
Contributor

@markwu-sde markwu-sde Apr 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems important to the PR. Are we going to act on this TODO?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactors NativeEnginesVectorsWriter class to encompass segmentStateWriter and initSegmentStateWriterIfNecessary()

/**
* A utility function to get {@link SegmentProfilerState} for a given segment and field.
* This needs to public as we are accessing this on a transport action
* TODO: move this out of this Util class and into another one.
Copy link
Contributor

@markwu-sde markwu-sde Apr 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here. This seems important to the PR. Should we complete this TODO?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method has already been refactors into it's own class for under org.opensearch.knn.index.query. Need to remove the existing code.

Copy link
Contributor

@markwu-sde markwu-sde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One over-arching comment here is that we should have unit/integration tests for all our code. Make sure to include them in the PR so that we can test our functionality such as making sure the aggregation at the shard level is what we're expecting.

* @return {@link SegmentProfilerState}
* @throws IOException exception during reading the {@link SegmentProfilerState}
*/
public static SegmentProfilerState getSegmentProfileState(final LeafReader leafReader, String fieldName) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have the util function below that is a exact replica of this method. Do we still need this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the method has already been refactors into it's own class for under org.opensearch.knn.index.query

/**
* Get aggregated dimension statistics by index
*/
private Map<String, Map<Integer, Map<String, Object>>> getAggregatedStats() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming you want to get the aggregation at the field level here across all shards.

  1. Shouldn't this be by field/index instead of just index?
  2. NIT: getAggregatedStats can be more specific because there are now multiple layers of aggregation

continue;
}

Map<Integer, Map<String, Object>> dimensions = indexDimensions.computeIfAbsent(indexName, k -> new HashMap<>());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the mapping that you're trying to do is:

Map<Index, Map<Dimension ID, Map<String,Object>>.

In lucene and specifically within k-nn a index can have multiple fields. Not all fields are of type knn_vector

Map<Integer, Map<String, Object>> dimensions = indexDimensions.computeIfAbsent(indexName, k -> new HashMap<>());

for (int i = 0; i < stats.size(); i++) {
StatisticalSummaryValues stat = stats.get(i);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please correct me if I'm wrong, but I think the crux of your situation here by looking at StatisticalSummaryValues is that the library we're using doesn't provide us with a aggregation technique for that object.

If that is the case, how about we collect our SegmentProfileStates and put them in the shard response? That way we can craft both the shard/field level aggregation as we see fit since we have access to those values.

int finalI = i;
Map<String, Object> dimension = dimensions.computeIfAbsent(i, k -> {
Map<String, Object> newDim = new HashMap<>();
newDim.put("dimension", finalI);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: static imports when possible

double min = in.readDouble();
double max = in.readDouble();
double sum = in.readDouble();
this.dimensionStats.add(new StatisticalSummaryValues(mean, variance, n, min, max, sum));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're deserialize it here again would it be worthwhile to just save the deserialization at the profile API layer?

What kind of responses do we need to deserialize here?

);

knnIndexShard.warmup();
knnIndexShard.profile();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this in WarmupTransportAction?

out.writeVInt(dimension);
out.writeVInt(statistics.size());
for (SummaryStatistics stat : statistics) {
out.writeDouble(stat.getMean());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't be serializing/deserializing the values but rather the object itself.

SummaryStatistics already implements the serializable interface so we should be able to convert it directly to bytes without needing to retrieve it's properties.


for (int i = 0; i < statsSize; i++) {
SummaryStatistics stat = new SummaryStatistics();
stat.addValue(input.readDouble());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we've previously had a discussion on the deserialization aspect of this - the same issue is present here as well.

Let's make sure to modify this as this would yield incorrect results for our profiler.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this was just used to help test for compression by creating a new instance. Reverting back to pre-existing method for serializing and deserializing.

* Response for KNN profile request
*/
@Log4j2
public class KNNProfileResponse extends BroadcastResponse implements ToXContentObject {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have not looked too closely into these since I do have some comments on the serialization/deserialization component.

@oaganesh oaganesh changed the title Updating serializer, compression, and api. Adding serializer and api implementation for segment profiler state Apr 30, 2025
builder.endObject();
}
builder.endObject();

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about at the cluster level with every segment aggregated together? This output currently does this per shard right? Can we combine every SummaryStatistics in the cluster together?

Copy link
Contributor Author

@oaganesh oaganesh May 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes the stats were currently per shard - added cluster level stats as well.

if (dim < state.getStatistics().size()) {
SummaryStatistics stats = state.getStatistics().get(dim);

totalCount += stats.getN();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we leverage a existing library for aggregation?

  1. This doesn't give an correct aggregation outside of min max and sum.
  2. https://commons.apache.org/proper/commons-math/commons-math-docs/apidocs/src-html/org/apache/commons/math4/legacy/stat/descriptive/AggregateSummaryStatistics.html#line.305 - there's a library provided to us that aggregates a SummaryStatistics object for us. Can we use that instead of trying to combine it on our own?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to use AggregateSummaryStatistics

builder.startObject(shardProfileResult.shardId);

// Individual segment statistics
builder.startArray("segments");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you provide an example of output in the Javadoc of what this would look like?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated javadocs


@Override
public void writeTo(StreamOutput streamOutput) throws IOException {
super.writeTo(streamOutput);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's see if we use this override for anything. If we don't we can throw an exception here to make sure we don't stray off the intended path

* Constructor for reading from StreamInput
*/
public KNNIndexShardProfileResult(StreamInput streamInput) throws IOException {
this.shardId = streamInput.readString();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we ever use this constructor?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it was used previously in KNNProfileResponse, but when removing the logic and the above constructor I was able to validate the output was still working.


## [Unreleased 2.x](https://github.com/opensearch-project/k-NN/compare/2.19...2.x)
### Features
* [Vector Profiler] Adding basic generic vector profiler implementation and tests. [#2624](https://github.com/opensearch-project/k-NN/pull/2624)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we adding these changes? Weren't they already merged?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, updated branch to adjust with changes already implemented in feature branch

*/
@Log4j2
@AllArgsConstructor
public class SegmentProfilerState implements Serializable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this file being shown as new? Isn't this already in the branch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated branch to adjust with changes already implemented in feature branch

Copy link
Contributor

@markwu-sde markwu-sde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add unit/integration tests to validate shard aggregation functionality. You'll need to create multiple segments/shards to do this most likely.

Copy link
Member

@jmazanec15 jmazanec15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments. I think overall structure is looking good. I think itd be good to enhance tests to ensure that we are validating functionality. Seems you are already working on that!

@Getter
public class KNNProfileRequest extends BroadcastRequest<KNNProfileRequest> {

private String index;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

public class KNNProfileRequest extends BroadcastRequest<KNNProfileRequest> {

private String index;
private String field;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: lets make these final.


@Override
public ActionRequestValidationException validate() {
return null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: no need to override

* "failures": []
* }
*/
public class KNNProfileResponse extends BroadcastResponse implements ToXContentObject {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*/
public class KNNProfileResponse extends BroadcastResponse implements ToXContentObject {

List<KNNIndexShardProfileResult> shardProfileResults;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: private final?

KNNProfileResponse,
KNNIndexShardProfileResult> {

public static Logger logger = LogManager.getLogger(KNNProfileTransportAction.class);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace this with @Log4j2 annotation

KNNIndexShard knnIndexShard = new KNNIndexShard(
indicesService.indexServiceSafe(shardRouting.shardId().getIndex()).getShard(shardRouting.shardId().id())
);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove for git history


@Setter
@Getter
public class SegmentProfileKNNCollector implements KnnCollector {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add comment on top of this class.

return response;
}

private void validateProfileResponse(Response response, int dimension) throws IOException, ParseException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Itd be good to update this to ensure expected statistics are being computed. For instance, you can generate the vectors from a normalized distribution (i.e. 0-1) and confirm that the mean values is computed as close to 0.5.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 agree with this.

Make sure to validate that the aggregation is what'd we expect. Since merge is out of scope for this this can probably be done be explicitly creating the 2 segments and calling the profile API to make sure the aggregate is what we expect.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated tests

import java.util.List;

@AllArgsConstructor
public class KNNIndexShardProfileResult implements Writeable {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we also need deserialization logic with this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it may be useful especially when compressing vectors.

@oaganesh oaganesh closed this Jun 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants