Skip to content

Conversation

oaganesh
Copy link
Contributor

@oaganesh oaganesh commented Apr 8, 2025

Description

Adding implementation for profiling and statistical analysis of KNN vector segments in OpenSearch. Provides functionality to collect, process, and store statistical information about vector dimensions across different shards and segments.

Example input:

Creating the index

curl -X PUT "http://localhost:9200/my-index" -H 'Content-Type: application/json' -d '     
{
  "settings": {                          
    "index": {
      "knn": true
    }
  },
  "mappings": {
    "properties": {
      "my_vector_field": {
        "type": "knn_vector",
        "dimension": 4
      }
    }
  }
}'

Adding vectors to the index

curl -X POST "http://localhost:9200/my-index/_doc" -H 'Content-Type: application/json' -d'
{
  "my_vector_field": [1.0, 2.0, 3.0, 4.0]
}'
curl -X POST "http://localhost:9200/my-index/_doc" -H 'Content-Type: application/json' -d'
{
  "my_vector_field": [1.5, 2.0, 3.0, 4.0]
}'

Flushing the index

curl -X POST "http://localhost:9200/my-index/_flush"

Observed example output:

curl -X GET "localhost:9200/_plugins/_knn/sampling/my-index/stats?pretty"
{
  "index_summary" : {
    "doc_count" : 2,
    "size_in_bytes" : 5763,
    "timestamp" : "2025-04-07T21:38:31.814948Z"
  },
  "vector_stats" : {
    "sample_size" : 2,
    "summary_stats" : {
      "dimensions" : [
        {
          "dimension" : 0,
          "count" : 2,
          "min" : 1.0,
          "max" : 1.5,
          "sum" : 2.5,
          "mean" : 1.25,
          "standardDeviation" : 0.3536,
          "variance" : 0.125
        },
        {
          "dimension" : 1,
          "count" : 2,
          "min" : 2.0,
          "max" : 2.0,
          "sum" : 4.0,
          "mean" : 2.0,
          "standardDeviation" : 0.0,
          "variance" : 0.0
        },
        {
          "dimension" : 2,
          "count" : 2,
          "min" : 3.0,
          "max" : 3.0,
          "sum" : 6.0,
          "mean" : 3.0,
          "standardDeviation" : 0.0,
          "variance" : 0.0
        },
        {
          "dimension" : 3,
          "count" : 2,
          "min" : 4.0,
          "max" : 4.0,
          "sum" : 8.0,
          "mean" : 4.0,
          "standardDeviation" : 0.0,
          "variance" : 0.0
        }
      ]
    }
  }
}

Related Issues

Implements #2687

Check List

  • [ ✔️ ] New functionality includes testing.
  • [ ✔️ ] New functionality has been documented.
  • [ ✔️] API changes companion pull request created.
  • [ ✔️] Commits are signed per the DCO using --signoff.
  • [ ✔️] Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check ✔️.

log.info("Starting vector profiling for field: {}", fieldInfo.getName());
SegmentProfilerState.profileVectors(knnVectorValuesSupplier, segmentWriteState, fieldInfo.getName());
log.info("Completed vector profiling for field: {}", fieldInfo.getName());
} catch (Exception e) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Should we catch the exception in our SegmentProfileState class instead? We want to keep the business logic here relatively untouched.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed


private KNNStats knnStats;
private ClusterService clusterService;
// private IndicesService indicesService;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uncomment or remove this line. Why do we need IndicesService here? It's a pretty heavy object to be using inside KnnPlugin

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed

private KNNStats knnStats;
private ClusterService clusterService;
// private IndicesService indicesService;
private Environment environment;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need Environment here? What information are we trying to get here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's used to access the data directory paths and node-specific items. It's needed to pass to the RestKNNSamplingStatsHandler to locate and read statistics files stored on disk. The environment provides access to the node's data paths where vector statistics files are stored under the indices directory structure.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need this - we can directly follow what's being used in the QuantizationState

Supplier<RepositoriesService> repositoriesServiceSupplier
) {
this.clusterService = clusterService;
// this.indicesService = client.getInstanceFromNode(IndicesService.class);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not leave commented lines in the PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed

this.clusterService = clusterService;
// this.indicesService = client.getInstanceFromNode(IndicesService.class);
this.repositoriesServiceSupplier = repositoriesServiceSupplier;
this.environment = environment;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need the environment here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's used to access the data directory paths and node-specific items. It's needed to pass to the RestKNNSamplingStatsHandler to locate and read statistics files stored on disk. The environment provides access to the node's data paths where vector statistics files are stored under the indices directory structure.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need this - we can directly follow what's being used in the QuantizationState

clusterService,
indexNameExpressionResolver,
this.environment
// indicesService
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed

* Rest handler for sampling stats endpoint
*/
public class RestKNNSamplingStatsHandler extends BaseRestHandler {
// private final IndicesService indicesService;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove comments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed

* @param indexNameExpressionResolver Resolver for index names
* @param environment OpenSearch environment configuration
*/
public RestKNNSamplingStatsHandler(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: use lombok for @AllArgsConstructor

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed

Environment environment
// IndicesService indicesService
) {
// this.indicesService = indicesService;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove comments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed

*/
@Override
protected RestChannelConsumer prepareRequest(RestRequest request, NodeClient client) throws IOException {
String indexName = request.param("index");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Let's not hardcode parameter strings here.

  2. An OS index can consist of multiple vector fields so getting just the index could return multiple fields. Do we need to query by field?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed

*/
public static SegmentProfilerState profileVectors(final Supplier<KNNVectorValues<?>> knnVectorValuesSupplier) throws IOException {
KNNVectorValues<?> vectorValues = knnVectorValuesSupplier.get();
private static void writeStatsToFile(Path outputFile, List<SummaryStatistics> statistics, String fieldName, int vectorCount)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: final whenever possible

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed

private static void writeStatsToFile(Path outputFile, List<SummaryStatistics> statistics, String fieldName, int vectorCount)
throws IOException {
// Create parent directories if they don't exist
Files.createDirectories(outputFile.getParent());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based off our conversation with @jmazanec15 and what we've discussed we may want to revisit how we're writing to the segment directory. Can we re-use the implementation used by the QuantizationStateWriter to persist our state?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added changes

// Create parent directories if they don't exist
Files.createDirectories(outputFile.getParent());

try (XContentBuilder jsonBuilder = XContentFactory.jsonBuilder()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused here... can we leverage some sort of ObjectMapper to simply write the SummaryStatistic object to the segment? That way we can avoid the XContentBuilder since we already have the object we need to serialize.

If we need to serialize additional metadata not included in SummaryStatistic we can also wrap it around a wrapper object and serialize that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Faced some difficulty with objectmapper to help serialize. Keeping XContentBuilder approach for now to accomodate for bigger changes. Can change more if necessary.

for (int i = 0; i < statistics.size(); i++) {
SummaryStatistics stats = statistics.get(i);
jsonBuilder.startObject()
.field("dimension", i)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't hardcode strings if possible in business logic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added variables for string values

* @param fieldName Name of the field being processed
* @return SegmentProfilerState containing collected statistics
*/
public static SegmentProfilerState profileVectors(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

profileVectors here returns a SegmentProfilerState but I'm not seeing it used anywhere in the PR. Are we using the method signature anywhere? Should we change it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are using the profilerState to gather the statistics. I believe it should be fine to use for now as that is also used within the RestKNNSamplingStatsHandler


// Initialize vector values
// Initialize new profiler state and vector values
SegmentProfilerState profilerState = new SegmentProfilerState(new ArrayList<>());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to create the SegmentProfilerState here? Can we just return it if possible (assuming we need to return it)

owenhalpert and others added 17 commits April 18, 2025 14:52
…ensearch-project#2647)

* Add multi-vector-support faiss patch to IndexHNSW::search_level_0

Signed-off-by: AnnTian Shao <[email protected]>

* Add tests to JNI and KNN

Signed-off-by: AnnTian Shao <[email protected]>

* Update tests by adding hnsw cagra index binary and remove JNI layer method updateIndexSettings

Signed-off-by: AnnTian Shao <[email protected]>

* test fixes

Signed-off-by: AnnTian Shao <[email protected]>

---------

Signed-off-by: AnnTian Shao <[email protected]>
Co-authored-by: AnnTian Shao <[email protected]>
…oject#2646)

* Combine method and lucene mappers to EngineFieldMapper

Signed-off-by: Kunal Kotwani <[email protected]>

* Change the default doc values to false, retain old value for flat field

Signed-off-by: Kunal Kotwani <[email protected]>

* Update flat field mapper checks

Signed-off-by: Kunal Kotwani <[email protected]>

* Fix the default doc value logic

Signed-off-by: Kunal Kotwani <[email protected]>

---------

Signed-off-by: Kunal Kotwani <[email protected]>
…ject#2652)

Fixes a bug that was already fixed in
opensearch-project#2494 but was then
reverted by accident in a refactor. It makes it so that instead of
opening up readers for each transform request, it opens up once per
reader.

Signed-off-by: John Mazanec <[email protected]>
Signed-off-by: Dooyong Kim <[email protected]>
Co-authored-by: Dooyong Kim <[email protected]>
* RestHandler for k-NN index profile API. API provides the ability for a user to get statistical information
* about vector dimensions in specific indices.
*/
public class RestKNNProfileHandler extends BaseRestHandler {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will need more time to look into the API implementation for this PR since it's not too closely coupled with our profiling implementation

/**
* Action for profiling KNN vectors in an index
*/
public class KNNProfileAction extends ActionType<KNNProfileResponse> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know we synced offline on this but is it possible to move the API implementation to another PR? I don't think there's

public static SegmentProfilerState getSegmentProfileState(final LeafReader leafReader, String fieldName) throws IOException {
final SegmentProfileKNNCollector tempCollector = new SegmentProfileKNNCollector();
leafReader.searchNearestVectors(fieldName, new float[0], tempCollector, null);
if (tempCollector.getSegmentProfilerState() == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to throw an exception if we're unable to get the SegmentProfilerState? What if the user has the feature disabled?

// Log the results for each dimension
for (int i = 0; i < shardVectorProfile.size(); i++) {
StatisticalSummaryValues stats = shardVectorProfile.get(i);
log.info(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've mentioned this before but would it just be easier to use toString here given we're not really focused on a specific output?

// For each leaf, collect the profile
searcher.getIndexReader().leaves().forEach(leaf -> {
try {
log.info("[KNN] Processing leaf reader for segment: {}", leaf.reader());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we avoid clutter in the logs by reducing these info lines? This doesn't add any value here when debugging as we already have exception handling

* @return List of statistical summaries for each dimension
*/
// TODO: Write unit tests to ensure that the segment statistic aggregation is correct.
public List<StatisticalSummaryValues> profile(String fieldName) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: final

* @param fieldName The name of the vector field to profile
* @return List of statistical summaries for each dimension
*/
// TODO: Write unit tests to ensure that the segment statistic aggregation is correct.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should write a integration test as well that covers across multiple segments. Let's leave this towards the end.

log.info("[KNN] Profiling completed for field: {} in shard: {}", fieldName, indexShard.shardId());
} catch (Exception e) {
log.error(
"[KNN] Critical error during profiling for field: {} in shard: {}: {}",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does Critical mean in this sense? Error messages should be as succinct and objective as possible.

@Vikasht34
Copy link
Collaborator

I looked at this PR from high level Overview , here are my suggestions that we might need to address
1 .SegmentProfilerState contains , how to process profile , how to retrieve vector from Supplier and additional metadata around profile , then we convert this class to Bytearray and save in file. There is no uniquness about how to process profile and how o retrieve vector , it should be same for all profiles , Please keep SegementState short and crisp , and Please remove all the methods from here to util.
2. Please add capability to how to serialise and deserlize this class with Segment State.
3. To Reteive the state , we are creating temp collector , see if we can avoid it. Ideally we should retrieve from file itself.

Signed-off-by: Arun Ganesh <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.