[FEATURE] Introduce Vector Sampling in K-NN #2622

markwu-sde · 2025-03-20T17:16:11Z

Problem

Today we provide index configurations comprising of index settings and mappings. There exists a wide range of algorithms each providing a range of capabilities like compression levels, hyper parameters, space type, encoders etc. All of these options allow for us to optimize for better performance, lower cost, etc. However they may not be applicable for every use case and in many instances depend on the how the vector space actually looks like (dimensionality, clustering, dense areas, etc.).

As of now we don’t have a lot of insight into what the actual vector space looks like in the index and so lack the relevant insights to make more accurate and concrete recommendations related to recall issues or optimizations.

Solution

The solution introduces sampling capabilities at both ingestion and query time to provide insights into the vector space. During document indexing, vectors are intercepted at flush time and asynchronously sampled, with statistical computations performed on this subset and stored alongside segment data in the Lucene directory.

Both ingestion and query flows feed into a global level API (like the Stats API) that aggregates sampling data across segments, providing metrics about vector distributions, dimension-level statistics, and search patterns.

A high level sequence diagram of the design is found below:

User will be able to query their profiled statistics like so:

curl -X GET "localhost:9200/_plugins/_knn/sampling/my_index1/stats?pretty"

Result:

{
  "my_index1": {
    "vector_sample_count": 50000,
    "last_updated": "2025-03-20T10:15:30Z",
    "field_stats": {
      "product_vector": {
        "dimension_stats": [
          {
            "min": -0.9876,
            "max": 0.9954,
            "mean": 0.0234,
            "variance": 0.2567
          },
          // ... (stats for other dimensions)
        ],
        "distance_distribution": {
          "min_distance": 0.0123,
          "max_distance": 1.9876,
          "mean_distance": 0.7654,
          "percentiles": {
            "p50": 0.7123,
            "p75": 1.0234,
            "p90": 1.3456,
            "p99": 1.7890
          }
        },
        "sparsity": 0.1234,
        "clustering_coefficient": 0.7890
      }
    }
  }
}

Users should be ideally be able able to trigger the analysis "on-demand" for debugging:

curl -X POST "localhost:9200/_plugins/_knn/sampling/my_index1/_analyze?pretty" -H 'Content-Type: application/json' -d'
{
  "field": "product_vector",
  "sample_size": 10000,
  "metrics": ["dimension_stats", "distance_distribution", "sparsity"]
}
'

Result:

{
  "acknowledged": true,
  "task_id": "ABC123XYZ",
  "status": "STARTED"
}

curl -X GET "localhost:9200/_tasks/ABC123XYZ?pretty"

Related Issues

#2243

The text was updated successfully, but these errors were encountered:

jmazanec15 · 2025-03-20T17:32:25Z

Thanks @markwu-sde - this looks cool. I like general idea. I think it would also be good to show breakdown based on shard and segment as well in the api

markwu-sde · 2025-03-20T17:33:47Z

Thanks. I'll add that to the description as well.

@oaganesh is currently on working on the POC.

markwu-sde added enhancement untriaged labels Mar 20, 2025

jmazanec15 removed the untriaged label Mar 20, 2025

oaganesh mentioned this issue Mar 22, 2025

Adding basic generic vector profiler implementation and tests. #2624

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Introduce Vector Sampling in K-NN #2622

[FEATURE] Introduce Vector Sampling in K-NN #2622

markwu-sde commented Mar 20, 2025 •

edited

Loading

jmazanec15 commented Mar 20, 2025

markwu-sde commented Mar 20, 2025 •

edited

Loading

[FEATURE] Introduce Vector Sampling in K-NN #2622

[FEATURE] Introduce Vector Sampling in K-NN #2622

Comments

markwu-sde commented Mar 20, 2025 • edited Loading

jmazanec15 commented Mar 20, 2025

markwu-sde commented Mar 20, 2025 • edited Loading

markwu-sde commented Mar 20, 2025 •

edited

Loading

markwu-sde commented Mar 20, 2025 •

edited

Loading