Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Introduce Vector Sampling in K-NN #2622

Open
markwu-sde opened this issue Mar 20, 2025 · 2 comments
Open

[FEATURE] Introduce Vector Sampling in K-NN #2622

markwu-sde opened this issue Mar 20, 2025 · 2 comments

Comments

@markwu-sde
Copy link
Contributor

markwu-sde commented Mar 20, 2025

Problem

Today we provide index configurations comprising of index settings and mappings. There exists a wide range of algorithms each providing a range of capabilities like compression levels, hyper parameters, space type, encoders etc. All of these options allow for us to optimize for better performance, lower cost, etc. However they may not be applicable for every use case and in many instances depend on the how the vector space actually looks like (dimensionality, clustering, dense areas, etc.).

As of now we don’t have a lot of insight into what the actual vector space looks like in the index and so lack the relevant insights to make more accurate and concrete recommendations related to recall issues or optimizations.

Solution

The solution introduces sampling capabilities at both ingestion and query time to provide insights into the vector space. During document indexing, vectors are intercepted at flush time and asynchronously sampled, with statistical computations performed on this subset and stored alongside segment data in the Lucene directory.

Both ingestion and query flows feed into a global level API (like the Stats API) that aggregates sampling data across segments, providing metrics about vector distributions, dimension-level statistics, and search patterns.

A high level sequence diagram of the design is found below:

Image

User will be able to query their profiled statistics like so:

curl -X GET "localhost:9200/_plugins/_knn/sampling/my_index1/stats?pretty"

Result:

{
  "my_index1": {
    "vector_sample_count": 50000,
    "last_updated": "2025-03-20T10:15:30Z",
    "field_stats": {
      "product_vector": {
        "dimension_stats": [
          {
            "min": -0.9876,
            "max": 0.9954,
            "mean": 0.0234,
            "variance": 0.2567
          },
          // ... (stats for other dimensions)
        ],
        "distance_distribution": {
          "min_distance": 0.0123,
          "max_distance": 1.9876,
          "mean_distance": 0.7654,
          "percentiles": {
            "p50": 0.7123,
            "p75": 1.0234,
            "p90": 1.3456,
            "p99": 1.7890
          }
        },
        "sparsity": 0.1234,
        "clustering_coefficient": 0.7890
      }
    }
  }
}

Users should be ideally be able able to trigger the analysis "on-demand" for debugging:

curl -X POST "localhost:9200/_plugins/_knn/sampling/my_index1/_analyze?pretty" -H 'Content-Type: application/json' -d'
{
  "field": "product_vector",
  "sample_size": 10000,
  "metrics": ["dimension_stats", "distance_distribution", "sparsity"]
}
'

Result:

{
  "acknowledged": true,
  "task_id": "ABC123XYZ",
  "status": "STARTED"
}
curl -X GET "localhost:9200/_tasks/ABC123XYZ?pretty"

Related Issues

#2243

@jmazanec15
Copy link
Member

Thanks @markwu-sde - this looks cool. I like general idea. I think it would also be good to show breakdown based on shard and segment as well in the api

@markwu-sde
Copy link
Contributor Author

markwu-sde commented Mar 20, 2025

Thanks. I'll add that to the description as well.

@oaganesh is currently on working on the POC.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants