Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How pagination depth works with knn and neural query? #1202

Open
martin-gaievski opened this issue Feb 27, 2025 · 1 comment
Open

How pagination depth works with knn and neural query? #1202

martin-gaievski opened this issue Feb 27, 2025 · 1 comment
Assignees
Labels
question Further information is requested

Comments

@martin-gaievski
Copy link
Member

I need a clarification from the team regarding one aspect of hybrid query when it's used with knn/neural query and pagination feature.

In recently developed pagination feature for hybrid query (released in 2.19) new parameter pagination depth sets the max number of doc scores that can be collected at the shard level for a single sub-query. Effectively that works in a same way as existing size.

For knn and neural queries there is one more parameter that works in a similar way and can limit the number of documents we retrieve from the shard or even segment level. This parameter is k, and typical recommendation is to keep size and k equal.

My question is: does the pagination depth in hybrid query change this behavior, or knn/neural query will keep work as they do today?
And is there a documentation that describes this behavior? This is important to know because vector query is normally part of the hybrid search.

@martin-gaievski martin-gaievski added bug Something isn't working untriaged question Further information is requested and removed bug Something isn't working labels Feb 27, 2025
@martin-gaievski
Copy link
Member Author

martin-gaievski commented Feb 27, 2025

Some more information regarding this question.

I do ingest following set of documents, 8 of them do have vectors:

{"index":{}}
{"field1": 2,"vector": [0.4, 0.5, 0.2],"title": "basic", "name": "A West Virginia university women 's basketball team , officials , and a small gathering of fans are in a West Virginia arena .", "category": "novel", "price": 20}
{"index":{}}
{ "name": "I brought home the trophy", "category": "story", "price": 20, "field1": 10,"vector": [0.2, 0.2, 0.3],"title": "java"}
{"index":{}}
{"field1": 50,"vector": [4.2, 5.5, 8.9],"name": "Why would he go to all that effort for a free pack of ranch dressing?", "category": "story", "price": 10 }
{"index":{}}
{"vector": [0.3, 0.12, 3.3],"title": "python","name": "In the next 40-50 years I plan on opening up my own business.","category": "poem","price": 100}
{"index":{}}
{  "field1": 100,"vector": [0.2, 0.2, 0.3],"title": "java", "name": "Does he have a big family?", "category": "biography", "price": 70}
{"index":{}}
{"name": "She is my younger sister","category": "workbook","price": 25}
{"index":{}} 
{"field1": 75, "vector": [0.8, 1.2, 0.9], "title": "scala", "name": "The old lighthouse stood guard over the rocky coastline for centuries.", "category": "novel", "price": 45} 
{"index":{}} 
{"field1": 30, "vector": [2.1, 1.8, 2.5], "title": "ruby", "name": "Fresh cookies filled the kitchen with their wonderful aroma.", "category": "story", "price": 15} 
{"index":{}} 
{"field1": 120, "vector": [0.6, 0.7, 0.4], "title": "swift", "name": "The ancient map revealed a hidden treasure in the mountains.", "category": "adventure", "price": 85}

When I do knn query where k and from value do not have overlap it gives me empty response

{
    "query": {
        "knn": {
            "vector": {
                "vector": [
                    5.0,
                    4.0,
                    2.1
                ],
                "k": 4
            }
        }
    },
    "size": 10,
    "from": 4
}
{
    "hits": {
        "total": {
            "value": 4,
            "relation": "eq"
        },
        "max_score": 0.06939625,
        "hits": []
    }
}

now I use exact same knn query in hybrid query along with other query range

{
    "query": {
        "hybrid": {
            "queries": [
                {
                    "range": {
                        "field1": {
                            "gte": 20,
                            "lte": 150
                        }
                    }
                },
                {
                    "knn": {
                        "vector": {
                            "vector": [
                                5.0,
                                4.0,
                                2.1
                            ],
                            "k": 4
                        }
                    }
                }
            ],
            "pagination_depth": 10
        }
    },
    "size": 10,
    "from": 4
}

this gives me following 2 docs

{
    "hits": {
        "total": {
            "value": 6,
            "relation": "eq"
        },
        "max_score": 1.0,
        "hits": [
            {
                "_index": "index-test",
                "_id": "MaTDSJUBfWBqJA1upVNy",
                "_score": 0.5,
                "_source": {
                    "field1": 100,
                    "vector": [
                        0.2,
                        0.2,
                        0.3
                    ],
                    "title": "java",
                    "name": "Does he have a big family?",
                    "category": "biography",
                    "price": 70
                }
            },
            {
                "_index": "index-test",
                "_id": "LaTDSJUBfWBqJA1upVNw",
                "_score": 5.0E-4,
                "_source": {
                    "field1": 2,
                    "vector": [
                        0.4,
                        0.5,
                        0.2
                    ],
                    "title": "basic",
                    "name": "A West Virginia university women 's basketball team , officials , and a small gathering of fans are in a West Virginia arena .",
                    "category": "novel",
                    "price": 20
                }
            }
        ]
    }
}

interestingly enough the second doc from response cannot be related to range query because it has "field1": 2, out of the range. So this must be from knn, but that's a bit counter intuitive because knn query when executed standalone does not return that doc.

I found that this doc is actually at position 4 of raw knn results. That means we do not apply the from/size at the sub-query level, but later in the process:

{
    "query": {
        "knn": {
            "vector": {
                "vector": [
                    5.0,
                    4.0,
                    2.1
                ],
                "k": 4
            }
        }
    },
    "size": 10,
    "from": 0
}
{
    "hits": {
        "total": {
            "value": 4,
            "relation": "eq"
        },
        "max_score": 0.06939625,
        "hits": [
            {
                "_index": "index-test",
                "_id": "NKTDSJUBfWBqJA1upVNy",
                "_score": 0.06939625,
                "_source": {
                    "field1": 30,
                    "vector": [
                        2.1,
                        1.8,
                        2.5
                    ],
                    "title": "ruby",
                    "name": "Fresh cookies filled the kitchen with their wonderful aroma.",
                    "category": "story",
                    "price": 15
                }
            },
            {
                "_index": "index-test",
                "_id": "M6TDSJUBfWBqJA1upVNy",
                "_score": 0.03581662,
                "_source": {
                    "field1": 75,
                    "vector": [
                        0.8,
                        1.2,
                        0.9
                    ],
                    "title": "scala",
                    "name": "The old lighthouse stood guard over the rocky coastline for centuries.",
                    "category": "novel",
                    "price": 45
                }
            },
            {
                "_index": "index-test",
                "_id": "NaTDSJUBfWBqJA1upVNy",
                "_score": 0.029291155,
                "_source": {
                    "field1": 120,
                    "vector": [
                        0.6,
                        0.7,
                        0.4
                    ],
                    "title": "swift",
                    "name": "The ancient map revealed a hidden treasure in the mountains.",
                    "category": "adventure",
                    "price": 85
                }
            },
            {
                "_index": "index-test",
                "_id": "LaTDSJUBfWBqJA1upVNw",
                "_score": 0.026301946,
                "_source": {
                    "field1": 2,
                    "vector": [
                        0.4,
                        0.5,
                        0.2
                    ],
                    "title": "basic",
                    "name": "A West Virginia university women 's basketball team , officials , and a small gathering of fans are in a West Virginia arena .",
                    "category": "novel",
                    "price": 20
                }
            }
        ]
    }
}

this is the create index request for my scenario

{
  "settings": {
    "index": {
      "knn": true,
      "number_of_shards": 1,
      "number_of_replicas": 0
    }
  },
  "mappings": {
    "properties": {
      "vector": {
        "type": "knn_vector",
        "dimension": 3,
        "method": {
          "name": "hnsw",
          "space_type": "l2",
          "engine": "lucene"
        }
      },
      "field1": {
        "type": "integer"
      },
      "name": {
        "type": "text"
      }
    }
  }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants