Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Rewrites for BooleanQuery #17586

Open
peteralfonsi opened this issue Mar 13, 2025 · 0 comments
Open

[Feature Request] Rewrites for BooleanQuery #17586

peteralfonsi opened this issue Mar 13, 2025 · 0 comments
Assignees
Labels
enhancement Enhancement or improvement to existing feature or request Search:Performance

Comments

@peteralfonsi
Copy link
Contributor

Is your feature request related to a problem? Please describe

Some boolean queries can be written in multiple equivalent ways, where one is more performant than another. Users may not be aware of which one is better, so I think it would make sense to automatically rewrite them. (See additional context section for specific rewrites and preliminary benchmarks)

Describe the solution you'd like

Probably the simplest way is to just add to doRewrite() in BoolQueryBuilder. It would be ideal for some rewrites to know what kind of field is being used (numeric vs non-numeric), but this info is not available here. It is available later in AbstractQueryBuilder.doToQuery() where the actual Lucene query is built, and queries like MatchQuery make use of this, but it seems unintended to make changes to the structure of the builder here.

Related component

Search:Performance

Describe alternatives you've considered

One proposal was to plug this into query insights, or some sort of rewrite-performance-tracking module as described in #12390. However, all of the rewrites I've found so far should always just be applied in all cases that I've tested, so we don't need any extra signal from QI or previous execution latencies to decide whether to do the rewrites. This is like how the date histogram rewrite is just applied in all cases because it's a strict speedup.

For implementation, I also considered making an OS wrapper around Lucene's BooleanQuery class. This would let us check the field type at query rewrite time. But necessary fields are private so this won't work.

Additional context

I've been testing out a few of these with OSB on http_logs and nyc_taxis to see what speedups are like. I have 3 promising ones so far. Please let me know if I'm incorrect about these being equivalent, or if there's some other context I've missed, or if you have any other rewrite ideas!

Pull out numeric must clauses to filter clauses
If you have a must clause that's purely numeric, like a range, a document either matches it or it doesn't. So all returned docs should get the same score from this clause. This means actually calculating this score is unnecessary. If we move this clause from must to filter, the scores of all matching documents drop by the same amount, which shouldn't change any results.

Example original query (on http_logs. status is an integer field)

"bool": {
  "must": [
    {
      "match": {
        "request": "images"
      }
    },
    {
      "match": {
        "status": "200"
      }
    }
  ]
}

Rewritten query:

"bool": {
  "filter": {
    "match": {
      "status": "200"
    }
  },
  "must": {
    "match": {
      "request": "images"
    }
  }
}

The speedup here varies based on the selectivity of the numeric clause that gets moved. The more documents it matches the greater the speedup is. Even for very few documents though, it never gets slower than the original query.

Moved clause Fraction of documents matching clause Original p50 (ms) Rewritten p50 (ms)
match status 200 84% 2892 27
match status 404 0.5% 111 26
match status 500 0.00025% 8.68 8.69
range timestamp 6/10 - 6/13 49% 1584 48.6

These large speedups make sense when we look at the flamegraphs taken when running the original and rewritten 200 queries. In the original, bulk scoring is the vast majority of the query:

Image

but in the rewritten the huge scoring block is missing:

Image

Note there's no speedup if the other clause is also numeric, like a range on another field.

must_not of a range --> should of complement of range

must_not is slow because it has to gather all documents and then filter out the ones that match the clause. If we can easily transform a must_not into a should on all the parts of its complement that should speed things up since we'd skip this filtering. This is simple for a numeric range.

Original query (on nyc_taxis):

"bool" : { 
  "must_not": [ 
    {"range" : {"dropoff_datetime" : {"gte": "01/07/2015", "lte": "01/09/2015", "format": "dd/MM/yyyy"}}}
  ]
}

Rewritten query:

"bool": { 
  "must":{
    "bool":{
      "should": [
        {"range" : {"dropoff_datetime" : {"lt": "01/07/2015", "format": "dd/MM/yyyy"}}},
        {"range" : {"dropoff_datetime" : {"gt": "01/09/2015", "format": "dd/MM/yyyy"}}}
      ]
    }
  }
}

Note the rewritten should clause is nested in another boolean query which goes in the original's must clause. This is to make sure it doesn't affect preexisting should clauses in the original query.

Imagine if we just put them in the original query's should clause, which previously had clauses A and B. Then we rewrite so should now contains A, B, < July, > September, and we increment minimumShouldMatch to 2 to account for the change. Now, some document that matches A and B but falls in the original excluded range could incorrectly be returned. To avoid this we have to nest the complement of the range into its own boolean query.

Excluded range Original p50 (ms) Rewritten p50 (ms)
Jul 1 - Sep 1 781 441
Jan 1 - Sep 1 1212 125

Note if we nest further by making each of the should clauses into its own boolean query with a filter on the range this sometimes speeds it up even more, but only for Jul 1 - Sep 1 (p50 went to 295 ms). I don't understand why this only happened for the narrower range.

must_not of A AND B --> should of !A, !B

Use the identity !(A AND B) == !A OR !B.

Example original query (on nyc_taxis):

"bool" : {
  "must_not": [
    {"bool":{"must":[
      {"range" : {"dropoff_datetime" : {"gte": "21/01/2015", "lte": "03/04/2015", "format": "dd/MM/yyyy"}}}, 
      {"range" : {"total_amount" : {"gte": 10, "lte": 20}}}]}
    }
  ]
}

Rewritten query:

"bool" : {
  "should": [
    {"bool":{"must_not": {"range" : {"dropoff_datetime" : {"gte": "21/01/2015", "lte": "03/04/2015", "format": "dd/MM/yyyy"}}}}},
    {"bool":{"must_not": {"range" : {"total_amount" : {"gte": 10, "lte": 20}}}}}
  ]
}

I don't fully understand why this one speeds things up. It seems to contradict the previous one, if must_nots are slow and the rewritten one has two must_nots instead of 1. However in all situations I tested it is faster.

Clause A Clause B Original p50 (ms) Rewritten p50 (ms)
dropoff_datetime from 1/21 to 12/3 total_amount from 10 to 20 2118 223
dropoff_datetime from 1/21 to 4/3 total_amount from 10 to 20 1569 265
dropoff_datetime from 1/21 to 1/22 total_amount from 10 to 20 552 173
dropoff_datetime from 1/21 to 4/3 vendor_id matches "1" 1281 105
payment_type matches "2" vendor_id matches "1" 2310 8.49

Also note the version of this rewrite swapping OR and AND, !(A OR B) --> !A AND !B, showed no change in latency at all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request Search:Performance
Projects
Status: 🆕 New
Development

No branches or pull requests

2 participants