[Feature Request] Rewrites for BooleanQuery #17586

peteralfonsi · 2025-03-13T21:45:09Z

Is your feature request related to a problem? Please describe

Some boolean queries can be written in multiple equivalent ways, where one is more performant than another. Users may not be aware of which one is better, so I think it would make sense to automatically rewrite them. (See additional context section for specific rewrites and preliminary benchmarks)

Describe the solution you'd like

Probably the simplest way is to just add to doRewrite() in BoolQueryBuilder. It would be ideal for some rewrites to know what kind of field is being used (numeric vs non-numeric), but this info is not available here. It is available later in AbstractQueryBuilder.doToQuery() where the actual Lucene query is built, and queries like MatchQuery make use of this, but it seems unintended to make changes to the structure of the builder here.

Related component

Search:Performance

Describe alternatives you've considered

One proposal was to plug this into query insights, or some sort of rewrite-performance-tracking module as described in #12390. However, all of the rewrites I've found so far should always just be applied in all cases that I've tested, so we don't need any extra signal from QI or previous execution latencies to decide whether to do the rewrites. This is like how the date histogram rewrite is just applied in all cases because it's a strict speedup.

For implementation, I also considered making an OS wrapper around Lucene's BooleanQuery class. This would let us check the field type at query rewrite time. But necessary fields are private so this won't work.

Additional context

I've been testing out a few of these with OSB on http_logs and nyc_taxis to see what speedups are like. I have 3 promising ones so far. Please let me know if I'm incorrect about these being equivalent, or if there's some other context I've missed, or if you have any other rewrite ideas!

Pull out numeric must clauses to filter clauses
If you have a must clause that's purely numeric, like a range, a document either matches it or it doesn't. So all returned docs should get the same score from this clause. This means actually calculating this score is unnecessary. If we move this clause from must to filter, the scores of all matching documents drop by the same amount, which shouldn't change any results.

Example original query (on http_logs. status is an integer field)

"bool": {
  "must": [
    {
      "match": {
        "request": "images"
      }
    },
    {
      "match": {
        "status": "200"
      }
    }
  ]
}

Rewritten query:

"bool": {
  "filter": {
    "match": {
      "status": "200"
    }
  },
  "must": {
    "match": {
      "request": "images"
    }
  }
}

The speedup here varies based on the selectivity of the numeric clause that gets moved. The more documents it matches the greater the speedup is. Even for very few documents though, it never gets slower than the original query.

Moved clause	Fraction of documents matching clause	Original p50 (ms)	Rewritten p50 (ms)
match status 200	84%	2892	27
match status 404	0.5%	111	26
match status 500	0.00025%	8.68	8.69
range timestamp 6/10 - 6/13	49%	1584	48.6

These large speedups make sense when we look at the flamegraphs taken when running the original and rewritten 200 queries. In the original, bulk scoring is the vast majority of the query:

but in the rewritten the huge scoring block is missing:

Note there's no speedup if the other clause is also numeric, like a range on another field.

must_not of a range --> should of complement of range

must_not is slow because it has to gather all documents and then filter out the ones that match the clause. If we can easily transform a must_not into a should on all the parts of its complement that should speed things up since we'd skip this filtering. This is simple for a numeric range.

Original query (on nyc_taxis):

"bool" : { 
  "must_not": [ 
    {"range" : {"dropoff_datetime" : {"gte": "01/07/2015", "lte": "01/09/2015", "format": "dd/MM/yyyy"}}}
  ]
}

Rewritten query:

"bool": { 
  "must":{
    "bool":{
      "should": [
        {"range" : {"dropoff_datetime" : {"lt": "01/07/2015", "format": "dd/MM/yyyy"}}},
        {"range" : {"dropoff_datetime" : {"gt": "01/09/2015", "format": "dd/MM/yyyy"}}}
      ]
    }
  }
}

Note the rewritten should clause is nested in another boolean query which goes in the original's must clause. This is to make sure it doesn't affect preexisting should clauses in the original query.

Imagine if we just put them in the original query's should clause, which previously had clauses A and B. Then we rewrite so should now contains A, B, < July, > September, and we increment minimumShouldMatch to 2 to account for the change. Now, some document that matches A and B but falls in the original excluded range could incorrectly be returned. To avoid this we have to nest the complement of the range into its own boolean query.

Excluded range	Original p50 (ms)	Rewritten p50 (ms)
Jul 1 - Sep 1	781	441
Jan 1 - Sep 1	1212	125

Note if we nest further by making each of the should clauses into its own boolean query with a filter on the range this sometimes speeds it up even more, but only for Jul 1 - Sep 1 (p50 went to 295 ms). I don't understand why this only happened for the narrower range.

must_not of A AND B --> should of !A, !B

Use the identity !(A AND B) == !A OR !B.

Example original query (on nyc_taxis):

"bool" : {
  "must_not": [
    {"bool":{"must":[
      {"range" : {"dropoff_datetime" : {"gte": "21/01/2015", "lte": "03/04/2015", "format": "dd/MM/yyyy"}}}, 
      {"range" : {"total_amount" : {"gte": 10, "lte": 20}}}]}
    }
  ]
}

Rewritten query:

"bool" : {
  "should": [
    {"bool":{"must_not": {"range" : {"dropoff_datetime" : {"gte": "21/01/2015", "lte": "03/04/2015", "format": "dd/MM/yyyy"}}}}},
    {"bool":{"must_not": {"range" : {"total_amount" : {"gte": 10, "lte": 20}}}}}
  ]
}

I don't fully understand why this one speeds things up. It seems to contradict the previous one, if must_nots are slow and the rewritten one has two must_nots instead of 1. However in all situations I tested it is faster.

Clause A	Clause B	Original p50 (ms)	Rewritten p50 (ms)
dropoff_datetime from 1/21 to 12/3	total_amount from 10 to 20	2118	223
dropoff_datetime from 1/21 to 4/3	total_amount from 10 to 20	1569	265
dropoff_datetime from 1/21 to 1/22	total_amount from 10 to 20	552	173
dropoff_datetime from 1/21 to 4/3	vendor_id matches "1"	1281	105
payment_type matches "2"	vendor_id matches "1"	2310	8.49

Also note the version of this rewrite swapping OR and AND, !(A OR B) --> !A AND !B, showed no change in latency at all.

The text was updated successfully, but these errors were encountered:

peteralfonsi added enhancement Enhancement or improvement to existing feature or request untriaged labels Mar 13, 2025

peteralfonsi self-assigned this Mar 13, 2025

github-actions bot added the Search:Performance label Mar 13, 2025

github-project-automation bot added this to Search Project Board Mar 13, 2025

github-project-automation bot moved this to 🆕 New in Search Project Board Mar 13, 2025

sandeshkr419 removed the untriaged label Mar 19, 2025

peteralfonsi mentioned this issue Mar 21, 2025

BooleanQuery rewrite for must_not RangeQuery clauses #17655

Draft

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Rewrites for BooleanQuery #17586

[Feature Request] Rewrites for BooleanQuery #17586

peteralfonsi commented Mar 13, 2025

[Feature Request] Rewrites for BooleanQuery #17586

[Feature Request] Rewrites for BooleanQuery #17586

Comments

peteralfonsi commented Mar 13, 2025

Is your feature request related to a problem? Please describe

Describe the solution you'd like

Related component

Describe alternatives you've considered

Additional context