You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe
Some boolean queries can be written in multiple equivalent ways, where one is more performant than another. Users may not be aware of which one is better, so I think it would make sense to automatically rewrite them. (See additional context section for specific rewrites and preliminary benchmarks)
Describe the solution you'd like
Probably the simplest way is to just add to doRewrite() in BoolQueryBuilder. It would be ideal for some rewrites to know what kind of field is being used (numeric vs non-numeric), but this info is not available here. It is available later in AbstractQueryBuilder.doToQuery() where the actual Lucene query is built, and queries like MatchQuery make use of this, but it seems unintended to make changes to the structure of the builder here.
Related component
Search:Performance
Describe alternatives you've considered
One proposal was to plug this into query insights, or some sort of rewrite-performance-tracking module as described in #12390. However, all of the rewrites I've found so far should always just be applied in all cases that I've tested, so we don't need any extra signal from QI or previous execution latencies to decide whether to do the rewrites. This is like how the date histogram rewrite is just applied in all cases because it's a strict speedup.
For implementation, I also considered making an OS wrapper around Lucene's BooleanQuery class. This would let us check the field type at query rewrite time. But necessary fields are private so this won't work.
Additional context
I've been testing out a few of these with OSB on http_logs and nyc_taxis to see what speedups are like. I have 3 promising ones so far. Please let me know if I'm incorrect about these being equivalent, or if there's some other context I've missed, or if you have any other rewrite ideas!
Pull out numeric must clauses to filter clauses
If you have a must clause that's purely numeric, like a range, a document either matches it or it doesn't. So all returned docs should get the same score from this clause. This means actually calculating this score is unnecessary. If we move this clause from must to filter, the scores of all matching documents drop by the same amount, which shouldn't change any results.
Example original query (on http_logs. status is an integer field)
The speedup here varies based on the selectivity of the numeric clause that gets moved. The more documents it matches the greater the speedup is. Even for very few documents though, it never gets slower than the original query.
Moved clause
Fraction of documents matching clause
Original p50 (ms)
Rewritten p50 (ms)
match status 200
84%
2892
27
match status 404
0.5%
111
26
match status 500
0.00025%
8.68
8.69
range timestamp 6/10 - 6/13
49%
1584
48.6
These large speedups make sense when we look at the flamegraphs taken when running the original and rewritten 200 queries. In the original, bulk scoring is the vast majority of the query:
but in the rewritten the huge scoring block is missing:
Note there's no speedup if the other clause is also numeric, like a range on another field.
must_not of a range --> should of complement of range
must_not is slow because it has to gather all documents and then filter out the ones that match the clause. If we can easily transform a must_not into a should on all the parts of its complement that should speed things up since we'd skip this filtering. This is simple for a numeric range.
Note the rewritten should clause is nested in another boolean query which goes in the original's must clause. This is to make sure it doesn't affect preexisting should clauses in the original query.
Imagine if we just put them in the original query's should clause, which previously had clauses A and B. Then we rewrite so should now contains A, B, < July, > September, and we increment minimumShouldMatch to 2 to account for the change. Now, some document that matches A and B but falls in the original excluded range could incorrectly be returned. To avoid this we have to nest the complement of the range into its own boolean query.
Excluded range
Original p50 (ms)
Rewritten p50 (ms)
Jul 1 - Sep 1
781
441
Jan 1 - Sep 1
1212
125
Note if we nest further by making each of the should clauses into its own boolean query with a filter on the range this sometimes speeds it up even more, but only for Jul 1 - Sep 1 (p50 went to 295 ms). I don't understand why this only happened for the narrower range.
I don't fully understand why this one speeds things up. It seems to contradict the previous one, if must_nots are slow and the rewritten one has two must_nots instead of 1. However in all situations I tested it is faster.
Clause A
Clause B
Original p50 (ms)
Rewritten p50 (ms)
dropoff_datetime from 1/21 to 12/3
total_amount from 10 to 20
2118
223
dropoff_datetime from 1/21 to 4/3
total_amount from 10 to 20
1569
265
dropoff_datetime from 1/21 to 1/22
total_amount from 10 to 20
552
173
dropoff_datetime from 1/21 to 4/3
vendor_id matches "1"
1281
105
payment_type matches "2"
vendor_id matches "1"
2310
8.49
Also note the version of this rewrite swapping OR and AND, !(A OR B) --> !A AND !B, showed no change in latency at all.
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem? Please describe
Some boolean queries can be written in multiple equivalent ways, where one is more performant than another. Users may not be aware of which one is better, so I think it would make sense to automatically rewrite them. (See additional context section for specific rewrites and preliminary benchmarks)
Describe the solution you'd like
Probably the simplest way is to just add to
doRewrite()
inBoolQueryBuilder
. It would be ideal for some rewrites to know what kind of field is being used (numeric vs non-numeric), but this info is not available here. It is available later inAbstractQueryBuilder.doToQuery()
where the actual Lucene query is built, and queries likeMatchQuery
make use of this, but it seems unintended to make changes to the structure of the builder here.Related component
Search:Performance
Describe alternatives you've considered
One proposal was to plug this into query insights, or some sort of rewrite-performance-tracking module as described in #12390. However, all of the rewrites I've found so far should always just be applied in all cases that I've tested, so we don't need any extra signal from QI or previous execution latencies to decide whether to do the rewrites. This is like how the date histogram rewrite is just applied in all cases because it's a strict speedup.
For implementation, I also considered making an OS wrapper around Lucene's BooleanQuery class. This would let us check the field type at query rewrite time. But necessary fields are private so this won't work.
Additional context
I've been testing out a few of these with OSB on
http_logs
andnyc_taxis
to see what speedups are like. I have 3 promising ones so far. Please let me know if I'm incorrect about these being equivalent, or if there's some other context I've missed, or if you have any other rewrite ideas!Pull out numeric
must
clauses tofilter
clausesIf you have a
must
clause that's purely numeric, like a range, a document either matches it or it doesn't. So all returned docs should get the same score from this clause. This means actually calculating this score is unnecessary. If we move this clause frommust
tofilter
, the scores of all matching documents drop by the same amount, which shouldn't change any results.Example original query (on
http_logs
.status
is an integer field)Rewritten query:
The speedup here varies based on the selectivity of the numeric clause that gets moved. The more documents it matches the greater the speedup is. Even for very few documents though, it never gets slower than the original query.
These large speedups make sense when we look at the flamegraphs taken when running the original and rewritten 200 queries. In the original, bulk scoring is the vast majority of the query:
but in the rewritten the huge scoring block is missing:
Note there's no speedup if the other clause is also numeric, like a range on another field.
must_not
of a range -->should
of complement of rangemust_not
is slow because it has to gather all documents and then filter out the ones that match the clause. If we can easily transform amust_not
into ashould
on all the parts of its complement that should speed things up since we'd skip this filtering. This is simple for a numeric range.Original query (on
nyc_taxis
):Rewritten query:
Note the rewritten
should
clause is nested in another boolean query which goes in the original'smust
clause. This is to make sure it doesn't affect preexistingshould
clauses in the original query.Imagine if we just put them in the original query's should clause, which previously had clauses
A
andB
. Then we rewrite soshould
now containsA
,B
,< July
,> September
, and we incrementminimumShouldMatch
to 2 to account for the change. Now, some document that matchesA
andB
but falls in the original excluded range could incorrectly be returned. To avoid this we have to nest the complement of the range into its own boolean query.Note if we nest further by making each of the
should
clauses into its own boolean query with a filter on the range this sometimes speeds it up even more, but only for Jul 1 - Sep 1 (p50 went to 295 ms). I don't understand why this only happened for the narrower range.must_not
ofA AND B
-->should
of!A
,!B
Use the identity
!(A AND B) == !A OR !B
.Example original query (on
nyc_taxis
):Rewritten query:
I don't fully understand why this one speeds things up. It seems to contradict the previous one, if
must_not
s are slow and the rewritten one has twomust_not
s instead of 1. However in all situations I tested it is faster.Also note the version of this rewrite swapping OR and AND,
!(A OR B) --> !A AND !B
, showed no change in latency at all.The text was updated successfully, but these errors were encountered: