Skip to content

Expand the Approximation to Integer and update the BKD traversal logic to improve the performance on skewed data #18342

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -988,23 +988,34 @@ public Query rangeQuery(
--u;
}
}
if (isSearchable && hasDocValues) {
Query query = IntPoint.newRangeQuery(field, l, u);
Query dvQuery = SortedNumericDocValuesField.newSlowRangeQuery(field, l, u);
query = new IndexOrDocValuesQuery(query, dvQuery);
if (context.indexSortedOnField(field)) {
query = new IndexSortSortedNumericDocValuesRangeQuery(field, l, u, query);
Query dvQuery = hasDocValues ? SortedNumericDocValuesField.newSlowRangeQuery(field, l, u) : null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we refactor these rangeQuery methods to cut down on the duplication?

I'm thinking that the type-specific information is:

  1. What are the upper/lower limits for the type?
  2. How do I parse values into the given type? (This is already captured in the parse method.)
  3. How do I create a PointRangeQuery for the type?
  4. How do I represent the type as a long for use in DV queries (both in SortedNumericDocValuesField.newSlowRangeQuery and IndexSortSortedNumericDocValuesRangeQuery)?
  5. How do I pack values of the type into point value bytes? <-- In theory, this can be used to produce a default implementation for number 3.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would lean toward splitting this into two PRs.

The traversal optimization is one thing. Handling all the other types (by doing the above refactoring) is another.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure Froh, since seen regression is with skewed dataset like http_logs, I want to test the traversal optimization on this dataset, but the only way to prove the optimization works in by enabling the approximation to integer numeric type as http_logs does not use long https://github.com/opensearch-project/opensearch-benchmark-workloads/blob/main/http_logs/index.json. But ya once I see performance results with benchmark I will separate the PR's.

Waiting on to test with id_12 CC @rishabh6788

if (isSearchable) {
Query pointRangeQuery = IntPoint.newRangeQuery(field, l, u);
Query query;
if (dvQuery != null) {
query = new IndexOrDocValuesQuery(pointRangeQuery, dvQuery);
if (context.indexSortedOnField(field)) {
query = new IndexSortSortedNumericDocValuesRangeQuery(field, l, u, query);
}
} else {
query = pointRangeQuery;
}
return query;
return new ApproximateScoreQuery(
query,
new ApproximatePointRangeQuery(
field,
IntPoint.pack(new int[] { l }).bytes,
IntPoint.pack(new int[] { u }).bytes,
new int[] { l }.length,
ApproximatePointRangeQuery.INT_FORMAT
)
);

}
if (hasDocValues) {
Query query = SortedNumericDocValuesField.newSlowRangeQuery(field, l, u);
if (context.indexSortedOnField(field)) {
query = new IndexSortSortedNumericDocValuesRangeQuery(field, l, u, query);
}
return query;
if (context.indexSortedOnField(field)) {
dvQuery = new IndexSortSortedNumericDocValuesRangeQuery(field, l, u, dvQuery);
}
return IntPoint.newRangeQuery(field, l, u);
return dvQuery;
}

@Override
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@

package org.opensearch.search.approximate;

import org.apache.lucene.document.IntPoint;
import org.apache.lucene.document.LongPoint;
import org.apache.lucene.index.LeafReader;
import org.apache.lucene.index.LeafReaderContext;
Expand Down Expand Up @@ -40,6 +41,7 @@
*/
public class ApproximatePointRangeQuery extends ApproximateQuery {
public static final Function<byte[], String> LONG_FORMAT = bytes -> Long.toString(LongPoint.decodeDimension(bytes, 0));
public static final Function<byte[], String> INT_FORMAT = bytes -> Integer.toString(IntPoint.decodeDimension(bytes, 0));
private int size;

private SortOrder sortOrder;
Expand Down Expand Up @@ -248,14 +250,49 @@ private void intersectRight(PointValues.PointTree pointTree, PointValues.Interse
// custom intersect visitor to walk the left of the tree
public void intersectLeft(PointValues.IntersectVisitor visitor, PointValues.PointTree pointTree, long[] docCount)
throws IOException {
PointValues.Relation r = visitor.compare(pointTree.getMinPackedValue(), pointTree.getMaxPackedValue());
if (docCount[0] >= size) {
return;
}
// Outside query - quick exit
PointValues.Relation r = visitor.compare(pointTree.getMinPackedValue(), pointTree.getMaxPackedValue());
if (r == PointValues.Relation.CELL_OUTSIDE_QUERY) {
return;
}
// Fast path for dense nodes
if (pointTree.size() > size && docCount[0] < size) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the relation is cross? Why we don't need to check that?

boolean processedDeepDive = false;
// deep-dive
if (pointTree.moveToChild()) {
processedDeepDive = true;
do {
intersectLeft(visitor, pointTree, docCount);
if (docCount[0] >= size) {
pointTree.moveToParent();
return;
}
} while (pointTree.size() > size && pointTree.moveToChild());
pointTree.moveToParent();

// If we've collected documents but not enough size, process the siblings
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this for the case where a single document has multiple values? So we may visit the left subtree based on the node size, but the actual number of doc IDs is small.

Maybe we could simplify by only applying this optimization when doc count >= point count?

Copy link
Collaborator

@kkewwei kkewwei May 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, I feel somewhat confused about this implementation—it strikes me as a bit convoluted. The original intent of optimization was to ensure that we do not traverse more than size elements when using asc in BKD. I believe we should leverage docCount to implement this process. just like in #18337

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Froh and kkewwei, doing some more tests will post the updates shortly. I have also responded to your PR #18337 (comment).

I assume leverage docCount would be good but optimizing this BKD traversal path should give us better results. Will do some more tests.

if (docCount[0] > 0 && docCount[0] < size) {
// siblings at current level
if (pointTree.moveToChild()) {
while (pointTree.moveToSibling() && docCount[0] < size) {
intersectLeft(visitor, pointTree, docCount);
}
pointTree.moveToParent();
}
return;
}
}
// If the optimization gave no docs continue with standard traversal
if (processedDeepDive && docCount[0] == 0) {} else if (processedDeepDive) {
return;
}
}

// Standard traversal for regular nodes and fallback for deep dive with zero docs
switch (r) {
case CELL_OUTSIDE_QUERY:
// This cell is fully outside the query shape: stop recursing
break;
case CELL_INSIDE_QUERY:
// If the cell is fully inside, we keep moving to child until we reach a point where we can no longer move or when
// we have sufficient doc count. We first move down and then move to the left child
Expand Down Expand Up @@ -293,7 +330,7 @@ public void intersectLeft(PointValues.IntersectVisitor visitor, PointValues.Poin
}
}

// custom intersect visitor to walk the right of tree
// custom intersect visitor to walk the right of tree
public void intersectRight(PointValues.IntersectVisitor visitor, PointValues.PointTree pointTree, long[] docCount)
throws IOException {
PointValues.Relation r = visitor.compare(pointTree.getMinPackedValue(), pointTree.getMaxPackedValue());
Expand Down
Loading