-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Expand the Approximation to Integer
and update the BKD traversal logic to improve the performance on skewed data
#18342
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Prudhvi Godithi <[email protected]>
Signed-off-by: Prudhvi Godithi <[email protected]>
{"run-benchmark-test": "id_4"} |
The Jenkins job url is https://build.ci.opensearch.org/job/benchmark-pull-request/3179/ . Final results will be published once the job is completed. |
I have created a PR #18343 to onboard the |
❌ Gradle check result for b2fcf5a: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
{"run-benchmark-test": "id_12"} |
The Jenkins job url is https://build.ci.opensearch.org/job/benchmark-pull-request/3180/ . Final results will be published once the job is completed. |
Benchmark ResultsBenchmark Results for Job: https://build.ci.opensearch.org/job/benchmark-pull-request/3179/
|
Benchmark Baseline Comparison ResultsBenchmark Results for Job: https://build.ci.opensearch.org/job/benchmark-compare/97/
|
{"run-benchmark-test": "id_4"} |
The Jenkins job url is https://build.ci.opensearch.org/job/benchmark-pull-request/3181/ . Final results will be published once the job is completed. |
} while (pointTree.size() > size && pointTree.moveToChild()); | ||
pointTree.moveToParent(); | ||
|
||
// If we've collected documents but not enough size, process the siblings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this for the case where a single document has multiple values? So we may visit the left subtree based on the node size, but the actual number of doc IDs is small.
Maybe we could simplify by only applying this optimization when doc count >= point count?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, I feel somewhat confused about this implementation—it strikes me as a bit convoluted. The original intent of optimization was to ensure that we do not traverse more than size elements when using asc
in BKD. I believe we should leverage docCount
to implement this process. just like in #18337
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Froh and kkewwei, doing some more tests will post the updates shortly. I have also responded to your PR #18337 (comment).
I assume leverage docCount would be good but optimizing this BKD traversal path should give us better results. Will do some more tests.
query = new IndexOrDocValuesQuery(query, dvQuery); | ||
if (context.indexSortedOnField(field)) { | ||
query = new IndexSortSortedNumericDocValuesRangeQuery(field, l, u, query); | ||
Query dvQuery = hasDocValues ? SortedNumericDocValuesField.newSlowRangeQuery(field, l, u) : null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we refactor these rangeQuery
methods to cut down on the duplication?
I'm thinking that the type-specific information is:
- What are the upper/lower limits for the type?
- How do I parse values into the given type? (This is already captured in the
parse
method.) - How do I create a
PointRangeQuery
for the type? - How do I represent the type as a
long
for use in DV queries (both inSortedNumericDocValuesField.newSlowRangeQuery
andIndexSortSortedNumericDocValuesRangeQuery
)? - How do I pack values of the type into point value bytes? <-- In theory, this can be used to produce a default implementation for number 3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would lean toward splitting this into two PRs.
The traversal optimization is one thing. Handling all the other types (by doing the above refactoring) is another.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure Froh, since seen regression is with skewed dataset like http_logs
, I want to test the traversal optimization on this dataset, but the only way to prove the optimization works in by enabling the approximation to integer numeric type as http_logs
does not use long https://github.com/opensearch-project/opensearch-benchmark-workloads/blob/main/http_logs/index.json. But ya once I see performance results with benchmark I will separate the PR's.
Waiting on to test with id_12
CC @rishabh6788
Benchmark ResultsBenchmark Results for Job: https://build.ci.opensearch.org/job/benchmark-pull-request/3181/
|
Benchmark Baseline Comparison ResultsBenchmark Results for Job: https://build.ci.opensearch.org/job/benchmark-compare/98/
|
{"run-benchmark-test": "id_12"} |
The Jenkins job url is https://build.ci.opensearch.org/job/benchmark-pull-request/3191/ . Final results will be published once the job is completed. |
Benchmark ResultsBenchmark Results for Job: https://build.ci.opensearch.org/job/benchmark-pull-request/3191/
|
Benchmark Baseline Comparison ResultsBenchmark Results for Job: https://build.ci.opensearch.org/job/benchmark-compare/99/
|
{"run-benchmark-test": "id_12"} |
The Jenkins job url is https://build.ci.opensearch.org/job/benchmark-pull-request/3192/ . Final results will be published once the job is completed. |
Signed-off-by: Prudhvi Godithi <[email protected]>
{"run-benchmark-test": "id_12"} |
The Jenkins job url is https://build.ci.opensearch.org/job/benchmark-pull-request/3193/ . Final results will be published once the job is completed. |
❌ Gradle check result for afc509f: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Benchmark ResultsBenchmark Results for Job: https://build.ci.opensearch.org/job/benchmark-pull-request/3193/
|
Benchmark Baseline Comparison ResultsBenchmark Results for Job: https://build.ci.opensearch.org/job/benchmark-compare/100/
|
return; | ||
} | ||
// Fast path for dense nodes | ||
if (pointTree.size() > size && docCount[0] < size) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the relation is cross? Why we don't need to check that?
Description
Expand the Approximation to
Integer
numeric type.From [Feature Request] Improve
ApproximatePointRangeQuery
Traversal for Skewed Datasets with DFS Strategy #18341 (comment) the drawbacks of existing approximation on skewed data sets, updatedintersectLeft
method to ~DFS strategy.A sample run now showing performance improvement for
http_logs
datasetRelated Issues
Part of #18341 and #14406
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.