Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Balanced Docs Slice Allocator to increase document balance for Concurrent Segment Search #17687

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

finnroblin
Copy link

@finnroblin finnroblin commented Mar 25, 2025

Description

Adds a new slicing mechanism, BalancedDocsSliceSupplier, to concurrent segment search. This mechanism assigns leaves to slices greedily based on the number of documents in the leaf and slice. Each leaf is assigned to the slice with the current lowest number of documents. This mechanism achieves a more balanced distribution of work compared to the preexisting MaxTargetSliceSupplier.

MaxTargetSliceSupplier sorts the leaves by document count and then round robin distributes the sorted leaves to slices. This round robin implementation can create uneven document distribution where there are a few very large slices and a few very small slices. The proposed BalancedDocsSliceSupplier creates a more balanced distribution of document counts across search slices.

Benchmarks on a vector search workload show that this method improves search throughput and latency compared to other slicing mechanisms. Benchmarks on the big5 workload show that this method is better than maxTargetSliceSupplier in some cases but worse or comparable in others (scroll queries are worse, mixed results for range aggregations). Please see the benchmarking subsections for experiment details and more commentary.

I propose adding this slicing mechanism as a concurrent segment search setting for the following reasons:

  • Demonstrated gains for CSS on k-NN indices yet mixed results in the big5 workload.
  • Isn't another setting for a low-level implementation detail bad? Yes, but a user who opts into custom OpenSearch CSS must already set the max_slice_count parameter to use MaxTargetSliceSupplier. So there is precedent for an additional setting in order to adjust CSS behavior compared to the lucene default. We could call out the new slicing option in the documentation similar to how we already call out the choice between Lucene slicing and round robin slicing.
  • If this is introduced as a setting then k-NN can enable it with index setting listeners via the getAdditionalIndexSettings() hook.

Vector Search Benchmarks

Vector search benchmarks on a dataset with 10M 768-dimension vectors show search throughput increases of 7-10% and search latency decreases of 12-16%. The comparison was performed on a 3x r6g.4xl cluster with 8 slices. Full benchmark results available here.

    Round Robin (MaxTargetSliceSupplier) Balanced Docs Percent difference
Segment count 213 213  
Min Throughput prod-queries 71.3233333 76.6966667 7.01%
Mean Throughput prod-queries 81.5766667 90.6933333 10.05%
Median Throughput prod-queries 81.9866667 91.56 10.46%
Max Throughput prod-queries 83.3 92.7766667 10.21%
50th percentile latency prod-queries 10.3225333 9.13488667 -13.00%
90th percentile latency prod-queries 11.6630333 10.3315433 -12.89%
99th percentile latency prod-queries 13.7183667 12.0862667 -13.50%
99.9th percentile latency prod-queries 32.4364333 27.7666 -16.82%
99.99th percentile latency prod-queries 70.9776667 62.4047333 -13.74%
100th percentile latency prod-queries 82.4167 73.5012667 -12.13%
50th percentile service time prod-queries 10.3225333 9.13488667 -13.00%
90th percentile service time prod-queries 11.6630333 10.3315433 -12.89%
99th percentile service time prod-queries 13.7183667 12.0862667 -13.50%
99.9th percentile service time prod-queries 32.4364333 27.7666 -16.82%
99.99th percentile service time prod-queries 70.9776667 62.4047333 -13.74%
100th percentile service time prod-queries 82.4167 73.5012667 -12.13%
error rate prod-queries 0 0  
Mean recall@k prod-queries 0.95 0.95  
Mean recall@1 prod-queries 0.97333333 0.97  

Big5 Benchmarks

Big5 benchmarks were performed against a single-node cluster with a r5.xl instance (same as in the GH benchmark configs). Only 2 target slices were used due to the small size of the cluster (4 vCPU). I used the big5-100 corpus. The table is attached at the bottom comparing median throughput, p90 latency, and p90 service time for each operation.

I will run the big5 workload with the documents-1000 corpus on a large 3x r6g.4xl cluster with 8 target slices to test more performance characteristics. It will take a few hours to get the results so I am opening the PR now with the more limited big5 results.

The following operations had worse performance with the balanced docs slice supplier. I dug into the queries with performance regressions. Besides the scroll operation, all the worse operations returned 0 hits. So the higher latencies are likely a result of the extra overhead of using a priority queue to assign leaves to slices instead of a sort plus round robin distribution. However, more benchmarking with documents-1000 will hopefully better reveal the performance differences.

Operations with worse performance:

  • range-auto-date-histo
  • term
  • multi_terms-keyword (much worse). (but keyword-terms is better w contender changes)....
  • keyword-terms-low-cardinality
  • composite-terms
  • composite_terms-keyword (much worse).
  • range
  • date_histogram_minute_agg
  • scroll (much worse). I added a check in shouldUseMaxTargetSlice() to use MaxTargetSlicing whenever there is scroll context present.
  • query-string-on-message
  • query-string-on-message-filtered
  • sort_numeric_desc
  • range-auto-date-histo
  • range-agg-2
  • cardinality-agg-low
Brief comparison of all big5 operations

(Full results and comparison available here).

Operation Metric RR BD Pct Diff
desc_sort_timestamp Median Throughput 2 2.01 0.50%
desc_sort_timestamp P90 Latency 24.7062233 23.2023429 -6.48%
desc_sort_timestamp P90 Service Time 22.6351037 20.8071975 -8.78%
asc_sort_timestamp Median Throughput 2.01 2.01 0.00%
asc_sort_timestamp P90 Latency 57.4730759 54.710749 -5.05%
asc_sort_timestamp P90 Service Time 56.303588 53.448578 -5.34%
desc_sort_with_after_timestamp Median Throughput 2.01 2.01 0.00%
desc_sort_with_after_timestamp P90 Latency 9.20452784 8.4762934 -8.59%
desc_sort_with_after_timestamp P90 Service Time 7.82590412 6.95620238 -12.50%
asc_sort_with_after_timestamp Median Throughput 2.01 2.01 0.00%
asc_sort_with_after_timestamp P90 Latency 20.422867 20.048872 -1.87%
asc_sort_with_after_timestamp P90 Service Time 18.9271991 18.5687515 -1.93%
desc_sort_timestamp_can_match_shortcut Median Throughput 2 2 0.00%
desc_sort_timestamp_can_match_shortcut P90 Latency 22.7059584 22.5943716 -0.49%
desc_sort_timestamp_can_match_shortcut P90 Service Time 20.4673676 20.309485 -0.78%
desc_sort_timestamp_no_can_match_shortcut Median Throughput 2.01 2.01 0.00%
desc_sort_timestamp_no_can_match_shortcut P90 Latency 23.231273 22.9051914 -1.42%
desc_sort_timestamp_no_can_match_shortcut P90 Service Time 20.9324122 20.4913126 -2.15%
asc_sort_timestamp_can_match_shortcut Median Throughput 2.01 2.01 0.00%
asc_sort_timestamp_can_match_shortcut P90 Latency 21.4486304 20.679081 -3.72%
asc_sort_timestamp_can_match_shortcut P90 Service Time 19.4117815 18.9196314 -2.60%
asc_sort_timestamp_no_can_match_shortcut Median Throughput 2.01 2.01 0.00%
asc_sort_timestamp_no_can_match_shortcut P90 Latency 22.0456534 20.4257297 -7.93%
asc_sort_timestamp_no_can_match_shortcut P90 Service Time 19.5836582 18.9208364 -3.50%
term Median Throughput 2.01 2.01 0.00%
term P90 Latency 4.98427876 5.02946671 0.90%
term P90 Service Time 3.52703551 3.53937903 0.35%
multi_terms-keyword Median Throughput 1.86 1.82 -2.20%
multi_terms-keyword P90 Latency 10612.5766 13827.4802 23.25%
multi_terms-keyword P90 Service Time 539.614216 554.750199 2.73%
keyword-terms Median Throughput 2 2 0.00%
keyword-terms P90 Latency 47.4260308 46.8147705 -1.31%
keyword-terms P90 Service Time 46.1887779 45.6283599 -1.23%
keyword-terms-low-cardinality Median Throughput 2.01 2.01 0.00%
keyword-terms-low-cardinality P90 Latency 39.4004102 41.209351 4.39%
keyword-terms-low-cardinality P90 Service Time 37.7886517 40.227164 6.06%
composite-terms Median Throughput 2 2 0.00%
composite-terms P90 Latency 174.72662 175.8952 0.66%
composite-terms P90 Service Time 173.260788 174.78259 0.87%
composite_terms-keyword Median Throughput 2 2 0.00%
composite_terms-keyword P90 Latency 262.62395 328.517717 20.06%
composite_terms-keyword P90 Service Time 261.752854 327.55114 20.09%
composite-date_histogram-daily Median Throughput 2.01 2.01 0.00%
composite-date_histogram-daily P90 Latency 6.53840693 6.61628689 1.18%
composite-date_histogram-daily P90 Service Time 5.08705092 5.29582438 3.94%
range Median Throughput 2.01 2.01 0.00%
range P90 Latency 48.2831966 49.2544756 1.97%
range P90 Service Time 47.0378465 48.1631258 2.34%
range-numeric Median Throughput 2.01 2.01 0.00%
range-numeric P90 Latency 4.29281602 4.11282149 -4.38%
range-numeric P90 Service Time 2.72757907 2.55417822 -6.79%
keyword-in-range Median Throughput 2 2 0.00%
keyword-in-range P90 Latency 157.12988 151.965757 -3.40%
keyword-in-range P90 Service Time 155.747702 150.983875 -3.16%
date_histogram_hourly_agg Median Throughput 2.01 2.01 0.00%
date_histogram_hourly_agg P90 Latency 10.4533635 10.1848669 -2.64%
date_histogram_hourly_agg P90 Service Time 8.93273024 8.86720689 -0.74%
date_histogram_minute_agg Median Throughput 2.01 2.01 0.00%
date_histogram_minute_agg P90 Latency 45.1442083 47.7756897 5.51%
date_histogram_minute_agg P90 Service Time 44.0200076 46.4204891 5.17%
scroll Median Throughput 47.27 45.85 -3.10%
scroll P90 Latency 8345.62109 13125.0181 36.41%
scroll P90 Service Time 522.262164 537.771689 2.88%
query-string-on-message Median Throughput 2 2 0.00%
query-string-on-message P90 Latency 87.0531196 88.8102339 1.98%
query-string-on-message P90 Service Time 86.2746482 87.4661455 1.36%
query-string-on-message-filtered Median Throughput 2 2 0.00%
query-string-on-message-filtered P90 Latency 126.427841 136.793502 7.58%
query-string-on-message-filtered P90 Service Time 125.370007 135.31314 7.35%
query-string-on-message-filtered-sorted-num Median Throughput 2.01 2 -0.50%
query-string-on-message-filtered-sorted-num P90 Latency 87.0696074 85.989187 -1.26%
query-string-on-message-filtered-sorted-num P90 Service Time 84.7262213 83.6267326 -1.31%
sort_keyword_can_match_shortcut Median Throughput 2.01 2.01 0.00%
sort_keyword_can_match_shortcut P90 Latency 6.0656352 6.001189 -1.07%
sort_keyword_can_match_shortcut P90 Service Time 4.5905493 4.4543399 -3.06%
sort_keyword_no_can_match_shortcut Median Throughput 2.01 2.01 0.00%
sort_keyword_no_can_match_shortcut P90 Latency 5.93450632 6.15424532 3.57%
sort_keyword_no_can_match_shortcut P90 Service Time 4.40459373 4.58719721 3.98%
sort_numeric_desc Median Throughput 2.01 2.01 0.00%
sort_numeric_desc P90 Latency 8.56299029 9.07405812 5.63%
sort_numeric_desc P90 Service Time 7.41191246 7.58723162 2.31%
sort_numeric_asc Median Throughput 2.01 2.01 0.00%
sort_numeric_asc P90 Latency 10.9239754 10.5399867 -3.64%
sort_numeric_asc P90 Service Time 9.452327 9.1964775 -2.78%
sort_numeric_desc_with_match Median Throughput 2.01 2.01 0.00%
sort_numeric_desc_with_match P90 Latency 4.49225422 4.0823272 -10.04%
sort_numeric_desc_with_match P90 Service Time 2.91509738 2.53588829 -14.95%
sort_numeric_asc_with_match Median Throughput 2.01 2.01 0.00%
sort_numeric_asc_with_match P90 Latency 4.49943779 4.42200442 -1.75%
sort_numeric_asc_with_match P90 Service Time 2.96980766 2.95181983 -0.61%
range_field_conjunction_big_range_big_term_query Median Throughput 2.01 2.01 0.00%
range_field_conjunction_big_range_big_term_query P90 Latency 4.21921783 4.51458221 6.54%
range_field_conjunction_big_range_big_term_query P90 Service Time 2.61447511 2.94693729 11.28%
range_field_disjunction_big_range_small_term_query Median Throughput 2.01 2.01 0.00%
range_field_disjunction_big_range_small_term_query P90 Latency 4.23597348 4.5576627 7.06%
range_field_disjunction_big_range_small_term_query P90 Service Time 2.66216472 3.0180722 11.79%
range_field_conjunction_small_range_small_term_query Median Throughput 2.01 2.01 0.00%
range_field_conjunction_small_range_small_term_query P90 Latency 4.0834622 4.72623301 13.60%
range_field_conjunction_small_range_small_term_query P90 Service Time 2.54099626 3.12980449 18.81%
range_field_conjunction_small_range_big_term_query Median Throughput 2.01 2.01 0.00%
range_field_conjunction_small_range_big_term_query P90 Latency 3.9624502 4.05092265 2.18%
range_field_conjunction_small_range_big_term_query P90 Service Time 2.4877071 2.46797339 -0.80%
range-auto-date-histo Median Throughput 0.25 0.25 0.00%
range-auto-date-histo P90 Latency 1033484.42 1037377.72 0.38%
range-auto-date-histo P90 Service Time 4133.72519 4119.05572 -0.36%
range-auto-date-histo-with-metrics Median Throughput 0.09 0.09 0.00%
range-auto-date-histo-with-metrics P90 Latency 3123086.77 3080568.57 -1.38%
range-auto-date-histo-with-metrics P90 Service Time 11571.4685 11197.4767 -3.34%
range-agg-1 Median Throughput 2.01 2.01 0.00%
range-agg-1 P90 Latency 4.4221091 4.35349254 -1.58%
range-agg-1 P90 Service Time 2.88510949 2.81368511 -2.54%
range-agg-2 Median Throughput 2.01 2.01 0.00%
range-agg-2 P90 Latency 4.3828232 4.8581396 9.78%
range-agg-2 P90 Service Time 2.81660126 3.35686863 16.09%
cardinality-agg-low Median Throughput 2.01 2.01 0.00%
cardinality-agg-low P90 Latency 6.4378883 6.93646969 7.19%
cardinality-agg-low P90 Service Time 5.38352195 5.50142419 2.14%
cardinality-agg-high Median Throughput 0.8 0.8 0.00%
cardinality-agg-high P90 Latency 218262.798 216991.479 -0.59%
cardinality-agg-high P90 Service Time 1268.53249 1261.44235 -0.56%

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

kotwanikunal and others added 9 commits March 12, 2025 11:29
…ghtly to account for non-ordered heap container vs previous ordered arraylist

Signed-off-by: Finn Roblin <[email protected]>
Signed-off-by: Finn Roblin <[email protected]>
Signed-off-by: Finn Roblin <[email protected]>
Copy link
Contributor

❌ Gradle check result for 99803c0: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Comment on lines +311 to +313
public static final String CONCURRENT_SEGMENT_SEARCH_USE_EXPERIMENTAL_SLICING_KEY = "search.concurrent.experimental_slicing.enable";
public static final boolean CONCURRENT_SEGMENT_SEARCH_USE_EXPERIMENTAL_SLICING_DEFAULT_VALUE = false;

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to gate this behind a setting when this ships? If we can avoid shipping yet another cluster setting, it would be nice.

I suspect that this approach may help (or at least not hurt) the existing concurrent segment search logic. In that case, we should just replace the old slice supplier.

I like having the setting in place during development/testing so that we can compare the old/new implementations easily.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that one fewer setting is desirable and would prefer not to gate it behind a setting when it ships, but I'm worried about performance on non-k-NN use cases compared to the preexisting MaxTargetSliceSupplier. Right now I'm running big5-1000 on a larger cluster with 8 slices to see how performance changes as we scale up cluster size. I'll add a comment when these runs are finished, and hopefully the balanced slicing is clearly better.

For now, I noticed some performance regressions on the big5 workload for scroll and a few other operations when we enable balanced docs slice supplier. The operations (besides scroll) that had performance regressions returned 0 hits in their queries (for the documents-100 corpus). This might be due to the added overhead of a priority queue, be a consequence of the different slice assignment policies, or something else. As another caveat the big5 benchmarks were run on a small r5.xl instance with only 2 slices.

We could also, as I believe you suggested before, have branching logic in the shouldUseMaxTargetSupplier method to decide to use round robin or balanced based on any patterns in the performance benchmarks (for instance, if scrollContext is present). I'm a little worried to take a decision based on just the big5 workload across two clusters.

Copy link
Contributor

❌ Gradle check result for a07f46d: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Finn Roblin <[email protected]>
Copy link
Contributor

✅ Gradle check result for c4403f7: SUCCESS

Copy link

codecov bot commented Mar 26, 2025

Codecov Report

Attention: Patch coverage is 87.80488% with 5 lines in your changes missing coverage. Please review.

Project coverage is 72.37%. Comparing base (2ee8660) to head (c4403f7).
Report is 52 commits behind head on main.

Files with missing lines Patch % Lines
...va/org/opensearch/search/DefaultSearchContext.java 71.42% 1 Missing and 1 partial ⚠️
...rch/search/internal/BalancedDocsSliceSupplier.java 92.30% 1 Missing and 1 partial ⚠️
...nsearch/search/internal/FilteredSearchContext.java 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #17687      +/-   ##
============================================
- Coverage     72.43%   72.37%   -0.07%     
+ Complexity    65694    65681      -13     
============================================
  Files          5311     5312       +1     
  Lines        304937   304975      +38     
  Branches      44226    44231       +5     
============================================
- Hits         220872   220712     -160     
- Misses        65912    66202     +290     
+ Partials      18153    18061      -92     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@finnroblin
Copy link
Author

Here are performance results for the big5 documents-1000 workload. This is the biggest workload that I could find to test OpenSearch's functionality. I ran the test with a 3x r6g.4xl cluster. The following settings were used: {"search.concurrent_segment_search.mode":"all", "search.concurrent.max_slice_count": 8}'. I performed 1 indexing run and 2 big5 runs. The first big5 run used search.concurrent.experimental_slicing.enable: true to test balanced docs slice supplier and the second big5 run set it to false to get a baseline for max target slice supplier.

Here are median throughput, p90 latencies, and p90 throughputs for the operations (big5-documents-1000-comparison.xlsx):

<style> </style>
Operation Metric RR BD Pct Diff
default Median Throughput 2.01 2.01 0.00%
default P90 Latency 6.40559793 7.83195782 18.21%
default P90 Service Time 4.84854794 6.40473151 24.30%
desc_sort_timestamp Median Throughput 2.01 2 -0.50%
desc_sort_timestamp P90 Latency 7.11152005 9.21213722 22.80%
desc_sort_timestamp P90 Service Time 5.57644439 7.6997695 27.58%
asc_sort_timestamp Median Throughput 1.33 1.39 4.32%
asc_sort_timestamp P90 Latency 73596.4219 62384.2598 -17.97%
asc_sort_timestamp P90 Service Time 749.490204 703.682312 -6.51%
desc_sort_with_after_timestamp Median Throughput 2.01 2 -0.50%
desc_sort_with_after_timestamp P90 Latency 66.8119507 63.5714169 -5.10%
desc_sort_with_after_timestamp P90 Service Time 65.4834175 62.310585 -5.09%
asc_sort_with_after_timestamp Median Throughput 1.43 1.53 6.54%
asc_sort_with_after_timestamp P90 Latency 57899.4551 45278.1914 -27.87%
asc_sort_with_after_timestamp P90 Service Time 696.889435 653.127319 -6.70%
desc_sort_timestamp_can_match_shortcut Median Throughput 2.01 2 -0.50%
desc_sort_timestamp_can_match_shortcut P90 Latency 14.1913858 14.8691702 4.56%
desc_sort_timestamp_can_match_shortcut P90 Service Time 12.6752672 13.4758678 5.94%
desc_sort_timestamp_no_can_match_shortcut Median Throughput 2.01 2.01 0.00%
desc_sort_timestamp_no_can_match_shortcut P90 Latency 14.741756 14.670032 -0.49%
desc_sort_timestamp_no_can_match_shortcut P90 Service Time 13.1524077 13.1667752 0.11%
asc_sort_timestamp_can_match_shortcut Median Throughput 2.01 1.99 -1.01%
asc_sort_timestamp_can_match_shortcut P90 Latency 131.598747 125.168289 -5.14%
asc_sort_timestamp_can_match_shortcut P90 Service Time 130.205177 123.940346 -5.05%
asc_sort_timestamp_no_can_match_shortcut Median Throughput 2.01 2.01 0.00%
asc_sort_timestamp_no_can_match_shortcut P90 Latency 131.024666 124.725517 -5.05%
asc_sort_timestamp_no_can_match_shortcut P90 Service Time 129.664658 123.389187 -5.09%
term Median Throughput 2.01 2.01 0.00%
term P90 Latency 6.71847963 7.1291945 5.76%
term P90 Service Time 5.23884797 5.58661246 6.22%
multi_terms-keyword Median Throughput 1.51 1.28 -17.97%
multi_terms-keyword P90 Latency 48044.2051 81996.4414 41.41%
multi_terms-keyword P90 Service Time 662.659088 781.371338 15.19%
keyword-terms Median Throughput 2 2 0.00%
keyword-terms P90 Latency 202.335304 202.249527 -0.04%
keyword-terms P90 Service Time 201.164589 201.085526 -0.04%
keyword-terms-low-cardinality Median Throughput 2 2 0.00%
keyword-terms-low-cardinality P90 Latency 188.965157 190.029907 0.56%
keyword-terms-low-cardinality P90 Service Time 187.735619 188.83065 0.58%
composite-terms Median Throughput 2.01 2 -0.50%
composite-terms P90 Latency 162.717285 149.724503 -8.68%
composite-terms P90 Service Time 161.589409 148.514496 -8.80%
composite_terms-keyword Median Throughput 2 2 0.00%
composite_terms-keyword P90 Latency 274.286682 256.525284 -6.92%
composite_terms-keyword P90 Service Time 273.028259 255.49453 -6.86%
composite-date_histogram-daily Median Throughput 2.01 2.01 0.00%
composite-date_histogram-daily P90 Latency 6.72325397 7.18232918 6.39%
composite-date_histogram-daily P90 Service Time 5.25644445 5.63916636 6.79%
range Median Throughput 2.01 2.01 0.00%
range P90 Latency 31.6533613 41.1376572 23.06%
range P90 Service Time 30.062602 39.6963749 24.27%
range-numeric Median Throughput 2.01 2.01 0.00%
range-numeric P90 Latency 5.18209457 4.51481891 -14.78%
range-numeric P90 Service Time 3.61531794 2.95159245 -22.49%
keyword-in-range Median Throughput 2.01 2 -0.50%
keyword-in-range P90 Latency 100.460861 120.471775 16.61%
keyword-in-range P90 Service Time 99.1523285 119.213768 16.83%
date_histogram_hourly_agg Median Throughput 2.01 2 -0.50%
date_histogram_hourly_agg P90 Latency 23.4777098 23.7953453 1.33%
date_histogram_hourly_agg P90 Service Time 21.0084438 21.3953438 1.81%
date_histogram_minute_agg Median Throughput 2.01 2.01 0.00%
date_histogram_minute_agg P90 Latency 28.0777187 32.9099083 14.68%
date_histogram_minute_agg P90 Service Time 25.5714541 31.5339174 18.91%
scroll Median Throughput 45.76 45.1 -1.46%
scroll P90 Latency 13851.4175 15793.3145 12.30%
scroll P90 Service Time 541.88501 538.196137 -0.69%
query-string-on-message Median Throughput 2 1.99 -0.50%
query-string-on-message P90 Latency 219.272316 206.608742 -6.13%
query-string-on-message P90 Service Time 217.957298 205.188568 -6.22%
query-string-on-message-filtered Median Throughput 2.01 2 -0.50%
query-string-on-message-filtered P90 Latency 161.926392 178.792679 9.43%
query-string-on-message-filtered P90 Service Time 160.742134 177.700157 9.54%
query-string-on-message-filtered-sorted-num Median Throughput 2.01 2.01 0.00%
query-string-on-message-filtered-sorted-num P90 Latency 53.2001629 53.1595345 -0.08%
query-string-on-message-filtered-sorted-num P90 Service Time 51.7851219 51.7874584 0.00%
sort_keyword_can_match_shortcut Median Throughput 2.01 2.01 0.00%
sort_keyword_can_match_shortcut P90 Latency 6.09069538 7.55181289 19.35%
sort_keyword_can_match_shortcut P90 Service Time 4.56942153 6.08352947 24.89%
sort_keyword_no_can_match_shortcut Median Throughput 2.01 2.01 0.00%
sort_keyword_no_can_match_shortcut P90 Latency 7.90096712 7.78420496 -1.50%
sort_keyword_no_can_match_shortcut P90 Service Time 6.33617401 6.2607739 -1.20%
sort_numeric_desc Median Throughput 2.01 2.01 0.00%
sort_numeric_desc P90 Latency 28.6591969 29.2075043 1.88%
sort_numeric_desc P90 Service Time 26.4250574 28.2246494 6.38%
sort_numeric_asc Median Throughput 2.01 2.01 0.00%
sort_numeric_asc P90 Latency 34.2001858 35.5184078 3.71%
sort_numeric_asc P90 Service Time 32.6845799 34.1878948 4.40%
sort_numeric_desc_with_match Median Throughput 2.01 2.01 0.00%
sort_numeric_desc_with_match P90 Latency 7.41214299 7.47790742 0.88%
sort_numeric_desc_with_match P90 Service Time 5.86182141 6.18104601 5.16%
sort_numeric_asc_with_match Median Throughput 2.01 2.01 0.00%
sort_numeric_asc_with_match P90 Latency 7.0982244 7.32826781 3.14%
sort_numeric_asc_with_match P90 Service Time 5.52601242 5.86133695 5.72%
range_field_conjunction_big_range_big_term_query Median Throughput 2.01 2.01 0.00%
range_field_conjunction_big_range_big_term_query P90 Latency 5.79680395 5.44469404 -6.47%
range_field_conjunction_big_range_big_term_query P90 Service Time 4.18014216 3.92662251 -6.46%
range_field_disjunction_big_range_small_term_query Median Throughput 2.01 2.01 0.00%
range_field_disjunction_big_range_small_term_query P90 Latency 7.11017346 7.63937259 6.93%
range_field_disjunction_big_range_small_term_query P90 Service Time 5.80316663 6.1090374 5.01%
range_field_conjunction_small_range_small_term_query Median Throughput 2.01 2.01 0.00%
range_field_conjunction_small_range_small_term_query P90 Latency 6.73654699 7.45743656 9.67%
range_field_conjunction_small_range_small_term_query P90 Service Time 5.16287255 5.98689055 13.76%
range_field_conjunction_small_range_big_term_query Median Throughput 2.01 2.01 0.00%
range_field_conjunction_small_range_big_term_query P90 Latency 5.312356 5.23332763 -1.51%
range_field_conjunction_small_range_big_term_query P90 Service Time 3.77940905 3.63290751 -4.03%
range-auto-date-histo Median Throughput 0.06 0.07 14.29%
range-auto-date-histo P90 Latency 4735073.5 3814395.38 -24.14%
range-auto-date-histo P90 Service Time 16842.9082 13730.8794 -22.66%
range-auto-date-histo-with-metrics Median Throughput 0.03 0.03 0.00%
range-auto-date-histo-with-metrics P90 Latency 10241987 9293074.5 -10.21%
range-auto-date-histo-with-metrics P90 Service Time 36089.3984 32727.1055 -10.27%
range-agg-1 Median Throughput 2.01 2.01 0.00%
range-agg-1 P90 Latency 5.10684395 5.88857794 13.28%
range-agg-1 P90 Service Time 3.55634403 4.30167484 17.33%
range-agg-2 Median Throughput 2.01 2.01 0.00%
range-agg-2 P90 Latency 5.70763207 5.99885202 4.85%
range-agg-2 P90 Service Time 4.10150647 4.38907647 6.55%
cardinality-agg-low Median Throughput 2.01 2.01 0.00%
cardinality-agg-low P90 Latency 8.77825785 9.02456617 2.73%
cardinality-agg-low P90 Service Time 7.31217814 7.52891541 2.88%
cardinality-agg-high Median Throughput 0.22 0.27 18.52%
cardinality-agg-high P90 Latency 1176064.63 937908.938 -25.39%
cardinality-agg-high P90 Service Time 4568.29663 3744.09534 -22.01%

The following operations are worse with balanced docs slice supplier. Looking at the queries, most of the queries with reggressions either contain term, a descending sort, a range on timestamps, or a match/match-all query.

  • default
  • desc_sort_timestamp
  • desc_sort_timestamp_can_match_shortcut (slightly worse)
  • term
  • multi_terms-keyword (much worse)
  • keyword-terms (p99, p100 much worse)
  • keyword-terms-low-cardinality (p99, p100 much worse)
  • composite-date_histogram-daily
  • range
  • query-string-on-message-filtered
  • sort_keyword_can_match_shortcut
  • sort_keyword_no_can_match_shortcut
  • sort_numeric_asc

If we don't want to add another setting we could change the slice supplier in the shouldUseMaxTargetSlice() method based on information from the searchContext object. This logic feels somewhat brittle to me though, since we're extrapolating performance characteristics based on a single benchmark.

@msfroh , what do you think based on these additional benchmarks?

@finnroblin
Copy link
Author

Codecov Report

Attention: Patch coverage is 87.80488% with 5 lines in your changes missing coverage. Please review.

Project coverage is 72.37%. Comparing base (2ee8660) to head (c4403f7).
Report is 51 commits behind head on main.

Files with missing lines Patch % Lines
...va/org/opensearch/search/DefaultSearchContext.java 71.42% 1 Missing and 1 partial ⚠️
...rch/search/internal/BalancedDocsSliceSupplier.java 92.30% 1 Missing and 1 partial ⚠️
...nsearch/search/internal/FilteredSearchContext.java 0.00% 1 Missing ⚠️
Additional details and impacted files

☔ View full report in Codecov by Sentry. 📢 Have feedback on the report? Share it here.
🚀 New features to boost your workflow:

Will address code coverage in next revision.

@jed326
Copy link
Collaborator

jed326 commented Mar 28, 2025

Thanks @finnroblin, I had a few thoughts on this.

  1. As you've identified, the MaxTargetSliceSupplier suffers when there is a wide variance in segment sizes. In this PR we are trying to address this at search time, but there are also some in-flight efforts to address this on the indexing side. For example, I wonder how your performance changes would look after Increasing Floor Segment Size to 16MB #17699 as been merged?
  2. I also echo the concerns about adding both a new cluster and index setting for this, especially as from your performance analysis whether or not we see improvements with this change does seem to at least somewhat depend on the query type. Broadly speaking I think most bucket aggregations, especially terms and multi-terms aggregations, are susceptible to performance regressions in this case due to additional reduce work needed if there are more segments in a slice. Have you evaluated what other options there are for rolling this out besides new cluster/index stats? For example, a search request parameter, or even a ConcurrentSearchRequestDecider type of query visitor pattern?
  3. In the cases where we do see regressions, I am wondering if it is because if increased tail latencies, or increased average latencies? The minimum latency of a concurrent search request is the longest time it takes to process any given slice, so I'd be curious to see if in these cases if, for example, a single slice is taking longer or on average all slices are taking longer. I think that could give some hints to why there actually are regressions.

@finnroblin
Copy link
Author

finnroblin commented Mar 28, 2025

Thanks @finnroblin, I had a few thoughts on this.

1. As you've identified, the `MaxTargetSliceSupplier` suffers when there is a wide variance in segment sizes. In this PR we are trying to address this at search time, but there are also some in-flight efforts to address this on the indexing side. For example, I wonder how your performance changes would look after [Increasing Floor Segment Size to 16MB #17699](https://github.com/opensearch-project/OpenSearch/pull/17699) as been merged?

2. I also echo the concerns about adding both a new cluster and index setting for this, especially as from your performance analysis whether or not we see improvements with this change does seem to at least somewhat depend on the query type. Broadly speaking I think most bucket aggregations, especially terms and multi-terms aggregations, are susceptible to performance regressions in this case due to additional `reduce` work needed if there are more segments in a slice. Have you evaluated what other options there are for rolling this out besides new cluster/index stats? For example, a search request parameter, or even a [`ConcurrentSearchRequestDecider`](https://github.com/opensearch-project/k-NN/blob/f3f4767c987ea39d903400b3153f9901892a7673/src/main/java/org/opensearch/knn/plugin/search/KNNConcurrentSearchRequestDecider.java#L27) type of query visitor pattern?

3. In the cases where we do see regressions, I am wondering if it is because if increased tail latencies, or increased average latencies? The minimum latency of a concurrent search request is the longest time it takes to process any given slice, so I'd be curious to see if in these cases if, for example, a single slice is taking longer or on average all slices are taking longer. I think that could give some hints to why there actually are regressions.

Thanks @jed326 for the thoughts!

  1. "For example, I wonder how your performance changes would look after Increasing Floor Segment Size to 16MB #17699 as been merged?"
    @kotwanikunal actually tested this change on a vector search workload with the 16mb segment floor and the derived source changes. The biggest change when the floor segment was added was a ~30% gain in minimum throughput. The latency benefits in the mixed experiment were likely driven by faster concurrent segment search, since both the mixed experiment and this balanced docs benchmark saw ~13% improvements in latency. On the other hand, I benchmarked the floor segment change individually and saw ~5-7% decreases in common case latency. So I think the floor segment change and this slice supplier change benefit performance when combined.
Metric Operation Unit Default with CSS With Changes (CSS)
Segment count 213 125
Min Throughput prod-queries ops/s 64.77 85.23333
Mean Throughput prod-queries ops/s 84.17 93.97
Median Throughput prod-queries ops/s 85.19333 94.42333
Max Throughput prod-queries ops/s 86.25 95.38
50th percentile latency prod-queries ms 9.84013 8.76297
90th percentile latency prod-queries ms 11.25363 9.8072
99th percentile latency prod-queries ms 12.74447 11.15153
99.9th percentile latency prod-queries ms 29.93143 21.17413
99.99th percentile latency prod-queries ms 82.9093 56.73783
100th percentile latency prod-queries ms 126.3996 66.8019
50th percentile service time prod-queries ms 9.84013 8.76297
90th percentile service time prod-queries ms 11.25363 9.8072
99th percentile service time prod-queries ms 12.74447 11.15153
99.9th percentile service time prod-queries ms 29.93143 21.17413
99.99th percentile service time prod-queries ms 82.9093 56.73783
100th percentile service time prod-queries ms 126.3996 66.8019
error rate prod-queries % 0 0
Mean recall@k prod-queries 0.95667 0.95
Mean recall@1 prod-queries 0.98 0.98

2/3 Performance regressions
Good point about the reduce being across more segments within a slice. I think that this could well be the reason for the performance regressions. There's not a clear pattern of regression -- some ops are worse in the common case others are worse in the tail case. Regressions in the fast <10ms ops could also be driven by the additional overhead of the priority queue.

Here's a summary table:

Performance Category Operations Impact Notes
Worse in Common Case (p50/p90) - multi_terms-keyword
- composite_terms-keyword
- scroll
- term
- query-string-on-message
- query-string-on-message-filtered
- range-agg-2
- cardinality-agg-low
- date_histogram_minute_agg
- multi_terms-keyword: Significant degradation (41.41% worse at p50)
- scroll: Notable impact (13.50% worse at p50)
- Others show moderate degradation (2-10% range)
Worse in Tail Latency (p99/p100) - keyword-terms
- keyword-terms-low-cardinality
- composite-terms
- range
- sort_numeric_desc
- range-auto-date-histo
- keyword-terms: ~18% worse at p99/p100
- keyword-terms-low-cardinality: ~20% worse at p99/p100
- Others show varying degrees of tail latency degradation
Consistently Worse Across All Percentiles - multi_terms-keyword
- scroll
- composite_terms-keyword
These operations showed degradation across both common case and tail latency metrics

2 -- other ways to roll out the change
I still don't think it's a good idea to enable this slicing mechanism by default due to the performance regressions. We could change the shouldUseMaxTargetSlice() option to use max target slice supplier in some cases and balanced docs supplier in others if there's a pattern (seems to be worse on more than just bucket aggregations unfortunately).

A search request option is an interesting idea but I'm not sure it sidesteps the apprehension with adding additional configuration options. From an implementation POV the option would need to be obtainable from the searchContext.

If we use KNNConcurrentSearchRequestDecider then I think we must still add a hook in on the opensearch side (either cluster/index setting or search request option) to use BD. If it's an index setting we could use the getAdditionalIndexSettings method on the k-NN side to enable this and get the performance benefits to k-NN.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants