Add Balanced Docs Slice Allocator to increase document balance for Concurrent Segment Search #17687

finnroblin · 2025-03-25T19:23:21Z

Description

Adds a new slicing mechanism, BalancedDocsSliceSupplier, to concurrent segment search. This mechanism assigns leaves to slices greedily based on the number of documents in the leaf and slice. Each leaf is assigned to the slice with the current lowest number of documents. This mechanism achieves a more balanced distribution of work compared to the preexisting MaxTargetSliceSupplier.

MaxTargetSliceSupplier sorts the leaves by document count and then round robin distributes the sorted leaves to slices. This round robin implementation can create uneven document distribution where there are a few very large slices and a few very small slices. The proposed BalancedDocsSliceSupplier creates a more balanced distribution of document counts across search slices.

Benchmarks on a vector search workload show that this method improves search throughput and latency compared to other slicing mechanisms. Benchmarks on the big5 workload show that this method is better than maxTargetSliceSupplier in some cases but worse or comparable in others (scroll queries are worse, mixed results for range aggregations). Please see the benchmarking subsections for experiment details and more commentary.

I propose adding this slicing mechanism as a concurrent segment search setting for the following reasons:

Demonstrated gains for CSS on k-NN indices yet mixed results in the big5 workload.
Isn't another setting for a low-level implementation detail bad? Yes, but a user who opts into custom OpenSearch CSS must already set the max_slice_count parameter to use MaxTargetSliceSupplier. So there is precedent for an additional setting in order to adjust CSS behavior compared to the lucene default. We could call out the new slicing option in the documentation similar to how we already call out the choice between Lucene slicing and round robin slicing.
If this is introduced as a setting then k-NN can enable it with index setting listeners via the getAdditionalIndexSettings() hook.

Vector Search Benchmarks

Vector search benchmarks on a dataset with 10M 768-dimension vectors show search throughput increases of 7-10% and search latency decreases of 12-16%. The comparison was performed on a 3x r6g.4xl cluster with 8 slices. Full benchmark results available here.

		Round Robin (MaxTargetSliceSupplier)	Balanced Docs	Percent difference
Segment count	213	213
Min Throughput	prod-queries	71.3233333	76.6966667	7.01%
Mean Throughput	prod-queries	81.5766667	90.6933333	10.05%
Median Throughput	prod-queries	81.9866667	91.56	10.46%
Max Throughput	prod-queries	83.3	92.7766667	10.21%
50th percentile latency	prod-queries	10.3225333	9.13488667	-13.00%
90th percentile latency	prod-queries	11.6630333	10.3315433	-12.89%
99th percentile latency	prod-queries	13.7183667	12.0862667	-13.50%
99.9th percentile latency	prod-queries	32.4364333	27.7666	-16.82%
99.99th percentile latency	prod-queries	70.9776667	62.4047333	-13.74%
100th percentile latency	prod-queries	82.4167	73.5012667	-12.13%
50th percentile service time	prod-queries	10.3225333	9.13488667	-13.00%
90th percentile service time	prod-queries	11.6630333	10.3315433	-12.89%
99th percentile service time	prod-queries	13.7183667	12.0862667	-13.50%
99.9th percentile service time	prod-queries	32.4364333	27.7666	-16.82%
99.99th percentile service time	prod-queries	70.9776667	62.4047333	-13.74%
100th percentile service time	prod-queries	82.4167	73.5012667	-12.13%
error rate	prod-queries	0	0
Mean recall@k	prod-queries	0.95	0.95
Mean recall@1	prod-queries	0.97333333	0.97

Big5 Benchmarks

Big5 benchmarks were performed against a single-node cluster with a r5.xl instance (same as in the GH benchmark configs). Only 2 target slices were used due to the small size of the cluster (4 vCPU). I used the big5-100 corpus. The table is attached at the bottom comparing median throughput, p90 latency, and p90 service time for each operation.

I will run the big5 workload with the documents-1000 corpus on a large 3x r6g.4xl cluster with 8 target slices to test more performance characteristics. It will take a few hours to get the results so I am opening the PR now with the more limited big5 results.

The following operations had worse performance with the balanced docs slice supplier. I dug into the queries with performance regressions. Besides the scroll operation, all the worse operations returned 0 hits. So the higher latencies are likely a result of the extra overhead of using a priority queue to assign leaves to slices instead of a sort plus round robin distribution. However, more benchmarking with documents-1000 will hopefully better reveal the performance differences.

Operations with worse performance:

range-auto-date-histo
term
multi_terms-keyword (much worse). (but keyword-terms is better w contender changes)....
keyword-terms-low-cardinality
composite-terms
composite_terms-keyword (much worse).
range
date_histogram_minute_agg
scroll (much worse). I added a check in shouldUseMaxTargetSlice() to use MaxTargetSlicing whenever there is scroll context present.
query-string-on-message
query-string-on-message-filtered
sort_numeric_desc
range-auto-date-histo
range-agg-2
cardinality-agg-low

Brief comparison of all big5 operations

(Full results and comparison available here).

Operation	Metric	RR	BD	Pct Diff
desc_sort_timestamp	Median Throughput	2	2.01	0.50%
desc_sort_timestamp	P90 Latency	24.7062233	23.2023429	-6.48%
desc_sort_timestamp	P90 Service Time	22.6351037	20.8071975	-8.78%
asc_sort_timestamp	Median Throughput	2.01	2.01	0.00%
asc_sort_timestamp	P90 Latency	57.4730759	54.710749	-5.05%
asc_sort_timestamp	P90 Service Time	56.303588	53.448578	-5.34%
desc_sort_with_after_timestamp	Median Throughput	2.01	2.01	0.00%
desc_sort_with_after_timestamp	P90 Latency	9.20452784	8.4762934	-8.59%
desc_sort_with_after_timestamp	P90 Service Time	7.82590412	6.95620238	-12.50%
asc_sort_with_after_timestamp	Median Throughput	2.01	2.01	0.00%
asc_sort_with_after_timestamp	P90 Latency	20.422867	20.048872	-1.87%
asc_sort_with_after_timestamp	P90 Service Time	18.9271991	18.5687515	-1.93%
desc_sort_timestamp_can_match_shortcut	Median Throughput	2	2	0.00%
desc_sort_timestamp_can_match_shortcut	P90 Latency	22.7059584	22.5943716	-0.49%
desc_sort_timestamp_can_match_shortcut	P90 Service Time	20.4673676	20.309485	-0.78%
desc_sort_timestamp_no_can_match_shortcut	Median Throughput	2.01	2.01	0.00%
desc_sort_timestamp_no_can_match_shortcut	P90 Latency	23.231273	22.9051914	-1.42%
desc_sort_timestamp_no_can_match_shortcut	P90 Service Time	20.9324122	20.4913126	-2.15%
asc_sort_timestamp_can_match_shortcut	Median Throughput	2.01	2.01	0.00%
asc_sort_timestamp_can_match_shortcut	P90 Latency	21.4486304	20.679081	-3.72%
asc_sort_timestamp_can_match_shortcut	P90 Service Time	19.4117815	18.9196314	-2.60%
asc_sort_timestamp_no_can_match_shortcut	Median Throughput	2.01	2.01	0.00%
asc_sort_timestamp_no_can_match_shortcut	P90 Latency	22.0456534	20.4257297	-7.93%
asc_sort_timestamp_no_can_match_shortcut	P90 Service Time	19.5836582	18.9208364	-3.50%
term	Median Throughput	2.01	2.01	0.00%
term	P90 Latency	4.98427876	5.02946671	0.90%
term	P90 Service Time	3.52703551	3.53937903	0.35%
multi_terms-keyword	Median Throughput	1.86	1.82	-2.20%
multi_terms-keyword	P90 Latency	10612.5766	13827.4802	23.25%
multi_terms-keyword	P90 Service Time	539.614216	554.750199	2.73%
keyword-terms	Median Throughput	2	2	0.00%
keyword-terms	P90 Latency	47.4260308	46.8147705	-1.31%
keyword-terms	P90 Service Time	46.1887779	45.6283599	-1.23%
keyword-terms-low-cardinality	Median Throughput	2.01	2.01	0.00%
keyword-terms-low-cardinality	P90 Latency	39.4004102	41.209351	4.39%
keyword-terms-low-cardinality	P90 Service Time	37.7886517	40.227164	6.06%
composite-terms	Median Throughput	2	2	0.00%
composite-terms	P90 Latency	174.72662	175.8952	0.66%
composite-terms	P90 Service Time	173.260788	174.78259	0.87%
composite_terms-keyword	Median Throughput	2	2	0.00%
composite_terms-keyword	P90 Latency	262.62395	328.517717	20.06%
composite_terms-keyword	P90 Service Time	261.752854	327.55114	20.09%
composite-date_histogram-daily	Median Throughput	2.01	2.01	0.00%
composite-date_histogram-daily	P90 Latency	6.53840693	6.61628689	1.18%
composite-date_histogram-daily	P90 Service Time	5.08705092	5.29582438	3.94%
range	Median Throughput	2.01	2.01	0.00%
range	P90 Latency	48.2831966	49.2544756	1.97%
range	P90 Service Time	47.0378465	48.1631258	2.34%
range-numeric	Median Throughput	2.01	2.01	0.00%
range-numeric	P90 Latency	4.29281602	4.11282149	-4.38%
range-numeric	P90 Service Time	2.72757907	2.55417822	-6.79%
keyword-in-range	Median Throughput	2	2	0.00%
keyword-in-range	P90 Latency	157.12988	151.965757	-3.40%
keyword-in-range	P90 Service Time	155.747702	150.983875	-3.16%
date_histogram_hourly_agg	Median Throughput	2.01	2.01	0.00%
date_histogram_hourly_agg	P90 Latency	10.4533635	10.1848669	-2.64%
date_histogram_hourly_agg	P90 Service Time	8.93273024	8.86720689	-0.74%
date_histogram_minute_agg	Median Throughput	2.01	2.01	0.00%
date_histogram_minute_agg	P90 Latency	45.1442083	47.7756897	5.51%
date_histogram_minute_agg	P90 Service Time	44.0200076	46.4204891	5.17%
scroll	Median Throughput	47.27	45.85	-3.10%
scroll	P90 Latency	8345.62109	13125.0181	36.41%
scroll	P90 Service Time	522.262164	537.771689	2.88%
query-string-on-message	Median Throughput	2	2	0.00%
query-string-on-message	P90 Latency	87.0531196	88.8102339	1.98%
query-string-on-message	P90 Service Time	86.2746482	87.4661455	1.36%
query-string-on-message-filtered	Median Throughput	2	2	0.00%
query-string-on-message-filtered	P90 Latency	126.427841	136.793502	7.58%
query-string-on-message-filtered	P90 Service Time	125.370007	135.31314	7.35%
query-string-on-message-filtered-sorted-num	Median Throughput	2.01	2	-0.50%
query-string-on-message-filtered-sorted-num	P90 Latency	87.0696074	85.989187	-1.26%
query-string-on-message-filtered-sorted-num	P90 Service Time	84.7262213	83.6267326	-1.31%
sort_keyword_can_match_shortcut	Median Throughput	2.01	2.01	0.00%
sort_keyword_can_match_shortcut	P90 Latency	6.0656352	6.001189	-1.07%
sort_keyword_can_match_shortcut	P90 Service Time	4.5905493	4.4543399	-3.06%
sort_keyword_no_can_match_shortcut	Median Throughput	2.01	2.01	0.00%
sort_keyword_no_can_match_shortcut	P90 Latency	5.93450632	6.15424532	3.57%
sort_keyword_no_can_match_shortcut	P90 Service Time	4.40459373	4.58719721	3.98%
sort_numeric_desc	Median Throughput	2.01	2.01	0.00%
sort_numeric_desc	P90 Latency	8.56299029	9.07405812	5.63%
sort_numeric_desc	P90 Service Time	7.41191246	7.58723162	2.31%
sort_numeric_asc	Median Throughput	2.01	2.01	0.00%
sort_numeric_asc	P90 Latency	10.9239754	10.5399867	-3.64%
sort_numeric_asc	P90 Service Time	9.452327	9.1964775	-2.78%
sort_numeric_desc_with_match	Median Throughput	2.01	2.01	0.00%
sort_numeric_desc_with_match	P90 Latency	4.49225422	4.0823272	-10.04%
sort_numeric_desc_with_match	P90 Service Time	2.91509738	2.53588829	-14.95%
sort_numeric_asc_with_match	Median Throughput	2.01	2.01	0.00%
sort_numeric_asc_with_match	P90 Latency	4.49943779	4.42200442	-1.75%
sort_numeric_asc_with_match	P90 Service Time	2.96980766	2.95181983	-0.61%
range_field_conjunction_big_range_big_term_query	Median Throughput	2.01	2.01	0.00%
range_field_conjunction_big_range_big_term_query	P90 Latency	4.21921783	4.51458221	6.54%
range_field_conjunction_big_range_big_term_query	P90 Service Time	2.61447511	2.94693729	11.28%
range_field_disjunction_big_range_small_term_query	Median Throughput	2.01	2.01	0.00%
range_field_disjunction_big_range_small_term_query	P90 Latency	4.23597348	4.5576627	7.06%
range_field_disjunction_big_range_small_term_query	P90 Service Time	2.66216472	3.0180722	11.79%
range_field_conjunction_small_range_small_term_query	Median Throughput	2.01	2.01	0.00%
range_field_conjunction_small_range_small_term_query	P90 Latency	4.0834622	4.72623301	13.60%
range_field_conjunction_small_range_small_term_query	P90 Service Time	2.54099626	3.12980449	18.81%
range_field_conjunction_small_range_big_term_query	Median Throughput	2.01	2.01	0.00%
range_field_conjunction_small_range_big_term_query	P90 Latency	3.9624502	4.05092265	2.18%
range_field_conjunction_small_range_big_term_query	P90 Service Time	2.4877071	2.46797339	-0.80%
range-auto-date-histo	Median Throughput	0.25	0.25	0.00%
range-auto-date-histo	P90 Latency	1033484.42	1037377.72	0.38%
range-auto-date-histo	P90 Service Time	4133.72519	4119.05572	-0.36%
range-auto-date-histo-with-metrics	Median Throughput	0.09	0.09	0.00%
range-auto-date-histo-with-metrics	P90 Latency	3123086.77	3080568.57	-1.38%
range-auto-date-histo-with-metrics	P90 Service Time	11571.4685	11197.4767	-3.34%
range-agg-1	Median Throughput	2.01	2.01	0.00%
range-agg-1	P90 Latency	4.4221091	4.35349254	-1.58%
range-agg-1	P90 Service Time	2.88510949	2.81368511	-2.54%
range-agg-2	Median Throughput	2.01	2.01	0.00%
range-agg-2	P90 Latency	4.3828232	4.8581396	9.78%
range-agg-2	P90 Service Time	2.81660126	3.35686863	16.09%
cardinality-agg-low	Median Throughput	2.01	2.01	0.00%
cardinality-agg-low	P90 Latency	6.4378883	6.93646969	7.19%
cardinality-agg-low	P90 Service Time	5.38352195	5.50142419	2.14%
cardinality-agg-high	Median Throughput	0.8	0.8	0.00%
cardinality-agg-high	P90 Latency	218262.798	216991.479	-0.59%
cardinality-agg-high	P90 Service Time	1268.53249	1261.44235	-0.56%

Check List

Functionality includes testing.
API changes companion pull request created, if applicable.
Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Kunal Kotwani <[email protected]>

…test performance Signed-off-by: Finn Roblin <[email protected]>

…ghtly to account for non-ordered heap container vs previous ordered arraylist Signed-off-by: Finn Roblin <[email protected]>

Signed-off-by: Finn Roblin <[email protected]>

github-actions · 2025-03-25T20:10:17Z

❌ Gradle check result for 99803c0: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

msfroh · 2025-03-25T21:12:52Z

server/src/main/java/org/opensearch/search/SearchService.java

+    public static final String CONCURRENT_SEGMENT_SEARCH_USE_EXPERIMENTAL_SLICING_KEY = "search.concurrent.experimental_slicing.enable";
+    public static final boolean CONCURRENT_SEGMENT_SEARCH_USE_EXPERIMENTAL_SLICING_DEFAULT_VALUE = false;
+


Do we want to gate this behind a setting when this ships? If we can avoid shipping yet another cluster setting, it would be nice.

I suspect that this approach may help (or at least not hurt) the existing concurrent segment search logic. In that case, we should just replace the old slice supplier.

I like having the setting in place during development/testing so that we can compare the old/new implementations easily.

I agree that one fewer setting is desirable and would prefer not to gate it behind a setting when it ships, but I'm worried about performance on non-k-NN use cases compared to the preexisting MaxTargetSliceSupplier. Right now I'm running big5-1000 on a larger cluster with 8 slices to see how performance changes as we scale up cluster size. I'll add a comment when these runs are finished, and hopefully the balanced slicing is clearly better.

For now, I noticed some performance regressions on the big5 workload for scroll and a few other operations when we enable balanced docs slice supplier. The operations (besides scroll) that had performance regressions returned 0 hits in their queries (for the documents-100 corpus). This might be due to the added overhead of a priority queue, be a consequence of the different slice assignment policies, or something else. As another caveat the big5 benchmarks were run on a small r5.xl instance with only 2 slices.

We could also, as I believe you suggested before, have branching logic in the shouldUseMaxTargetSupplier method to decide to use round robin or balanced based on any patterns in the performance benchmarks (for instance, if scrollContext is present). I'm a little worried to take a decision based on just the big5 workload across two clusters.

Signed-off-by: Finn Roblin <[email protected]>

github-actions · 2025-03-25T22:38:31Z

❌ Gradle check result for a07f46d: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Finn Roblin <[email protected]>

github-actions · 2025-03-26T00:29:36Z

✅ Gradle check result for c4403f7: SUCCESS

codecov · 2025-03-26T00:30:28Z

Codecov Report

Attention: Patch coverage is 87.80488% with 5 lines in your changes missing coverage. Please review.

Project coverage is 72.37%. Comparing base (2ee8660) to head (c4403f7).
Report is 419 commits behind head on main.

Files with missing lines	Patch %	Lines
...va/org/opensearch/search/DefaultSearchContext.java	71.42%	1 Missing and 1 partial ⚠️
...rch/search/internal/BalancedDocsSliceSupplier.java	92.30%	1 Missing and 1 partial ⚠️
...nsearch/search/internal/FilteredSearchContext.java	0.00%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main   #17687      +/-   ##
============================================
- Coverage     72.43%   72.37%   -0.07%     
+ Complexity    65694    65681      -13     
============================================
  Files          5311     5312       +1     
  Lines        304937   304975      +38     
  Branches      44226    44231       +5     
============================================
- Hits         220872   220712     -160     
- Misses        65912    66202     +290     
+ Partials      18153    18061      -92

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

finnroblin · 2025-03-28T15:03:48Z

Here are performance results for the big5 documents-1000 workload. This is the biggest workload that I could find to test OpenSearch's functionality. I ran the test with a 3x r6g.4xl cluster. The following settings were used: {"search.concurrent_segment_search.mode":"all", "search.concurrent.max_slice_count": 8}'. I performed 1 indexing run and 2 big5 runs. The first big5 run used search.concurrent.experimental_slicing.enable: true to test balanced docs slice supplier and the second big5 run set it to false to get a baseline for max target slice supplier.

Here are median throughput, p90 latencies, and p90 throughputs for the operations (big5-documents-1000-comparison.xlsx):

Operation	Metric	RR	BD	Pct Diff
default	Median Throughput	2.01	2.01	0.00%
default	P90 Latency	6.40559793	7.83195782	18.21%
default	P90 Service Time	4.84854794	6.40473151	24.30%
desc_sort_timestamp	Median Throughput	2.01	2	-0.50%
desc_sort_timestamp	P90 Latency	7.11152005	9.21213722	22.80%
desc_sort_timestamp	P90 Service Time	5.57644439	7.6997695	27.58%
asc_sort_timestamp	Median Throughput	1.33	1.39	4.32%
asc_sort_timestamp	P90 Latency	73596.4219	62384.2598	-17.97%
asc_sort_timestamp	P90 Service Time	749.490204	703.682312	-6.51%
desc_sort_with_after_timestamp	Median Throughput	2.01	2	-0.50%
desc_sort_with_after_timestamp	P90 Latency	66.8119507	63.5714169	-5.10%
desc_sort_with_after_timestamp	P90 Service Time	65.4834175	62.310585	-5.09%
asc_sort_with_after_timestamp	Median Throughput	1.43	1.53	6.54%
asc_sort_with_after_timestamp	P90 Latency	57899.4551	45278.1914	-27.87%
asc_sort_with_after_timestamp	P90 Service Time	696.889435	653.127319	-6.70%
desc_sort_timestamp_can_match_shortcut	Median Throughput	2.01	2	-0.50%
desc_sort_timestamp_can_match_shortcut	P90 Latency	14.1913858	14.8691702	4.56%
desc_sort_timestamp_can_match_shortcut	P90 Service Time	12.6752672	13.4758678	5.94%
desc_sort_timestamp_no_can_match_shortcut	Median Throughput	2.01	2.01	0.00%
desc_sort_timestamp_no_can_match_shortcut	P90 Latency	14.741756	14.670032	-0.49%
desc_sort_timestamp_no_can_match_shortcut	P90 Service Time	13.1524077	13.1667752	0.11%
asc_sort_timestamp_can_match_shortcut	Median Throughput	2.01	1.99	-1.01%
asc_sort_timestamp_can_match_shortcut	P90 Latency	131.598747	125.168289	-5.14%
asc_sort_timestamp_can_match_shortcut	P90 Service Time	130.205177	123.940346	-5.05%
asc_sort_timestamp_no_can_match_shortcut	Median Throughput	2.01	2.01	0.00%
asc_sort_timestamp_no_can_match_shortcut	P90 Latency	131.024666	124.725517	-5.05%
asc_sort_timestamp_no_can_match_shortcut	P90 Service Time	129.664658	123.389187	-5.09%
term	Median Throughput	2.01	2.01	0.00%
term	P90 Latency	6.71847963	7.1291945	5.76%
term	P90 Service Time	5.23884797	5.58661246	6.22%
multi_terms-keyword	Median Throughput	1.51	1.28	-17.97%
multi_terms-keyword	P90 Latency	48044.2051	81996.4414	41.41%
multi_terms-keyword	P90 Service Time	662.659088	781.371338	15.19%
keyword-terms	Median Throughput	2	2	0.00%
keyword-terms	P90 Latency	202.335304	202.249527	-0.04%
keyword-terms	P90 Service Time	201.164589	201.085526	-0.04%
keyword-terms-low-cardinality	Median Throughput	2	2	0.00%
keyword-terms-low-cardinality	P90 Latency	188.965157	190.029907	0.56%
keyword-terms-low-cardinality	P90 Service Time	187.735619	188.83065	0.58%
composite-terms	Median Throughput	2.01	2	-0.50%
composite-terms	P90 Latency	162.717285	149.724503	-8.68%
composite-terms	P90 Service Time	161.589409	148.514496	-8.80%
composite_terms-keyword	Median Throughput	2	2	0.00%
composite_terms-keyword	P90 Latency	274.286682	256.525284	-6.92%
composite_terms-keyword	P90 Service Time	273.028259	255.49453	-6.86%
composite-date_histogram-daily	Median Throughput	2.01	2.01	0.00%
composite-date_histogram-daily	P90 Latency	6.72325397	7.18232918	6.39%
composite-date_histogram-daily	P90 Service Time	5.25644445	5.63916636	6.79%
range	Median Throughput	2.01	2.01	0.00%
range	P90 Latency	31.6533613	41.1376572	23.06%
range	P90 Service Time	30.062602	39.6963749	24.27%
range-numeric	Median Throughput	2.01	2.01	0.00%
range-numeric	P90 Latency	5.18209457	4.51481891	-14.78%
range-numeric	P90 Service Time	3.61531794	2.95159245	-22.49%
keyword-in-range	Median Throughput	2.01	2	-0.50%
keyword-in-range	P90 Latency	100.460861	120.471775	16.61%
keyword-in-range	P90 Service Time	99.1523285	119.213768	16.83%
date_histogram_hourly_agg	Median Throughput	2.01	2	-0.50%
date_histogram_hourly_agg	P90 Latency	23.4777098	23.7953453	1.33%
date_histogram_hourly_agg	P90 Service Time	21.0084438	21.3953438	1.81%
date_histogram_minute_agg	Median Throughput	2.01	2.01	0.00%
date_histogram_minute_agg	P90 Latency	28.0777187	32.9099083	14.68%
date_histogram_minute_agg	P90 Service Time	25.5714541	31.5339174	18.91%
scroll	Median Throughput	45.76	45.1	-1.46%
scroll	P90 Latency	13851.4175	15793.3145	12.30%
scroll	P90 Service Time	541.88501	538.196137	-0.69%
query-string-on-message	Median Throughput	2	1.99	-0.50%
query-string-on-message	P90 Latency	219.272316	206.608742	-6.13%
query-string-on-message	P90 Service Time	217.957298	205.188568	-6.22%
query-string-on-message-filtered	Median Throughput	2.01	2	-0.50%
query-string-on-message-filtered	P90 Latency	161.926392	178.792679	9.43%
query-string-on-message-filtered	P90 Service Time	160.742134	177.700157	9.54%
query-string-on-message-filtered-sorted-num	Median Throughput	2.01	2.01	0.00%
query-string-on-message-filtered-sorted-num	P90 Latency	53.2001629	53.1595345	-0.08%
query-string-on-message-filtered-sorted-num	P90 Service Time	51.7851219	51.7874584	0.00%
sort_keyword_can_match_shortcut	Median Throughput	2.01	2.01	0.00%
sort_keyword_can_match_shortcut	P90 Latency	6.09069538	7.55181289	19.35%
sort_keyword_can_match_shortcut	P90 Service Time	4.56942153	6.08352947	24.89%
sort_keyword_no_can_match_shortcut	Median Throughput	2.01	2.01	0.00%
sort_keyword_no_can_match_shortcut	P90 Latency	7.90096712	7.78420496	-1.50%
sort_keyword_no_can_match_shortcut	P90 Service Time	6.33617401	6.2607739	-1.20%
sort_numeric_desc	Median Throughput	2.01	2.01	0.00%
sort_numeric_desc	P90 Latency	28.6591969	29.2075043	1.88%
sort_numeric_desc	P90 Service Time	26.4250574	28.2246494	6.38%
sort_numeric_asc	Median Throughput	2.01	2.01	0.00%
sort_numeric_asc	P90 Latency	34.2001858	35.5184078	3.71%
sort_numeric_asc	P90 Service Time	32.6845799	34.1878948	4.40%
sort_numeric_desc_with_match	Median Throughput	2.01	2.01	0.00%
sort_numeric_desc_with_match	P90 Latency	7.41214299	7.47790742	0.88%
sort_numeric_desc_with_match	P90 Service Time	5.86182141	6.18104601	5.16%
sort_numeric_asc_with_match	Median Throughput	2.01	2.01	0.00%
sort_numeric_asc_with_match	P90 Latency	7.0982244	7.32826781	3.14%
sort_numeric_asc_with_match	P90 Service Time	5.52601242	5.86133695	5.72%
range_field_conjunction_big_range_big_term_query	Median Throughput	2.01	2.01	0.00%
range_field_conjunction_big_range_big_term_query	P90 Latency	5.79680395	5.44469404	-6.47%
range_field_conjunction_big_range_big_term_query	P90 Service Time	4.18014216	3.92662251	-6.46%
range_field_disjunction_big_range_small_term_query	Median Throughput	2.01	2.01	0.00%
range_field_disjunction_big_range_small_term_query	P90 Latency	7.11017346	7.63937259	6.93%
range_field_disjunction_big_range_small_term_query	P90 Service Time	5.80316663	6.1090374	5.01%
range_field_conjunction_small_range_small_term_query	Median Throughput	2.01	2.01	0.00%
range_field_conjunction_small_range_small_term_query	P90 Latency	6.73654699	7.45743656	9.67%
range_field_conjunction_small_range_small_term_query	P90 Service Time	5.16287255	5.98689055	13.76%
range_field_conjunction_small_range_big_term_query	Median Throughput	2.01	2.01	0.00%
range_field_conjunction_small_range_big_term_query	P90 Latency	5.312356	5.23332763	-1.51%
range_field_conjunction_small_range_big_term_query	P90 Service Time	3.77940905	3.63290751	-4.03%
range-auto-date-histo	Median Throughput	0.06	0.07	14.29%
range-auto-date-histo	P90 Latency	4735073.5	3814395.38	-24.14%
range-auto-date-histo	P90 Service Time	16842.9082	13730.8794	-22.66%
range-auto-date-histo-with-metrics	Median Throughput	0.03	0.03	0.00%
range-auto-date-histo-with-metrics	P90 Latency	10241987	9293074.5	-10.21%
range-auto-date-histo-with-metrics	P90 Service Time	36089.3984	32727.1055	-10.27%
range-agg-1	Median Throughput	2.01	2.01	0.00%
range-agg-1	P90 Latency	5.10684395	5.88857794	13.28%
range-agg-1	P90 Service Time	3.55634403	4.30167484	17.33%
range-agg-2	Median Throughput	2.01	2.01	0.00%
range-agg-2	P90 Latency	5.70763207	5.99885202	4.85%
range-agg-2	P90 Service Time	4.10150647	4.38907647	6.55%
cardinality-agg-low	Median Throughput	2.01	2.01	0.00%
cardinality-agg-low	P90 Latency	8.77825785	9.02456617	2.73%
cardinality-agg-low	P90 Service Time	7.31217814	7.52891541	2.88%
cardinality-agg-high	Median Throughput	0.22	0.27	18.52%
cardinality-agg-high	P90 Latency	1176064.63	937908.938	-25.39%
cardinality-agg-high	P90 Service Time	4568.29663	3744.09534	-22.01%

The following operations are worse with balanced docs slice supplier. Looking at the queries, most of the queries with reggressions either contain term, a descending sort, a range on timestamps, or a match/match-all query.

default
desc_sort_timestamp
desc_sort_timestamp_can_match_shortcut (slightly worse)
term
multi_terms-keyword (much worse)
keyword-terms (p99, p100 much worse)
keyword-terms-low-cardinality (p99, p100 much worse)
composite-date_histogram-daily
range
query-string-on-message-filtered
sort_keyword_can_match_shortcut
sort_keyword_no_can_match_shortcut
sort_numeric_asc

If we don't want to add another setting we could change the slice supplier in the shouldUseMaxTargetSlice() method based on information from the searchContext object. This logic feels somewhat brittle to me though, since we're extrapolating performance characteristics based on a single benchmark.

@msfroh , what do you think based on these additional benchmarks?

finnroblin · 2025-03-28T15:04:37Z

Codecov Report

Attention: Patch coverage is 87.80488% with 5 lines in your changes missing coverage. Please review.

Project coverage is 72.37%. Comparing base (2ee8660) to head (c4403f7).
Report is 51 commits behind head on main.

Files with missing lines Patch % Lines
...va/org/opensearch/search/DefaultSearchContext.java 71.42% 1 Missing and 1 partial ⚠️
...rch/search/internal/BalancedDocsSliceSupplier.java 92.30% 1 Missing and 1 partial ⚠️
...nsearch/search/internal/FilteredSearchContext.java 0.00% 1 Missing ⚠️
Additional details and impacted files

☔ View full report in Codecov by Sentry. 📢 Have feedback on the report? Share it here.
🚀 New features to boost your workflow:

Will address code coverage in next revision.

jed326 · 2025-03-28T16:45:20Z

Thanks @finnroblin, I had a few thoughts on this.

As you've identified, the MaxTargetSliceSupplier suffers when there is a wide variance in segment sizes. In this PR we are trying to address this at search time, but there are also some in-flight efforts to address this on the indexing side. For example, I wonder how your performance changes would look after Increasing Floor Segment Size to 16MB #17699 as been merged?
I also echo the concerns about adding both a new cluster and index setting for this, especially as from your performance analysis whether or not we see improvements with this change does seem to at least somewhat depend on the query type. Broadly speaking I think most bucket aggregations, especially terms and multi-terms aggregations, are susceptible to performance regressions in this case due to additional reduce work needed if there are more segments in a slice. Have you evaluated what other options there are for rolling this out besides new cluster/index stats? For example, a search request parameter, or even a ConcurrentSearchRequestDecider type of query visitor pattern?
In the cases where we do see regressions, I am wondering if it is because if increased tail latencies, or increased average latencies? The minimum latency of a concurrent search request is the longest time it takes to process any given slice, so I'd be curious to see if in these cases if, for example, a single slice is taking longer or on average all slices are taking longer. I think that could give some hints to why there actually are regressions.

finnroblin · 2025-03-28T18:35:09Z

Thanks @finnroblin, I had a few thoughts on this.

1. As you've identified, the `MaxTargetSliceSupplier` suffers when there is a wide variance in segment sizes. In this PR we are trying to address this at search time, but there are also some in-flight efforts to address this on the indexing side. For example, I wonder how your performance changes would look after [Increasing Floor Segment Size to 16MB #17699](https://github.com/opensearch-project/OpenSearch/pull/17699) as been merged?

2. I also echo the concerns about adding both a new cluster and index setting for this, especially as from your performance analysis whether or not we see improvements with this change does seem to at least somewhat depend on the query type. Broadly speaking I think most bucket aggregations, especially terms and multi-terms aggregations, are susceptible to performance regressions in this case due to additional `reduce` work needed if there are more segments in a slice. Have you evaluated what other options there are for rolling this out besides new cluster/index stats? For example, a search request parameter, or even a [`ConcurrentSearchRequestDecider`](https://github.com/opensearch-project/k-NN/blob/f3f4767c987ea39d903400b3153f9901892a7673/src/main/java/org/opensearch/knn/plugin/search/KNNConcurrentSearchRequestDecider.java#L27) type of query visitor pattern?

3. In the cases where we do see regressions, I am wondering if it is because if increased tail latencies, or increased average latencies? The minimum latency of a concurrent search request is the longest time it takes to process any given slice, so I'd be curious to see if in these cases if, for example, a single slice is taking longer or on average all slices are taking longer. I think that could give some hints to why there actually are regressions.

Thanks @jed326 for the thoughts!

"For example, I wonder how your performance changes would look after Increasing Floor Segment Size to 16MB #17699 as been merged?"
@kotwanikunal actually tested this change on a vector search workload with the 16mb segment floor and the derived source changes. The biggest change when the floor segment was added was a ~30% gain in minimum throughput. The latency benefits in the mixed experiment were likely driven by faster concurrent segment search, since both the mixed experiment and this balanced docs benchmark saw ~13% improvements in latency. On the other hand, I benchmarked the floor segment change individually and saw ~5-7% decreases in common case latency. So I think the floor segment change and this slice supplier change benefit performance when combined.

Metric	Operation	Unit	Default with CSS	With Changes (CSS)
Segment count			213	125
Min Throughput	prod-queries	ops/s	64.77	85.23333
Mean Throughput	prod-queries	ops/s	84.17	93.97
Median Throughput	prod-queries	ops/s	85.19333	94.42333
Max Throughput	prod-queries	ops/s	86.25	95.38
50th percentile latency	prod-queries	ms	9.84013	8.76297
90th percentile latency	prod-queries	ms	11.25363	9.8072
99th percentile latency	prod-queries	ms	12.74447	11.15153
99.9th percentile latency	prod-queries	ms	29.93143	21.17413
99.99th percentile latency	prod-queries	ms	82.9093	56.73783
100th percentile latency	prod-queries	ms	126.3996	66.8019
50th percentile service time	prod-queries	ms	9.84013	8.76297
90th percentile service time	prod-queries	ms	11.25363	9.8072
99th percentile service time	prod-queries	ms	12.74447	11.15153
99.9th percentile service time	prod-queries	ms	29.93143	21.17413
99.99th percentile service time	prod-queries	ms	82.9093	56.73783
100th percentile service time	prod-queries	ms	126.3996	66.8019
error rate	prod-queries	%	0	0
Mean recall@k	prod-queries		0.95667	0.95
Mean recall@1	prod-queries		0.98	0.98

2/3 Performance regressions
Good point about the reduce being across more segments within a slice. I think that this could well be the reason for the performance regressions. There's not a clear pattern of regression -- some ops are worse in the common case others are worse in the tail case. Regressions in the fast <10ms ops could also be driven by the additional overhead of the priority queue.

Here's a summary table:

Performance Category	Operations	Impact Notes
Worse in Common Case (p50/p90)	- multi_terms-keyword - composite_terms-keyword - scroll - term - query-string-on-message - query-string-on-message-filtered - range-agg-2 - cardinality-agg-low - date_histogram_minute_agg	- multi_terms-keyword: Significant degradation (41.41% worse at p50) - scroll: Notable impact (13.50% worse at p50) - Others show moderate degradation (2-10% range)
Worse in Tail Latency (p99/p100)	- keyword-terms - keyword-terms-low-cardinality - composite-terms - range - sort_numeric_desc - range-auto-date-histo	- keyword-terms: ~18% worse at p99/p100 - keyword-terms-low-cardinality: ~20% worse at p99/p100 - Others show varying degrees of tail latency degradation
Consistently Worse Across All Percentiles	- multi_terms-keyword - scroll - composite_terms-keyword	These operations showed degradation across both common case and tail latency metrics

2 -- other ways to roll out the change
I still don't think it's a good idea to enable this slicing mechanism by default due to the performance regressions. We could change the shouldUseMaxTargetSlice() option to use max target slice supplier in some cases and balanced docs supplier in others if there's a pattern (seems to be worse on more than just bucket aggregations unfortunately).

A search request option is an interesting idea but I'm not sure it sidesteps the apprehension with adding additional configuration options. From an implementation POV the option would need to be obtainable from the searchContext.

If we use KNNConcurrentSearchRequestDecider then I think we must still add a hook in on the opensearch side (either cluster/index setting or search request option) to use BD. If it's an index setting we could use the getAdditionalIndexSettings method on the k-NN side to enable this and get the performance benefits to k-NN.

opensearch-trigger-bot · 2025-04-28T15:22:47Z

This PR is stalled because it has been open for 30 days with no activity.

opensearch-trigger-bot · 2025-05-31T15:22:12Z

This PR is stalled because it has been open for 30 days with no activity.

kotwanikunal and others added 9 commits March 12, 2025 11:29

Add support for BalancedDocsSliceSupplier

cc44ec3

Signed-off-by: Kunal Kotwani <[email protected]>

Add unit tests; replace max target w balanced docs slice supplier to …

829c861

…test performance Signed-off-by: Finn Roblin <[email protected]>

Improve efficiency of BalancedDocsSliceSupplier. Alter unit tests sli…

38df02f

…ghtly to account for non-ordered heap container vs previous ordered arraylist Signed-off-by: Finn Roblin <[email protected]>

Use maxtargetslicesupplier for scroll queries

6b62c90

Signed-off-by: Finn Roblin <[email protected]>

add index & cluster settings to enable balanced slicing

b078a1f

Signed-off-by: Finn Roblin <[email protected]>

self review

835ecbc

Signed-off-by: Finn Roblin <[email protected]>

revert slicesInternal logging to debug level

71d5e17

Signed-off-by: Finn Roblin <[email protected]>

fix balancedDocsSLiceSupplierTests

2c29254

Signed-off-by: Finn Roblin <[email protected]>

update CHANGELOG

99803c0

Signed-off-by: Finn Roblin <[email protected]>

msfroh reviewed Mar 25, 2025

View reviewed changes

fix indexwriter resource leak in test

a07f46d

Signed-off-by: Finn Roblin <[email protected]>

formatting

c4403f7

Signed-off-by: Finn Roblin <[email protected]>

finnroblin marked this pull request as ready for review March 28, 2025 15:04

finnroblin requested review from anasalkouz, andrross, ashking94, Bukhtawar, CEHENKLE, cwperks, dblock, dbwiddis, gbbafna, jed326 and kotwanikunal as code owners March 28, 2025 15:04

finnroblin requested review from mch2, nknize, owaiskazi19, reta, Rishikesh1159, sachinpkale, saratvemulapalli, shwetathareja, sohami, VachaShah, bugmakerrrrrr, jainankitk and linuxpi as code owners March 28, 2025 15:04

opensearch-trigger-bot bot added stalled Issues that have stalled and removed stalled Issues that have stalled labels Apr 28, 2025

opensearch-trigger-bot bot added the stalled Issues that have stalled label May 31, 2025

jed326 mentioned this pull request Jun 6, 2025

Optimize grouping for segment concurrent search by ensuring that documents within each group are as equal as possible #18451

Open

3 tasks

opensearch-infra bot added the lucene label Jun 27, 2025

opensearch-trigger-bot bot removed the stalled Issues that have stalled label Jun 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Balanced Docs Slice Allocator to increase document balance for Concurrent Segment Search #17687

Add Balanced Docs Slice Allocator to increase document balance for Concurrent Segment Search #17687

Uh oh!

finnroblin commented Mar 25, 2025 •

edited by peterzhuamazon

Loading

Uh oh!

github-actions bot commented Mar 25, 2025

Uh oh!

msfroh Mar 25, 2025

Uh oh!

finnroblin Mar 25, 2025

Uh oh!

github-actions bot commented Mar 25, 2025

Uh oh!

github-actions bot commented Mar 26, 2025

Uh oh!

codecov bot commented Mar 26, 2025 •

edited

Loading

Uh oh!

finnroblin commented Mar 28, 2025

Uh oh!

finnroblin commented Mar 28, 2025

Codecov Report

Uh oh!

jed326 commented Mar 28, 2025

Uh oh!

finnroblin commented Mar 28, 2025 •

edited

Loading

Uh oh!

opensearch-trigger-bot bot commented Apr 28, 2025

Uh oh!

opensearch-trigger-bot bot commented May 31, 2025

Uh oh!

Uh oh!

		public static final String CONCURRENT_SEGMENT_SEARCH_USE_EXPERIMENTAL_SLICING_KEY = "search.concurrent.experimental_slicing.enable";
		public static final boolean CONCURRENT_SEGMENT_SEARCH_USE_EXPERIMENTAL_SLICING_DEFAULT_VALUE = false;

Add Balanced Docs Slice Allocator to increase document balance for Concurrent Segment Search #17687

Are you sure you want to change the base?

Add Balanced Docs Slice Allocator to increase document balance for Concurrent Segment Search #17687

Uh oh!

Conversation

finnroblin commented Mar 25, 2025 • edited by peterzhuamazon Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Vector Search Benchmarks

Big5 Benchmarks

Brief comparison of all big5 operations

Check List

Uh oh!

github-actions bot commented Mar 25, 2025

Uh oh!

msfroh Mar 25, 2025

Choose a reason for hiding this comment

Uh oh!

finnroblin Mar 25, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Mar 25, 2025

Uh oh!

github-actions bot commented Mar 26, 2025

Uh oh!

codecov bot commented Mar 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

finnroblin commented Mar 28, 2025

Uh oh!

finnroblin commented Mar 28, 2025

Codecov Report

Uh oh!

jed326 commented Mar 28, 2025

Uh oh!

finnroblin commented Mar 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

opensearch-trigger-bot bot commented Apr 28, 2025

Uh oh!

opensearch-trigger-bot bot commented May 31, 2025

Uh oh!

Uh oh!

finnroblin commented Mar 25, 2025 •

edited by peterzhuamazon

Loading

codecov bot commented Mar 26, 2025 •

edited

Loading

finnroblin commented Mar 28, 2025 •

edited

Loading