[RFC] Improve segment replication resiliency through Lucene's Adaptive Refresh

### Is your feature request related to a problem? Please describe

For e-commerce search on Amazon, we have a custom search engine built on Lucene that leverages Lucene's powerful segment based replication. This has proven to be an excellent design choice for the high QPS search use-case of our service. Documents are indexed once and simply replicated across multiple searchers (replicas), we can physically isolate indexing+merging from search, support quick failovers and point in time restores etc. Basically, all the benefits that OpenSearch also gets with segment replication.

For our replication setup, our indexers publish periodic replication checkpoints to s3 every `N` seconds. These checkpoints contain the new segments created since the last checkpoint, published in a single commit. Replicas periodically fetch the new checkpoints from s3 and refresh their searchers (using Lucene's `SearcherManager`).

Like any distributed system, replication is prone to a number of failure modes, from network issues to misbehaving nodes. One such issue we've observed over the past few months, happens when we end up with very large checkpoints. These arise if there is a network glitch and we accumulate segments for `>N` seconds before publishing a checkpoint, or if there was a burst in indexing traffic and suddenly we have a lot of docs indexed within the checkpoint window. When replicas refresh on these large checkpoints, they observe a big surge in page faults, which causes thrashing on the in-flight hot queries and degrades search performance.

To add resiliency against such incidents, we recently introduced *"Adaptive Refresh"*  for searcher managers in Lucene. Instead of refreshing on the entire checkpoint in one fell swoop, this change allows searchers to intelligently process through ‘safe to refresh' commit points, and absorb the large checkpoint without excessive page faults.
 
It seems to me that segment replicated clusters in OpenSearch could also be made more resilient by integrating with Lucene's Adaptive Refresh. This RFC is to explore the path forward for adding this support to OpenSearch.

**Links to Lucene Issue and PR:**
- https://github.com/apache/lucene/issues/14219 
- https://github.com/apache/lucene/pull/14443

### Describe the solution you'd like

We've already merged changes in Lucene to support adaptive refresh in searcher managers (see https://github.com/apache/lucene/pull/14443). It allows us now to define a `RefreshCommitSupplier` that can select the best "safe" commit for searchers to refresh on. We would define this supplier in OpenSearch and use it within the searcher managers.

I believe we would need to modify `OpenSearchReaderManager` to start supporting adaptive refresh. Might need to add similar support to OpenSearch from scratch, since it is a final class? Also, segment replication in OpenSearch likely has it's own set of nuances that we need to handle. Would like to hear from the community on whether this change makes sense, and what are some OpenSearch segment replication specific details that need attention.

### Related component

Indexing:Replication

### Describe alternatives you've considered

_No response_

### Additional context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC] Improve segment replication resiliency through Lucene's Adaptive Refresh #18700

Is your feature request related to a problem? Please describe

Describe the solution you'd like

Related component

Describe alternatives you've considered

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC] Improve segment replication resiliency through Lucene's Adaptive Refresh #18700

Description

Is your feature request related to a problem? Please describe

Describe the solution you'd like

Related component

Describe alternatives you've considered

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions