Skip to content

[RFC] Improve segment replication resiliency through Lucene's Adaptive Refresh #18700

@vigyasharma

Description

@vigyasharma

Is your feature request related to a problem? Please describe

For e-commerce search on Amazon, we have a custom search engine built on Lucene that leverages Lucene's powerful segment based replication. This has proven to be an excellent design choice for the high QPS search use-case of our service. Documents are indexed once and simply replicated across multiple searchers (replicas), we can physically isolate indexing+merging from search, support quick failovers and point in time restores etc. Basically, all the benefits that OpenSearch also gets with segment replication.

For our replication setup, our indexers publish periodic replication checkpoints to s3 every N seconds. These checkpoints contain the new segments created since the last checkpoint, published in a single commit. Replicas periodically fetch the new checkpoints from s3 and refresh their searchers (using Lucene's SearcherManager).

Like any distributed system, replication is prone to a number of failure modes, from network issues to misbehaving nodes. One such issue we've observed over the past few months, happens when we end up with very large checkpoints. These arise if there is a network glitch and we accumulate segments for >N seconds before publishing a checkpoint, or if there was a burst in indexing traffic and suddenly we have a lot of docs indexed within the checkpoint window. When replicas refresh on these large checkpoints, they observe a big surge in page faults, which causes thrashing on the in-flight hot queries and degrades search performance.

To add resiliency against such incidents, we recently introduced "Adaptive Refresh" for searcher managers in Lucene. Instead of refreshing on the entire checkpoint in one fell swoop, this change allows searchers to intelligently process through ‘safe to refresh' commit points, and absorb the large checkpoint without excessive page faults.

It seems to me that segment replicated clusters in OpenSearch could also be made more resilient by integrating with Lucene's Adaptive Refresh. This RFC is to explore the path forward for adding this support to OpenSearch.

Links to Lucene Issue and PR:

Describe the solution you'd like

We've already merged changes in Lucene to support adaptive refresh in searcher managers (see apache/lucene#14443). It allows us now to define a RefreshCommitSupplier that can select the best "safe" commit for searchers to refresh on. We would define this supplier in OpenSearch and use it within the searcher managers.

I believe we would need to modify OpenSearchReaderManager to start supporting adaptive refresh. Might need to add similar support to OpenSearch from scratch, since it is a final class? Also, segment replication in OpenSearch likely has it's own set of nuances that we need to handle. Would like to hear from the community on whether this change makes sense, and what are some OpenSearch segment replication specific details that need attention.

Related component

Indexing:Replication

Describe alternatives you've considered

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Indexing:ReplicationIssues and PRs related to core replication framework eg segrepenhancementEnhancement or improvement to existing feature or requestlucene

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions