[BUG] segment replication stops when publish checkpoint fails #17595
Labels
bug
Something isn't working
Indexing:Replication
Issues and PRs related to core replication framework eg segrep
untriaged
Describe the bug
The current segment replication is based on pull mode, but the premise is that the replica receives the checkpoint sent by the primary shard.
Once the replica does not receive the checkpoint of the primary for some reason (For example, network issues cause the limit of
TransportReplicationAction#REPLICATION_RETRY_TIMEOUT
to be exceeded), and the primary shard is no longer written, the replica shard will always be unable to synchronize with the primary shard.Usually, when encountering this kind of problem, users need to write a piece of data to trigger the primary shard to publish a new checkpoint. Or solve it by reducing the number of replicas first and then increasing the number of replicas.
Related component
Indexing:Replication
To Reproduce
Stable reproduction can be performed in Integrated Test
RemoteTransportException
exception before processing requestindices:admin/publishCheckpoint[r].
indices:admin/publishCheckpoint[r]
normally.Expected behavior
I suggest adding a scheduled asynchronous task. When the primary shard detects that the replica is behind for more than a certain time threshold, it triggers a publish checkpoint.
Please help me evaluate whether I can make improvements in this way. I will submit a Pull Request and supplement relevant test cases.
Additional Details
Plugins
Please list all plugins currently enabled.
Screenshots
If applicable, add screenshots to help explain your problem.
Host/Environment (please complete the following information):
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: