[BUG] segment replication stops when publish checkpoint fails #17595

guojialiang92 · 2025-03-15T14:14:53Z

Describe the bug

The current segment replication is based on pull mode, but the premise is that the replica receives the checkpoint sent by the primary shard.
Once the replica does not receive the checkpoint of the primary for some reason (For example, network issues cause the limit of TransportReplicationAction#REPLICATION_RETRY_TIMEOUT to be exceeded), and the primary shard is no longer written, the replica shard will always be unable to synchronize with the primary shard.
Usually, when encountering this kind of problem, users need to write a piece of data to trigger the primary shard to publish a new checkpoint. Or solve it by reducing the number of replicas first and then increasing the number of replicas.

Related component

Indexing:Replication

To Reproduce

Stable reproduction can be performed in Integrated Test

Create an index that enables segment replication, and specify 1 primary shard and 1 replica shard.
The simulated replica shard throws a RemoteTransportException exception before processing request indices:admin/publishCheckpoint[r].
Write a doc and wait for the refresh to finish.
Let the replica shard process request indices:admin/publishCheckpoint[r] normally.
Wait for a while to determine if both the primary shard and the replica shard have 1 doc.

Expected behavior

I suggest adding a scheduled asynchronous task. When the primary shard detects that the replica is behind for more than a certain time threshold, it triggers a publish checkpoint.
Please help me evaluate whether I can make improvements in this way. I will submit a Pull Request and supplement relevant test cases.

Additional Details

Plugins
Please list all plugins currently enabled.

Screenshots
If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

OS: [e.g. iOS]
Version [e.g. 22]

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

guojialiang92 added bug Something isn't working untriaged labels Mar 15, 2025

github-actions bot added the Indexing:Replication Issues and PRs related to core replication framework eg segrep label Mar 15, 2025

guojialiang92 linked a pull request Mar 18, 2025 that will close this issue

[segment replication] Add async publish checkpoint task #17619

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] segment replication stops when publish checkpoint fails #17595

[BUG] segment replication stops when publish checkpoint fails #17595

guojialiang92 commented Mar 15, 2025

[BUG] segment replication stops when publish checkpoint fails #17595

[BUG] segment replication stops when publish checkpoint fails #17595

Comments

guojialiang92 commented Mar 15, 2025

Describe the bug

Related component

To Reproduce

Expected behavior

Additional Details