Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] segment replication stops when publish checkpoint fails #17595

Open
guojialiang92 opened this issue Mar 15, 2025 · 0 comments · May be fixed by #17619
Open

[BUG] segment replication stops when publish checkpoint fails #17595

guojialiang92 opened this issue Mar 15, 2025 · 0 comments · May be fixed by #17619
Labels
bug Something isn't working Indexing:Replication Issues and PRs related to core replication framework eg segrep untriaged

Comments

@guojialiang92
Copy link

Describe the bug

The current segment replication is based on pull mode, but the premise is that the replica receives the checkpoint sent by the primary shard.
Once the replica does not receive the checkpoint of the primary for some reason (For example, network issues cause the limit of TransportReplicationAction#REPLICATION_RETRY_TIMEOUT to be exceeded), and the primary shard is no longer written, the replica shard will always be unable to synchronize with the primary shard.
Usually, when encountering this kind of problem, users need to write a piece of data to trigger the primary shard to publish a new checkpoint. Or solve it by reducing the number of replicas first and then increasing the number of replicas.

Related component

Indexing:Replication

To Reproduce

Stable reproduction can be performed in Integrated Test

  1. Create an index that enables segment replication, and specify 1 primary shard and 1 replica shard.
  2. The simulated replica shard throws a RemoteTransportException exception before processing request indices:admin/publishCheckpoint[r].
  3. Write a doc and wait for the refresh to finish.
  4. Let the replica shard process request indices:admin/publishCheckpoint[r] normally.
  5. Wait for a while to determine if both the primary shard and the replica shard have 1 doc.

Expected behavior

I suggest adding a scheduled asynchronous task. When the primary shard detects that the replica is behind for more than a certain time threshold, it triggers a publish checkpoint.
Please help me evaluate whether I can make improvements in this way. I will submit a Pull Request and supplement relevant test cases.

Additional Details

Plugins
Please list all plugins currently enabled.

Screenshots
If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

  • OS: [e.g. iOS]
  • Version [e.g. 22]

Additional context
Add any other context about the problem here.

@guojialiang92 guojialiang92 added bug Something isn't working untriaged labels Mar 15, 2025
@github-actions github-actions bot added the Indexing:Replication Issues and PRs related to core replication framework eg segrep label Mar 15, 2025
@guojialiang92 guojialiang92 linked a pull request Mar 18, 2025 that will close this issue
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Indexing:Replication Issues and PRs related to core replication framework eg segrep untriaged
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant