Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[segment replication] Add async publish checkpoint task #17619

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

guojialiang92
Copy link

Description

Added a test. In the current situation, if the primary shard publish checkpoint fails, it will cause the replica shard and the primary shard to fail to synchronize.
Added an asynchronous task. When the primary shard detects that the replica is behind for more than a certain time threshold, it triggers a publish checkpoint. And ensure that the above tests can be passed.

Related Issues

Resolves 17595

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions github-actions bot added bug Something isn't working Indexing:Replication Issues and PRs related to core replication framework eg segrep labels Mar 18, 2025
@guojialiang92 guojialiang92 changed the title Dev/add async publish checkpoint task [segment replication] Add async publish checkpoint task Mar 18, 2025
Copy link
Contributor

❌ Gradle check result for 5b2a4e7: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❕ Gradle check result for 19b97fa: UNSTABLE

  • TEST FAILURES:
      1 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

Copy link

codecov bot commented Mar 21, 2025

Codecov Report

Attention: Patch coverage is 53.84615% with 30 lines in your changes missing coverage. Please review.

Project coverage is 72.50%. Comparing base (9b9abf7) to head (e067c6b).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
...org/opensearch/index/seqno/ReplicationTracker.java 38.88% 8 Missing and 3 partials ⚠️
...c/main/java/org/opensearch/index/IndexService.java 64.00% 6 Missing and 3 partials ⚠️
.../main/java/org/opensearch/index/IndexSettings.java 68.75% 5 Missing ⚠️
...in/java/org/opensearch/index/shard/IndexShard.java 16.66% 2 Missing and 3 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #17619      +/-   ##
============================================
+ Coverage     72.48%   72.50%   +0.02%     
- Complexity    65835    65874      +39     
============================================
  Files          5316     5316              
  Lines        305281   305345      +64     
  Branches      44287    44302      +15     
============================================
+ Hits         221274   221397     +123     
+ Misses        65851    65796      -55     
+ Partials      18156    18152       -4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: guojialiang <[email protected]>
Copy link
Contributor

❕ Gradle check result for 3e146dc: UNSTABLE

  • TEST FAILURES:
      1 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

Signed-off-by: guojialiang <[email protected]>
Copy link
Contributor

❕ Gradle check result for e067c6b: UNSTABLE

  • TEST FAILURES:
      2 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
      1 org.opensearch.remotestore.RemoteStoreIT.testNoSearchIdleForAnyReplicaCount
      1 org.opensearch.index.reindex.DeleteByQueryBasicTests.testDeleteByQueryWithMultipleIndices

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

Copy link
Contributor

❌ Gradle check result for 2a272aa: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for 54945b2: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@guojialiang92 guojialiang92 force-pushed the dev/add_async_publish_checkpoint_task branch from 54945b2 to 23c1b87 Compare March 22, 2025 03:08
Copy link
Contributor

❌ Gradle check result for 23c1b87: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: guojialiang <[email protected]>
@guojialiang92 guojialiang92 force-pushed the dev/add_async_publish_checkpoint_task branch from 23c1b87 to 4394239 Compare March 22, 2025 12:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Indexing:Replication Issues and PRs related to core replication framework eg segrep
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] segment replication stops when publish checkpoint fails
1 participant