Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AUTOCUT] Gradle Check Flaky Test Report for SearchReplicaRestoreIT #17648

Open
opensearch-ci-bot opened this issue Mar 21, 2025 · 3 comments
Open
Assignees
Labels
autocut flaky-test Random test failure that succeeds on second run >test-failure Test failure from CI, local build, etc. untriaged

Comments

@opensearch-ci-bot
Copy link
Collaborator

Flaky Test Report for SearchReplicaRestoreIT

Noticed the SearchReplicaRestoreIT has some flaky, failing tests that failed during post-merge actions.

Details

Git Reference Merged Pull Request Build Details Test Name
9d4414b 17604 54813 org.opensearch.indices.replication.SearchReplicaRestoreIT.testSearchReplicaRestore_WhenSnapshotOnSegRepWithSearchReplica_RestoreOnDocRep

The other pull requests, besides those involved in post-merge actions, that contain failing tests with the SearchReplicaRestoreIT class are:

For more details on the failed tests refer to OpenSearch Gradle Check Metrics dashboard.

@opensearch-ci-bot opensearch-ci-bot added >test-failure Test failure from CI, local build, etc. autocut flaky-test Random test failure that succeeds on second run untriaged labels Mar 21, 2025
@andrross
Copy link
Member

@mch2 @vinaykpud This looks like a fairly new test that is now flaky. Can we fix/remove/mute this so as to not add more flakiness?

REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.indices.replication.SearchReplicaRestoreIT.testSearchReplicaRestore_WhenSnapshotOnSegRepWithSearchReplica_RestoreOnDocRep" -Dtests.seed=BC2DF12FF548A44D -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=ann-Latn-NG -Dtests.timezone=US/Eastern -Druntime.java=21

SearchReplicaRestoreIT > testSearchReplicaRestore_WhenSnapshotOnSegRepWithSearchReplica_RestoreOnDocRep FAILED
    java.lang.AssertionError: Remote metadata file can't be null if shard is active STARTED
        at __randomizedtesting.SeedInfo.seed([BC2DF12FF548A44D]:0)
        at org.opensearch.indices.replication.RemoteStoreReplicationSource.getCheckpointMetadata(RemoteStoreReplicationSource.java:76)
        at org.opensearch.indices.replication.SegmentReplicationTarget.startReplication(SegmentReplicationTarget.java:179)
        at org.opensearch.indices.replication.SegmentReplicator.start(SegmentReplicator.java:275)
        at org.opensearch.indices.replication.SegmentReplicator$ReplicationRunner.doRun(SegmentReplicator.java:261)
        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:994)
        at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52)
        at java.****/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
        at java.****/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
        at java.****/java.lang.Thread.run(Thread.java:1583)

@vinaykpud
Copy link
Contributor

vinaykpud commented Mar 21, 2025

@andrross Yes, I will take a look into this

@vinaykpud
Copy link
Contributor

vinaykpud commented Mar 21, 2025

@andrross, @mch2
It looks like flakiness was not introduced by this test, its because of the existing code here :

// During initial recovery flow, the remote store might not
// have metadata as primary hasn't uploaded anything yet.
if (mdFile == null && indexShard.state().equals(IndexShardState.STARTED) == false) {
listener.onResponse(new CheckpointInfoResponse(checkpoint, Collections.emptyMap(), null));
return;
}
assert mdFile != null : "Remote metadata file can't be null if shard is active " + indexShard.state();
metadataMap = mdFile.getMetadata()

In the above block, flakiness happens when is the indexShard.state() is not equal to IndexShardState.STARTED in line 72, but its STARTED in line 76 throws exception.

We need to check if we can handle this gracefully to avoid flakiness.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
autocut flaky-test Random test failure that succeeds on second run >test-failure Test failure from CI, local build, etc. untriaged
Projects
None yet
Development

No branches or pull requests

3 participants