Skip to content

Fix DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode that is failing with reproducible seed #17389

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

cwperks
Copy link
Member

@cwperks cwperks commented Feb 19, 2025

Description

Fails consistently with:

./gradlew ':server:internalClusterTest' --tests "org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode" -Dtests.seed=565D3205AE4EEC96 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=uz-Cyrl -Dtests.timezone=Australia/Victoria -Druntime.java=23 -i

This PR ensures that the generation in the ShardSnapshotStatus is not null in the case where the snapshot is aborted. Essentially the repository cleanup fails to cleanup files where a snapshot was aborted.

Related Issues

Resolves #15806

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

…is failing with reproducible seed

Signed-off-by: Craig Perkins <[email protected]>
@github-actions github-actions bot added >test-failure Test failure from CI, local build, etc. autocut flaky-test Random test failure that succeeds on second run labels Feb 19, 2025
Copy link
Contributor

❌ Gradle check result for de2f7a2: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

✅ Gradle check result for de2f7a2: SUCCESS

Copy link

codecov bot commented Feb 19, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 72.47%. Comparing base (e27cf58) to head (30f2b90).
Report is 2 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main   #17389      +/-   ##
============================================
- Coverage     72.53%   72.47%   -0.06%     
+ Complexity    65814    65722      -92     
============================================
  Files          5311     5311              
  Lines        304945   304948       +3     
  Branches      44226    44226              
============================================
- Hits         221193   221017     -176     
- Misses        65619    65815     +196     
+ Partials      18133    18116      -17     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@cwperks
Copy link
Member Author

cwperks commented Feb 28, 2025

One thing I've noticed is the the shardCount is wrong for this seed. shardInfo.getShardCount() returns 1 when it should return 2 here.

status.generation() is returning null here and it returns the index name when the test is successful.

Status shows the following: ShardSnapshotStatus[state=FAILED, nodeId=6ROcJ0L_RFehM4-ZvkwNFg, reason=aborted, generation=null]

When the test is successful it returns: ShardSnapshotStatus[state=SUCCESS, nodeId=LlXDDnl8SOqPEsivPOCFbw, reason=null, generation=OdpXP3dbReumUYQ9l2zdWg]

Copy link
Contributor

❌ Gradle check result for fe9d1e6: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for 96d866c: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for 96d866c: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for 96d866c: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for d123f8c: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

✅ Gradle check result for d123f8c: SUCCESS

Copy link
Contributor

❌ Gradle check result for 30f2b90: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❕ Gradle check result for 30f2b90: UNSTABLE

  • TEST FAILURES:
      1 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
      1 org.opensearch.action.admin.cluster.node.tasks.ResourceAwareTasksTests.testTaskResourceTrackingDuringTaskCancellation

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

@msfroh
Copy link
Collaborator

msfroh commented Mar 19, 2025

@cwperks -- I just got hit by this flaky test and was hoping that your PR will fix it. But, it looks like one of your runs yesterday failed with org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode.

Do you know if that's another issue unrelated to your fix? Does your fix make the test less flaky? I'm definitely willing to accept less flaky.

@cwperks
Copy link
Member Author

cwperks commented Mar 19, 2025

@cwperks -- I just got hit by this flaky test and was hoping that your PR will fix it. But, it looks like one of your runs yesterday failed with org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode.

Do you know if that's another issue unrelated to your fix? Does your fix make the test less flaky? I'm definitely willing to accept less flaky.

The changes in this PR fix an issue with a reproducible seed which would certainly cut down on flakiness in this test case. I just looked at the failure in #17389 (comment), but I can't reproduce with the same seed from that run when running the test locally. I think the changes in this PR will reduce the flakiness, but looks like test is still susceptible to failing if the snapshot repository that is created in the test isn't cleaned up in time.

@cwperks
Copy link
Member Author

cwperks commented Mar 25, 2025

@msfroh any concerns with this flaky test fix?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
autocut flaky-test Random test failure that succeeds on second run >test-failure Test failure from CI, local build, etc.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[AUTOCUT] Gradle Check Flaky Test Report for DedicatedClusterSnapshotRestoreIT
3 participants