Skip to content

[AUTOCUT] Gradle Check Flaky Test Report for S3BlobContainerRetriesTests #17551

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
opensearch-ci-bot opened this issue Mar 7, 2025 · 3 comments · Fixed by #18144 or #18201
Open

[AUTOCUT] Gradle Check Flaky Test Report for S3BlobContainerRetriesTests #17551

opensearch-ci-bot opened this issue Mar 7, 2025 · 3 comments · Fixed by #18144 or #18201
Labels
autocut flaky-test Random test failure that succeeds on second run >test-failure Test failure from CI, local build, etc.

Comments

@opensearch-ci-bot
Copy link
Collaborator

opensearch-ci-bot commented Mar 7, 2025

Flaky Test Report for S3BlobContainerRetriesTests

Noticed the S3BlobContainerRetriesTests has some flaky, failing tests that failed during post-merge actions.

Details

Git Reference Merged Pull Request Build Details Test Name
1294506 18061 57060 org.opensearch.repositories.s3.S3BlobContainerRetriesTests.classMethod
2b62528 18308 58144 org.opensearch.repositories.s3.S3BlobContainerRetriesTests.classMethod
47f9bdb 18165 57479 org.opensearch.repositories.s3.S3BlobContainerRetriesTests.classMethod
60aa7a5 18212 57844 org.opensearch.repositories.s3.S3BlobContainerRetriesTests.classMethod
6719b1f 18057 57116 org.opensearch.repositories.s3.S3BlobContainerRetriesTests.classMethod
816787d 18044 57010 org.opensearch.repositories.s3.S3BlobContainerRetriesTests.classMethod
8964f63 17873 56048 org.opensearch.repositories.s3.S3BlobContainerRetriesTests.classMethod
cc82be9 17515 54208 org.opensearch.repositories.s3.S3BlobContainerRetriesTests.classMethod
e0a67fd 17674 55181 org.opensearch.repositories.s3.S3BlobContainerRetriesTests.classMethod
fc78200 17732 55229 org.opensearch.repositories.s3.S3BlobContainerRetriesTests.classMethod
fcb0649 17668 54996 org.opensearch.repositories.s3.S3BlobContainerRetriesTests.classMethod

The other pull requests, besides those involved in post-merge actions, that contain failing tests with the S3BlobContainerRetriesTests class are:

For more details on the failed tests refer to OpenSearch Gradle Check Metrics dashboard.

@opensearch-ci-bot opensearch-ci-bot added >test-failure Test failure from CI, local build, etc. autocut flaky-test Random test failure that succeeds on second run untriaged labels Mar 7, 2025
@krisfreedain
Copy link
Member

Catch All Triage - 1 2

@andrross
Copy link
Member

andrross commented Apr 29, 2025

Reopening as I saw the failure again even with this fix:

REPRODUCE WITH: ./gradlew ':plugins:repository-s3:test' --tests "org.opensearch.repositories.s3.S3BlobContainerRetriesTests" -Dtests.seed=7C03C4361DD3B7D5 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=en -Dtests.timezone=Etc/UTC -Druntime.java=21

S3BlobContainerRetriesTests > classMethod FAILED
    com.carrotsearch.randomizedtesting.ThreadLeakError: There are still zombie threads that couldn't be terminated:
       1) Thread[id=185, name=sdk-ScheduledExecutor-12-0, state=TIMED_WAITING, group=TGRP-S3BlobContainerRetriesTests]
            at java.****/jdk.internal.misc.Unsafe.park(Native Method)
            at java.****/java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:269)
            at java.****/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1763)
            at java.****/java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1182)
            at java.****/java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:899)
            at java.****/java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1070)
            at java.****/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
            at java.****/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
            at java.****/java.lang.Thread.run(Thread.java:1583)
        at __randomizedtesting.SeedInfo.seed([7C03C4361DD3B7D5]:0)

https://build.ci.opensearch.org/job/gradle-check/57400/consoleText

@andrross andrross reopened this Apr 29, 2025
@andrross
Copy link
Member

  2> Apr 29, 2025 6:38:16 PM com.carrotsearch.randomizedtesting.ThreadLeakControl checkThreadLeaks
  2> WARNING: Will linger awaiting termination of 1 leaked thread(s).
  2> Apr 29, 2025 6:38:21 PM com.carrotsearch.randomizedtesting.ThreadLeakControl checkThreadLeaks
  2> SEVERE: 1 thread leaked from SUITE scope at org.opensearch.repositories.s3.S3BlobContainerRetriesTests: 
  2>    1) Thread[id=185, name=sdk-ScheduledExecutor-12-0, state=TIMED_WAITING, group=TGRP-S3BlobContainerRetriesTests]
...
  2> Apr 29, 2025 6:38:21 PM com.carrotsearch.randomizedtesting.ThreadLeakControl tryToInterruptAll
  2> INFO: Starting to interrupt leaked threads:
  2>    1) Thread[id=185, name=sdk-ScheduledExecutor-12-0, state=TIMED_WAITING, group=TGRP-S3BlobContainerRetriesTests]
  2> Apr 29, 2025 6:38:24 PM com.carrotsearch.randomizedtesting.ThreadLeakControl tryToInterruptAll
  2> SEVERE: There are still zombie threads that couldn't be terminated:
  2>    1) Thread[id=185, name=sdk-ScheduledExecutor-12-0, state=TIMED_WAITING, group=TGRP-S3BlobContainerRetriesTests]
...
  2> com.carrotsearch.randomizedtesting.ThreadLeakError: 1 thread leaked from SUITE scope at org.opensearch.repositories.s3.S3BlobContainerRetriesTests: 
       1) Thread[id=185, name=sdk-ScheduledExecutor-12-0, state=TIMED_WAITING, group=TGRP-S3BlobContainerRetriesTests]
...

The name of the leaked thread starts with sdk-ScheduledExecutor meaning it comes from the scheduled executor service in the client. The test finds the leaked thread, waits 5 seconds for it to stop, interrupts it, waits 3 more seconds, then fails the test. The fact the thread doesn't die with the interrupt means (I think) that the scheduled executor service is not being shut down. I can't get a reliable repro, and my reading of the code shows that this executor should be shut down when the client closes. I'm not sure what's going wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
autocut flaky-test Random test failure that succeeds on second run >test-failure Test failure from CI, local build, etc.
Projects
None yet
3 participants