Flink: Set forkEvery=1 on integration test to prevent worker JVM hang#15645
Flink: Set forkEvery=1 on integration test to prevent worker JVM hang#15645jbonofre wants to merge 1 commit intoapache:mainfrom
Conversation
Flink 2.0/2.1 mini-cluster shutdown leaves non-daemon threads (TaskExecutorFileMergingManager) alive, preventing the Gradle test worker JVM from exiting. This causes the integration test task to hang indefinitely until the CI timeout kills it. Setting forkEvery=1 forces Gradle to terminate the worker process after each test class, avoiding the hang.
pvary
left a comment
There was a problem hiding this comment.
+1 to prevent flaky tests.
@jbonofre: Could you please link a Flink issue for this where this is discussed, mentioned and hopefully fixed later, so we can remove this workaround when it is fixed in Flink?
@mxm, @Guosmilesmile: Any thoughts?
| task integrationTest(type: Test) { | ||
| description = "Test Flink Runtime Jar against Flink ${flinkMajorVersion}" | ||
| group = "verification" | ||
| forkEvery = 1 |
There was a problem hiding this comment.
There is only one test which uses this task: TestIcebergConnectorSmoke (located in the flink-runtime module). The majority of the tests are located in the flink module.
We could add this flag under the test task (line 125), but we should keep an eye on the build time. I'm thinking it could go up by quite a bit (currently around 20 minutes).
There was a problem hiding this comment.
That's the one in particular which is problematic
There was a problem hiding this comment.
The build time is the same today
There was a problem hiding this comment.
If that one is the relevant test, then +1 for the change. How did we debug this? I'm curious because it is a bit hard to reproduce this. I've never run into the issue locally.
There was a problem hiding this comment.
I’m also curious how to reproduce this scenario—I haven’t hit this issue locally. If we can reproduce it, we can fix it more effectively.
If this addition doesn’t have a significant impact on the overall runtime, I think it’s fine to +1 for the change.
Flink 2.0/2.1 mini-cluster shutdown leaves non-daemon threads (TaskExecutorFileMergingManager) alive, preventing the Gradle test worker JVM from exiting. This causes the integration test task to hang indefinitely until the CI timeout kills it.
Setting forkEvery=1 forces Gradle to terminate the worker process after each test class, avoiding the hang.
This is the tentative to fix the flakiness timeout on the Flink CI.