Skip to content

Flink: Set forkEvery=1 on integration test to prevent worker JVM hang#15645

Open
jbonofre wants to merge 1 commit intoapache:mainfrom
jbonofre:fix/flink-integration-test-fork
Open

Flink: Set forkEvery=1 on integration test to prevent worker JVM hang#15645
jbonofre wants to merge 1 commit intoapache:mainfrom
jbonofre:fix/flink-integration-test-fork

Conversation

@jbonofre
Copy link
Member

Flink 2.0/2.1 mini-cluster shutdown leaves non-daemon threads (TaskExecutorFileMergingManager) alive, preventing the Gradle test worker JVM from exiting. This causes the integration test task to hang indefinitely until the CI timeout kills it.

Setting forkEvery=1 forces Gradle to terminate the worker process after each test class, avoiding the hang.

This is the tentative to fix the flakiness timeout on the Flink CI.

Flink 2.0/2.1 mini-cluster shutdown leaves non-daemon threads
(TaskExecutorFileMergingManager) alive, preventing the Gradle test
worker JVM from exiting. This causes the integration test task to
hang indefinitely until the CI timeout kills it.

Setting forkEvery=1 forces Gradle to terminate the worker process
after each test class, avoiding the hang.
Copy link
Contributor

@pvary pvary left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to prevent flaky tests.

@jbonofre: Could you please link a Flink issue for this where this is discussed, mentioned and hopefully fixed later, so we can remove this workaround when it is fixed in Flink?

@mxm, @Guosmilesmile: Any thoughts?

task integrationTest(type: Test) {
description = "Test Flink Runtime Jar against Flink ${flinkMajorVersion}"
group = "verification"
forkEvery = 1
Copy link
Contributor

@mxm mxm Mar 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is only one test which uses this task: TestIcebergConnectorSmoke (located in the flink-runtime module). The majority of the tests are located in the flink module.

We could add this flag under the test task (line 125), but we should keep an eye on the build time. I'm thinking it could go up by quite a bit (currently around 20 minutes).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the one in particular which is problematic

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The build time is the same today

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that one is the relevant test, then +1 for the change. How did we debug this? I'm curious because it is a bit hard to reproduce this. I've never run into the issue locally.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m also curious how to reproduce this scenario—I haven’t hit this issue locally. If we can reproduce it, we can fix it more effectively.

If this addition doesn’t have a significant impact on the overall runtime, I think it’s fine to +1 for the change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants