-
Notifications
You must be signed in to change notification settings - Fork 14.5k
KAFKA-17904: Flaky testMultiConsumerSessionTimeoutOnClose #17789
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hello @xijiu , thanks for taking a look at this! Very interesting finding indeed. I expect this same issue is behind the flakiness on -- update I created https://issues.apache.org/jira/browse/KAFKA-18008 for the other test just for visibility. You can take it if you want and maybe this PR serves both. If it needs more work we can tackle that separately. |
@lianetm Thanks for reply. I think this PR will server boths.
And the failure of method But to be honest, this flaky test is hard to reproduce on my mac , I ran many many times but only reproduced it once 😁. |
@@ -32,6 +34,12 @@ import scala.jdk.CollectionConverters._ | |||
@Timeout(600) | |||
class PlaintextConsumerPollTest extends AbstractConsumerTest { | |||
|
|||
override protected def brokerPropertyOverrides(properties: Properties): Unit = { | |||
super.brokerPropertyOverrides(properties) | |||
properties.setProperty(GroupCoordinatorConfig.CONSUMER_GROUP_HEARTBEAT_INTERVAL_MS_CONFIG, "1000") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I notice there are several test in this same file overriding the equivalent of this prop for the classic consumer (ConsumerConfig.HEARTBEAT_INTERVAL_MS_CONFIG), but to 500. Should we consider that value instead? With that, we would have those tests running for both consumers with the same config. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lianetm Yeah, you are right, I will fix it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! LGTM.
4 unrelated test failures (failures exist in trunk). 3 already tracked. I filed https://issues.apache.org/jira/browse/KAFKA-18025. |
Reviewers: Lianet Magrans <[email protected]>
Reviewers: Lianet Magrans <[email protected]>
Here are some of my conclusions about this flaky test.
First of all, the reason for the failure of this test is due to TIMEOUT, the method
AbstractConsumerTest#validateGroupAssignment
timeout after waiting for 10 seconds. And it reproduced on my computer.AbstractConsumerTest#validateGroupAssignment
is used to check all the consumer's assignments meet expectations, the exception as below:I ran this junit test many times on my local computer after I added some logs. Then I found the timeout case is the GroupProtocol.CONSUMER mode. The CONSUMER mode maybe interact with the GroupCoordinator multiple times before reconciliation completed

The frequency of interaction is controlled by configuration
group.consumer.heartbeat.interval.ms
which default value is 5000ms. Those successful unit tests take at least 5 seconds to complete, so maybe we can reduce heartbeat interval.After I set
group.consumer.heartbeat.interval.ms
to 1000ms, this problem has not occurred again on my computer. And running this unit test has become more faster.