Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KAFKA-17904: Flaky testMultiConsumerSessionTimeoutOnClose #17789

Open
wants to merge 2 commits into
base: trunk
Choose a base branch
from

Conversation

xijiu
Copy link
Contributor

@xijiu xijiu commented Nov 13, 2024

Here are some of my conclusions about this flaky test.

First of all, the reason for the failure of this test is due to TIMEOUT, the method AbstractConsumerTest#validateGroupAssignment timeout after waiting for 10 seconds. And it reproduced on my computer.

AbstractConsumerTest#validateGroupAssignment is used to check all the consumer's assignments meet expectations, the exception as below:

org.opentest4j.AssertionFailedError: Did not get valid assignment for partitions HashSet(topic1-2, topic1-4, topic-1, topic-0, topic1-5, topic1-1, topic1-0, topic1-3). Instead, got ArrayBuffer(Set(topic1-1, topic1-0, topic1-2), Set(topic1-5, topic1-4), Set())

I ran this junit test many times on my local computer after I added some logs. Then I found the timeout case is the GroupProtocol.CONSUMER mode. The CONSUMER mode maybe interact with the GroupCoordinator multiple times before reconciliation completed
image

The frequency of interaction is controlled by configuration group.consumer.heartbeat.interval.ms which default value is 5000ms. Those successful unit tests take at least 5 seconds to complete, so maybe we can reduce heartbeat interval.

After I set group.consumer.heartbeat.interval.ms to 1000ms, this problem has not occurred again on my computer. And running this unit test has become more faster.

image

@github-actions github-actions bot added core Kafka Broker tests Test fixes (including flaky tests) small Small PRs labels Nov 13, 2024
@xijiu
Copy link
Contributor Author

xijiu commented Nov 13, 2024

@lianetm @chia7712 PTAL

@lianetm
Copy link
Contributor

lianetm commented Nov 13, 2024

Hello @xijiu , thanks for taking a look at this! Very interesting finding indeed. I expect this same issue is behind the flakiness on testMultiConsumerSessionTimeoutOnStopPolling too right? (Could you maybe validate locally for this one too?)

https://ge.apache.org/scans/tests?search.buildOutcome=failure&search.relativeStartTime=P28D&search.rootProjectNames=kafka&search.tags=trunk&search.timeZoneId=America%2FToronto&tests.container=kafka.api.PlaintextConsumerPollTest&tests.test=testMultiConsumerSessionTimeoutOnStopPolling(String%2C%20String)%5B2%5D

-- update

I created https://issues.apache.org/jira/browse/KAFKA-18008 for the other test just for visibility. You can take it if you want and maybe this PR serves both. If it needs more work we can tackle that separately.

@xijiu
Copy link
Contributor Author

xijiu commented Nov 14, 2024

@lianetm Thanks for reply.

I think this PR will server boths.
Both testMultiConsumerSessionTimeoutOnClose() and testMultiConsumerSessionTimeoutOnStopPolling() will call the method runMultiConsumersSessionTimeoutTest()

  @ParameterizedTest(name = TestInfoUtils.TestWithParameterizedQuorumAndGroupProtocolNames)
  @MethodSource(Array("getTestQuorumAndGroupProtocolParametersAll"))
  def testMultiConsumerSessionTimeoutOnStopPolling(quorum: String, groupProtocol: String): Unit = {
    runMultiConsumerSessionTimeoutTest(false)
  }

  @ParameterizedTest(name = TestInfoUtils.TestWithParameterizedQuorumAndGroupProtocolNames)
  @MethodSource(Array("getTestQuorumAndGroupProtocolParametersAll"))
  def testMultiConsumerSessionTimeoutOnClose(quorum: String, groupProtocol: String): Unit = {
    runMultiConsumerSessionTimeoutTest(true)
  }

And the failure of method runMultiConsumersSessionTimeoutTest(boolean) is unrelated to the input parameters.

But to be honest, this flaky test is hard to reproduce on my mac , I ran many many times but only reproduced it once 😁.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Kafka Broker small Small PRs tests Test fixes (including flaky tests)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants