Skip to content

KAFKA-19092: Fix flaky testBalancePartitionLeaders by reducing check interval and increasing timeout#22621

Open
suryakantade wants to merge 1 commit into
apache:trunkfrom
suryakantade:fix/KAFKA-19092-flaky-testBalancePartitionLeaders
Open

KAFKA-19092: Fix flaky testBalancePartitionLeaders by reducing check interval and increasing timeout#22621
suryakantade wants to merge 1 commit into
apache:trunkfrom
suryakantade:fix/KAFKA-19092-flaky-testBalancePartitionLeaders

Conversation

@suryakantade

Copy link
Copy Markdown

Summary

Fixes the flaky testBalancePartitionLeaders test that was failing ~3% of CI runs.

The root cause is a timing issue: the leader imbalance check interval (1s) combined with a 10× timeout multiplier left only a 10s window for the periodic electPreferred task to fire and rebalance partitions. Under CI thread-scheduling jitter (KafkaEventQueue uses real wall-clock cond.awaitNanos()), this was insufficient.

Changes

  • Reduce leaderImbalanceCheckIntervalNs from 1s to 100ms — the periodic task fires more frequently, reducing sensitivity to thread-scheduling delays.
  • Increase the waitForCondition timeout multiplier from 10× to 100× (still 10s total) — gives ample headroom for slow CI environments.
  • Add Javadoc explaining the test lifecycle and why the low check interval is necessary.

Jira: KAFKA-19092 (https://issues.apache.org/jira/browse/KAFKA-19092)

…interval and increasing timeout

The test was failing ~3% of CI runs because the leader imbalance check
interval (1s) combined with a 10x timeout multiplier left only a 10s
window for the periodic "electPreferred" task to fire and rebalance.
Under CI thread-scheduling jitter, this was insufficient.

Changes:
- Reduce leaderImbalanceCheckIntervalNs from 1s to 100ms so the
  periodic task fires more frequently, reducing sensitivity to
  thread-scheduling delays.
- Increase the waitForCondition timeout multiplier from 10x to 100x
  (effectively 10s total), giving ample headroom for slow CI
  environments while keeping the test deterministic.
- Add documentation explaining the test lifecycle and why the low
  check interval is necessary.
@github-actions github-actions Bot added triage PRs from the community tests Test fixes (including flaky tests) kraft small Small PRs labels Jun 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kraft small Small PRs tests Test fixes (including flaky tests) triage PRs from the community

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant