Skip to content

KAFKA-20710: Share coordinator - Fence stale periodic jobs#22603

Open
Shekharrajak wants to merge 2 commits into
apache:trunkfrom
Shekharrajak:KAFKA-20710-share-coordinator-periodic-job-generation
Open

KAFKA-20710: Share coordinator - Fence stale periodic jobs#22603
Shekharrajak wants to merge 2 commits into
apache:trunkfrom
Shekharrajak:KAFKA-20710-share-coordinator-periodic-job-generation

Conversation

@Shekharrajak

Copy link
Copy Markdown
Contributor

Ref https://issues.apache.org/jira/browse/KAFKA-20710

ShareCoordinatorService with a periodic-job generation guard so stale timer tasks and stale async completions cannot reschedule after disable/re-enable.

@github-actions github-actions Bot added KIP-932 Queues for Kafka triage PRs from the community labels Jun 17, 2026
*/
private volatile boolean shouldRunPeriodicJob;

private final AtomicLong periodicJobGeneration = new AtomicLong();

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

protects duplicate timer chains - similar to epoch concept.

}

private boolean shouldRunPeriodicJob(long generation) {
return shouldRunPeriodicJob && periodicJobGeneration.get() == generation;

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It must not start work if it belongs to old generation.


service.onMetadataUpdate(mock(MetadataDelta.class), enabledImage);
assertTrue(service.shouldRunPeriodicJob());
verify(timer, times(4)).add(any());

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without the fix, this became 6 because old prune and old snapshot completions each rescheduled one stale job.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2(old) + 4(new)

.thenReturn(List.of(firstSnapshotFuture))
.thenReturn(List.of(secondSnapshotFuture));

service.startup(() -> 1);

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two periodic jobs:
- record prune job: write-state-record-prune
- cold partition snapshot job: snapshot-cold-partitions

service.onMetadataUpdate(mock(MetadataDelta.class), enabledImage);

verify(timer, times(2)).add(any());
timer.advanceClock(30001L);

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

manually advances mock time so the scheduled timer tasks fire


verify(timer, times(2)).add(any());
timer.advanceClock(30001L);
verify(runtime, times(1)).scheduleWriteOperation(

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both delayed tasks run once when we advance the timer

@github-actions github-actions Bot removed the triage PRs from the community label Jun 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

KIP-932 Queues for Kafka

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant