KAFKA-19800: Compute share partition lag in GroupCoordinatorService #20839

chirag-wadhwa5 · 2025-11-06T11:18:03Z

This PR is part of
KIP-1226.

This PR computes the share partition lag in GroupCoordinatorService
using deliveryCompleteCount received from readSummary, and partition
end offsets received from adminClient.listOffstes. The computed lag is
returned to the end user in DescribeShareGroupOffsetsResponse.

NOTE: The GroupCoordinator is built with a no-op implementation of
PartitionMetadataClient, which returns -1 as the partition end offset
for any requested topic partition. This will later be replaced with an
actual implementation that uses InterBrokerSendThread to retrieve
partition end offsets via ListOffsets RPC.

Reviewers: Apoorv Mittal [email protected], Andrew Schofield
[email protected]

apoorvmittal10

Thanks for the PR, trying to understand the approach. Some basic doubts.

apoorvmittal10 · 2025-11-07T09:34:07Z

...inator/src/main/java/org/apache/kafka/coordinator/group/metrics/PartitionMetadataClient.java

+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.kafka.coordinator.group.metrics;


Why is it in metrics package?

The class should be in org.apache.kafka.coordinator.group.

apoorvmittal10 · 2025-11-07T09:37:21Z

core/src/main/scala/kafka/server/BrokerServer.scala

+        topicPartitions.asScala
+          .map { tp =>
+            tp -> CompletableFuture.completedFuture(java.lang.Long.valueOf(-1L))
+          }
+          .toMap
+          .asJava


This implementation will always retutn -1, am I reading right?

Thanks for the review. Actually, this is a placeholder. I am working in parallel on a separate PR that will create an implementation class of PartitionMetadataClient using the InterBrokerSendThread to fetch the partition end offsets. Once that PR is completed, I can simply plug in the instance of the new impl class here.

Makes sense, but it should be written as comment in BrokerServer.scala and in PR description. Can you please do that.

…rtitionMetadataClient in BrokerServer

…ponse in SharePartitionOffsetInfo

apoorvmittal10

We can simplify the GroupCoordinatorService code. Also please write comments in code, general practice, else it takes time to understand what you really want to do in code.

apoorvmittal10 · 2025-11-07T10:26:00Z

group-coordinator/src/main/java/org/apache/kafka/coordinator/group/GroupCoordinatorService.java

+        return future;
+    }
+
+    private void computeLagAndBuildResponse(


Suggested change

private void computeLagAndBuildResponse(

private void computeShareGroupLagAndBuildResponse(

apoorvmittal10 · 2025-11-07T10:34:40Z

group-coordinator/src/main/java/org/apache/kafka/coordinator/group/GroupCoordinatorService.java

+                        tp,
+                        new DescribeShareGroupOffsetsResponseData.DescribeShareGroupOffsetsResponsePartition()
+                            .setPartitionIndex(partitionData.partition())
+                            .setStartOffset(PartitionFactory.UNINITIALIZED_START_OFFSET)


If there is error in partitionData for any partition then we won't get startOffset hence it's safe to put UNINITIALIZED_START_OFFSET here, correct? Can you please write this as comment. The reason I am asking for the comment as there are 2 OR conditions earlier.

Thanks for the review. I think the comment above explains why we set UNINITIALIZED_START_OFFSET in teh case where persister returns an error. I will extend the comment to include explanation for the other OR condition though.

apoorvmittal10 · 2025-11-07T15:05:24Z

group-coordinator/src/main/java/org/apache/kafka/coordinator/group/GroupCoordinatorService.java

-                                .setLeaderEpoch(partitionData.errorCode() == Errors.NONE.code() ? partitionData.leaderEpoch() : PartitionFactory.DEFAULT_LEADER_EPOCH)
-                        ).toList())
-                    ));
+                if (partitionData.errorCode() != Errors.NONE.code() || partitionData.startOffset() == PartitionFactory.UNINITIALIZED_START_OFFSET) {


And for groups where startOffset is not yet initialized for them the lag will not be calculated, is it intended?

Persister returns startOffset as -1 (uninitialized offset) for share partitions for which consumption hasn't begun yet. Thus, lag computation is not needed in these situations, since the persister does not yet know from where the consumption will begin. So, -1 (uninitialized lag) is returned here

Yeah, make sense. Please add that as comment.

apoorvmittal10 · 2025-11-07T15:50:42Z

group-coordinator/src/main/java/org/apache/kafka/coordinator/group/GroupCoordinatorService.java

+                    CompletableFuture<Void> lagComputationFuture = partitionLatestOffsets.get(tp)
+                        .handle((latestOffset, throwable) -> {
+                            if (throwable != null) {
+                                partitionsResponses.put(
+                                    tip,
+                                    new DescribeShareGroupOffsetsResponseData.DescribeShareGroupOffsetsResponsePartition()
+                                        .setPartitionIndex(partitionData.partition())
+                                        .setErrorCode(Errors.forException(throwable).code())
+                                        .setErrorMessage(throwable.getMessage())
+                                );
+                            } else {
+                                // Compute lag: lag = partitionLatestOffset - startOffset + 1 - deliveryCompleteCount
+                                long lag = latestOffset - partitionData.startOffset() + 1 - partitionData.deliveryCompleteCount();
+                                partitionsResponses.put(
+                                    tip,
+                                    new DescribeShareGroupOffsetsResponseData.DescribeShareGroupOffsetsResponsePartition()
+                                        .setPartitionIndex(partitionData.partition())
+                                        .setStartOffset(partitionData.startOffset())
+                                        .setLeaderEpoch(partitionData.leaderEpoch())
+                                        .setLag(lag)
+                                );
+                            }
+                            return null;
+                        });
+
+                    lagComputationFutures.add(lagComputationFuture);


There is handling code for individual futures but then also they are added in a list where again the handling exists, why? Can't we just wait for the futures to just complete and then iterate over the original map?

Something like below:

CompletableFuture.allOf(partitionLatestOffsets.values().toArray(new CompletableFuture<?>[0])) .whenComplete((result, error) -> { .... .... readSummaryResult.topicsData().forEach(topicData -> { topicData.partitions().forEach(partitionData -> { TopicPartition tp = new TopicPartition(requestTopicIdToNameMapping.get(topicData.topicId()), partitionData.partition()); TopicIdPartition tip = new TopicIdPartition(topicData.topicId(), tp); if (partitionData.errorCode() == Errors.NONE.code() && partitionData.startOffset() != PartitionFactory.UNINITIALIZED_START_OFFSET) { // The call to join() is safe here because of the allOf above i.e. the futures // have already completed. Long lag = partitionLatestOffsets.get(tp).join(); ... });

apoorvmittal10 · 2025-11-07T15:52:23Z

group-coordinator/src/main/java/org/apache/kafka/coordinator/group/GroupCoordinatorService.java

+        ReadShareGroupStateSummaryResult readSummaryResult,
+        Map<Uuid, String> requestTopicIdToNameMapping,
+        List<DescribeShareGroupOffsetsResponseData.DescribeShareGroupOffsetsResponseTopic> describeShareGroupOffsetsResponseTopicList,
+        CompletableFuture<DescribeShareGroupOffsetsResponseData.DescribeShareGroupOffsetsResponseGroup> responseFuture,
+        String groupId
+    ) {
+        Set<TopicPartition> partitionsToComputeLag = new HashSet<>();
+        Map<TopicIdPartition, DescribeShareGroupOffsetsResponseData.DescribeShareGroupOffsetsResponsePartition> partitionsResponses = new HashMap<>();
+
+        readSummaryResult.topicsData().forEach(topicData -> {
+            topicData.partitions().forEach(partitionData -> {
+                TopicIdPartition tp = new TopicIdPartition(
+                    topicData.topicId(),
+                    new TopicPartition(requestTopicIdToNameMapping.get(topicData.topicId()), partitionData.partition())
+                );


The method is overly complicated. Why can't it be a simple one like first get the partitions for which lag is to be computed and then in a single parse when all futures of lag calculation are complted then fill the result.

readSummaryResult.topicsData().forEach(topicData -> topicData.partitions().forEach(partitionData -> { if (partitionData.errorCode() == Errors.NONE.code()) { partitionsToComputeLag.add(new TopicPartition(requestTopicIdToNameMapping.get(topicData.topicId()), partitionData.partition())); } })); ..... .....

…nator.group

apoorvmittal10

Thanks for the changes, we can still simplify the processing in GroupCoordinatoService.

apoorvmittal10 · 2025-11-10T09:40:03Z

group-coordinator/src/main/java/org/apache/kafka/coordinator/group/GroupCoordinatorService.java

+        CompletableFuture<DescribeShareGroupOffsetsResponseData.DescribeShareGroupOffsetsResponseGroup> responseFuture,
+        String groupId
+    ) {
+


nit: line break not needed.

apoorvmittal10 · 2025-11-10T09:44:04Z

group-coordinator/src/main/java/org/apache/kafka/coordinator/group/GroupCoordinatorService.java

+        });
+
+        // Fetch latest offsets for all partitions that need lag computation.
+        Map<TopicPartition, CompletableFuture<Long>> partitionLatestOffsets = partitionsToComputeLag.isEmpty() ? new HashMap<>() :


Suggested change

Map<TopicPartition, CompletableFuture<Long>> partitionLatestOffsets = partitionsToComputeLag.isEmpty() ? new HashMap<>() :

Map<TopicPartition, CompletableFuture<Long>> partitionLatestOffsets = partitionsToComputeLag.isEmpty() ? Map.of() :

apoorvmittal10 · 2025-11-10T09:45:57Z

group-coordinator/src/main/java/org/apache/kafka/coordinator/group/GroupCoordinatorService.java

-                        .setGroupId(readSummaryRequestData.groupId())
-                        .setTopics(describeShareGroupOffsetsResponseTopicList));
+        CompletableFuture.allOf(partitionLatestOffsets.values().toArray(new CompletableFuture<?>[0]))
+            .whenComplete((result, error) -> {


Why error is not checked prior parsing the result?

allOf will complete exceptionally and thus throw an error only in case a subset of the list of original futures fails. As per my understanding, the exact error is CompletableException with cause being the exception thrown in any one of the failing futures. That wouldn't help us at all, because no matter if the future returned by allOf() failrs or succeeds, I have a try catch block around the statement where I join individual original futures, handling both the cases.

allOf will complete exceptionally and thus throw an error

It will not throw exception, you need to check error is not null. You can write a test and verify as well.

apoorvmittal10 · 2025-11-10T09:48:32Z

group-coordinator/src/main/java/org/apache/kafka/coordinator/group/GroupCoordinatorService.java

-        return future;
+    }
+
+    private DescribeShareGroupOffsetsResponseData.DescribeShareGroupOffsetsResponseGroup BuildDescribeShareGroupOffsetsResponse(


Why the method name starts with capital B?

My mistake, will replace this in the next commit

apoorvmittal10 · 2025-11-10T10:28:40Z

group-coordinator/src/main/java/org/apache/kafka/coordinator/group/GroupCoordinatorService.java

+        Set<TopicPartition> partitionsToComputeLag = new HashSet<>();
+
+        // This map stores the final DescribeShareGroupOffsetsResponsePartition, including the lag, for all the partitions.
+        Map<TopicIdPartition, DescribeShareGroupOffsetsResponseData.DescribeShareGroupOffsetsResponsePartition> partitionsResponses = new HashMap<>();


What purpose does this map serve? I don't see any need of having it.

Thanks for the review. I have changed the logic to make it more efficient.

apoorvmittal10 · 2025-11-10T10:30:57Z

group-coordinator/src/main/java/org/apache/kafka/coordinator/group/GroupCoordinatorService.java

+                TopicIdPartition tp = new TopicIdPartition(
+                    topicData.topicId(),
+                    new TopicPartition(requestTopicIdToNameMapping.get(topicData.topicId()), partitionData.partition())
+                );


Rather filling partial resonse here and then in another parse, you should fill the responses together fro better readability. i.e.

First only get the partitions to compute lag:

readSummaryResult.topicsData().forEach(topicData -> topicData.partitions().forEach(partitionData -> { if (partitionData.errorCode() == Errors.NONE.code() || partitionData.startOffset() == PartitionFactory.UNINITIALIZED_START_OFFSET) { partitionsToComputeLag.add(new TopicPartition(requestTopicIdToNameMapping.get(topicData.topicId()), partitionData.partition())); } }));

Then fill the response in another parse of readSummaryResult.

AndrewJSchofield

Thanks for the PR. I'll continue with my review.

AndrewJSchofield · 2025-11-10T18:29:51Z

group-coordinator/src/main/java/org/apache/kafka/coordinator/group/GroupCoordinatorService.java

        private GroupConfigManager groupConfigManager;
        private Persister persister;
        private Optional<Plugin<Authorizer>> authorizerPlugin;
+        private PartitionMetadataClient partitionMetadataClient;


Shouldn't we do a null check in the build() method too?

apoorvmittal10

Thanks for the changes, it's coming well. Some more comments.

apoorvmittal10 · 2025-11-10T19:05:06Z

group-coordinator/src/main/java/org/apache/kafka/coordinator/group/GroupCoordinatorService.java

-                        .setGroupId(readSummaryRequestData.groupId())
-                        .setTopics(describeShareGroupOffsetsResponseTopicList));
+        CompletableFuture.allOf(partitionLatestOffsets.values().toArray(new CompletableFuture<?>[0]))
+            .whenComplete((result, error) -> {


allOf will complete exceptionally and thus throw an error

It will not throw exception, you need to check error is not null. You can write a test and verify as well.

apoorvmittal10 · 2025-11-10T19:10:27Z

group-coordinator/src/main/java/org/apache/kafka/coordinator/group/GroupCoordinatorService.java

+        List<DescribeShareGroupOffsetsResponseData.DescribeShareGroupOffsetsResponseTopic> responseTopics = new ArrayList<>();
+        for (Map.Entry<Uuid, List<DescribeShareGroupOffsetsResponseData.DescribeShareGroupOffsetsResponsePartition>> entry : topicToPartitionResults.entrySet()) {
+            DescribeShareGroupOffsetsResponseData.DescribeShareGroupOffsetsResponseTopic topic =
+                new DescribeShareGroupOffsetsResponseData.DescribeShareGroupOffsetsResponseTopic()
+                    .setTopicId(entry.getKey())
+                    .setTopicName(requestTopicIdToNameMapping.get(entry.getKey()))
+                    .setPartitions(entry.getValue());
+            responseTopics.add(topic);
+        }


Why do you need another parse to fill this and can't be done in previous iteration of readSummaryResult?

apoorvmittal10 · 2025-11-10T19:13:52Z

group-coordinator/src/main/java/org/apache/kafka/coordinator/group/PartitionMetadataClient.java

+     * @param topicPartitions A set of topic partitions.
+     * @return A map of topic partitions to the completableFuture of their latest offsets
+     */
+    Map<TopicPartition, CompletableFuture<Long>> listLatestOffsets(


Shouldn't we have Map over TopicIdPartition i.e.

Suggested change

Map<TopicPartition, CompletableFuture<Long>> listLatestOffsets(

Map<TopicIdPartition, CompletableFuture<Long>> listLatestOffsets(

The topic ID is actually not required at all. So the listLatestOffsets method is responsible for 2 important things ->

Find the destination node for a particular topic partition, which is the leader broker for that partition. This information is retrieved using metadataCache, and the specific method for it requires only the topic name, not the id.

The ListOffsetsRequest is built and sent to the previously calculated destination node. The requestData object also requires only the name, not the ID.

The PR for an implementation of PartitionMetadataClient is already created. #20852

Maybe that can provide a better picture for this.

In the future, I think we'll want to use topic ID, but the underlying RPC doesn't support topic ID yet. I'm happy for this to be based on topic name for now.

apoorvmittal10 · 2025-11-10T19:15:48Z

group-coordinator/src/main/java/org/apache/kafka/coordinator/group/PartitionMetadataClient.java

+     * @return A map of topic partitions to the completableFuture of their latest offsets
+     */
+    Map<TopicPartition, CompletableFuture<Long>> listLatestOffsets(
+        Set<TopicPartition> topicPartitions


Suggested change

Set<TopicPartition> topicPartitions

Set<TopicIdPartition> topicIdPartitions

Same as above

apoorvmittal10

Thanks for the PR, LGTM!

AndrewJSchofield

Looks good to me.

AndrewJSchofield · 2025-11-11T15:09:57Z

group-coordinator/src/main/java/org/apache/kafka/coordinator/group/PartitionMetadataClient.java

+     * @param topicPartitions A set of topic partitions.
+     * @return A map of topic partitions to the completableFuture of their latest offsets
+     */
+    Map<TopicPartition, CompletableFuture<Long>> listLatestOffsets(


In the future, I think we'll want to use topic ID, but the underlying RPC doesn't support topic ID yet. I'm happy for this to be based on topic name for now.

chia7712 · 2025-11-11T17:17:38Z

group-coordinator/src/main/java/org/apache/kafka/coordinator/group/GroupCoordinatorService.java

+                        // For the partitions where lag computation is not needed, a partitionResponse is built directly.
+                        // The lag is set to -1 (uninitialized lag) in these cases. If the persister returned an error for a
+                        // partition, the startOffset is set to -1 (uninitialized offset) and the leaderEpoch is set to 0
+                        // (default epoch). This is consistent with OffsetFetch for situations in which there is no offset


Out of curiosity, why is zero being used instead of -1? I assume zero is a valid epoch, right?

KAFKA-19800: Compute share partition lag in GroupCoordinatorService

9e66586

github-actions bot added triage PRs from the community core Kafka Broker clients group-coordinator labels Nov 6, 2025

chirag-wadhwa5 marked this pull request as draft November 6, 2025 11:18

KAFKA-19800: Removed the implementation class of PartitionMetadataClient

2515cd3

chirag-wadhwa5 marked this pull request as ready for review November 6, 2025 13:22

chirag-wadhwa5 added 2 commits November 6, 2025 18:54

Merge remote-tracking branch 'upstream/trunk' into KAFKA-19800

f1aeb2f

KAFKA-19800: minor formatting error

3f1f2ad

AndrewJSchofield added KIP-932 Queues for Kafka ci-approved and removed triage PRs from the community labels Nov 6, 2025

KAFKA-19800: Minor changes for test failure resolution

7f79763

apoorvmittal10 reviewed Nov 7, 2025

View reviewed changes

chirag-wadhwa5 added 2 commits November 7, 2025 16:14

KAFKA-19800: Added a minor comment for the no-op implementation of Pa…

9d126b7

…rtitionMetadataClient in BrokerServer

KAFKA-19800: pass the lag retrieved from DescribeShareGroupOffsetsRes…

616d4b8

…ponse in SharePartitionOffsetInfo

apoorvmittal10 requested changes Nov 7, 2025

View reviewed changes

chirag-wadhwa5 added 2 commits November 8, 2025 01:32

KAFKA-19800: Changed the logic to make the code less complex

370671f

KAFKA-19800: moved PartitionMetadataClient to org.apache.kafka.coordi…

d019b3c

…nator.group

chirag-wadhwa5 requested a review from apoorvmittal10 November 10, 2025 07:59

apoorvmittal10 reviewed Nov 10, 2025

View reviewed changes

chirag-wadhwa5 added 2 commits November 10, 2025 23:19

KAFKA-19800: minor changes for better readability

5497b6f

KAFKA-19800: Changed the logic to remove an extra map usage

c773cac

AndrewJSchofield requested changes Nov 10, 2025

View reviewed changes

apoorvmittal10 reviewed Nov 10, 2025

View reviewed changes

KAFKA-19800: Minor changes for better readability

fc75082

chirag-wadhwa5 requested a review from apoorvmittal10 November 11, 2025 08:09

chirag-wadhwa5 requested a review from AndrewJSchofield November 11, 2025 08:09

chirag-wadhwa5 added 2 commits November 11, 2025 17:55

KAFKA-19800: added comments and corrected the lag formula

790ec21

KAFKA-19800: corrected a few tests

9200195

apoorvmittal10 approved these changes Nov 11, 2025

View reviewed changes

AndrewJSchofield approved these changes Nov 11, 2025

View reviewed changes

AndrewJSchofield merged commit 1146f97 into apache:trunk Nov 11, 2025
25 checks passed

chia7712 reviewed Nov 11, 2025

View reviewed changes

	private void computeLagAndBuildResponse(
	private void computeShareGroupLagAndBuildResponse(

	Map<TopicPartition, CompletableFuture<Long>> partitionLatestOffsets = partitionsToComputeLag.isEmpty() ? new HashMap<>() :
	Map<TopicPartition, CompletableFuture<Long>> partitionLatestOffsets = partitionsToComputeLag.isEmpty() ? Map.of() :

	Map<TopicPartition, CompletableFuture<Long>> listLatestOffsets(
	Map<TopicIdPartition, CompletableFuture<Long>> listLatestOffsets(

	Set<TopicPartition> topicPartitions
	Set<TopicIdPartition> topicIdPartitions

KAFKA-19800: Compute share partition lag in GroupCoordinatorService #20839

KAFKA-19800: Compute share partition lag in GroupCoordinatorService #20839

Conversation

chirag-wadhwa5 commented Nov 6, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

apoorvmittal10 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

apoorvmittal10 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

apoorvmittal10 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AndrewJSchofield left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

apoorvmittal10 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

chirag-wadhwa5 commented Nov 6, 2025 •

edited by github-actions bot

Loading