KAFKA-17541:[2/2] Improve handling of delivery count #20837

DL1231 · 2025-11-06T08:57:11Z

For records with a delivery count exceeding 2, there is suspicion that
delivery failures stem from underlying issues rather than natural
retry scenarios. The batching of such records should be reduced.

Implementation Approach:

When the bad record is the first pending record: Only the current
record is delivered.
When there are n pending records ahead of the bad record: The first n
pending records are delivered, and the current bad record is skipped.

Example Scenario

Record offsets and delivery counts:

Offset:          10,    11,    12,    13,    14,    15
Delivery Count:  1,     2,     3,     2,     2,     1

(Assuming BAD_RECORD_DELIVERY_THRESHOLD = 3 - offset 12 is a bad
record)

Before Changes:

Case 1: fetchStartOffset = 10, maxFetchRecords = 6
- Returns records from offset 10-15 (including the bad record at
  offset 12)
Case 2: fetchStartOffset = 12, maxFetchRecords = 6
- Returns records from offset 12-15 (starting with the bad record)

After Changes:

Case 1: fetchStartOffset = 10, maxFetchRecords = 6
- Returns records from offset 10-11 only (stops at the first bad
  record)
Case 2: fetchStartOffset = 12, maxFetchRecords = 6
- Returns only offset 12 (the bad record itself, since there are no
  preceding good records)

apoorvmittal10

Thanks for the PR, I have doubts for the approach. Can you please help me explain.

apoorvmittal10 · 2025-11-07T16:53:50Z

core/src/main/java/kafka/server/share/SharePartition.java

                int numRecordsRemaining = maxRecordsToAcquire - acquiredCount;
                boolean recordLimitSubsetMatch = isRecordLimitMode && checkForRecordLimitSubsetMatch(inFlightBatch, maxRecordsToAcquire, acquiredCount);
-                if (!fullMatch || inFlightBatch.offsetState() != null || recordLimitSubsetMatch) {
+                if (!fullMatch || inFlightBatch.offsetState() != null || recordLimitSubsetMatch || inFlightBatch.batchDeliveryCount() >= 2) {


We have defined the constant above but have used directly 2 here.

Also can you help explain the reasoning here for the condition added.

I've made the updates, PTAL.

For batches or records with a deliveryCount >= 3, we consider that bad records exist within the batch. Therefore, depending on the specific situation (such as whether there are pending records to be sent), we will determine the sending behavior for each record.

apoorvmittal10 · 2025-11-07T16:55:29Z

core/src/main/java/kafka/server/share/SharePartition.java

+     * Records whose delivery count exceeds this are deemed abnormal,
+     * and the batching of these records should be reduced.
+     */
+    private static final int BAD_RECORD_DELIVERY_THRESHOLD = 2;


May be 3 as default.

apoorvmittal10 · 2025-11-07T16:59:00Z

core/src/main/java/kafka/server/share/SharePartition.java

+                // If the record has any pending deliveries, return immediately and do not deliver the current bad record.
+                if (offsetState.getValue().deliveryCount() >= BAD_RECORD_DELIVERY_THRESHOLD && (hasBeenAcquired > 0 || acquiredCount > 0)) {
+                    return -acquiredCount;
+                }
+


Sorry I didn't understand what we are verifying here?

If the deliveryCount of the current record is greater than or equal to the threshold, it indicates that the current record is a bad record and may fail to be delivered.

If there are already pending deliveries, we should immediately send those to avoid being affected by the bad record.

If there are no pending deliveries, the bad record should be sent individually.

Can you provide example in your PR description i.e. what exactly we have handled in the PR. The changes seems very confusing hence wanted to come to common understanding.

Example Scenario

Record offsets and delivery counts:

Offset: 10, 11, 12, 13, 14, 15 Delivery Count: 1, 2, 3, 2, 2, 1

(Assuming BAD_RECORD_DELIVERY_THRESHOLD = 3 - offset 12 is a bad record)

Before Changes:

Case 1: fetchStartOffset = 10, maxFetchRecords = 6

Returns records from offset 10-15 (including the bad record at offset 12)

Case 2: fetchStartOffset = 12, maxFetchRecords = 6

Returns records from offset 12-15 (starting with the bad record)

After Changes:

Case 1: fetchStartOffset = 10, maxFetchRecords = 6

Returns records from offset 10-11 only (stops at the first bad record)

Case 2: fetchStartOffset = 12, maxFetchRecords = 6

Returns only offset 12 (the bad record itself, since there are no preceding good records)

adixitconfluent

Thanks for the PR. Overall I didn't get the concept to use a negative number as acquired count. Maybe comment some examples to show how using a negative number for acquired count is useful.

core/src/main/java/kafka/server/share/SharePartition.java

adixitconfluent · 2025-11-10T07:08:19Z

core/src/main/java/kafka/server/share/SharePartition.java

+                    // to prevent acquiring any new records afterwards.
+                    if (acquiredSubsetCount < 0) {
+                        maxRecordsToAcquire = -1;
+                        acquiredCount += acquiredSubsetCount == Integer.MIN_VALUE ? 0 : -1 * acquiredSubsetCount;


This is a little confusing to me. Maybe take an example and explain what are we trying to achieve here

The negative return values serve as special flags to handle bad records while maintaining data delivery:

Example Scenarios:

Normal case (no bad records):

Returns 3 → acquired 3 records normally

Processing continues until maxRecordsToAcquire is reached

Bad record encountered after some good records:

Acquired 3 good records, then hits a bad record

Returns -3 → signals "deliver the 3 good records AND stop further acquisition due to bad record"

Outer code: acquiredCount += -(-3) = +3

First record is bad (edge case):

First record is bad, zero records acquired

Returns Integer.MIN_VALUE → signals "no records to deliver AND stop due to bad record"

Outer code: acquiredCount += 0 (no increment)

The negative return values act as "circuit breakers" - they tell the outer loop to deliver what we have and stop further processing in that batch.

apoorvmittal10

Thanks for the changes, can you please add examples in PR description around the fix in the PR.

apoorvmittal10 · 2025-11-10T10:39:05Z

core/src/main/java/kafka/server/share/SharePartition.java

+                // If the record has any pending deliveries, return immediately and do not deliver the current bad record.
+                if (offsetState.getValue().deliveryCount() >= BAD_RECORD_DELIVERY_THRESHOLD && (hasBeenAcquired > 0 || acquiredCount > 0)) {
+                    return -acquiredCount;
+                }
+


Can you provide example in your PR description i.e. what exactly we have handled in the PR. The changes seems very confusing hence wanted to come to common understanding.

apoorvmittal10 · 2025-11-10T13:01:32Z

@DL1231 I am not sure if we are on same page with the change. I am trying to write the problem statement and probable solution below:

Problem Statement:
There can be a bad record in a batch which can trigger an application crash, if application doesn't handle bad records correctly, which will keep bumping the batch's delivery count untill full batch is archived. However, whole batch will be archived when a single bad record might have causing the crash.

Solution:
Determining which offset is bad is not possible at broker's end. But broker can restrict the acquired records to a subset so only bad record is skipped. We can do the following:

If delivery count of a batch is >= 3 then only acquire 1/2 of the batch records i.e for a batch of 0-499 (500 records) if batch delivery count is 3 then start offset tracking and acquire 0-249 (250 records)
If delivery count is again bumped then keeping acquring 1/2 of previously acquired offsets until last delivery attempt i.e. 0-124 (125 records)
For last delivery attempt, acquire only 1 offset. Then only the bad record will be skipped.

DL1231 · 2025-11-10T13:25:30Z

@apoorvmittal10 Thanks for your patient reply. I'd like to follow up with another question:

If offset 30 is the bad record.

First acquisition: Get half the records (0-249) → will still fail because it includes offset 30
Second acquisition: Get half of that range (0-125) → will still fail because it still includes offset 30
Final attempt: Get only 1 record (offset 0) → succeeds, but offset 30 (the actual bad record) remains untouched
Next acquisition range becomes 1-499 → will fail again

In this scenario, records 1-125 would have been delivered 5 times and eventually get archived, while the actual bad record at offset 30 continues to cause failures.

So this solution essentially reduces the impact range by about 3/4, but doesn't completely isolate the bad record. Is my understanding correct?

This seems to minimize the collateral damage rather than surgically removing the problematic record.

apoorvmittal10 · 2025-11-10T13:49:06Z

Next acquisition range becomes 1-499 → will fail again

No, next acquisition will be only offset 1 not 1-499. This is determined by looking at the offset delivery count, offset 1 will also be at limit of delivery count and only last attempt will be pending.

Continuing with previous example, this will continue till offset 124 i.e. single record is being acquired as all of them are in final delivery attempt. Then whole 125 - 499 will be acquired as that's not in the final attempt.

DL1231 · 2025-11-10T14:29:20Z

@apoorvmittal10 Thanks for the explanation. I get it now.

Regarding the scenario with multiple batches:

Offset 30 is a bad record
We have two batches: 0-100 (delivery count = 3) and 100-200 (delivery count = 0)

After the first acquisition fails and we retry by fetching half of the first batch (0-50), if we haven't reached maxFetchRecords yet, should we:

Continue and fetch data from the second batch (100-200), or
Return immediately with just the 0-50 data?

apoorvmittal10 · 2025-11-10T16:50:45Z

Regarding the scenario with multiple batches:

Offset 30 is a bad record

We have two batches: 0-100 (delivery count = 3) and 100-200 (delivery count = 0)

After the first acquisition fails and we retry by fetching half of the first batch (0-50), if we haven't reached maxFetchRecords yet, should we:

Continue and fetch data from the second batch (100-200), or

Return immediately with just the 0-50 data?

I think we should only deal with single batch in these scenarios.

AndrewJSchofield

Thanks for the PR. Just a few comments and I see that the approach is being discussed in the comments.

AndrewJSchofield · 2025-11-10T17:18:33Z

core/src/main/java/kafka/server/share/SharePartition.java

+     * Records whose delivery count exceeds this are deemed abnormal,
+     * and the batching of these records should be reduced.
+     */
+    private static final int BAD_RECORD_DELIVERY_THRESHOLD = 3;


The configured delivery count limit can range from 2 to 10 inclusive. If the configured value is 2, we need to set the threshold as 2. If the configured value is larger, maybe half of the configured value would be a good threshold. So, for config=5, use 3 (default). For config=10, use 5. And so on.

KAFKA-17541:[2/2] Improve handling of delivery count

0ea8b53

github-actions bot added triage PRs from the community core Kafka Broker KIP-932 Queues for Kafka labels Nov 6, 2025

apoorvmittal10 added ci-approved and removed triage PRs from the community labels Nov 7, 2025

apoorvmittal10 reviewed Nov 7, 2025

View reviewed changes

DL1231 added 2 commits November 10, 2025 09:12

fix

7f47a9c

fix

db24b36

adixitconfluent reviewed Nov 10, 2025

View reviewed changes

apoorvmittal10 reviewed Nov 10, 2025

View reviewed changes

fix comment

522bc2e

AndrewJSchofield requested changes Nov 10, 2025

View reviewed changes

KAFKA-17541:[2/2] Improve handling of delivery count #20837

Are you sure you want to change the base?

KAFKA-17541:[2/2] Improve handling of delivery count #20837

Conversation

DL1231 commented Nov 6, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Example Scenario

Before Changes:

After Changes:

Uh oh!

apoorvmittal10 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DL1231 Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DL1231 Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Example Scenario

Before Changes:

After Changes:

Uh oh!

adixitconfluent left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DL1231 Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

apoorvmittal10 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

apoorvmittal10 commented Nov 10, 2025

Uh oh!

DL1231 commented Nov 10, 2025

Uh oh!

apoorvmittal10 commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DL1231 commented Nov 10, 2025

Uh oh!

apoorvmittal10 commented Nov 10, 2025

Uh oh!

AndrewJSchofield left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

DL1231 commented Nov 6, 2025 •

edited by github-actions bot

Loading

DL1231 Nov 10, 2025 •

edited

Loading

DL1231 Nov 10, 2025 •

edited

Loading

DL1231 Nov 10, 2025 •

edited

Loading

apoorvmittal10 commented Nov 10, 2025 •

edited

Loading