Fix broken error handling around Idempotent producer + Ensure strict ordering when Net.MaxOpenRequests = 1 #2943

richardartoul · 2024-07-19T22:26:37Z

There have been numerous issues filed lately that:

Strict ordering no longer works after request pipelining was introduced even when Net.MaxOpenRequests is set to 1.
Requests can still end up failing / mis-sequenced when Idempotent producer is enabled.

I was able to reproduce these issues both locally and in a staging environment and have fixed both of them. P.R has many comments explaining the changes.

Relevant issues:

Signed-off-by: Richard Artoul <[email protected]>

richardartoul · 2024-07-20T00:28:54Z

Hmm... I think it may still be possible for concurrent requests to be issued sendResponse writes to a channel, and once the response is picked up off that channel there could be an inflight request from the next request + an inflight retry actually. I'm not sure how to resolve that other then somehow "waiting" for a response to be processed (either failing as an error or being retried)

puellanivis

I’m somewhat concerned by the number of panic()s in the code. Are these really so serious that they need to potentially crash the whole program? and/or kill off a processing goroutine, without any information up-the-chain that processing has terminated?

puellanivis · 2024-07-21T05:47:22Z

async_producer.go

@@ -249,6 +250,19 @@ func (pe ProducerError) Unwrap() error {
 type ProducerErrors []*ProducerError

 func (pe ProducerErrors) Error() string {
+	if len(pe) > 0 {


Should this ever actually be produced with zero messages?

If it’s unlikely to ever happen with len(pe) == 0 then that should be the guard condition, and then the complex error message should be the unindented path.

puellanivis · 2024-07-21T05:48:39Z

async_producer.go

@@ -695,6 +709,9 @@ func (pp *partitionProducer) dispatch() {
 		// All messages being retried (sent or not) have already had their retry count updated
 		// Also, ignore "special" syn/fin messages used to sync the brokerProducer and the topicProducer.
 		if pp.parent.conf.Producer.Idempotent && msg.retries == 0 && msg.flags == 0 {
+			if msg.hasSequence {
+				panic("assertion failure: reassigning producer epoch and sequence number to message that already has them")


https://go.dev/wiki/CodeReviewComments#dont-panic

Is the condition here so bad that we need to panic? (That is, is it entirely unrecoverable?)

puellanivis · 2024-07-21T05:57:37Z

produce_set.go

+			Logger.Println(
+				"assertion failed: message out of sequence added to batch",
+				"producer_id",
+				ps.producerID,
+				set.recordsToSend.RecordBatch.ProducerID,
+				"producer_epoch",
+				ps.producerEpoch,
+				set.recordsToSend.RecordBatch.ProducerEpoch,
+				"sequence_number",
+				msg.sequenceNumber,
+				set.recordsToSend.RecordBatch.FirstSequence,
+				"buffer_count",
+				ps.bufferCount,
+				"msg_has_sequence",
+				msg.hasSequence)


I would recommend leaving the log message on the same line as the method call, so that it’s easily findable via grep, and otherwise one-line isolates well.

Additionally, if the line is long enough to break up, then the possibility of adding even more fields is high, so each entry should end with a comma and a newline, so adding new fiels to the end of the call don’t produce unnecessary line changes, where the only change is in punctuation due to syntax requirements.

Then I would pair up each log field name with the log field value, all together:

Logger.Println("assertion failed: message out of sequence added to batch", "producer_id", ps.producerID, set.recordsToSend.RecordBatch.ProducerID, "producer_epoch", ps.producerEpoch, set.recordsToSend.RecordBatch.ProducerEpoch, "sequence_number", msg.sequenceNumber, set.recordsToSend.RecordBatch.FirstSequence, "buffer_count", ps.bufferCount, "msg_has_sequence", msg.hasSequence, )

puellanivis · 2024-07-21T06:00:37Z

async_producer.go

+	}
+
+	if !succeeded {
+		Logger.Printf("Failed retrying batch for %v-%d because of %v while looking up for new leader, no more retries\n", topic, partition)


[nitpick] Newlines at the end should be unnecessary for loggers? (I mean, this is generally the case, but I don’t know if that is specifically true here.)

Three % verbs are specified but only two arguments are given.

puellanivis · 2024-07-21T06:02:45Z

async_producer.go

+	// as expected. This retry loop is very important since prematurely (and unnecessarily) failing
+	// an idempotent batch is ~equivalent to data loss.
+	succeeded := false
+	for i := 0; i < p.conf.Producer.Retry.Max; i++ {


Suggest using a different variable name if we’re trying retries/tries rather than indices.

[off-by-one smell] Are we counting retries, or tries? That is, if I’ve asked for 5 retries max, then that’s 6 total tries.

lifepuzzlefun · 2024-09-29T07:28:34Z

any progress on this ？I think this bug is very urgent to fix

jgersdorf · 2024-10-28T15:45:12Z

As we are seeing also a lot of kafka server: The broker received an out of order sequence number errors, we would be interessted in this fix.

jgersdorf · 2024-11-05T07:20:16Z

I guess @richardartoul has not answered to @puellanivis review comments yet?

@richardartoul : Do you think you can continue with this PR, or is this PR abandoned?

ulpcan · 2024-11-05T12:23:37Z

We are waiting for this MR to be continued

richardartoul · 2024-11-06T23:48:47Z

Hey guys, I'm really sorry but I wasn't able to convince myself this P.R was correct or materially improved the situation. There is definitely a bug in the library, but I'm suspicious there is more than one issue and I timeboxed myself to two days on this and don't have more time to continue trying to fix it

github-actions · 2025-02-05T00:13:28Z

Thank you for your contribution! However, this pull request has not had any activity in the past 90 days and will be closed in 30 days if no updates occur.
If you believe the changes are still valid then please verify your branch has no conflicts with main and rebase if needed. If you are awaiting a (re-)review then please let us know.

richardartoul force-pushed the main branch from ecba376 to 3bba76c Compare July 19, 2024 22:27

richardartoul added 11 commits July 19, 2024 17:29

sync produce

0a07760

Signed-off-by: Richard Artoul <[email protected]>

add sequence for retries

78b02d0

Signed-off-by: Richard Artoul <[email protected]>

panic if no sequence

46d19e5

Signed-off-by: Richard Artoul <[email protected]>

maintain producer ID and epoch when retrying batch

e715c7d

Signed-off-by: Richard Artoul <[email protected]>

panic out of order

d22f96f

Signed-off-by: Richard Artoul <[email protected]>

wip

337aab2

Signed-off-by: Richard Artoul <[email protected]>

improve and add comments

b7cc6fc

Signed-off-by: Richard Artoul <[email protected]>

remove unnecessary check

82e8795

Signed-off-by: Richard Artoul <[email protected]>

better comment

401c990

Signed-off-by: Richard Artoul <[email protected]>

Add kafkastorageerror as retriable

2d7cdb6

Signed-off-by: Richard Artoul <[email protected]>

fix typo

1cf850b

Signed-off-by: Richard Artoul <[email protected]>

richardartoul force-pushed the main branch from 3bba76c to 1cf850b Compare July 19, 2024 22:29

This was referenced Jul 19, 2024

AsyncProducer produces messages in out-of-order when retries happen #2619

Open

Sarama Async Producer Encounters 'Out of Order' Error: what are the reasons? #2803

Closed

Does sarama still guarantee message ordering? #2860

Open

richardartoul added 3 commits July 20, 2024 17:03

improve error logging

1125d60

more logging

b73a5f2

dont clear message before logging

0222838

puellanivis reviewed Jul 21, 2024

View reviewed changes

github-actions bot added the stale Issues and pull requests without any recent activity label Feb 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix broken error handling around Idempotent producer + Ensure strict ordering when Net.MaxOpenRequests = 1 #2943

Fix broken error handling around Idempotent producer + Ensure strict ordering when Net.MaxOpenRequests = 1 #2943

richardartoul commented Jul 19, 2024

richardartoul commented Jul 20, 2024 •

edited

Loading

puellanivis left a comment

puellanivis Jul 21, 2024

puellanivis Jul 21, 2024

puellanivis Jul 21, 2024

puellanivis Jul 21, 2024

puellanivis Jul 21, 2024

lifepuzzlefun commented Sep 29, 2024

jgersdorf commented Oct 28, 2024

jgersdorf commented Nov 5, 2024 •

edited

Loading

ulpcan commented Nov 5, 2024

richardartoul commented Nov 6, 2024

github-actions bot commented Feb 5, 2025

Fix broken error handling around Idempotent producer + Ensure strict ordering when Net.MaxOpenRequests = 1 #2943

Are you sure you want to change the base?

Fix broken error handling around Idempotent producer + Ensure strict ordering when Net.MaxOpenRequests = 1 #2943

Conversation

richardartoul commented Jul 19, 2024

richardartoul commented Jul 20, 2024 • edited Loading

puellanivis left a comment

Choose a reason for hiding this comment

puellanivis Jul 21, 2024

Choose a reason for hiding this comment

puellanivis Jul 21, 2024

Choose a reason for hiding this comment

puellanivis Jul 21, 2024

Choose a reason for hiding this comment

puellanivis Jul 21, 2024

Choose a reason for hiding this comment

puellanivis Jul 21, 2024

Choose a reason for hiding this comment

lifepuzzlefun commented Sep 29, 2024

jgersdorf commented Oct 28, 2024

jgersdorf commented Nov 5, 2024 • edited Loading

ulpcan commented Nov 5, 2024

richardartoul commented Nov 6, 2024

github-actions bot commented Feb 5, 2025

richardartoul commented Jul 20, 2024 •

edited

Loading

jgersdorf commented Nov 5, 2024 •

edited

Loading