Batch Processing - Why handle Full Badge failures different than partial failures. #1785

RaphaelManke · 2023-11-08T15:03:59Z

RaphaelManke
Nov 8, 2023

I was writing tests for the Batch processor and noticed that the behaviour is different between partial failure and full batch failure.
docs

Partial failure mechanics¶
All records in the batch will be passed to this handler for processing, even if exceptions are thrown - Here's the behaviour after >completing the batch:
All records successfully processed. We will return an empty list of item failures {'batchItemFailures': []}
Partial success with some exceptions. We will return a list of all item IDs/sequence numbers that failed processing>
All records failed to be processed. We will raise BatchProcessingError exception with a list of all exceptions raised when processing

In my project we don't expect big batch sizes so size 1 will happen quite often.

So in some cases the result of a failed item will result in an error thrown which then also is marked as a failed invocation run in metrics, and in other cases where a failure hapend the lambda will return a failureArray and lambda invocation will be counted as successfull.

That behaviour is suboptimal I think. It makes it harder to build alerts for the function.
On top of that I need to write two test cases for each lambda handler because the behaviour could be different.

Is there a particular reason for that behaviour?

My expectation would have been that the partial failure array would be returned in all cases, even if the full batch failed.

Answered by dreamorosi

Feb 21, 2024

Hi again!

I remember we discussed this either offline (Twitter/Discord DMs) or in person at re:Invent, but for the sake of other people bumping into this I'll write down the answer here as well.

The initial implementation was made in this way because from a producer perspective (i.e. SQS, Kinesis, etc.) a consumer that throws an error is functionally equal to one that returns a partial failure list that contains all the items. This is because in both cases all the items in the batch are eligible to be retried.

With this in mind, we decided to throw an error to explicitly reflect the full batch failure in the operational metrics (i.e. function runtime errors). Looking at your use case howe…

View full answer

dreamorosi · 2023-11-08T17:57:34Z

dreamorosi
Nov 8, 2023
Maintainer

Hi @RaphaelManke - thanks for the question.

Let me check with the team and I'll get back to you on this tomorrow.

0 replies

dreamorosi · 2024-02-21T17:26:21Z

dreamorosi
Feb 21, 2024
Maintainer

Hi again!

I remember we discussed this either offline (Twitter/Discord DMs) or in person at re:Invent, but for the sake of other people bumping into this I'll write down the answer here as well.

The initial implementation was made in this way because from a producer perspective (i.e. SQS, Kinesis, etc.) a consumer that throws an error is functionally equal to one that returns a partial failure list that contains all the items. This is because in both cases all the items in the batch are eligible to be retried.

With this in mind, we decided to throw an error to explicitly reflect the full batch failure in the operational metrics (i.e. function runtime errors). Looking at your use case however I can see how this can be problematic and end up skew your operational metrics since smaller batches are more likely to fail entirely and thus cause an error to be thrown.

I have opened a feature request (#2122) to add a throwOnFullBatchFailure option to the utility so that customers can optionally opt-out of the behavior if, like in your case, they expect frequent full batch failures or simply don't want errors to be thrown. I have linked this discussion but would appreciate your 👍 on it if you can.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Batch Processing - Why handle Full Badge failures different than partial failures. #1785

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Batch Processing - Why handle Full Badge failures different than partial failures. #1785

Uh oh!

RaphaelManke Nov 8, 2023

Replies: 2 comments

Uh oh!

dreamorosi Nov 8, 2023 Maintainer

Uh oh!

dreamorosi Feb 21, 2024 Maintainer

RaphaelManke
Nov 8, 2023

dreamorosi
Nov 8, 2023
Maintainer

dreamorosi
Feb 21, 2024
Maintainer