Batch Processing - Why handle Full Badge failures different than partial failures. #1785
-
I was writing tests for the Batch processor and noticed that the behaviour is different between partial failure and full batch failure.
In my project we don't expect big batch sizes so size 1 will happen quite often. So in some cases the result of a failed item will result in an error thrown which then also is marked as a failed invocation run in metrics, and in other cases where a failure hapend the lambda will return a failureArray and lambda invocation will be counted as successfull. That behaviour is suboptimal I think. It makes it harder to build alerts for the function. Is there a particular reason for that behaviour? My expectation would have been that the partial failure array would be returned in all cases, even if the full batch failed. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
Hi @RaphaelManke - thanks for the question. Let me check with the team and I'll get back to you on this tomorrow. |
Beta Was this translation helpful? Give feedback.
-
Hi again! I remember we discussed this either offline (Twitter/Discord DMs) or in person at re:Invent, but for the sake of other people bumping into this I'll write down the answer here as well. The initial implementation was made in this way because from a producer perspective (i.e. SQS, Kinesis, etc.) a consumer that throws an error is functionally equal to one that returns a partial failure list that contains all the items. This is because in both cases all the items in the batch are eligible to be retried. With this in mind, we decided to throw an error to explicitly reflect the full batch failure in the operational metrics (i.e. function runtime errors). Looking at your use case however I can see how this can be problematic and end up skew your operational metrics since smaller batches are more likely to fail entirely and thus cause an error to be thrown. I have opened a feature request (#2122) to add a |
Beta Was this translation helpful? Give feedback.
Hi again!
I remember we discussed this either offline (Twitter/Discord DMs) or in person at re:Invent, but for the sake of other people bumping into this I'll write down the answer here as well.
The initial implementation was made in this way because from a producer perspective (i.e. SQS, Kinesis, etc.) a consumer that throws an error is functionally equal to one that returns a partial failure list that contains all the items. This is because in both cases all the items in the batch are eligible to be retried.
With this in mind, we decided to throw an error to explicitly reflect the full batch failure in the operational metrics (i.e. function runtime errors). Looking at your use case howe…