-
Notifications
You must be signed in to change notification settings - Fork 16.8k
Large XCom Payload Causes Task Heartbeat Timeout #64628
Description
Apache Airflow version
3.1.7
What happened and how to reproduce it?
We attempted to push a very large XCom payload (over 300 MB) from a worker task to the XCom table.
We understand this is not the ideal approach for data of this size, and that other mechanisms are generally better suited.
However, the main reason for opening this issue is the behavior we observed: uploading the XCom value through the supervisor took about 18 minutes. During that time, the supervisor was blocked by the XCom push and could not process heartbeats. As a result, the task timed out and was marked as failed.
This may also affect other backends. The default task timeout is 300 seconds, so if the XCom push takes longer than that, the scheduler marks the task as failed.
Has anyone experienced the same issue, and do you have suggestions for how to solve it?
What you think should happen instead?
Pushing an XCom value should not block task heartbeats.
If the payload is large and upload takes longer, the task should continue sending heartbeats (or fail with a clear XCom-size error) instead of being marked failed due to heartbeat timeout.
Operating System
No response
Versions of Apache Airflow Providers
No response
Deployment
None
Deployment details
No response
Anything else?
No response
Are you willing to submit PR?
- Yes I am willing to submit a PR!
Code of Conduct
- I agree to follow this project's Code of Conduct