Skip to content

Large XCom Payload Causes Task Heartbeat Timeout #64628

@AutomationDev85

Description

@AutomationDev85

Apache Airflow version

3.1.7

What happened and how to reproduce it?

We attempted to push a very large XCom payload (over 300 MB) from a worker task to the XCom table.
We understand this is not the ideal approach for data of this size, and that other mechanisms are generally better suited.

However, the main reason for opening this issue is the behavior we observed: uploading the XCom value through the supervisor took about 18 minutes. During that time, the supervisor was blocked by the XCom push and could not process heartbeats. As a result, the task timed out and was marked as failed.

This may also affect other backends. The default task timeout is 300 seconds, so if the XCom push takes longer than that, the scheduler marks the task as failed.

Has anyone experienced the same issue, and do you have suggestions for how to solve it?

What you think should happen instead?

Pushing an XCom value should not block task heartbeats.
If the payload is large and upload takes longer, the task should continue sending heartbeats (or fail with a clear XCom-size error) instead of being marked failed due to heartbeat timeout.

Operating System

No response

Versions of Apache Airflow Providers

No response

Deployment

None

Deployment details

No response

Anything else?

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:Schedulerincluding HA (high availability) schedulerarea:corekind:bugThis is a clearly a bugneeds-triagelabel for new issues that we didn't triage yetpriority:highHigh priority bug that should be patched quickly but does not require immediate new release

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions