Implement execution_timeout semantics for DbtCloudRunJobOperator in deferrable mode#61472
Conversation
943cabc to
a7f2927
Compare
|
Requesting review for this. |
potiuk
left a comment
There was a problem hiding this comment.
That looks good. Though I would feel more comfortable if I saw some output of running job/ screenshot showing the behaviour. Possible @SameerMesiah97 ?
My free trial has expired but I will see if I can still use the DBT Cloud API. I will update the PR description with the screenshots once I am able to reproduce it again. There is a link to the issue which has comprehensive reproduction steps so I believe it can be easily verified by someone with DBT Cloud API access. For future reference, would you advise including screenshots by default for any bugs? Or are explicit steps sufficient and this is more of a sanity check on a case-by-case basis? Edit: I just attempted to use the DBT Cloud API (which is necessary to reproduce this bug). No accesss as free trial has expired. @potiuk |
in case there is something a bit more complex withich requires access to services |
I was just wondering whether screenshots are strictly required for this PR to be validated? |
|
if possible, could you have a look at this as well? |
a7f2927 to
2fd5254
Compare
Restore execution_timeout semantics in deferrable mode by propagating timeouts through the trigger and explicitly cancelling dbt Cloud jobs when the task exceeds its execution deadline. This preserves behavior parity with non-deferrable execution and avoids leaking dbt jobs.
2fd5254 to
7138f79
Compare
I have used the following DAG to test the behavior of Behavior before the fix Job run status after You can see that the job keeps running even after the timeout period. Behavior after the fix Job run status after The Job run is cancelled via a request by the operator due to my implementation. Note: Only the last 4 digits of the JOB ID and RUN ID have been provided for security reasons but it should be sufficient for you to match the DAG runs to the Job runs. Additional private information has been blacked out as well for the same reason. |
vincbeck
left a comment
There was a problem hiding this comment.
This looks like more a bug/something we missed in Airflow in general rather than one bug in a provider? Am I wrong?
That’s definitely possible but at the moment I cannot see any opportunities for a generalisable abstraction because cancellation semantics may differ between operators. And whilst I suspect other operators might have this class of bug, I don’t have concrete evidence. So based on what I know now, this is the most appropriate solution I can think of for this operator. However, I will definitely check other operators to see if I can reproduce this type of bug for them as well. |




Description
This change implements
execution_timeoutsemantics forDbtCloudRunJobOperatorwhen running in deferrable mode.Previously, when the operator deferred execution,
execution_timeoutwas not enforced because the task process is no longer running and the scheduler cannot terminate it via SIGTERM. As a result, dbt Cloud jobs could continue running after the Airflow task exceeded its execution timeout.This update restores parity with non-deferrable execution by explicitly enforcing execution timeouts through the trigger/operator interaction. The operator now computes an absolute execution deadline derived from
execution_timeoutbefore deferring, the trigger emits a timeout event when that deadline is exceeded, and the operator cancels the dbt Cloud job and fails the task when the event is received.Rationale
In non-deferrable mode,
execution_timeoutis enforced by the scheduler, which terminates the task process and invokeson_kill()to cancel the external dbt Cloud job.In deferrable mode, execution is handed off to a trigger running in the triggerer process, which does not have an associated worker process that can be terminated. However, the absence of a worker process does not remove the requirement to honor task-level execution semantics. From a user perspective,
execution_timeoutrepresents a hard task-level limit that should behave consistently regardless of whether execution is deferrable or not.Without explicit handling in the trigger/operator interaction,
execution_timeoutsilently stops working in deferrable mode, leading to leaked dbt Cloud jobs and inconsistent behavior between deferrable and non-deferrable execution. Deferrable execution should adapt how timeouts are enforced, but not whether they are enforced.This change ensures
execution_timeoutsemantics are preserved in deferrable mode and remain consistent with non-deferrable execution.Notes
timeoutparameter continues to limit only how long the operator waits for job completion and does not imply cancellation.execution_timeoutandtimeoutare set, the earlier deadline takes precedence.Tests
execution_deadlinefield is always serialized.Documentation
DbtCloudRunJobTriggerhas been updated to document the newexecution_deadlineparameter and clarify its behavior.DbtCloudRunJobOperatorhas been updated to clarify the behavior of thetimeoutparameter and distinguish it from task-levelexecution_timeout.Backwards Compatibility
This change does not alter public APIs or method signatures. The runtime behavior changes in that dbt Cloud jobs are now explicitly cancelled when
execution_timeoutis reached during deferrable execution, whereas previously the job could continue running after the task timed out.Closes: #61467