Apache Airflow Provider(s)
google
Versions of Apache Airflow Providers
No response
Apache Airflow version
main
Operating System
ubuntu
Deployment
Astronomer
Deployment details
No response
What happened
When implementing the Ephemeral Dataproc Cluster pattern:
Create Cluster -> Run Jobs -> Delete Cluster (TriggerRule.ALL_DONE)
There is a conflict between the default behavior of DataprocCreateClusterOperator and the downstream DataprocDeleteClusterOperator.
DataprocCreateClusterOperator has delete_on_error=True by default. If the cluster creation fails and ends up in an ERROR state, the operator automatically deletes the cluster.
- The downstream
DataprocDeleteClusterOperator triggers (due to TriggerRule.ALL_DONE).
- It attempts to delete the cluster which no longer exists.
- The
DataprocDeleteClusterOperator fails with a NotFound (404) error from the Google Cloud API.
This causes the cleanup task to be marked as failed, which creates noise and can potentially mask the actual upstream failure in monitoring views.
What you think should happen instead
DataprocDeleteClusterOperator should ideally be idempotent. If the cluster is already deleted (returns 404 NotFound), the operator should consider the task successful (or skipped) rather than failed.
Currently, the deferrable mode implementation checks for existence:
try:
hook.get_cluster(...)
except NotFound:
self.log.info("Cluster deleted.")
return
However, the standard synchronous execute path does not seem to catch NotFound exceptions during the delete operation.
How to reproduce
- Create a DAG with
DataprocCreateClusterOperator -> DataprocDeleteClusterOperator (with trigger_rule=TriggerRule.ALL_DONE).
- Force the cluster creation to enter an ERROR state (e.g., by providing invalid configuration that passes validation but fails provisioning).
DataprocCreateClusterOperator will delete the cluster and fail.
DataprocDeleteClusterOperator will run, attempt to delete the missing cluster, and fail with NotFound.
Anything else
Proposed behaviour:
- Update
DataprocDeleteClusterOperator to catch NotFound exceptions during the delete operation and log a message instead of raising an error.
- Alternatively, update documentation to explicitly recommend setting
delete_on_error=False in DataprocCreateClusterOperator when an explicit delete task is used.
Are you willing to submit PR?
Code of Conduct
Apache Airflow Provider(s)
google
Versions of Apache Airflow Providers
No response
Apache Airflow version
main
Operating System
ubuntu
Deployment
Astronomer
Deployment details
No response
What happened
When implementing the Ephemeral Dataproc Cluster pattern:
Create Cluster->Run Jobs->Delete Cluster (TriggerRule.ALL_DONE)There is a conflict between the default behavior of
DataprocCreateClusterOperatorand the downstreamDataprocDeleteClusterOperator.DataprocCreateClusterOperatorhasdelete_on_error=Trueby default. If the cluster creation fails and ends up in anERRORstate, the operator automatically deletes the cluster.DataprocDeleteClusterOperatortriggers (due toTriggerRule.ALL_DONE).DataprocDeleteClusterOperatorfails with aNotFound(404) error from the Google Cloud API.This causes the cleanup task to be marked as
failed, which creates noise and can potentially mask the actual upstream failure in monitoring views.What you think should happen instead
DataprocDeleteClusterOperatorshould ideally be idempotent. If the cluster is already deleted (returns 404 NotFound), the operator should consider the task successful (or skipped) rather than failed.Currently, the
deferrablemode implementation checks for existence:However, the standard synchronous
executepath does not seem to catchNotFoundexceptions during the delete operation.How to reproduce
DataprocCreateClusterOperator->DataprocDeleteClusterOperator(withtrigger_rule=TriggerRule.ALL_DONE).DataprocCreateClusterOperatorwill delete the cluster and fail.DataprocDeleteClusterOperatorwill run, attempt to delete the missing cluster, and fail withNotFound.Anything else
Proposed behaviour:
DataprocDeleteClusterOperatorto catchNotFoundexceptions during the delete operation and log a message instead of raising an error.delete_on_error=FalseinDataprocCreateClusterOperatorwhen an explicit delete task is used.Are you willing to submit PR?
Code of Conduct