Skip to content

Isolate per-dag-run failures in _schedule_all_dag_runs() to prevent single DagRun crashing the Scheduler#62893

Open
XD-DENG wants to merge 2 commits intoapache:mainfrom
XD-DENG:isolate-dag-run-scheduling-failures
Open

Isolate per-dag-run failures in _schedule_all_dag_runs() to prevent single DagRun crashing the Scheduler#62893
XD-DENG wants to merge 2 commits intoapache:mainfrom
XD-DENG:isolate-dag-run-scheduling-failures

Conversation

@XD-DENG
Copy link
Member

@XD-DENG XD-DENG commented Mar 4, 2026

What's the issue

Previously, _schedule_all_dag_runs() used a list comprehension to process all DagRuns. If any _schedule_dag_run() raised an exception for any single dag run, the entire comprehension would abort, no other DagRun would be processed, and the exception would propagate up to crash the scheduler process — stopping scheduling for ALL DAGs.

How to reproduce

I tried to run a DagRun then go to database to mark one TaskInstace's state to up_for_retry and the end_date to None. In this case, the scheduler simply crashed with the error below:

scheduler  | File "/Users/xd/Downloads/airflow-test-venv/lib/python3.11/site-packages/airflow/models/taskinstance.py", line 1005, in next_retry_datetime
scheduler  | return self.end_date + delay
scheduler  | ~~~~~~~~~~~~~~^~~~~~~
scheduler  | TypeError: unsupported operand type(s) for +: 'NoneType' and 'datetime.timedelta'
scheduler  | 2026-03-04T20:48:14.558116Z [info     ] Shutting down LocalExecutor; waiting for running tasks to finish.  Signal again if you don't want to wait. [airflow.executors.local_executor.LocalExecutor] loc=local_executor.py:252

While the specific scenario used to reproduce this (a TaskInstance with state=UP_FOR_RETRY and end_date=NULL) is nearly impossible under normal operation, the lack of per-dag-run fault isolation means ANY unexpected exception from ANY DagRun would have the same fatal effect.

(tested with Airflow 3.1.7)

What's the fix

Replace the list comprehension with an explicit loop that catches exceptions per DagRun, logs the error with full traceback, and continues processing the remaining DagRuns as well as the future DagRuns.


Was generative AI tooling used to co-author this PR?
  • Yes (please specify the tool below)

Generated-by: [Claude Opus 4.6] following the guidelines


  • Read the Pull Request Guidelines for more information. Note: commit author/co-author name and email in commits become permanently public when merged.
  • For fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
  • When adding dependency, check compliance with the ASF 3rd Party License Policy.
  • For significant user-facing changes create newsfragment: {pr_number}.significant.rst or {issue_number}.significant.rst, in airflow-core/newsfragments.

@XD-DENG XD-DENG requested a review from ashb as a code owner March 4, 2026 21:34
@boring-cyborg boring-cyborg bot added the area:Scheduler including HA (high availability) scheduler label Mar 4, 2026
@XD-DENG XD-DENG added the affected_version:3.1 Issues Reported for 3.1 label Mar 4, 2026
@XD-DENG XD-DENG force-pushed the isolate-dag-run-scheduling-failures branch from cb28e4c to 8ae4c23 Compare March 4, 2026 23:38
XD-DENG added 2 commits March 6, 2026 11:52
# What's the issue
Previously, `_schedule_all_dag_runs()` used a list comprehension to process all DagRuns. If any `_schedule_dag_run()` raised an exception for any single dag run, the entire comprehension would abort, no other DagRun would be processed, and the exception would propagate up to crash the scheduler process — **stopping scheduling for ALL DAGs**.

# How to reproduce
While the specific scenario used to reproduce this (a TaskInstance with `state=UP_FOR_RETRY` and `end_date=NULL`) is nearly impossible under normal operation, the lack of **per-dag-run fault isolation** means ANY unexpected exception from ANY DagRun would have the same fatal effect.

#What's the fix
Replace the list comprehension with an explicit loop that catches exceptions per DagRun, logs the error with full traceback, and continues processing the remaining DagRuns.
@XD-DENG XD-DENG force-pushed the isolate-dag-run-scheduling-failures branch from 7d1a8b2 to c915fd5 Compare March 6, 2026 19:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

affected_version:3.1 Issues Reported for 3.1 area:Scheduler including HA (high availability) scheduler

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant