Skip to content

DagProcessor pod crashes regulary on Openshift due to stale filehandler #69321

Description

@dabla

Under which category would you file this issue?

Airflow Core

Apache Airflow version

3.2.2

What happened and how to reproduce it?

In Openshift we see a lot of restarts related to the DagProcessor pod.

When checking the previous logs of the crashed pod, we can clearly see the reason why the pod was restarted.

2026-07-03T08:59:25.767332Z [info     ] Process exited                 [supervisor] exit_code=<Negsignal.SIGTERM: -15> loc=supervisor.py:859 pid=11928 signal_sent=SIGTERM
2026-07-03T08:59:25.843704Z [info     ] Waiting up to 5 seconds for processes to exit... [airflow.utils.process_utils] loc=process_utils.py:308
Traceback (most recent call last):
  File "/usr/local/sbin/airflow", line 10, in <module>
    sys.exit(main())
             ~~~~^^
  File "/usr/local/lib/python3.13/site-packages/airflow/__main__.py", line 55, in main
    args.func(args)
    ~~~~~~~~~^^^^^^
  File "/usr/local/lib/python3.13/site-packages/airflow/cli/cli_config.py", line 49, in command
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.13/site-packages/airflow/utils/memray_utils.py", line 60, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.13/site-packages/airflow/utils/cli.py", line 113, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.13/site-packages/airflow/utils/providers_configuration_loader.py", line 54, in wrapped_function
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.13/site-packages/airflow/cli/commands/dag_processor_command.py", line 64, in dag_processor
    run_command_with_daemon_option(
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
        args=args,
        ^^^^^^^^^^
    ...<2 lines>...
        should_setup_logging=True,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/usr/local/lib/python3.13/site-packages/airflow/cli/commands/daemon_utils.py", line 86, in run_command_with_daemon_option
    callback()
    ~~~~~~~~^^
  File "/usr/local/lib/python3.13/site-packages/airflow/cli/commands/dag_processor_command.py", line 67, in <lambda>
    callback=lambda: run_job(job=job_runner.job, execute_callable=job_runner._execute),
                     ~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.13/site-packages/airflow/utils/session.py", line 100, in wrapper
    return func(*args, session=session, **kwargs)  # type: ignore[arg-type]
  File "/usr/local/lib/python3.13/site-packages/airflow/jobs/job.py", line 355, in run_job
    return execute_job(job, execute_callable=execute_callable)
  File "/usr/local/lib/python3.13/site-packages/airflow/jobs/job.py", line 384, in execute_job
    ret = execute_callable()
  File "/usr/local/lib/python3.13/site-packages/airflow/jobs/dag_processor_job_runner.py", line 61, in _execute
    self.processor.run()
    ~~~~~~~~~~~~~~~~~~^^
  File "/usr/local/lib/python3.13/site-packages/airflow/dag_processing/manager.py", line 334, in run
    return self._run_parsing_loop()
           ~~~~~~~~~~~~~~~~~~~~~~^^
  File "/usr/local/lib/python3.13/site-packages/airflow/dag_processing/manager.py", line 453, in _run_parsing_loop
    self._collect_results()
    ~~~~~~~~~~~~~~~~~~~~~^^
  File "/usr/local/lib/python3.13/site-packages/airflow/utils/session.py", line 100, in wrapper
    return func(*args, session=session, **kwargs)  # type: ignore[arg-type]
  File "/usr/local/lib/python3.13/site-packages/airflow/dag_processing/manager.py", line 979, in _collect_results
    processor.logger_filehandle.close()
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^
OSError: [Errno 116] Stale file handle

What you think should happen instead?

The OSError should be catched when file handler is stale and logged as a warning but not propageted as this would avoid the pod from crashing and thus being restarted each time.

Operating System

Red Hat Fedora 5.3

Deployment

Official Apache Airflow Helm Chart

Apache Airflow Provider(s)

No response

Versions of Apache Airflow Providers

No response

Official Helm Chart version

1.22.0 (latest released)

Kubernetes Version

v1.29.14+41c4e9b

Helm Chart configuration

No response

Docker Image customizations

No response

Anything else?

Multiple times a day

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

Labels

area:corekind:bugThis is a clearly a bugneeds-triagelabel for new issues that we didn't triage yet

Type

No type

Fields

No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions