Ensure graceful shutdown for Edge Tasks in Airflow 3 #60086

jscheffl · 2026-01-03T23:10:30Z

While we had a patch in Edge for Airflow 2 by @majorosdonat in PR #43927 I noticed while attempting to implement a better event loop in Edge that the CTRL-C handling was not proper in Airflow 3.

In order not to delay a fix, this PR changes accordingly like the previous PR the same behavior in Airflow 3:
If in the console via CTRL-C or via kill $EDGE_WORKER_PID an interrupt is send to the worker, the signal is not sent to child processes (which kills the task execution) but gracefully shuts down the worker - waiting to complete pending jobs w/o picking new ones.
This is preventing the (currnent) effect that tasks are just killed leaving them in failed state as well as also allows a graceful shutdown in case of a rolling release.

I am not sure why the same method that was applied in Airflow 2 is not working in Airflow 3 (probably because of supervisor handling...) but via os.setpgrp()is is working.

Also fixed the way hiw the killing via SIGTERM is handled, it was formerly sending the SIGTERM to the process group (not the process) and using the PID and not the needed PGID... but this would have an endless loop of signalling to the worker itself. So fixing this as well.

potiuk

Nice. Indeed - but I think it needs a few tweaks.

Yep. starting a new process group when your process is handling signal propagation is the best way to handle both regular kills and the process-group kills for terminal (with Ctrl-C).

But I think we used to have a process group created when we started processes in Airflow 2 in TaskRunner and there is a clear indication of it because we have (currently unused !!!!) set_new_process_group in process_utils - which is better to use - because it also handles the edge case when your process is session leader.

And we were using it in the past to propely handle signal propagation in fact. this is 2.11 code of standard_task_runner:

    def _start_by_fork(self):
        pid = os.fork()
        if pid:
            self.log.info("Started process %d to run task", pid)
            return psutil.Process(pid)
        else:
            from airflow.api_internal.internal_api_call import InternalApiConfig
            from airflow.configuration import conf

            if conf.getboolean("core", "database_access_isolation", fallback=False):
                InternalApiConfig.set_use_internal_api("Forked task runner")
            # Start a new process group
            set_new_process_group()

Was it a deliberate change (@ashb ? @amoghrajesh @kaxil ?) to stop using it ? Because if not, then possibly better way will be to bring it back and use it in supervisor - this way we could simply propagate signals from supervisor to the supervised processes - this is a "classic" and IMHO best way of handling signals - especially that it makes Ctlr+C behaviour consistent with simply sending a signal.

This would also fix the behaviour running of our processes in the images of ours. We are using - deliberately - dumb-init default behaviour of propagating the signals to process group (except celery worker).

This is described here: https://airflow.apache.org/docs/docker-stack/entrypoint.html#signal-propagation

Previously, we had the new process group creation and signal forwarding happening.

But if we are not doing it today, then it might have unintended side effects due to race conditions - because all the forked processes will immediately get the signal - and in case supervisor (like edge main process) want to do some cleanup / preparation before the child processes get the signals, there is no way to control who will get the signal first.

That's why I think it seems the most reasonable approach to come back to always creating new process group by the supervisor's children and simply propagate all signals deliberately in the shutdown handler of supervisor. (Which does not prevent edge doing the same - I think it's a good idea to implement it in edge provider as well, those things stack very well) - but I think reviewing this PR is a good opportunity to agree the right behaviour for supervisor as well.

ashb · 2026-01-08T11:44:59Z

Was it a deliberate change (@ashb ? @amoghrajesh @kaxil ?) to stop using it ? Because if not, then possibly better way will be to bring it back and use it in supervisor - this way we could simply propagate signals from supervisor to the supervised processes - this is a "classic" and IMHO best way of handling signals - especially that it makes Ctlr+C behaviour consistent with simply sending a signal.

Oh no, adding back in process group was a TODO that I forgot to record anywhere 🤦🏻

Ensure graceful shutdown for Edge Tasks in Airflow 3

c81e651

boring-cyborg bot added area:providers provider:edge Edge Executor / Worker (AIP-69) labels Jan 3, 2026

jscheffl requested review from amoghrajesh, potiuk and shahar1 January 3, 2026 23:13

jscheffl added the type:bug-fix Changelog: Bug Fixes label Jan 3, 2026

potiuk requested changes Jan 3, 2026

View reviewed changes

jscheffl requested review from ashb and kaxil January 6, 2026 21:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Ensure graceful shutdown for Edge Tasks in Airflow 3 #60086

Ensure graceful shutdown for Edge Tasks in Airflow 3 #60086

Uh oh!

jscheffl commented Jan 3, 2026

Uh oh!

potiuk left a comment •

edited

Loading

Uh oh!

ashb commented Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Ensure graceful shutdown for Edge Tasks in Airflow 3 #60086

Are you sure you want to change the base?

Ensure graceful shutdown for Edge Tasks in Airflow 3 #60086

Uh oh!

Conversation

jscheffl commented Jan 3, 2026

Uh oh!

potiuk left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ashb commented Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

potiuk left a comment •

edited

Loading