Skip to content

Conversation

@jscheffl
Copy link
Contributor

@jscheffl jscheffl commented Jan 3, 2026

While we had a patch in Edge for Airflow 2 by @majorosdonat in PR #43927 I noticed while attempting to implement a better event loop in Edge that the CTRL-C handling was not proper in Airflow 3.

In order not to delay a fix, this PR changes accordingly like the previous PR the same behavior in Airflow 3:
If in the console via CTRL-C or via kill $EDGE_WORKER_PID an interrupt is send to the worker, the signal is not sent to child processes (which kills the task execution) but gracefully shuts down the worker - waiting to complete pending jobs w/o picking new ones.
This is preventing the (currnent) effect that tasks are just killed leaving them in failed state as well as also allows a graceful shutdown in case of a rolling release.

I am not sure why the same method that was applied in Airflow 2 is not working in Airflow 3 (probably because of supervisor handling...) but via os.setpgrp()is is working.

Also fixed the way hiw the killing via SIGTERM is handled, it was formerly sending the SIGTERM to the process group (not the process) and using the PID and not the needed PGID... but this would have an endless loop of signalling to the worker itself. So fixing this as well.

@boring-cyborg boring-cyborg bot added area:providers provider:edge Edge Executor / Worker (AIP-69) labels Jan 3, 2026
@jscheffl jscheffl added the type:bug-fix Changelog: Bug Fixes label Jan 3, 2026
Copy link
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. Indeed - but I think it needs a few tweaks.

Yep. starting a new process group when your process is handling signal propagation is the best way to handle both regular kills and the process-group kills for terminal (with Ctrl-C).

But I think we used to have a process group created when we started processes in Airflow 2 in TaskRunner and there is a clear indication of it because we have (currently unused !!!!) set_new_process_group in process_utils - which is better to use - because it also handles the edge case when your process is session leader.

And we were using it in the past to propely handle signal propagation in fact. this is 2.11 code of standard_task_runner:

    def _start_by_fork(self):
        pid = os.fork()
        if pid:
            self.log.info("Started process %d to run task", pid)
            return psutil.Process(pid)
        else:
            from airflow.api_internal.internal_api_call import InternalApiConfig
            from airflow.configuration import conf

            if conf.getboolean("core", "database_access_isolation", fallback=False):
                InternalApiConfig.set_use_internal_api("Forked task runner")
            # Start a new process group
            set_new_process_group()

Was it a deliberate change (@ashb ? @amoghrajesh @kaxil ?) to stop using it ? Because if not, then possibly better way will be to bring it back and use it in supervisor - this way we could simply propagate signals from supervisor to the supervised processes - this is a "classic" and IMHO best way of handling signals - especially that it makes Ctlr+C behaviour consistent with simply sending a signal.

This would also fix the behaviour running of our processes in the images of ours. We are using - deliberately - dumb-init default behaviour of propagating the signals to process group (except celery worker).

This is described here: https://airflow.apache.org/docs/docker-stack/entrypoint.html#signal-propagation

Previously, we had the new process group creation and signal forwarding happening.

But if we are not doing it today, then it might have unintended side effects due to race conditions - because all the forked processes will immediately get the signal - and in case supervisor (like edge main process) want to do some cleanup / preparation before the child processes get the signals, there is no way to control who will get the signal first.

That's why I think it seems the most reasonable approach to come back to always creating new process group by the supervisor's children and simply propagate all signals deliberately in the shutdown handler of supervisor. (Which does not prevent edge doing the same - I think it's a good idea to implement it in edge provider as well, those things stack very well) - but I think reviewing this PR is a good opportunity to agree the right behaviour for supervisor as well.

@jscheffl jscheffl requested review from ashb and kaxil January 6, 2026 21:02
@ashb
Copy link
Member

ashb commented Jan 8, 2026

Was it a deliberate change (@ashb ? @amoghrajesh @kaxil ?) to stop using it ? Because if not, then possibly better way will be to bring it back and use it in supervisor - this way we could simply propagate signals from supervisor to the supervised processes - this is a "classic" and IMHO best way of handling signals - especially that it makes Ctlr+C behaviour consistent with simply sending a signal.

Oh no, adding back in process group was a TODO that I forgot to record anywhere 🤦🏻

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:providers provider:edge Edge Executor / Worker (AIP-69) type:bug-fix Changelog: Bug Fixes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants