Potentially unintended behaviour with TaskGroups and user-made attributes attached to them #42987

Meehai · 2024-10-11T09:29:42Z

Meehai
Oct 11, 2024

Apache Airflow version

Other Airflow 2 version (please specify below)

If "Other Airflow 2 version" selected, which one?

2.5.3

What happened?

Is this the intended behaviour? It seems that airflow clones the task group object which is different from regular python semantic, leading to potentially hidden bugs. I got bitten by it and spent a good few hours before realizing. I can try newer airflow versions too, if requested.

class RunnerTaskGroup(TaskGroup):
  def __init__(self, affinity: str):
    super().__init__()
    self.rand_attribute = True

  @task(task_group=self)
  def some_task():
    if random.random() < 0.5:
      self.rand_attribute = False # 50% chance to set this to False
  (
    some_task
  )

# .... in main dag .....

affinities: list[str]= ["""...some list where each item will create a TaskGroup..."""]

with DAG(dag_id="some_dag"):
  @task
  def some_task(_groups: list[str]):
    for group in _groups:
      print(f"{group.rand_attribute=}")
      # group.rand_attribute=True <- always TRUE regardless of the number of times I run the DAG and for all the inner TaskGroups

   groups = []
    for affinity in affinities:
       runner_tg = RunnerTaskGroup(affinity)
       groups.append(runner_tg)
  ( 
    groups >>
    some_task(groups)
  )

What you think should happen instead?

Same TaskGroup object is maintained during task group execution and main dag execution. If I were to make these regular python classes, it'd work like that.

How to reproduce

run the code above

Operating System

Ubuntu 22.04.3 LTS

Versions of Apache Airflow Providers

No response

Deployment

Docker-Compose

Deployment details

No response

Anything else?

No response

Are you willing to submit PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct

Answered by potiuk

Oct 14, 2024

If you have an idea how to do it then of course you can try to implement something like that. But remember that this is a distributed system and in such a system you cannot absolutely rely that everything is done in a single interpreter.

I think what you propose is against the basic architecture assumptions of airflow and you have cluster policies( look it up in the documentation) that handle modification of objects in a way that they can be applied in different interpreters consistently.

View full answer

2024-10-11T09:29:45Z

boring-cyborg[bot]
bot Oct 11, 2024

Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.

0 replies

potiuk · 2024-10-14T00:22:52Z

potiuk
Oct 14, 2024
Collaborator

If I understand your question correctly, you must understand Airflow architecture.

Airflow - by definition - is a distributed task execution engine. This means that every task is a result of parsing the same DAG with completely different Python interpreter (potentially forked multiple time) and potentially on a completely different machine. Each interpreter is run separately and they do not communicate with each other nor share any memory for the "parsed" DAG objects. So yes things are different than running and parsing this single Python file in a single interpreter - if that's what you expected.

You can read more about Airflow architecture here: https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/overview.html

BTW. If you are not sure if you have an airflow issue, please create a discussion instead. Those are better suited to asking questions if you understand things wrongly or whether things are a real "issue" in Airflow. Discussion allows you to get feedback before you decide to create an issue. I converted your issue to discussion now, but for the future, please bear it in mind.

0 replies

Meehai · 2024-10-14T05:13:13Z

Meehai
Oct 14, 2024
Author

Alright, fair enough about the objects being different due to their distributed nature with the tg being replicated; that makes totally sense (even on a local machine where multiprocessing is probably used to run the task groups and tasks).

My issue was related to the fact that the main airflow process never receives the updated attributes back from the task group after execution to sync the object state.

I guess my proposal is that some sort of implicit xcom should happen behind the scenes or a warning/exception should be thrown when accessing attributes from the main DAG after their execution. A thir option would be to disallow creating and updating attributes of the task group object via self.xxx from withing a task of a TaskGroup.

In a larger codebase this kind of behavior can become a foot gun that's easy to miss in reviews (i.e. you assume that the main DAG properly has access to tg.attribute and it's updated during execution, when in fact it's not the same attribute at all, it was just copied during multiprocessing creation and remains to the default value in reality as there's no sync).

PS: updated the code a bit to remove some bloat to better observe the potential issue

4 replies

potiuk Oct 14, 2024
Collaborator

If you have an idea how to do it then of course you can try to implement something like that. But remember that this is a distributed system and in such a system you cannot absolutely rely that everything is done in a single interpreter.

I think what you propose is against the basic architecture assumptions of airflow and you have cluster policies( look it up in the documentation) that handle modification of objects in a way that they can be applied in different interpreters consistently.

Answer selected by Meehai

potiuk Oct 14, 2024
Collaborator

Or better solution maybe you can update the documentation to explicitly mention that in the place that you would normally look for it when you are learning about airflow cuz I assume you did.

This way you will prevent others from making the same mistake and you can point them to a cluster policy. It's super easy just find the page that you want to update and click suggest to change on this page button and it will be open for you automatic and you will be able to update documentation.

Meehai Oct 14, 2024
Author

I guess my meta-request here was to make the defaults more reliable in footgun-y situations. I come from a ML background and using PyTorch taught me that things usually "just work out of the box" without too many gotchas and cognitive waste on implementation details unless you want to optimize something very specific, and for that the tools are usually there.

In this case I see this system as a map-reduce kind of system where the main process (the DAG) maps a bunch of tasks to the TaskGroups and then there's a 'reduce' process that would sync the results back to the master process after they're done. The 'reduce' part in my example would be syncing whatever global attributes were modified (in this case the object scoped attributes in each group instance) so further tasks have access to these modifications.

It's not about different interpreters, it's about making a basic a >> [b,c,d] >> e >> [f, g, h] behave the same: i.e. 'e' has access to changes that happened in b, c and d regardless if it's distributed, multi-process, multi threader or single threaded.

But I guess you are also right, I maybe have the wrong mental model and basic understandings of this system and I'll just look up to see how to implement what I consider that should be the default case by myself. Thanks for the documentation pointers, I'll go RTFM.

potiuk Oct 15, 2024
Collaborator

Look at cluster policies. Again - they are implementing what you want (by consistently applying the same code modifications in various interpreters).

And yes while I understand you would like to have your modifications propagate across multiple distributed interpreters - each of wich holding a different instance of DAG and task objects. Yes it could be done by serializing those modifications and restoring them in all cases where they are used (on a different interpreter/system), but it has certain implications:

the atrributes you modify and the whole DAG and task have to be serializable - you effective have to dump it to a form that has to be send over remote call (HTTP for example or DB) - this has a host of limitations - not all objects are serializable, there are a number of libraries that implement serialization - each of them with different issues (for example security issues when you want to serialize actual functions/code/objects that contain code), or inability to serialize certain types (pendulum's datetime is not serializable and python's datetime contains some stored shared state that has some edge cases)
it has performance impact - basically it means that whenever you make any modification to any of the objects you need to serialize it, send to remote interpreter and deserialize. You also need to persistently store it because the task to execute might not yet started in a new interpreter, and it needs to pick the fact than any modifications you've done in parser or elsewhere, retrieve it, deserialize and apply locally - this has to be done for every instance of executing task and every task group it is contained with - so need to be recursively applied if you have multiple task groups.
on top of it - if you perform modification on the same "task" in multiple places - at about the same time - there are various race conditions possible - distributing changes over the network is not instanteneous - it takes time and the sequence of changes is uncertain

So yes - it is all the result of Airlfow being distributed system. And if you find / implement a solution that solves all those (and a host of other that I have not thought about) issues - it could be done.

Instead we have a simple solution - that when you want to apply a "global" dynamic change to your task - you write a custom policy - that will do it - every time when your dag file is parsed. Which means that you can apply "system-wide" global modifications much faster and more reliably.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potentially unintended behaviour with TaskGroups and user-made attributes attached to them #42987

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 4 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Potentially unintended behaviour with TaskGroups and user-made attributes attached to them #42987

Meehai Oct 11, 2024

Apache Airflow version

If "Other Airflow 2 version" selected, which one?

What happened?

What you think should happen instead?

How to reproduce

Operating System

Versions of Apache Airflow Providers

Deployment

Deployment details

Anything else?

Are you willing to submit PR?

Code of Conduct

Replies: 3 comments · 4 replies

boring-cyborg[bot] bot Oct 11, 2024

potiuk Oct 14, 2024 Collaborator

Meehai Oct 14, 2024 Author

potiuk Oct 14, 2024 Collaborator

potiuk Oct 14, 2024 Collaborator

Meehai Oct 14, 2024 Author

potiuk Oct 15, 2024 Collaborator

Meehai
Oct 11, 2024

Replies: 3 comments 4 replies

boring-cyborg[bot]
bot Oct 11, 2024

potiuk
Oct 14, 2024
Collaborator

Meehai
Oct 14, 2024
Author

potiuk Oct 14, 2024
Collaborator

potiuk Oct 14, 2024
Collaborator

Meehai Oct 14, 2024
Author

potiuk Oct 15, 2024
Collaborator