Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tasks fail if part of the outputs (but not all) already exist. #105

Open
meliache opened this issue May 2, 2021 · 2 comments
Open

Tasks fail if part of the outputs (but not all) already exist. #105

meliache opened this issue May 2, 2021 · 2 comments
Labels
help wanted Extra attention is needed

Comments

@meliache
Copy link
Collaborator

meliache commented May 2, 2021

When task (with b2luigi.on_temporary_files has outputs A and B, and A is missing but B exists, the moving of the temporary files after the task had been successful fails if one of the outputs already exists.

Due to the nature of on_temporary_files, this doesn't happen naturally when running tasks because the final outputs can only exist if the task had been successful and thus all outputs should be there. However, this can happen when changing output definitions or when deleting some of the outputs but not others manually.

I solved this in my tasks by extending the process method with a clean-up step, but maybe this is something we want to fix in the moving of temporary files.

Or is this a behaviour a feature and we want the users to be explicitwhat to do with existing data to prevent the loss of it? After all, luigi itself does not clean-up of outputs either. Anyway the user uses a temporary wrapper, if they force-writes the output in the process-method, this will not help because he will only force-write the temporary file location, so I think we should fix that...

@meliache
Copy link
Collaborator Author

meliache commented May 2, 2021

Okay I just checked and the copying of outputs after success is not done by b2luigi but by the luigi temporary_path function, in particular the exception that the target exists is called in the function rename_dont_move.

So this seems to be an idea or conscious design-design from luigi. Does anyone experienced with luigi have an idea why they implemented it that way? Just wanted to ask here before opening a luigi issue/PR.

@meliache meliache added the help wanted Extra attention is needed label May 2, 2021
@nils-braun
Copy link
Owner

Luigi Tasks should not have multiple output files in theory. That is against the atomicity and idempotency of tasks (two properties you normally aim for in workflow managers). So yes, it is a design decision.

Typically, you would use a _SUCCESS file if this is you use-case after all files have been written (and only depend on it).

However, in practice these things do occur and fixing it seems like a plausible solution. Just wanted to give some background.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants