Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix handling contents added after header creation #283

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

elbe0046
Copy link

@elbe0046 elbe0046 commented Mar 3, 2022

Addresses #282.

To solve the issue we limit the amount of data that can be read from the file to whatever was available at the time of creating the Header.

@alexcrichton
Copy link
Owner

Thanks for the PR, but I think this is working as intended. As the documentation mentions if the I/O stream is not the same size as the header size then the archive will be corrupted. I don't think it's correct for this crate to use .take(), instead that seems like something better done at the application layer if the input stream changes size after you originally witness the size.

@elbe0046 elbe0046 closed this Mar 3, 2022
@posborne
Copy link

posborne commented Mar 3, 2022

@alexcrichton I think the challenge with handling this (at least easily) at the application layer is that it makes it challenging to use convenience methods like append_dir_all; the use case where we encountered these issues is one that I expect is not uncommon of providing an endpoint to stream an archive containing log files from a directory that may be receiving frequent writes.

Do you see there being another layer in this crate that would be appropriate to handle this concern or is it the case that the functionality from append_dir_all needs to be recreated elsewhere to handle this case? We are OK going this route but I expect there are a number of users of this crate that may, intermittently, end up with corrupted archives without some change.

@alexcrichton
Copy link
Owner

Oh sorry I can generalize what I'm thinking a bit from "application layer" to "callers of this function". This is almost surely a bug in append_dir_all and it's fine to fix there, I just don't think it should be fixed in append which takes a generic input stream

@elbe0046
Copy link
Author

elbe0046 commented Mar 4, 2022

@alexcrichton your point about append being the wrong layer for the fix definitely makes sense to me, thanks for that explanation! You preferred the fix to be applied in append_dir_all, but these changes are still a layer lower than that, where now the header's size is enforced within append_path_with_name and append_file. Does this seem reasonable to you? I may be missing a more obvious solution that you had in mind.

With the new approach I'm not seeing a good way to reliably get test coverage on this edge case unfortunately. But I'm of course open to any guidance on this as well.

@alexcrichton
Copy link
Owner

Yeah this is where I roughly expected a fix to go, if any. One thing this makes me wonder about though is not only file extensions but also file truncations, as presumably that would also cause issues with concurrently modified files and archiving?

I have little-to-no experience with working with the filesystem while it's being concurrently modified, so I don't know what to do in the face of situations like this. One possible fix would be to open the file and then from the open file get the metadata about the length, but I don't know if the file 'snapshots' like that and if the file is appended to whether or not the changes are reflected live on the file descriptor just opened.

@elbe0046 elbe0046 force-pushed the fix-contents-added-after-creation branch from ab642af to 9edb4c2 Compare March 4, 2022 17:06
@posborne
Copy link

posborne commented Mar 4, 2022

Yeah this is where I roughly expected a fix to go, if any. One thing this makes me wonder about though is not only file extensions but also file truncations, as presumably that would also cause issues with concurrently modified files and archiving?

I don't think there is going to be a great way to handle cases like truncation or writes at varying offsets in a file. For typical log rotation I don't believe that move and remove operations will present a problem as I believe the kernel allows access to continue and only does the remove once open fds are closed. Processes which know they may be working on a shared file concurrently might use file locking but that isn't the situation we are in.

Even cp has no feature to copy a snapshot. In the case of a truncation, you'll just end up with the contents up until the point of truncation. Of course cp isn't writing the size of the dst file ahead of time in the same way as we are here.

I think the proposed change is probably still worthwhile to avoid corruption in a large number of cases, even if it is not able to eliminate the (possibly intractable) problem entirely.

@alexcrichton
Copy link
Owner

Personally I'm hesitant to give a semblance that everything in this crate works with concurrent filesystem interactions because basically nothing has been written with that in mind. While things could be restructured in the case to work for the precise situation where files are only appended to that's hard to explain in the code where there's theoretically a comment saying "we handle the file extension case but the file truncation case continues to produce a corrupt tarball".

Sorry to change minds but now I'm sort of thinking that this belongs externally. The goal of this crate is that all the helper functions are layered on top of one another so it's ok to call whichever layer you need, so if you find it difficult to build append_dir_all or similar externally then this could perhaps grow configuration options or similar to make it easier to build externally as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants