Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for unpacking using custom Unpacker API #355

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

alessandrod
Copy link

Hello!

This PR proposes introducing a new Unpacker trait to the API, and new unpack_with methods to Archive and Entry. The aim is to offer users a way to use specialized fs and IO operations to speed up unpacking.

Background: I work on a program that unpacks a really large .tar.zst (60G compressed, 188G uncompressed) at launch. Using tar-rs main which uses regular fs ops and io::copy, it takes about 4m30s to unpack. Using io_uring on linux, with pre-allocated huge page buffers etc, it takes about 1m30s uncompressing at ~1.8GB/s which is the max zstd can do.

This PR introduces the API I need to plumb io_uring, without it being specific to io_uring.

This introduces a new Unpacker trait to the API, and new `unpack_with`
methods to Archive and Entry.

Custom unpackers can be used to abstract/optimize IO and filesystem
operations, for example by using io_uring on linux. The current/default
unpacking code is moved to DefaultUnpacker.
@cgwalters
Copy link
Collaborator

The high level goal makes total sense. That said though, is your software already doing decompression in a separate thread from the one doing unpacking?

If you just implement this software in the obvious way of putting a zstd decoder reader in front of tar::Archive, then a lot of parallelism is lost versus e.g. creating a Unix pipe or equivalent and two threads.

I'm curious, is your program also processing multiple entries asynchronously? IOW when you have multiple regular files in a row, you're writing them to disk simultaneously?

This all said...this is a big PR, and it's not clear to me it's really a win versus just directly processing the entries externally.

Or maybe instead of this PR, what you could do is maintain e.g. a tokio-tar-uring crate (that's a whole other thing in this, there's also https://docs.rs/tokio-tar/latest/tokio_tar/ and many people doing stuff like this want to do async too).

@alessandrod
Copy link
Author

Thanks for taking a look! :)

The high level goal makes total sense. That said though, is your software already doing decompression in a separate thread from the one doing unpacking?

If you just implement this software in the obvious way of putting a zstd decoder reader in front of tar::Archive, then a lot of parallelism is lost versus e.g. creating a Unix pipe or equivalent and two threads.

I'm curious, is your program also processing multiple entries asynchronously? IOW when you have multiple regular files in a row, you're writing them to disk simultaneously?

Yes I'm writing multiple entries asynchronously, and I'm doing a bunch of other stat/link etc syscalls asynchronously with io-uring too. In my case the tarball is large and contains many entries (~2 million files), so syscall and fs overhead are significant.

I want to do writes and all other syscalls as fast as possible.

I have profiled what I'm doing extensively, wrote a custom profiler for it and gave a talk about it at a rustau meetup a while ago if you're interested https://www.youtube.com/watch?v=JX0aVnpHomk.

This all said...this is a big PR, and it's not clear to me it's really a win versus just directly processing the entries externally.

Not entirely sure I understand what you mean? Externally how?

Or maybe instead of this PR, what you could do is maintain e.g. a tokio-tar-uring crate (that's a whole other thing in this, there's also https://docs.rs/tokio-tar/latest/tokio_tar/ and many people doing stuff like this want to do async too).

Sure I can fork, but I'd rather not fork if I can avoid it.

@cgwalters
Copy link
Collaborator

cgwalters commented Jun 2, 2024 via email

@alessandrod
Copy link
Author

alessandrod commented Jun 2, 2024

First, bear in mind I have no “official” maintainer role on this repository. I just commented because I have other outstanding PRs here and I try to help.

I figured, all help is welcome :)

This all said...this is a big PR, and it's not clear to me it's really a win versus just directly processing the entries externally. > Not entirely sure I understand what you mean? Externally how?
Maybe I am misunderstanding but it seems like you are changing the unpacking code here to have it drive a virtual file system API effectively. But you can just use the existing ability to read the data directly from the tar archive and have your own code do the unpacking, right? The cost here would just be needing to handle some of the “normalization” (especially security related) that the code here does. That said IMO since you are only targeting Linux you can use openat2 RESOLVE_BENEATH anyways.

It's not "just" normalization tho, unpack() does many things: https://github.com/alexcrichton/tar-rs/blob/main/src/entry.rs#L462

It preserves mtimes/ownerships handles tar extensions etc. Sure I can do it in my code, but I'm trying to avoid having to do exactly that and for everyone who wants to extract large tarballs efficiently to have to do it too.

Sure I can fork, but I'd rather not fork if I can avoid it.
Yeah, of course. That said maybe what we really need in the ecosystem is a “sans-io” tar crate.

Yeah - isn't this PR a step in that direction? There's a default IO implementation, or you can bring your own via the Unpacker trait. I did call it unpack_with_io in an earlier revision 😂

My view is this: this crate is very popular and it does do IO today. It works well for reasonably large files. It doesn't work so well for very large files, and there's nothing a consumer of the API can do about it other than forking and becoming a tar expert. We could create another sans-io crate, but that wouldn't fix the issue that unpacking large files with this crate is not great, and that seems like a fixable thing not something that justifies a whole new crate.

If we had something like the Unpacker trait in this PR, then I would be happy to maintain the fast io-uring code in a separate crate. In this crate we could add docs to unpack() that say something like "The default implementation works for reasonably large files. If you have very large files, consider using unpack_with() with your own Unpacker or use the <to-be-created-tar-unpack-uring> crate on linux". Then an user of the crate could do something about large files without having to learn anything about tar itself.

Lastly, I concede that the PR is big, but it's mostly shuffling existing nested function definitions around. I can probably reduce the diff - I didn't spend too much time trying to make the perfect PR because I was seeking feedback on the general direction. And again, I do appreciate your feedback!

@cgwalters cgwalters mentioned this pull request Jun 4, 2024
@cgwalters
Copy link
Collaborator

It preserves mtimes/ownerships handles tar extensions etc.

The default Entry reader is handling some tar extensions (e.g. at least GNU long links, etc.) unless you set raw mode.

As far as preserving mtimes and ownership - that's the thing, ultimately you as a consumer of this "custom unpacking api" would need to bridge between the mtimes provided and an io-uring or whatever other API anyways, right? It seems to me the main advantage of an "unpacker API" here is that the implementation would effectively be forced to handle all the tar attributes (or at least, explicitly ignore them).

As far as other tar extensions like xattrs; I think we should have a nice high level API to access them that's used by the unpacking code.

(This is very tangential but we have a different problem in that there's no API to build archives with some of those extensions)

@cgwalters
Copy link
Collaborator

I wanted to update you here with a bit of feedback; see #379 (comment)

At the current time I personally think it makes more sense for us to aim more twards a "sans-io" style crate, and leave unpacking/packing from the filesystem like we have today as "sugar".

This is going in a bit of a contrary direction to that, and in the end it's not doing anything that one couldn't do external to this crate today either.

Bridging between tar and io-uring is cool of course, but I think what we want here is another crate that is that bridge.

IOW nothing stops you from publishing a "tar-event-stream" or something crate that is basically this "unpacker callback API' right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants