Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read tag data in its entirety from AsyncFileReader. #81

Closed
wants to merge 5 commits into from

Conversation

feefladder
Copy link
Contributor

@feefladder feefladder commented Mar 27, 2025

Why was #76 closed 👀?

Closes #70

This PR allows a caching middleware to directly estimate the remaining TileOffsets and TileByteCounts arrays from the byte size of TileOffsets0, which is in #74 as a proof-of-concept.

  • without this: first out-of-cache request will be a single u64 byte range
  • with this: first out-of-cache request will be the full TileOffsets0 array

A middleware could do exponential cache growing, just like a Vec, but if a good estimate (2*range_len+4*range_len.isqrt()) of size can be made, Vec::with_capacity(good_estimate) is better derivation.pdf.

EDIT: Actually, an exponential prefetch (exponent 2) will work quite well with this, since the byte size of all ifds ~>= 4*TileOffsets0.isqrt()1

also Cleans up tag loading logic and cursor handling.

Footnotes

  1. The estimate of $2b_{offset0}+4\sqrt(b_{offset0}$ is only based on the assumption that we are dealing with a square unmasked bigtiff cog. smalltiff adds a $\frac{4}{3}$ factor, masks a $2$ factor and square-ness only works on the $\sqrt{}$ factor, which is small in comparison.

@kylebarron
Copy link
Member

Why was #76 closed 👀?

#76 was auto-closed because when I wrote the description of #79 (#79 (comment)) I had the words "I think this probably also closes #76"

@feefladder
Copy link
Contributor Author

ah ok! This should be good for review now...

@kylebarron
Copy link
Member

directly estimate the remaining TileOffsets and TileByteCounts arrays from the byte size of TileOffsets0

This is a horribly leaky abstraction because you're trying to infer how much more data you'll need to read in total from individual requests made to the underlying network layer.

On the contrary it's the entire point that these layers are isolated, so it's intentional that the network layer should not be able to infer where the header parsing is, solely based on requests. The internals of how the metadata is read should be isolated from reading the bytes.

It might work in this specific case to do your estimation, but that happens to rely on this specific case, and if the internals of how metadata is parsed is changed again in the future, that would break your code.

Instead of hacking around this via byte ranges, I think it would be better to expose a little more of the internals of metadata parsing, so that you can swap our your own implementation of reading the metadata if you wish. See #82

@feefladder
Copy link
Contributor Author

Could we voice chat or something?

@feefladder
Copy link
Contributor Author

On the contrary it's the entire point that these layers are isolated, so it's intentional that the network layer should not be able to infer where the header parsing is, solely based on requests. The internals of how the metadata is read should be isolated from reading the bytes.

Layers need to be isolated: Yes! Clearly, the exponential prefetch better separates layers.

Still, I think #83 works very well together with this PR. (this PR would reduce additional requests from ~9 to 2).

Instead of hacking around this via byte ranges, I think it would be better to expose a little more of the internals of metadata parsing, so that you can swap our your own implementation of reading the metadata if you wish. See #82

I completely errored on this and am sad I couldn't appreciate all the effort you put into #82

If you think 4 match blocks that change cursor state to anywhere in the tiff is more maintainable than 2 match blocks that set the cursor at the end of the tag offset field, sure don't merge #81.

Yes, I think it's more maintainable because it's exactly the same as the upstream code. I don't want to have to think about the specifics of tag parsing. The code as it is just works, and even reviewing your changes is really hard because I would have to research the tag structure and validate that it's correct.

I've already made unit tests for tag parsing in tiff2. Would it help if:

  1. I copy over the unit tests from tiff2 so they work on the current (upstream) implementation in a separate PR
  2. this PR makes all unit tests still pass
  3. show on a cog (link) you provide that it improves metadata fetching (3 or 4 requests) with a default 32kB prefetch and exponential prefetch.

@kylebarron
Copy link
Member

Sorry, I've already spent more time on this than I wanted to. I have a bunch of other projects to attend to and I don't want to spend more time arguing about this. If anything, I'd much rather focus on data parsing and decompression (e.g. decoding predictors)

@feefladder feefladder closed this Apr 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Handling large IFD data arrays
2 participants