aiocogeo design (lessons learned) #7
geospatial-jeff
started this conversation in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I wrote aiocogeo ~5 years ago now and have since built ~15 other COG reader implementations (go/python/node) for various use cases, this is a non-exhaustive brain dump of lessons learned along the way to help inform the design of
aiocogeo-rs
. This is of course all very opinionated and up for debate.Aiocogeo Mistakes
At a macro level, the primary goal of
aiocogeo
was to be a faster alternative to rasterio with a similar interface and similar configuration options. Designed to be fast, but also easy to use and familiar for most geospatial python devs. Herein lies the first mistake; GDAL (and rasterio) is a library that is intentionally designed to provide an abstraction layer over many file formats whileaiocogeo
just needs to read one.While COG may be a single file format, it is almost endlessly flexible. The joke when TIFF was released in 1984 is it stood for "the infinitely flexible format". This leads to my second mistake, I did not put clear enough boundaries on what the library would or wouldn't support which led to
aiocogeo
becoming a library that was good at many things but not really that great at anything either. This eroded the core identity of the library. Adding additional features becomes an extremely slippery slope. Before you know it you are supporting 10 different compressions and 4 different ways to define anodata
value when all you really wanted was a library to quickly fetch tiles from a COG.I ultimately abandoned the library as it quickly made more and more sense to build out specific COG reader implementations tailored to the specific use case. Building a tile server for JPEG web-optimized COGs? All you need to do is read ~4-5 tiff tags, a
get_tile
function and you are basically done. It costs you ~150-200 lines of code and a day of time, but you don't need to deal with figuring out how to make the generalized COG library (aiocogeo
) work and scale for your use case. Moving LZW compressed COGs onto a GPU for inference? Same story.I also wanted to call out a few micro-level mistakes I made with the design of the code itself:
aiocogeo
have been functional, with the same components:class Header
,class IFD
,class Tag
). These should implement the underlying specs as closely as possible. For example I will usedataclass
in python.open_cog(filepath: str) -> List[IFD]
that is responsible for creating these structs from a given filepath.read_tile(ifd: IFD, x: int, y: int) -> bytes
which OPERATES on these structs to do something, in this case read a tile.write_cog(ifds: List[IFD], filepath: str)
to write from structs to a COG (if you want to support writing). The reverse ofopen_cog
.object_store
definitely helps a ton here.aiocogeo
, mostly the ones that returned an array of some sort, were a mistake to implement. I wish the library just implemented the lower level function calls to read a header, fetch a tile etc. Everything about COGs (reading header, fetching tiles etc.) is pretty standard up until you start loading data into arrays; that is where a lot of use cases begin to differ. This includes the interpretation of TIFF tags like photometric interpretation and nodata which are useful when understanding how to structure a COG into an array.aiocogeo-rs
should expose the values of these tags to end user, but shouldn't do anything with them beyond that.zarr-python
has done here (viaimagecodecs
). Support a sensible number of compressions but allow users to easily implement their own.aiocogeo-rs
should be batteries included, but we shouldn't bend over backwards to support image compressions that are rarely used. Optimize for the majority.Mistakes from other libraries
Finally I wanted to call out one particular mistake that other libraries make which is not giving the user enough control over the underlying bytes. The number one way by far to make COGs more scaleable in the cloud is caching. I cannot even begin to emphasize how important caching is (stay tuned for CNG conference). The key with caching is flexibility, and you can't have any flexibility in caching if you as the caller don't have direct control of the underlying bytes.
This gets a bit nuanced because the "proper" way to cache in distributed systems is via forward/reverse proxies which doesn't actually require access to the bytes because your application code itself is not responsible for the cache. But this use case is a bit advanced and there are many people who just want to, for example, slap a
functools.lru_cache
decorator on a function call and move on. Or request a cog header then cache the bytes in redis (write-ahead).Conclusion
My hope for
aiocogeo-rs
is that it turns into a very simple library with (1) structs for the TIFF spec and (2) functions that operate on those structs. All we really need at this point in time isopen_cog
,read_tile
, anddecompress_tile
with support for a few compressions (ex. JPEG/LZW/DEFLATE covers the majority of COGs) and access to the underlying bytes. "Keep it simple stupid", as they say.If done well I believe this will be a critical piece of software that serves as a foundation to fully realize the benefits of horizontal scaling in the cloud (against COGs) and all the niceties that come with that. Especially if this simple, diminutive, COG interface can be bound nicely into python for downstream users to build off.
Beta Was this translation helpful? Give feedback.
All reactions