From ecd25a3578ca4104901b50b19da2bd5bb0e46663 Mon Sep 17 00:00:00 2001 From: Kyle Barron Date: Wed, 21 May 2025 18:27:23 -0400 Subject: [PATCH 1/2] Set up intro obspec blog post --- docs/blog/.authors.yml | 5 +++++ docs/blog/index.md | 2 ++ docs/blog/posts/introducing-obspec.md | 22 ++++++++++++++++++++++ mkdocs.yml | 4 ++-- 4 files changed, 31 insertions(+), 2 deletions(-) create mode 100644 docs/blog/.authors.yml create mode 100644 docs/blog/index.md create mode 100644 docs/blog/posts/introducing-obspec.md diff --git a/docs/blog/.authors.yml b/docs/blog/.authors.yml new file mode 100644 index 0000000..c23a2bb --- /dev/null +++ b/docs/blog/.authors.yml @@ -0,0 +1,5 @@ +authors: + kylebarron: + name: Kyle Barron + description: Creator + avatar: https://github.com/kylebarron.png diff --git a/docs/blog/index.md b/docs/blog/index.md new file mode 100644 index 0000000..c58f16c --- /dev/null +++ b/docs/blog/index.md @@ -0,0 +1,2 @@ +# Blog + diff --git a/docs/blog/posts/introducing-obspec.md b/docs/blog/posts/introducing-obspec.md new file mode 100644 index 0000000..cf74f0b --- /dev/null +++ b/docs/blog/posts/introducing-obspec.md @@ -0,0 +1,22 @@ +--- +draft: false +date: 2025-05-25 +categories: + - Release +authors: + - kylebarron +--- + +# Introducing Obspec + + + +Obstore is the simplest, highest-throughput Python interface to Amazon S3, Google Cloud Storage, and Azure Storage, powered by Rust. + +This post gives an overview of what's new in obstore version 0.4. + + + + +## Compare and contrast to fsspec + diff --git a/mkdocs.yml b/mkdocs.yml index 076265a..5a7fee4 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -29,8 +29,8 @@ nav: # - integrations.md # - performance.md # - fsspec.md - # - Blog: - # - blog/index.md + - Blog: + - blog/index.md - API Reference: - api/copy.md - api/delete.md From 811f47a18f2d4d24da984ebf3d3fb4b6cf90432a Mon Sep 17 00:00:00 2001 From: Kyle Barron Date: Wed, 28 May 2025 18:36:05 -0400 Subject: [PATCH 2/2] brain dump --- docs/blog/posts/introducing-obspec.md | 242 +++++++++++++++++++++++++- 1 file changed, 240 insertions(+), 2 deletions(-) diff --git a/docs/blog/posts/introducing-obspec.md b/docs/blog/posts/introducing-obspec.md index cf74f0b..ca955b0 100644 --- a/docs/blog/posts/introducing-obspec.md +++ b/docs/blog/posts/introducing-obspec.md @@ -1,13 +1,13 @@ --- draft: false -date: 2025-05-25 +date: 2025-05-29 categories: - Release authors: - kylebarron --- -# Introducing Obspec +# Introducing Obspec: A Python protocol for interfacing with object storage @@ -17,6 +17,244 @@ This post gives an overview of what's new in obstore version 0.4. +## Why? + +Consistent interface to object storage. + + + +- Obspec grew out of obstore. + +Comparison to obstore: Obstore is a concrete implementation; obspec is an abstract interface using Python protocols. + +Builds on a series of known protocols. Uses the buffer protocol for representing binary data. ## Compare and contrast to fsspec +1. api surface area of obspec vs fsspec. moving away from trying to make a file system layer which is a poor semantic mismatch and causes confusion and overhead. + +2. We don't have any implementation logic inside of obstore. A lot of baked-in fsspec logic is going to go away. If you want to have implementation-specific logic, it can be on top of obspec instead of having to go into obspec and understand what's going on. + +### Abstraction target + +Fsspec: +Access remote data via stateful file objects + + + +```py +from fsspec import AbstractFileSystem + + +def download_file(fs: AbstractFileSystem) -> str: + with fs.open("my-file.txt", "rb") as f: + return f.read().decode() +``` + + +Obstore: HTTP requests + +Access remote data via HTTP-like requests +All operations are atomic (readers cannot observe partial/failed writes) +Allows for functionality not native to filesystems +Operation preconditions (fetch if unmodified) +Atomic multipart uploads + + + +```py +from obspec import Get + + +def download_file(client: Get) -> str: + response = client.get("my-file.txt") + # buffer is only known to implement the Buffer Protocol + buffer = response.bytes() + return bytes(buffer).decode() +``` + +!!! note + + Core point: mismatched abstraction, object stores are not filesystems. + +### Stateful vs Stateless + +Core point: stateful APIs add user uncertainty. +Is the list request cached? +How many requests are made? +What happens if the remote data changes? +Will the second list automatically reflect new data? + + +We want a clear contract between provider (backend) and consumer (user/library) +Is the list request cached? (yes) +How many requests does this make? (1) +What happens if the remote data changes? +Will the second list automatically reflect new data? (no, not by default, but could be implementation dependent) + + +```py +from time import sleep + +from fsspec import AbstractFileSystem + + +def list_files_twice(fs: AbstractFileSystem): + fs.ls("s3://mybucket") + sleep(5) + fs.ls("s3://mybucket") +``` + +```py +from time import sleep + +from obspec import List + + +def list_files_twice(client: List): + list_iter = client.list("prefix") + list_items = list(list_iter) + sleep(5) + list_iter = client.list("prefix") + list_items = list(list_iter) +``` + +### API Surface + +Core point: obstore has a smaller API surface, easier to understand, compose. + +Fsspec: + +AbstractFileSystem: 10 public attributes, 56 public methods, more public async methods +AbstractBufferedFile: 20 public methods +Common to hit NotImplementedError since not all backends support all filesystem concepts (e.g. async) + +Obstore: + +Just 11 methods total: Core operations that object stores support natively +Full clarity of underlying HTTP calls +E.g. opening an fsspec file and then iterating over the responses… unclear how many raw HTTP requests that translates into. +Predictable performance. +No automatic caching (to be provided on top) +Very rare NotImplementedError: Azure suffix requests + + +copy/copy_async: Copy an object +delete/delete_async: Delete an object +get: Download a file +get_range/get_range_async: Get a byte range +get_ranges/get_ranges_async: Get multiple byte ranges +head/head_async: Access file metadata +list: List objects +list_with_delimiter/list_with_delimiter_async: List objects within a specific “directory”, avoiding recursing into further directories. +put/put_async: Upload to file +rename/rename_async: Move an object from one path to another +sign/sign_async: Create a signed URL + +### Streaming + +Core point: obstore has full streaming support. + +Fsspec: + +Streaming download: No support. +Can be emulated with file object, but no way to make one request and have it return as a stream. +Streaming upload: supported synchronously by passing file-like object. +Streaming list: No support: ls will always return all objects within prefix. + + +Both sync and async streaming support. +Streaming download: start working with byte response before entire file has downloaded. +Streaming upload: upload data from any byte source without materializing everything in memory. +Streaming list: automatic pagination behind the scenes + +Streaming download: + +```py +from obspec import Get + + +def streaming_download(client: Get): + response = client.get("file.txt") + for buffer_chunk in response: + # The iteration object is again a Buffer Protocol object + print(len(memoryview(buffer_chunk))) +``` + +Async streaming download. In just a few lines of code we can switch to supporting async. + + +```py +from obspec import GetAsync + + +async def streaming_download(client: GetAsync): + response = await client.get_async("file.txt") + async for buffer_chunk in response: + # The iteration object is again a Buffer Protocol object + print(len(memoryview(buffer_chunk))) +``` + + + + +### Intersecting features + +Not all backends will support all features. + +This is why obspec is defined as a set of independent protocols. Users can intersect the ones they need. + +### Full async API + +### Type hinting + +Fully type hinted + +### Manner of subtyping + +As [described in the Mypy documentation](https://mypy.readthedocs.io/en/stable/protocols.html), the Python type system supports two different manners of subtyping. + +> _Nominal_ subtyping is strictly based on the class hierarchy. If class `Dog` +> inherits class `Animal`, it's a subtype of `Animal`. Instances of `Dog` +> can be used when `Animal` instances are expected. This form of subtyping +> is what Python's type system predominantly uses: it's easy to +> understand and produces clear and concise error messages, and matches how the +> native :py:func:`isinstance ` check works -- based on class +> hierarchy. +> +> _Structural_ subtyping is based on the operations that can be performed with +> an object. Class `Dog` is a structural subtype of class `Animal` if the former +> has all attributes and methods of the latter, and with compatible types. +> +> Structural subtyping can be seen as a static equivalent of duck typing, which +> is well known to Python programmers. + +Fsspec uses nominal subtyping. + +Obspec uses structural subtyping. + + +### Predictability + +We don't have any implementation logic inside of obstore. A lot of baked-in fsspec logic is going to go away. If you want to have implementation-specific logic, it can be on top of obspec instead of having to go into obspec and understand what's going on. + +#### Caching + +Fsspec has caching built in. This can cause unpredictable results. + +With obspec, the idea is to provide the low-level primitives and caching can be implemented _as wrappers_ on top of + +### Dependencies? + +Talk about protocols? + +## Implemented by obstore + +Protocols are worthless if a concrete implementation doesn't exist. + +Obstore is a zero-dependency implementation. + + +## Future work + +### Common exceptions