-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kvikio backend entrypoint with Zarr v3 #70
base: main
Are you sure you want to change the base?
Conversation
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
Allow it to be rendered under the User Guide section.
Will need it for the kvikio.zarr docs later.
Create new section in the API documentation page for the kvikIO engine. Added more docstrings to the kvikio.py file, and fixed some imports so things render nicely on the API page. Also added an intersphinx link to the kvikio docs at https://docs.rapids.ai/api/kvikio/stable.
Fixes error like `TypeError: ZarrArrayWrapper.__init__() takes 2 positional arguments but 3 were given`.
for more information, see https://pre-commit.ci
Fix improper merge handling on d684dad
Fix `TypeError: Implicit conversion to a NumPy array is not allowed. Please use `.get()` to construct a NumPy array explicitly` on https://github.com/pydata/xarray/blob/v2024.11.0/xarray/core/indexing.py#L578
for more information, see https://pre-commit.ci
Can directly rely on upstream xarray's ZarrStore.open_store_variable method since Zarr v3 compatibility was added in pydata/xarray#9552.
UserWarning: The `compressor` argument is deprecated. Use `compressors` instead.
Only difference is to pass the Zarr store's root filepath to kvikio.zarr.GDSStore instead of xarray.backends.zarr.ZarrStore.
Dev version containing patch at pydata/xarray#10078 that fixes `TypeError: NumpyIndexingAdapter only wraps np.ndarray. Trying to wrap <class 'cupy.ndarray'>`.
Needed to return cupy.ndarray instead of numpy.ndarray objects. Should find a better place to put this `zarr.config.enable_gpu()` call later.
Rearranged some cells so the air-temperature.zarr store is created first. Added a couple of `zarr.config.enable_gpu()` statements in to ensure arrays are read to cupy.ndarray. Removed the flox section at the end.
|
||
@pytest.mark.parametrize("indexer", [slice(None), slice(2, 4), 2, [2, 3, 5]]) | ||
def test_lazy_indexing(indexer, store): | ||
with zarr.config.enable_gpu(), xr.open_dataset(store, engine="kvikio") as ds: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, xr.open_dataset(store, engine="kvikio")
should have zarr.config.enable_gpu()
set already, so GPU-backed cupy arrays are returned when a user accesses the arrays. Unless we expect there to be users who wants to use the kvikio engine while returning CPU-backed numpy arrays. Default should probably be cupy though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure... The (v3) kvikio.zarr.GDSStore is written to accept either. I'm tempted to leave this up to the user (so make them configure Zarr correctly) rather than assuming they want to use GPU memory, but maybe engine="kvikio"
is sufficient indication that they want stuff on the GPU.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, kvikio.zarr.GDSStore
can return numpy (CPU) arrays too, and I guess there might be a case where someone wants nvCOMP decompression on the GPU (once zarr-developers/zarr-python#2863 is implemented), but have a numpy array returned in CPU memory? But I do feel that engine="kvikio"
should default to putting things on GPU/cupy, while having a way to toggle it off as needed.
Looking at xr.open_dataset
, I'm thinking if it might make sense to pass something into the backend_kwargs
parameter to indicate the user's preference for GPU or CPU outputs. Is the from_array_kwargs
or chunked_array_type
parameters (experimental API) the intended use for this actually?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am inclined to agree that kvikio.zarr.GDSStore
should have the ability to not enable the GPU zarr configs even if someone wants to get data on the GPU since they can try and use a different BufferPrototype
that return torch tensors instead of cupy arrays, for example.
But for ease of use, it might be better to default to zarr.config.enable_gpu()
and that can be overridden by passing in a different context manager? So xr.open_dataset(store, engine="kvikio")
will default to using cupy arrays, but xr.open_dataset(store, engine="kvikio", zarr_context=contextlib.nullcontext())
would return numpy arrays and xr.open_dataset(store, engine="kvikio", zarr_context=MyCustomTorchContext())
would return torch tensors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is similar to the discussion at zarr-developers/zarr-python#2473. That zarr_context
parameter is exactly what I'm thinking of, I'm trying to find out if there's a way to implement this right now by setting/monkeypatching the default prototype
in kvikio.zarr.GDSStore
's get
method here:
Currently the prototype: BufferPrototype
is set dynamically by calling zarr.core.buffer.core.default_buffer_prototype()
. But I'm wondering if we can have a pre-determined default buffer prototype that is set when the GDSStore instance is created?
cc @jakirkham @TomAugspurger @madsbk @ncclementi who may be interested in this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
High-level question: is are these two functionally equivalent?
ds = xr.open_dataset(kvikio.zarr.GDSStore("data.zarr"), engine="zarr")
ds = xr.open_dataset("data.zarr", engine="kvikio")
Do people have a preference on what to recommend to users?
|
||
@pytest.mark.parametrize("indexer", [slice(None), slice(2, 4), 2, [2, 3, 5]]) | ||
def test_lazy_indexing(indexer, store): | ||
with zarr.config.enable_gpu(), xr.open_dataset(store, engine="kvikio") as ds: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure... The (v3) kvikio.zarr.GDSStore is written to accept either. I'm tempted to leave this up to the user (so make them configure Zarr correctly) rather than assuming they want to use GPU memory, but maybe engine="kvikio"
is sufficient indication that they want stuff on the GPU.
cupy_xarray/kvikio.py
Outdated
filename_or_obj = _normalize_path(filename_or_obj) | ||
if not store: | ||
store = ZarrStore.open_group( | ||
store=kvikio.zarr.GDSStore(root=filename_or_obj), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
High-level question: is are these two functionally equivalent?
ds = xr.open_dataset(kvikio.zarr.GDSStore("data.zarr"), engine="zarr") ds = xr.open_dataset("data.zarr", engine="kvikio")Do people have a preference on what to recommend to users?
Yes, those two are functionally the same. The xr.open_dataset(kvikio.zarr.GDSStore("data.zarr"), engine="zarr")
style works today (with xarray patch =2025.03.0 and kvikio=25.04.00a), while xr.open_dataset("data.zarr", engine="kvikio")
is more just a convenience syntax. I was hoping to also sneak in the zarr.config.enable_gpu()
in the backend somewhere to get cupy
arrays by default (but we can discuss that in the https://github.com/xarray-contrib/cupy-xarray/pull/70/files#r1988251560 thread).
|
||
@pytest.mark.parametrize("indexer", [slice(None), slice(2, 4), 2, [2, 3, 5]]) | ||
def test_lazy_indexing(indexer, store): | ||
with zarr.config.enable_gpu(), xr.open_dataset(store, engine="kvikio") as ds: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, kvikio.zarr.GDSStore
can return numpy (CPU) arrays too, and I guess there might be a case where someone wants nvCOMP decompression on the GPU (once zarr-developers/zarr-python#2863 is implemented), but have a numpy array returned in CPU memory? But I do feel that engine="kvikio"
should default to putting things on GPU/cupy, while having a way to toggle it off as needed.
Looking at xr.open_dataset
, I'm thinking if it might make sense to pass something into the backend_kwargs
parameter to indicate the user's preference for GPU or CPU outputs. Is the from_array_kwargs
or chunked_array_type
parameters (experimental API) the intended use for this actually?
Bumps [xarray](https://github.com/pydata/xarray) from 2025.1.3.dev22+g0184702f to 2025.03.0. - [Release notes](https://github.com/pydata/xarray/releases) - [Changelog](https://github.com/pydata/xarray/blob/main/HOW_TO_RELEASE.md) - [Commits](pydata/xarray@0184702...v2025.03.0) Use stable version of xarray, also with patch pydata/xarray#10081
cc @akshaysubr |
Using functools.partial to override default buffer protocol to be GPU buffer instead of CPU buffer. Not quite working as expected, but hopefully gets a point across.
Branched off the work at #10, this PR registers a
kvikio
backend that allows reading data from Zarr (v3) stores directly into GPU memory using NVIDIA GPU Direct Storage (GDS) without going through host (CPU) memory.Preview demo at https://cupy-xarray--70.org.readthedocs.build/70/source/kvikio.html
Requires some patches from un-released versions:
xarray>=2025.1.3.dev22+g0184702f (Generalize lazy backend indexing a little more pydata/xarray#10078)Now inxarray=2025.03.0
Also requires:
TODO:
zarr.config.enable_gpu()
into the backend to read intocupy.ndarray
by default?Limitations:
References: