Skip to content

Commit 504d293

Browse files
authored
Add documentation for caching feature (#237)
1 parent dcd09a0 commit 504d293

File tree

6 files changed

+282
-1
lines changed

6 files changed

+282
-1
lines changed

.gitignore

+1
Original file line numberDiff line numberDiff line change
@@ -66,3 +66,4 @@ target/
6666
# tests
6767
.pytest_cache/*
6868
.mypy_cache/
69+
.ipynb_checkpoints/

ci/requirements/doc.yml

+3
Original file line numberDiff line numberDiff line change
@@ -20,9 +20,12 @@ dependencies:
2020
- adlfs
2121
- ipykernel
2222
- nbsphinx
23+
- netcdf4
24+
- pooch
2325
- zarr
2426
# Editable xbatcher installation
2527
- pip
28+
2629
- pip:
2730
# relative to this file. Needs to be editable to be accepted.
2831
- -e ../..

doc/contributing.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
.. _contributing:
22

33
******************
4-
Contributing guide
4+
Contributing Guide
55
******************
66

77
.. note::

doc/index.rst

+1
Original file line numberDiff line numberDiff line change
@@ -88,6 +88,7 @@ or via a built-in `Xarray accessor <http://xarray.pydata.org/en/stable/internals
8888
:caption: Contents:
8989

9090
api
91+
user-guide/index
9192
tutorials-and-presentations
9293
roadmap
9394
contributing

doc/user-guide/caching.ipynb

+268
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,268 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"# Xbatcher Caching Feature \n",
8+
"\n",
9+
"This notebook demonstrates the new caching feature added to xbatcher's `BatchGenerator`. This feature allows you to cache batches, potentially improving performance for repeated access to the same batches. \n",
10+
"\n",
11+
"\n",
12+
"## Introduction\n",
13+
"\n",
14+
"The caching feature in xbatcher's `BatchGenerator` allows you to store generated batches in a cache, which can significantly speed up subsequent accesses to the same batches. This is particularly useful in scenarios where you need to iterate over the same dataset multiple times. \n",
15+
"\n",
16+
"\n",
17+
"The cache is pluggable, meaning you can use any dict-like object to store the cache. This flexibility allows for various storage backends, including local storage, distributed storage systems, or cloud storage solutions.\n",
18+
"\n",
19+
"## Installation \n",
20+
"\n",
21+
"To use the caching feature, you'll need to have xbatcher installed, along with zarr for serialization. If you haven't already, you can install these using pip:\n",
22+
"\n",
23+
"```bash\n",
24+
"python -m pip install xbatcher zarr\n",
25+
"```\n",
26+
"\n",
27+
"or \n",
28+
"\n",
29+
"using conda:\n",
30+
"\n",
31+
"```bash\n",
32+
"conda install -c conda-forge xbatcher zarr\n",
33+
"```\n"
34+
]
35+
},
36+
{
37+
"cell_type": "markdown",
38+
"metadata": {},
39+
"source": [
40+
"## Basic Usage \n",
41+
"\n",
42+
"Let's start with a basic example of how to use the caching feature:"
43+
]
44+
},
45+
{
46+
"cell_type": "code",
47+
"execution_count": null,
48+
"metadata": {},
49+
"outputs": [],
50+
"source": [
51+
"import tempfile\n",
52+
"\n",
53+
"import xarray as xr\n",
54+
"import zarr\n",
55+
"\n",
56+
"import xbatcher"
57+
]
58+
},
59+
{
60+
"cell_type": "code",
61+
"execution_count": null,
62+
"metadata": {},
63+
"outputs": [],
64+
"source": [
65+
"# create a cache using Zarr's DirectoryStore\n",
66+
"directory = f'{tempfile.mkdtemp()}/xbatcher-cache'\n",
67+
"print(directory)\n",
68+
"cache = zarr.storage.DirectoryStore(directory)"
69+
]
70+
},
71+
{
72+
"cell_type": "markdown",
73+
"metadata": {},
74+
"source": [
75+
"In this example, we're using a local directory to store the cache, but you could use any zarr-compatible store, such as S3, Redis, etc."
76+
]
77+
},
78+
{
79+
"cell_type": "code",
80+
"execution_count": null,
81+
"metadata": {},
82+
"outputs": [],
83+
"source": [
84+
"# load a sample dataset\n",
85+
"ds = xr.tutorial.open_dataset('air_temperature', chunks={})\n",
86+
"ds"
87+
]
88+
},
89+
{
90+
"cell_type": "code",
91+
"execution_count": null,
92+
"metadata": {},
93+
"outputs": [],
94+
"source": [
95+
"# create a BatchGenerator with caching enabled\n",
96+
"gen = xbatcher.BatchGenerator(ds, input_dims={'lat': 10, 'lon': 10}, cache=cache)"
97+
]
98+
},
99+
{
100+
"cell_type": "markdown",
101+
"metadata": {},
102+
"source": [
103+
"### Performance Comparison\n",
104+
"\n",
105+
"\n",
106+
"Let's compare the performance with and without caching:\n"
107+
]
108+
},
109+
{
110+
"cell_type": "code",
111+
"execution_count": null,
112+
"metadata": {},
113+
"outputs": [],
114+
"source": [
115+
"import time\n",
116+
"\n",
117+
"\n",
118+
"def time_iteration(gen):\n",
119+
" start = time.time()\n",
120+
" for batch in gen:\n",
121+
" pass\n",
122+
" end = time.time()\n",
123+
" return end - start"
124+
]
125+
},
126+
{
127+
"cell_type": "code",
128+
"execution_count": null,
129+
"metadata": {},
130+
"outputs": [],
131+
"source": [
132+
"directory = f'{tempfile.mkdtemp()}/xbatcher-cache'\n",
133+
"cache = zarr.storage.DirectoryStore(directory)\n",
134+
"\n",
135+
"# Without cache\n",
136+
"gen_no_cache = xbatcher.BatchGenerator(ds, input_dims={'lat': 10, 'lon': 10})\n",
137+
"time_no_cache = time_iteration(gen_no_cache)\n",
138+
"print(f'Time without cache: {time_no_cache:.2f} seconds')"
139+
]
140+
},
141+
{
142+
"cell_type": "code",
143+
"execution_count": null,
144+
"metadata": {},
145+
"outputs": [],
146+
"source": [
147+
"# With cache\n",
148+
"gen_with_cache = xbatcher.BatchGenerator(\n",
149+
" ds, input_dims={'lat': 10, 'lon': 10}, cache=cache\n",
150+
")\n",
151+
"time_first_run = time_iteration(gen_with_cache)\n",
152+
"print(f'Time with cache (first run): {time_first_run:.2f} seconds')\n",
153+
"\n",
154+
"\n",
155+
"time_second_run = time_iteration(gen_with_cache)\n",
156+
"print(f'Time with cache (second run): {time_second_run:.2f} seconds')"
157+
]
158+
},
159+
{
160+
"cell_type": "markdown",
161+
"metadata": {},
162+
"source": [
163+
"You should see that the second run with cache is significantly faster than both the first run and the run without cache."
164+
]
165+
},
166+
{
167+
"cell_type": "markdown",
168+
"metadata": {},
169+
"source": [
170+
"## Advanced Usage \n",
171+
"\n",
172+
"### Custom Cache Preprocessing\n",
173+
"\n",
174+
"You can also specify a custom preprocessing function to be applied to batches before they are cached:\n"
175+
]
176+
},
177+
{
178+
"cell_type": "code",
179+
"execution_count": null,
180+
"metadata": {},
181+
"outputs": [],
182+
"source": [
183+
"# create a cache using Zarr's DirectoryStore\n",
184+
"directory = f'{tempfile.mkdtemp()}/xbatcher-cache'\n",
185+
"cache = zarr.storage.DirectoryStore(directory)\n",
186+
"\n",
187+
"\n",
188+
"def preprocess_batch(batch):\n",
189+
" # example: add a new variable to each batch\n",
190+
" batch['new_var'] = batch['air'] * 2\n",
191+
" return batch\n",
192+
"\n",
193+
"\n",
194+
"gen_with_preprocess = xbatcher.BatchGenerator(\n",
195+
" ds,\n",
196+
" input_dims={'lat': 10, 'lon': 10},\n",
197+
" cache=cache,\n",
198+
" cache_preprocess=preprocess_batch,\n",
199+
")\n",
200+
"\n",
201+
"# Now, each cached batch will include the 'new_var' variable\n",
202+
"for batch in gen_with_preprocess:\n",
203+
" print(batch)\n",
204+
" break"
205+
]
206+
},
207+
{
208+
"cell_type": "markdown",
209+
"metadata": {},
210+
"source": [
211+
"### Using Different Storage Backends\n",
212+
"\n",
213+
"While we've been using a local directory for caching, you can use any dict-like that is compatible with zarr. For example, you could use an S3 bucket as the cache storage backend:\n",
214+
"\n",
215+
"```python\n",
216+
"import s3fs\n",
217+
"import zarr \n",
218+
"\n",
219+
"# Set up S3 filesystem (you'll need appropriate credentials)\n",
220+
"s3 = s3fs.S3FileSystem(anon=False)\n",
221+
"store = s3.get_mapper('s3://my-bucket/my-cache.zarr')\n",
222+
"\n",
223+
"# Use this cache with BatchGenerator\n",
224+
"gen_s3 = xbatcher.BatchGenerator(ds, input_dims={'lat': 10, 'lon': 10}, cache=cache)\n",
225+
"```\n"
226+
]
227+
},
228+
{
229+
"cell_type": "markdown",
230+
"metadata": {},
231+
"source": [
232+
"## Considerations and Best Practices \n",
233+
"\n",
234+
"- **Storage Space**: Be mindful of the storage space required for your cache, especially when working with large datasets.\n",
235+
"- **Cache Invalidation**: The current implementation doesn't handle cache invalidation. If your source data changes, you'll need to manually clear or update the cache.\n",
236+
"- **Performance Tradeoffs**: While caching can significantly speed up repeated access to the same data, the initial caching process may be slower than processing without a cache. Consider your use case to determine if caching is beneficial.\n",
237+
"- **Storage Backend**: Choose a storage backend that's appropriate for your use case. Local storage might be fastest for single-machine applications, while distributed or cloud storage might be necessary for cluster computing or cloud-based workflows.\n",
238+
"\n"
239+
]
240+
},
241+
{
242+
"cell_type": "markdown",
243+
"metadata": {},
244+
"source": []
245+
}
246+
],
247+
"metadata": {
248+
"kernelspec": {
249+
"display_name": "Python 3 (ipykernel)",
250+
"language": "python",
251+
"name": "python3"
252+
},
253+
"language_info": {
254+
"codemirror_mode": {
255+
"name": "ipython",
256+
"version": 3
257+
},
258+
"file_extension": ".py",
259+
"mimetype": "text/x-python",
260+
"name": "python",
261+
"nbconvert_exporter": "python",
262+
"pygments_lexer": "ipython3",
263+
"version": "3.11.9"
264+
}
265+
},
266+
"nbformat": 4,
267+
"nbformat_minor": 4
268+
}

doc/user-guide/index.rst

+8
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
User Guide
2+
===========
3+
4+
.. toctree::
5+
:maxdepth: 2
6+
:caption: Contents:
7+
8+
caching

0 commit comments

Comments
 (0)