9999The recommended way to store xarray data structures is `netCDF `__, which
100100is a binary file format for self-described datasets that originated
101101in the geosciences. xarray is based on the netCDF data model, so netCDF files
102- on disk directly correspond to :py:class: `~xarray.Dataset ` objects.
102+ on disk directly correspond to :py:class: `~xarray.Dataset ` objects (more accurately,
103+ a group in a netCDF file directly corresponds to a to :py:class: `~xarray.Dataset ` object.
104+ See :ref: `io.netcdf_groups ` for more.)
103105
104106NetCDF is supported on almost all platforms, and parsers exist
105107for the vast majority of scientific programming languages. Recent versions of
@@ -121,7 +123,7 @@ read/write netCDF V4 files and use the compression options described below).
121123__ https://github.com/Unidata/netcdf4-python
122124
123125We can save a Dataset to disk using the
124- :py:attr: ` Dataset.to_netcdf <xarray.Dataset.to_netcdf> ` method:
126+ :py:meth: ` ~ Dataset.to_netcdf ` method:
125127
126128.. ipython :: python
127129
@@ -147,19 +149,6 @@ convert the ``DataArray`` to a ``Dataset`` before saving, and then convert back
147149when loading, ensuring that the ``DataArray `` that is loaded is always exactly
148150the same as the one that was saved.
149151
150- NetCDF groups are not supported as part of the
151- :py:class: `~xarray.Dataset ` data model. Instead, groups can be loaded
152- individually as Dataset objects.
153- To do so, pass a ``group `` keyword argument to the
154- ``open_dataset `` function. The group can be specified as a path-like
155- string, e.g., to access subgroup 'bar' within group 'foo' pass
156- '/foo/bar' as the ``group `` argument.
157- In a similar way, the ``group `` keyword argument can be given to the
158- :py:meth: `~xarray.Dataset.to_netcdf ` method to write to a group
159- in a netCDF file.
160- When writing multiple groups in one file, pass ``mode='a' `` to ``to_netcdf ``
161- to ensure that each call does not delete the file.
162-
163152Data is always loaded lazily from netCDF files. You can manipulate, slice and subset
164153Dataset and DataArray objects, and no array values are loaded into memory until
165154you try to perform some sort of actual computation. For an example of how these
@@ -195,6 +184,24 @@ It is possible to append or overwrite netCDF variables using the ``mode='a'``
195184argument. When using this option, all variables in the dataset will be written
196185to the original netCDF file, regardless if they exist in the original dataset.
197186
187+
188+ .. _io.netcdf_groups :
189+
190+ Groups
191+ ~~~~~~
192+
193+ NetCDF groups are not supported as part of the :py:class: `~xarray.Dataset ` data model.
194+ Instead, groups can be loaded individually as Dataset objects.
195+ To do so, pass a ``group `` keyword argument to the
196+ :py:func: `~xarray.open_dataset ` function. The group can be specified as a path-like
197+ string, e.g., to access subgroup ``'bar' `` within group ``'foo' `` pass
198+ ``'/foo/bar' `` as the ``group `` argument.
199+ In a similar way, the ``group `` keyword argument can be given to the
200+ :py:meth: `~xarray.Dataset.to_netcdf ` method to write to a group
201+ in a netCDF file.
202+ When writing multiple groups in one file, pass ``mode='a' `` to
203+ :py:meth: `~xarray.Dataset.to_netcdf ` to ensure that each call does not delete the file.
204+
198205.. _io.encoding :
199206
200207Reading encoded data
@@ -203,7 +210,7 @@ Reading encoded data
203210NetCDF files follow some conventions for encoding datetime arrays (as numbers
204211with a "units" attribute) and for packing and unpacking data (as
205212described by the "scale_factor" and "add_offset" attributes). If the argument
206- ``decode_cf=True `` (default) is given to `` open_dataset ` `, xarray will attempt
213+ ``decode_cf=True `` (default) is given to :py:func: ` ~xarray. open_dataset `, xarray will attempt
207214to automatically decode the values in the netCDF objects according to
208215`CF conventions `_. Sometimes this will fail, for example, if a variable
209216has an invalid "units" or "calendar" attribute. For these cases, you can
@@ -247,6 +254,130 @@ will remove encoding information.
247254 import os
248255 os.remove(' saved_on_disk.nc' )
249256
257+
258+ .. _combining multiple files :
259+
260+ Reading multi-file datasets
261+ ...........................
262+
263+ NetCDF files are often encountered in collections, e.g., with different files
264+ corresponding to different model runs or one file per timestamp.
265+ xarray can straightforwardly combine such files into a single Dataset by making use of
266+ :py:func: `~xarray.concat `, :py:func: `~xarray.merge `, :py:func: `~xarray.combine_nested ` and
267+ :py:func: `~xarray.combine_by_coords `. For details on the difference between these
268+ functions see :ref: `combining data `.
269+
270+ Xarray includes support for manipulating datasets that don't fit into memory
271+ with dask _. If you have dask installed, you can open multiple files
272+ simultaneously in parallel using :py:func: `~xarray.open_mfdataset `::
273+
274+ xr.open_mfdataset('my/files/*.nc', parallel=True)
275+
276+ This function automatically concatenates and merges multiple files into a
277+ single xarray dataset.
278+ It is the recommended way to open multiple files with xarray.
279+ For more details on parallel reading, see :ref: `combining.multi `, :ref: `dask.io ` and a
280+ `blog post `_ by Stephan Hoyer.
281+ :py:func: `~xarray.open_mfdataset ` takes many kwargs that allow you to
282+ control its behaviour (for e.g. ``parallel ``, ``combine ``, ``compat ``, ``join ``, ``concat_dim ``).
283+ See its docstring for more details.
284+
285+
286+ .. note ::
287+
288+ A common use-case involves a dataset distributed across a large number of files with
289+ each file containing a large number of variables. Commonly a few of these variables
290+ need to be concatenated along a dimension (say ``"time" ``), while the rest are equal
291+ across the datasets (ignoring floating point differences). The following command
292+ with suitable modifications (such as ``parallel=True ``) works well with such datasets::
293+
294+ xr.open_mfdataset('my/files/*.nc', concat_dim="time",
295+ data_vars='minimal', coords='minimal', compat='override')
296+
297+ This command concatenates variables along the ``"time" `` dimension, but only those that
298+ already contain the ``"time" `` dimension (``data_vars='minimal', coords='minimal' ``).
299+ Variables that lack the ``"time" `` dimension are taken from the first dataset
300+ (``compat='override' ``).
301+
302+
303+ .. _dask : http://dask.pydata.org
304+ .. _blog post : http://stephanhoyer.com/2015/06/11/xray-dask-out-of-core-labeled-arrays/
305+
306+ Sometimes multi-file datasets are not conveniently organized for easy use of :py:func: `~xarray.open_mfdataset `.
307+ One can use the ``preprocess `` argument to provide a function that takes a dataset
308+ and returns a modified Dataset.
309+ :py:func: `~xarray.open_mfdataset ` will call ``preprocess `` on every dataset
310+ (corresponding to each file) prior to combining them.
311+
312+
313+ If :py:func: `~xarray.open_mfdataset ` does not meet your needs, other approaches are possible.
314+ The general pattern for parallel reading of multiple files
315+ using dask, modifying those datasets and then combining into a single ``Dataset `` is::
316+
317+ def modify(ds):
318+ # modify ds here
319+ return ds
320+
321+
322+ # this is basically what open_mfdataset does
323+ open_kwargs = dict(decode_cf=True, decode_times=False)
324+ open_tasks = [dask.delayed(xr.open_dataset)(f, **open_kwargs) for f in file_names]
325+ tasks = [dask.delayed(modify)(task) for task in open_tasks]
326+ datasets = dask.compute(tasks) # get a list of xarray.Datasets
327+ combined = xr.combine_nested(datasets) # or some combination of concat, merge
328+
329+
330+ As an example, here's how we could approximate ``MFDataset `` from the netCDF4
331+ library::
332+
333+ from glob import glob
334+ import xarray as xr
335+
336+ def read_netcdfs(files, dim):
337+ # glob expands paths with * to a list of files, like the unix shell
338+ paths = sorted(glob(files))
339+ datasets = [xr.open_dataset(p) for p in paths]
340+ combined = xr.concat(dataset, dim)
341+ return combined
342+
343+ combined = read_netcdfs('/all/my/files/*.nc', dim='time')
344+
345+ This function will work in many cases, but it's not very robust. First, it
346+ never closes files, which means it will fail one you need to load more than
347+ a few thousands file. Second, it assumes that you want all the data from each
348+ file and that it can all fit into memory. In many situations, you only need
349+ a small subset or an aggregated summary of the data from each file.
350+
351+ Here's a slightly more sophisticated example of how to remedy these
352+ deficiencies::
353+
354+ def read_netcdfs(files, dim, transform_func=None):
355+ def process_one_path(path):
356+ # use a context manager, to ensure the file gets closed after use
357+ with xr.open_dataset(path) as ds:
358+ # transform_func should do some sort of selection or
359+ # aggregation
360+ if transform_func is not None:
361+ ds = transform_func(ds)
362+ # load all data from the transformed dataset, to ensure we can
363+ # use it after closing each original file
364+ ds.load()
365+ return ds
366+
367+ paths = sorted(glob(files))
368+ datasets = [process_one_path(p) for p in paths]
369+ combined = xr.concat(datasets, dim)
370+ return combined
371+
372+ # here we suppose we only care about the combined mean of each file;
373+ # you might also use indexing operations like .sel to subset datasets
374+ combined = read_netcdfs('/all/my/files/*.nc', dim='time',
375+ transform_func=lambda ds: ds.mean())
376+
377+ This pattern works well and is very robust. We've used similar code to process
378+ tens of thousands of files constituting 100s of GB of data.
379+
380+
250381.. _io.netcdf.writing_encoded :
251382
252383Writing encoded data
@@ -817,84 +948,3 @@ For CSV files, one might also consider `xarray_extras`_.
817948.. _xarray_extras : https://xarray-extras.readthedocs.io/en/latest/api/csv.html
818949
819950.. _IO tools : http://pandas.pydata.org/pandas-docs/stable/io.html
820-
821-
822- .. _combining multiple files :
823-
824-
825- Combining multiple files
826- ------------------------
827-
828- NetCDF files are often encountered in collections, e.g., with different files
829- corresponding to different model runs. xarray can straightforwardly combine such
830- files into a single Dataset by making use of :py:func: `~xarray.concat `,
831- :py:func: `~xarray.merge `, :py:func: `~xarray.combine_nested ` and
832- :py:func: `~xarray.combine_by_coords `. For details on the difference between these
833- functions see :ref: `combining data `.
834-
835- .. note ::
836-
837- Xarray includes support for manipulating datasets that don't fit into memory
838- with dask _. If you have dask installed, you can open multiple files
839- simultaneously using :py:func: `~xarray.open_mfdataset `::
840-
841- xr.open_mfdataset('my/files/*.nc')
842-
843- This function automatically concatenates and merges multiple files into a
844- single xarray dataset.
845- It is the recommended way to open multiple files with xarray.
846- For more details, see :ref: `combining.multi `, :ref: `dask.io ` and a
847- `blog post `_ by Stephan Hoyer.
848-
849- .. _dask : http://dask.pydata.org
850- .. _blog post : http://stephanhoyer.com/2015/06/11/xray-dask-out-of-core-labeled-arrays/
851-
852- For example, here's how we could approximate ``MFDataset `` from the netCDF4
853- library::
854-
855- from glob import glob
856- import xarray as xr
857-
858- def read_netcdfs(files, dim):
859- # glob expands paths with * to a list of files, like the unix shell
860- paths = sorted(glob(files))
861- datasets = [xr.open_dataset(p) for p in paths]
862- combined = xr.concat(dataset, dim)
863- return combined
864-
865- combined = read_netcdfs('/all/my/files/*.nc', dim='time')
866-
867- This function will work in many cases, but it's not very robust. First, it
868- never closes files, which means it will fail one you need to load more than
869- a few thousands file. Second, it assumes that you want all the data from each
870- file and that it can all fit into memory. In many situations, you only need
871- a small subset or an aggregated summary of the data from each file.
872-
873- Here's a slightly more sophisticated example of how to remedy these
874- deficiencies::
875-
876- def read_netcdfs(files, dim, transform_func=None):
877- def process_one_path(path):
878- # use a context manager, to ensure the file gets closed after use
879- with xr.open_dataset(path) as ds:
880- # transform_func should do some sort of selection or
881- # aggregation
882- if transform_func is not None:
883- ds = transform_func(ds)
884- # load all data from the transformed dataset, to ensure we can
885- # use it after closing each original file
886- ds.load()
887- return ds
888-
889- paths = sorted(glob(files))
890- datasets = [process_one_path(p) for p in paths]
891- combined = xr.concat(datasets, dim)
892- return combined
893-
894- # here we suppose we only care about the combined mean of each file;
895- # you might also use indexing operations like .sel to subset datasets
896- combined = read_netcdfs('/all/my/files/*.nc', dim='time',
897- transform_func=lambda ds: ds.mean())
898-
899- This pattern works well and is very robust. We've used similar code to process
900- tens of thousands of files constituting 100s of GB of data.
0 commit comments