Simultaneously read multiple Datasets into an Xarray-Beam pipeline #68

shoyer · 2022-12-07T01:04:01Z

It is relatively common to need to load multiple xarray.Dataset objects, e.g., to compare two different models.

This currently can be done by loading data with separate calls to xbeam.DatasetToChunks, and by joining together the result with beam.CoGroupBykey. This works but is rather inefficient, involving an extra write of the data to disk. Ideally we could load the data in a single beam transform instead, e.g., xbeam.DatasetToChunks([ds1, ds2], chunks) would return a PCollection with elements of type tuple[xbeam.Key, tuple[xarray.Dataset, xarray.Dataset]].

CC @alxmrs

The text was updated successfully, but these errors were encountered:

alxmrs · 2022-12-15T02:45:01Z

I'm looking into this now.

I have a design question, though: What is the best PCollection interface? tuple[xbeam.Key, tuple[xarray.Dataset, xarray.Dataset]] or tuple[xbeam.Key, xarray.Dataset, xarray.Dataset]? I have a slight preference for the latter (and, this implementation would not be so bad). This version also seems fairly natural for operations like beam.MapTuple() (can handle n-ary tuples) and beam.GroupByKey() (will just use the first value in the tuple). It feels more zen to me, too. :)

WDYT?

alxmrs · 2022-12-15T03:29:03Z

A small update -- the former does seem to offer a better typing story (python/typing#180), I now am leaning that way.

Here is an initial implementation of google#68.

alxmrs added a commit to alxmrs/xarray-beam that referenced this issue Dec 15, 2022

Initial support for multiple datasets in DatasetToChunks

fcc8216

Here is an initial implementation of google#68.

alxmrs mentioned this issue Dec 15, 2022

Open multiple datasets at once from DatasetToChunks. #69

Merged

alxmrs added a commit to alxmrs/xarray-beam that referenced this issue Dec 22, 2022

Initial support for multiple datasets in DatasetToChunks

46e8eb9

Here is an initial implementation of google#68.

alxmrs added a commit to alxmrs/xarray-beam that referenced this issue Jan 25, 2023

Initial support for multiple datasets in DatasetToChunks

90aa3db

Here is an initial implementation of google#68.

alxmrs added a commit to alxmrs/xarray-beam that referenced this issue Jan 25, 2023

Initial support for multiple datasets in DatasetToChunks

515a899

Here is an initial implementation of google#68.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simultaneously read multiple Datasets into an Xarray-Beam pipeline #68

Simultaneously read multiple Datasets into an Xarray-Beam pipeline #68

shoyer commented Dec 7, 2022

alxmrs commented Dec 15, 2022

alxmrs commented Dec 15, 2022

Simultaneously read multiple Datasets into an Xarray-Beam pipeline #68

Simultaneously read multiple Datasets into an Xarray-Beam pipeline #68

Comments

shoyer commented Dec 7, 2022

alxmrs commented Dec 15, 2022

alxmrs commented Dec 15, 2022