Skip to content

open_mfdataset, function provided to combine_attrs has confusing behavior: multiple calls? separate last group of attributes? #6679

Open
@gg2

Description

@gg2

What is your issue?

I'm attempting to use the combine_attrs parameter on open_mfdataset with a function to generate and preserve a list of values from specific attributes on specific variables.

I use a library (act-atmos) that uses xarray as a dependency.
I use act-atmos' act.io.armfiles.read_netcdf method to read in the data from a list of files.
( https://github.com/ARM-DOE/ACT/blob/main/act/io/armfiles.py )
I provide to read_netcdf the function to use for the combine_attrs parameter.

I am fairly confident that act-atmos does nothing unexpected with the list of files or parameters.
It sets combine = 'by_coords', use_cftime = True, and combine_attrs = untouched.
The above parameters are provided to open_mfdataset via **kwargs; as well as passing through the list of filenames untouched.

I see unexpected behavior in the combine_attrs function.
To test, I read in 3 netCDF files, with records starting at 6:00am day1, ending 6:00am day2.
The resultant combined data has 4 days. Really I expect 3 days of data, but since the files overlap the next day from midnight to 6:00am, it makes sense I end up with 4 days. But the last, 4th day will always have no data, because the data starts during daylight hours.

# Using the combined xarray data returned by open_mfdataset:
len(xarray_data['time'].values) # 1440 * 3
xarray_data['time'].values)
xarray_data['time'].coords

4320
['2009-01-01T06:00:00.000000000' '2009-01-01T06:01:00.000000000'
'2009-01-01T06:02:00.000000000' ... '2009-01-04T05:57:00.000000000'
'2009-01-04T05:58:00.000000000' '2009-01-04T05:59:00.000000000']
Coordinates: time (time) datetime64[ns] 2009-01-01T06:00:00 ... 2009-01-04T05:59:00

So, I expect combine_attrs to receive a list of 3-4 sets of attributes to iterate through all at once.
Instead, it apparently gets called twice: once with a list of 3 sets of attributes, and a 2nd time with a list of 1 set of the attributes.

The last lone set does contain the combined attributes from the first call.
But given I was expecting to iterate through a single list of all attributes, I have to implement special logic to watch for that last set. If I don't, then the last set in the 2nd call obliterates the results of the 1st set of 3, and the result ends up looking similar to combine_attrs="override".
Without further experimentation, I don't know yet if this behavior will remain consistent for larger sets of files. I will experiment soon to see if it does remain consistent.

Is this behavior of combine_attrs expected? Why would it be set up to make 2 (or multiple) separate calls like that?
Is it somehow a side-effect of the files having times that overlap days? Is it somehow because the last dataset is essentially empty? If I end up with one combined set of data, I would expect one combined list of attributes to iterate through.
If this behavior is unexpected, I will provide more specifics about the data as requested, to keep this initial post shorter.


Also, as a side note, it would be nice to improve the documentation for combine_attrs.
The details about the parameter don't even show up here:
https://docs.xarray.dev/en/stable/generated/xarray.open_mfdataset.html

And the description of the callable signature in the source code could be clearer:

  • What is a "sequence of" attrs dicts? Is "sequence of" the reason I'm seeing multiple calls to combine_attrs?
  • What might the context object possibly contain? When or why might it vary?

With experimentation I can find some of these things out (e.g. context is None, in my case); but it would be nice if these were clearer up front.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions