Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rethinking fix_file #2129

Open
schlunma opened this issue Jul 11, 2023 · 9 comments · May be fixed by #2579
Open

Rethinking fix_file #2129

schlunma opened this issue Jul 11, 2023 · 9 comments · May be fixed by #2579
Assignees
Labels
enhancement New feature or request

Comments

@schlunma
Copy link
Contributor

Currently, fx_file has the call signature (filename) -> (filename). Thus, files need to be copied if they need to be modified (overwriting input files is a very bad idea), which is very expensive. Usually, the netCDF4 library is used to perform these kind of modifications.

A possible alternative that has often come up is xarray. At the moment, there is a method DataArray.to_iris() which allows converting a DataArray into a Cube (no idea how efficient this is, though). In addition, the iris devs are working on an improved xarray-iris bridge.

Thus, it would be very nice to allow an efficient usage of xarray in fix_file. My question is now: how would that look like in practice? Possible solutions I could think of are:

  • Allow fx_file to return xarray objects (currently only DataArray?) and load to read these kind of objects.
  • Allow fx_file to return cubes and load to read cubes.
  • Rewrite fx_file in such a way that it is a callback in load (with one of the call signatures above).

The reason I am asking this now is that I want to get started with reading the ERA5 GRIB files that are available on Levante (see #1991). However, iris-grib cannot open them. On the other hand, xarray (in combination with ECMWF's cfgrib) works perfectly fine. Thus, an option to include xarray into our pipeline would be super helpful for me.

@ESMValGroup/technical-lead-development-team opinions?

@bouweandela
Copy link
Member

bouweandela commented Aug 1, 2023

A possible alternative that has often come up is xarray. At the moment, there is a method DataArray.to_iris()which allows converting a DataArray into a Cube (no idea how efficient this is, though).

The issue with the current implementation is of xarray.DataArray.to_iris is that it loses information, if I remember correctly some of the coordinates got lost last time I tried to use it.

@bouweandela
Copy link
Member

bouweandela commented Aug 1, 2023

Rewrite fx_file in such a way that it is a callback in load (with one of the call signatures above).

This seems the cleanest option, but I doubt whether the callback argument is flexible enough for all use cases. A limiting factor of the callback function is that you only have access to a single cube at a time.

@bouweandela
Copy link
Member

If we want to work with data that is not stored in files at some point in the future, it would be useful to either get rid of fix_file altogether, or make it return whatever input it got if it is not a file.

We could consider making load call the to_iris method of its first argument (file) if it has such a method and fall back to calling iris.load_raw if it doesn't have it? It would be really good to take into account the potential use of intake-esm and STAC to load data from the cloud at some point in the future. Does anyone have experience trying to do this through iris, without using xarray? Maybe by using fsspec?

@valeriupredoi
Copy link
Contributor

AFAIK iris can't load file objects eg something stored on an S3 storage so our main problem is not fix_file but the I/O itself - if we get the I/O to work with object stores then we can modify fix_file to work on such a thing I reckon

@valeriupredoi
Copy link
Contributor

I have experience with netCDF files on various types of storages, including object stores, so does @zklaus (am fairly sure of that) - iris is the type of thing you don't want in that case, you want a basic loader into a file format like Zarr or overloaded Zarr - beware of xarray since that transfers the data!

@schlunma
Copy link
Contributor Author

schlunma commented Aug 1, 2023

The issue with the current implementation is of xarray.DataArray.to_iris is that it loses information, if I remember correctly some of the coordinates got lost last time I tried to use it.

I tested this a lot lately for the ERA5 grib files, in which case I didn't find any problems with this function. The lazyness of the data and all relevant metadata is properly preserved.

This seems the cleanest option, but I doubt whether the callback argument is flexible enough for all use cases. A limiting factor of the callback function is that you only have access to a single cube at a time.

Conceptually I agree, but our current callback argument (which is simply passed to iris.load_raw is not suited for that. It would rather be something like a load_func option that defaults to iris.load_raw. However, to ensure backwards-compatibility, we need to allow fix_file to return paths...this will not really make it cleaner.

If we want to work with data that is not stored in files at some point in the future, it would be useful to either get rid of fix_file altogether, or make it return whatever input it got if it is not a file.

I don't think we should get rid of fix_file entirely (we cannot expected Iris to handle every single format natively). fix_file already returns what it gets if no special fix is needed. I added some type hints for that in #2160, but these can be changed to Any.

We could consider making load call the to_iris method of its first argument (file) if it has such a method and fall back to calling iris.load_raw if it doesn't have it? It would be really good to take into account the potential use of intake-esm and STAC to load data from the cloud at some point in the future. Does anyone have experience trying to do this through iris, without using xarray? Maybe by using fsspec?

In #2160, I allowed load to handle CubeList objects (they are simply returned), this is more flexible than relying on xarray's to_iris in my opinion (we are not sure if this works in every case), and the magic xarray -> iris can happen in fix_file. Support for other formats can be implemented to load in subsequent PRs.

@schlunma
Copy link
Contributor Author

Last week I found that reading nc files with lots of variables (like we have for many native models like ICON, EMAC, etc.) can be very slow with iris (it can take hours depending on the setup). See more details here: SciTools/iris#6223. I found that ncdata can really help in reducing this time (~40% by just using xarray->iris instead of iris directly, much more if we extract individual variables with xarray).

Thus, I'd really like to implement the solution "Allow fx_file to return cubes and load to read cubes." mentioned above. I saw that this is already done in a draft PR here. @bouweandela what are your plans regarding this PR? If that's okay with you, I could extract the general parts from that PR that affect all datasets and open a new PR based on that.

With the more flexible fix_file we can then put code that extracts variables into fix_file, which, in my ICON example, will speedup data loading by more than a factor of 10.

@bouweandela
Copy link
Member

@bouweandela what are your plans regarding #2454?

While working on it, I realized that the separation between fix_file/fix_metadata/fix_data does no longer have a purpose now that we're working with ncdata and dask arrays. It would much simpler to have a single fix function that takes care of all three operations, that's why the pull request is still in draft. Also, loading the data with xarray or ncdata leads to very different Dask chunks, so I was planning to investigate if that adversely affects performance.

@schlunma
Copy link
Contributor Author

It would much simpler to have a single fix function that takes care of all three operations

I really like this idea!

However, I don't think this can be easily implemented in practice. Currently, the existing fix functions are not called one after each other. The order is rather

fix_file --> load --> fix_metadata --> concatenate --> cmor_check_metadata --> clip_timerange --> fix_data --> cmor_check_data

How and where in this chain would you implement this fix in practice? We could of course re-order these operations, but are we sure that this won't mess up (some of) the existing fixes? If clip_timerange is at the end of the pipeline, is Dask "smart" enough to discard the unused chunks in earlier computations? How do we make this backwards-compatible?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants