Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loading a netCDF file with multiple variables is very slow #6223

Open
schlunma opened this issue Nov 8, 2024 · 6 comments
Open

Loading a netCDF file with multiple variables is very slow #6223

schlunma opened this issue Nov 8, 2024 · 6 comments

Comments

@schlunma
Copy link
Contributor

schlunma commented Nov 8, 2024

📰 Custom Issue

Hi! While evaluating a large number of files with multiple variables each I noticed that ESMValTool is much slower when files contain a lot of variables. I could trace that back to Iris' load function. Here is an example of a loading files with 1 and 61 variables:

import iris

one_path = "data/one_cube.nc"  # file with 1 variable
multi_path = "data/multiple_cubes.nc"  # file with 61 variables

%%timeit
iris.load(one_path)  # 13.2 ms ± 136 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
iris.load(multi_path)  # 673 ms ± 984 μs per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
constraint = iris.Constraint("zonal stress from subgrid scale orographic drag")
iris.load(multi_path, constraint)  # 611 ms ± 1.72 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

As you can see, loading the file with 61 variables takes ~51 times as long as loading the file with 1 variable. Using a constraint does not help.

Doing the same with xarray gives:

import xarray as xr

one_path = "data/one_cube.nc"  # file with 1 variable
multi_path = "data/multiple_cubes.nc"  # file with 61 variables

%%timeit
xr.open_dataset(one_path, chunks='auto')  # 7.75 ms ± 164 μs per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
xr.open_dataset(multi_path, chunks='auto')  # 54.6 ms ± 241 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Here, the difference between 1 and 61 variables is only a factor of ~7.

If only a single file needs to be loaded, this is not a problem, but this quickly adds up to a lot of time if 100s or even 1000s of files need to be read (which can be the case for climate models that write one file with many variables per time step).

Have you ever encountered this problem? Are there any tricks to make loading faster? As mentioned, I tried with a constraint, but that didn't work.

Thanks for your help!

Sample data:

@trexfeathers
Copy link
Contributor

Have you ever encountered this problem?

Yes. But this a downside of deliberate choices in Iris' data model. None of the dimensional metadata (coordinates, cell measures, etcetera) is shared between Cubes, so in your example there are 61 copies of each. This makes each Cube and entirely independent entity, allowing different workflows to be written compared to libraries such as Xarray where each variable belongs to a larger Dataset. But it does make many-variable files difficult to work with.

Are there any tricks to make loading faster?

We have tried. We implemented #5229 for truly absurd cases where tiny files were taking a long time to load. And we have a benchmark to make sure it doesn't get even worse:

There are ongoing discussions about opt-in sharing in some form (e.g. #3172), but we have nothing concrete at the moment.

I tried with a constraint, but that didn't work.

This is presumably because the constraint gets applied after the Cube has been generated. Could you try including ncdata as a pre-loading step?

@schlunma
Copy link
Contributor Author

Thanks for your insight @trexfeathers, that all makes sense!

Could you try including ncdata as a pre-loading step?

This reduces the runtime by more than 35%!

# Note: this example is run on another machine;
# that's why the numbers differ to those given in the PR description

%%timeit
iris.load(multi_path)  # 362 ms ± 18.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
ds = xr.open_dataset(multi_path, chunks='auto')
ncdata.iris_xarray.cubes_from_xarray(ds)  # 224 ms ± 4.62 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

@trexfeathers
Copy link
Contributor

@schlunma good to hear about the speedup. I was actually imagining modifying the NetCDF dataset to remove the variables you are not interested in, rather than going via Xarray. You might get even more speedups that way.

@schlunma
Copy link
Contributor Author

You're right, extracting the variable in xarray and then using ncdata is almost 10x faster than loading the cube with a constraint:

%%timeit
ds = xr.open_dataset(multi_path, chunks='auto')[["tauu_sso", "clat_bnds", "clon_bnds"]]
ncdata.iris_xarray.cubes_from_xarray(ds)  # 35.6 ms ± 1.79 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

I didn't know how to extract a variables from an NcData object.

What I also found is that by bypassing xarray and only using ncdata, the load times are much worse:

%%timeit
ncd = ncdata.netcdf4.from_nc4(multi_path)
ncdata.iris.to_iris(ncd)  # 643 ms ± 23 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

This is more than 2 times longer than using iris.load.

@pp-mo
Copy link
Member

pp-mo commented Nov 11, 2024

I didn't know how to extract a variables from an NcData object.

Thanks for looking !
The problem, I think, is that you want to select 3 particular "data variables", which you know the variable names of, but you must also work out what other (non-data) variables are needed.
Unfortunately that isn't easy, because it means interpreting the data in CF terms, which is frustratingly non-trivial.

What you can do fairly easily is to remove unwanted variables, using code like del ncdata.variables[varname]
or ncdata.variables.pop(varname),
though this obviously relies on you identifying what to remove.
( and I am working on better docs, honest 😉 )


What I also found is that by bypassing xarray and only using ncdata, the load times are much worse:

So, I presume in that case you are back to loading all the variables again ?
Even so, I'm not clear why initial loading should be substantially slower than the direct load.
Maybe it is just a "more code layers" thing, or it could be interesting to see if it is possibly due to different chunking : Ncdata doesn't have the more intelligent chunking schemes built into Iris (and Xarray I think), so for large files the dask "auto" decisions could be more costly. Dask can be slow when managing large task graphs [100's of tasks].

So, I think Xarray is helping here because it analyses the file and grabs the 'other' variables as coords, without making a big deal of it.
Whereas Iris proceeds to create complex objects for everything, so I'm imagining that the slow part is maybe the CF interpretation, or more likely just building all the cubes + coords.


FWIW Iris can also skip building cubes for unwanted data variables, but only in the rather limited case where a single NameConstraint is provided, which matches just one data variable. See here, and the call to it here.
Unfortunately this "shortcut" approach is rather obscure + limited, and remains largely undocumented and unused AFAIK. The approach is limited by the opaque nature of Constraint objects, and it doesn't extend to an efficient way of selecting 'N' data-variables, since each requires its own 'load' operation.

However, if this would be of practical use, we could possibly revisit that approach + extend the cases it can handle ?

It would certainly make sense to be able to say something like iris.load_cubes(file, ['var1', 'var2, 'var3']) and have it ignore all the other data variables. But this, it cannot do at present.

@schlunma
Copy link
Contributor Author

Thanks for all the details @pp-mo, this really helps a lot to understand what's going on here.

The problem, I think, is that you want to select 3 particular "data variables", which you know the variable names of, but you must also work out what other (non-data) variables are needed. Unfortunately that isn't easy, because it means interpreting the data in CF terms, which is frustratingly non-trivial.

Yes, I already tried that and fully agree!

Even so, I'm not clear why initial loading should be substantially slower than the direct load.

I was just surprised that using xarray as an additional layer is faster than not using it. From what I understand, this effectively does xarray.Dataset -> NcData -> Cubes. So it's a bit non-intuitive that NcData -> Cubes is slower than that.

However, if this would be of practical use, we could possibly revisit that approach + extend the cases it can handle ?

What we currently do in ESMValTool is loading all cubes without any constraint and then extract one variable after some kind of preprocessing. For data that contain just one variable (like CMIP model data) this is trivial (and here, loading all cubes is as fast as loading just the one variable).

This really only becomes a problem for "raw" climate model data where 10s or even 100s of variables are stored in one netcdf file. Here, the aforementioned preprocessing extracts the desired variable, but this is very slow since we load ALL variables initially. For some cases, we even need to extract more than one variable (to derive another quantity), so the workaround you mentioned above would not help in general.

So yes, being able to do something like iris.load_cubes(file, ['var1', 'var2, 'var3']) would really help. In the meantime, I would like to load these data with xarray or ncdata first, extract all the necessary variables, and then use ncdata to convert to iris cubes (see ESMValGroup/ESMValCore#2129 (comment)).

It's really great that we can do that now with ncdata! Thanks for all your work on that!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants