Skip to content

Conversation

@Mikejmnez
Copy link
Contributor

@Mikejmnez Mikejmnez commented Aug 12, 2025

With this PR, the following is true:

import xarray as xr
from requests_cache import CachedSession
session=CachedSession(cache_name='debug')
session.cache.clear()

dap4urls = ["dap4://test.opendap.org/opendap/hyrax/data/nc/coads_climatology.nc", 
            "dap4://test.opendap.org/opendap/hyrax/data/nc/coads_climatology2.nc"]

ds = xr.open_mfdataset(dap4urls, engine='pydap', session=session, concat_dim='TIME', parallel=True, combine='nested', decode_times=False)

session.cache.urls()
>>>['http://test.opendap.org/opendap/hyrax/data/nc/coads_climatology.nc.dap?dap4.ce=COADSX%5B0%3A1%3A179%5D%3BCOADSY%5B0%3A1%3A89%5D%3BTIME%5B0%3A1%3A11%5D&dap4.checksum=true',
 'http://test.opendap.org/opendap/hyrax/data/nc/coads_climatology.nc.dmr',
 'http://test.opendap.org/opendap/hyrax/data/nc/coads_climatology2.nc.dap?dap4.ce=COADSX%5B0%3A1%3A179%5D%3BCOADSY%5B0%3A1%3A89%5D%3BTIME%5B0%3A1%3A11%5D&dap4.checksum=true',
 'http://test.opendap.org/opendap/hyrax/data/nc/coads_climatology2.nc.dmr']

And so the dimensions are batched (downloaded) together in same always in DAP4.

In addition to this, and to preserve backwards functionality before, I added an backend argument batch=True | False. When batch=True, this makes it possible to download all non-dimension arrays in same response (ideal when streaming data to store locally).
When batch=False, which is the default, each non-dimension array is downloaded with its own http requests, as before. This is ideal in many scenarios when performing some data exploration.

cache_session=CachedSession(cache_name='debug')

ds = xr.open_mfdataset(dap4urls, engine='pydap', session=cache_session, parallel=True, combine='nested', concat_dim="TIME", decode_times=False, batch=True)

len(cache_session.cache.urls())
>>> 4 # 1dmr and 1 dap per file (2 files)

# triggers all non-dimension data to be downloaded in a single http request
ds.load()

len(cache_session.cache.urls())
>>> 6 # the previous 4, plus an extra request extra per file 

When batch=False (False is the default) , the last step (ds.load()) triggers individual downloads.

These changes allow a more performant download experience with xarray+pydap. However ,must of these changes depend on a yet-to-release version of pydap (3.5.6). I want to check that things go smoothly here before making a new release, i.e. perhaps I will need to make a change to the backend base code. pydap 3.5.6 has been released!

@github-actions github-actions bot added topic-backends CI Continuous Integration tools dependencies Pull requests that update a dependency file io labels Aug 12, 2025
@Mikejmnez Mikejmnez changed the title Pydap4 scale [pydap backend] enables downloading/processing multiple arrays within single http request Aug 12, 2025
@Mikejmnez Mikejmnez marked this pull request as ready for review August 13, 2025 07:11
@Mikejmnez
Copy link
Contributor Author

Mikejmnez commented Aug 13, 2025

hmm - the test I see that fails (sporadically) concerns the following assertion:

Differing data variables:
L   group_1_var  (lon, lat) float64 16B ...
R   group_1_var  (lat, lon) float64 16B ...

where the groups have reverse ordering in the way dimensions show up ((lat,lon) vs (lon,lat)). Not sure if this is a pydap/PydapDataStore issue. I am imposing sorted into the get_dimensions method of the PydapDataStore. The local test ran fine (so nothing broke), but again this failing test did not show up on my testing...

Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Mikejmnez !

@shoyer
Copy link
Member

shoyer commented Aug 18, 2025

hmm - the test I see that fails (sporadically) concerns the following assertion:

Differing data variables:
L   group_1_var  (lon, lat) float64 16B ...
R   group_1_var  (lat, lon) float64 16B ...

where the groups have reverse ordering in the way dimensions show up ((lat,lon) vs (lon,lat)). Not sure if this is a pydap/PydapDataStore issue. I am imposing sorted into the get_dimensions method of the PydapDataStore. The local test ran fine (so nothing broke), but again this failing test did not show up on my testing...

This is a little concerning! Not sure how this could be a bug on the Xarray side, unless we're using the wrong API for getting variable dimensions from Pydap.

@shoyer
Copy link
Member

shoyer commented Aug 18, 2025

hmm - the test I see that fails (sporadically) concerns the following assertion:

Differing data variables:
L   group_1_var  (lon, lat) float64 16B ...
R   group_1_var  (lat, lon) float64 16B ...

where the groups have reverse ordering in the way dimensions show up ((lat,lon) vs (lon,lat)). Not sure if this is a pydap/PydapDataStore issue. I am imposing sorted into the get_dimensions method of the PydapDataStore. The local test ran fine (so nothing broke), but again this failing test did not show up on my testing...

This is a little concerning! Not sure how this could be a bug on the Xarray side, unless we're using the wrong API for getting variable dimensions from Pydap.

I'm seeing the same error over here:
#10649

Not quite sure what to make of this, but seems to be a separate bug.

@Mikejmnez
Copy link
Contributor Author

Mikejmnez commented Aug 18, 2025

Thanks @shoyer ! I am participating all week in a hackathon, but I will try to check and address your comments as fast as I can :)

@Mikejmnez
Copy link
Contributor Author

Mikejmnez commented Sep 19, 2025

@shoyer I had a second go at this finally. Moved much of the logic to the backend.

Here is the current state of things:

  • This PR installs pydap from source. Why? I want to leave the door open for changes on the pydap backend, that may arise from this PR, and include them in the new pydap release. Only when there is a general feeling that this PR is ready to be merged will I then make a pydap release and revert to installing pydap from conda. More comments/request for changes about this PR are welcome!
  • Failing test is unrelated to this PR. But I think I found the potential culprit in the dap4 metadata parser in pydap. Will spend today working on that. This needs to be fixed asap.

@Mikejmnez
Copy link
Contributor Author

Mikejmnez commented Sep 26, 2025

@shoyer This is ready for further reviewing.

Pydap has a new release that fixes some issues on the backend xml parser (there was a bug that got fixed). I think there may be some additional work to be needed in the next couple of weeks, but these are unrelated to this PR anyways...

I did not know what to make of Mypy fails, but these also fail on the main branch too. Fixed in #10792

@Mikejmnez Mikejmnez force-pushed the pydap4_scale branch 2 times, most recently from aac3163 to 4b516b4 Compare September 30, 2025 20:38
@Mikejmnez Mikejmnez requested a review from shoyer September 30, 2025 20:41
@Mikejmnez
Copy link
Contributor Author

Mikejmnez commented Sep 30, 2025

@shoyer Let me know if there is any feedback, concerns, further reviewing, etc.

This PR enables a new (non-default) feature that was added to the pydap backend over the span of several months, namely the ability to download multiple variables within single request, according to the opendap spec. Without this feature, each variable is downloaded separately, which does not take advantage of the opendap protocol, and can make pydap unusable when each remote file has ~>2-3 variables, and there are at least >10 urls to consolidate (for example via mds = xr.open_mfdataset and then mdf.to_zarr or something).

This PR also makes it so that when accessing via dap4 protocol, all dimensions are downloaded within single request by default, always. This is the most performant approach compared to downloading each dimension using a separate request. This again improves performance when "only opening" multiple remote files.

~~~~~~~~~~~~

- Improved ``pydap`` backend behavior and performance when using :py:func:`open_dataset`, :py:func:`open_datatree` when downloading dap4 (opendap) data (:issue:`10628`, :pull:`10629`).
``batch=True|False`` is a new ``backend_kwarg`` that further enables downloading multiple arrays in single response. In addition ``checksums`` is added as optional argument to be passed to ``pydap`` backend.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you help me undestand -- why would a user not want to enable batch mode if they are using a new enough version of pydap?

Copy link
Contributor Author

@Mikejmnez Mikejmnez Oct 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good question. The route map for pydap is for batch=True always for dap4. So a dap4-url would automatically do this batch=true thingy. But right now, batch=true or false follow slightly different pathways internally within pydap (some of it very old, as you may know). In the roadmap for pydap, there is definitely is a major refactoring in sight.

FYI: Even when batch=False by default, dimension data is always batched (downloaded at once) when protocol= dap4, as long as they are using a "new enough version of pydap (>=3.5.6)" (see new line 177 on backends.pydap_). I think this is the first step to making this dap4 <--> batch=True in the future.

But enabling batch=True for streaming/downloading data that is being subset by Xarray (e.g. ds.isel(lat=slice1, lon=slice2).to_netcdf can become rapidly complex (in particular in the presence of hierarchies). I done plenty of testing (with both hierarchical data, and data on staggered grids), and while things work well so far, I think "soft launching" this feature on xarray makes the most sense to me. It seems safer to me at least.

@Mikejmnez
Copy link
Contributor Author

@shoyer any further comments?

I'd be happy if at least some of the features within this PR are incorporated, specially the feature of always downloading all dimensions at once (i.e. single dap url for all N dims instead of N dap urls for N dims), when dap4 is the protocol. That would make a significant performance difference. In that simple scenario, batch is no longer necessary to add as an extra argument. So no extra logic needed from the user to get the performance gains.

In the general case (which this PR enables), the user needs to specify batch=True, to ensure a "safe" approach. An "unsafe" approach is when the remote file is a virtually aggregation of many (nc) files , often with an ncml extension. In that scenario you want to download individual variables along individual dap requests...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI Continuous Integration tools dependencies Pull requests that update a dependency file io topic-backends

Projects

None yet

Development

Successfully merging this pull request may close these issues.

make pydap backend more opendap-like by downloading multiple variables in same http request

2 participants