Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

views of variables don't use the DiskArrays interface #274

Open
tiemvanderdeure opened this issue Feb 6, 2025 · 7 comments
Open

views of variables don't use the DiskArrays interface #274

tiemvanderdeure opened this issue Feb 6, 2025 · 7 comments

Comments

@tiemvanderdeure
Copy link

Calling view on a Variable returns a SubVariable from CommonDataModel, which doesn't implement the DiskArray interface.

This unfortunately means that any chunked operation (such as lazy raster operations) are extremely slow, as discussed in rafaqz/Rasters.jl#889

For example:

using Rasters, NCDatasets
ras = Rasters.create("ras1.nc", Float64, (X(1:100), Y(1:100)); force = true, chunks = (10, 10))
A = read(ras)
open(ras; write = true) do dest
    @time dest .= A
    @time view(dest, :,:) .= A
    @time dest .= dest
    return
end
  0.002006 seconds (10.72 k allocations: 560.047 KiB)
  0.237490 seconds (1.14 M allocations: 41.352 MiB, 3.64% gc time)
  0.228919 seconds (1.15 M allocations: 41.738 MiB, 3.57% gc time)

The last operation here is slow because we are copying a DiskArray to a DiskArray, which happens chunk by chunk, so view is called internally. So clearly this is not great.

Two possible way forward are to implement (parts of) the DiskArray interface for SubVariable, or to return a SubDiskArray from view on Variable. Arguable NCDatasets violates the DiskArray interface here.

@tiemvanderdeure
Copy link
Author

Just to demonstrate one possible fix: removing these view methods just totally fix the problem
After

Base.delete_method.(methods(view, (NCDatasets.CommonDataModel.AbstractVariable, Colon)))

The very same code as above gives

  0.002933 seconds (10.72 k allocations: 560.047 KiB)
  0.004322 seconds (21.53 k allocations: 982.578 KiB)
  0.008718 seconds (33.40 k allocations: 1.483 MiB)

@tiemvanderdeure
Copy link
Author

I'll just add an example with NCDatasets only, might be easier to see what is going on:

using NCDatasets
NCDataset("test_file.nc","c") do ds
    defVar(ds,"temp",rand(100,100),("lon","lat");
        chunksizes = [10,10]
    )
    @time variable(ds, "temp") .+= 1
end;

Before deleting the view methods: 0.781584 seconds (1.15 M allocations: 41.739 MiB, 1.31% gc time)
After deleting the view methods: 0.007317 seconds (33.43 k allocations: 1.484 MiB)

@tiemvanderdeure
Copy link
Author

@Alexander-Barth, would you like to share your thoughts here?

@Alexander-Barth
Copy link
Member

Personally, I do not make element-wise operations on types of NCDatasets. But a better integration with DiskArray would be very desirable but it is not that straightforward to implement.

views of CommonDataModel have also a special field attrib (with the NetCDF attributes). If attributes disappears we would break the documented API.

All the code is in CommonDataModel, maybe it is better to have the discussion there?
I think this issue from @rafaqz describes a similar (same?) problem:
JuliaGeo/CommonDataModel.jl#21

@rafaqz
Copy link
Member

rafaqz commented Mar 4, 2025

Needing a getproperties field is a real problem for shared APIs, and not really possible with DiskArrays.jl.

But you can easily make an attrib function get that metadata from the inner Variable? Why not switch to a function based interface?

Otherwise this will be another ecosystem integration blocker. In Rasters we have to wrap your variables to prevent this from happening.

(Also just realised a function based interface can get attribs after reshape or permutedims or whatever - a getproperty interface has the same problem with all of them)

@Alexander-Barth
Copy link
Member

A function based interface would be quite a massive breaking change. I like the ability to modify attributes as it were a dictionary. Maybe an approach would be that a view of a CFVariable (wrapping a DiskArray) should be also a CFVariable (wrapping the corresponding DiskArray view)?

@rafaqz
Copy link
Member

rafaqz commented Mar 11, 2025

CFVariable is not a DiskArray... I'm always confused by this conversation, because wrapping a DiskArray essentially makes it useless, broadcast and many other base methods wont be chunked - as happens with the SubVariable in this issue. We need an object that is a disk array and always stays a disk array. (Or alternatively implement broadcast and everything like DimensionalData.jl does so broadcast and other methods are properly forwarded to the inner array, and rewrap around view etc rather than having your own view object)

As with the old setfield! discussion, the syntax is nice but the outcome is many base methods still dont work (or are 100x slower) for your users, and its lots of work for me, @tiemvanderdeure and others to work around them.

If you switch to just using attrib(var) these fixes would be much easier. If not, yes it will hard. It means your wrappers always need to be outside for getproperty to work, but that breaks chunked broadcast and other dispatch that needs a disk array as the outside wrapper. So, an impasse.

For now we will just wrap and hide CDM variables in other geo packages, and make sure chunk handling works on top of them. See: rafaqz/Rasters.jl#892. We just really need disk arrays chunking to always work.

Or maybe I misunderstood... do you mean you keep the getproperty/setproperty interface for CFVariable only, and make Variable have to use attrib(var) (which can then get the atrib through a DiskArray)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants