Taking the diagonal of each 2-D array along a third dimension? Current approach uses deprecated groupby-squeeze #9173

claytharrison · 2024-06-25T08:19:55Z

claytharrison
Jun 25, 2024

I have a DataArray of three dimensions - "cov_i", "cov_j", and "location_id", which I got back from doing some scipy curve fitting. I'd like to calculate the standard deviation from this covariance array. This involves getting the 2D array for each location_id and then sqrt-ing its diagonal.

The approach I came up with involves doing groupby on the dataarray and then using apply_ufunc on that result to do my computations. I like this way because it all fits in a couple lines, I don't need to do any pulling stuff out of the array or rebuilding anything, it just churns over what's there and handles everything nicely. I can also dask-ify it easily when I need to apply this to massive datasets that won't fit in memory.

>>> fits
<xarray.Dataset> Size: 488B
Dimensions:                (location_id: 4, parameter: 3, cov_i: 3, cov_j: 3)
Coordinates:
  * parameter              (parameter) <U2 24B 'b0' 'b1' 'b2'
  * cov_i                  (cov_i) <U2 24B 'b0' 'b1' 'b2'
  * cov_j                  (cov_j) <U2 24B 'b0' 'b1' 'b2'
  * location_id            (location_id) int64 32B 3068918 3071502 ... 3114697
Data variables:
    curvefit_coefficients  (location_id, parameter) float64 96B -7.477 ... -0...
    curvefit_covariance    (location_id, cov_i, cov_j) float64 288B 3.793e-05...
>>> fits.curvefit_covariance
<xarray.DataArray 'curvefit_covariance' (location_id: 4, cov_i: 3, cov_j: 3)> Size: 288B
array([[[ 3.79298999e-05, -1.87955006e-06,  6.07566477e-10],
        [-1.87955006e-06,  9.85906372e-07, -4.07836737e-08],
        [ 6.07566477e-10, -4.07836737e-08,  2.00518116e-09]],

       [[ 3.82335861e-05, -1.81900026e-06, -3.32938668e-09],
        [-1.81900026e-06,  9.83693989e-07, -4.09497824e-08],
        [-3.32938668e-09, -4.09497824e-08,  2.03417609e-09]],

       [[ 3.56544048e-05, -2.02137272e-06,  1.32904069e-08],
        [-2.02137272e-06,  9.45842659e-07, -3.82653117e-08],
        [ 1.32904069e-08, -3.82653117e-08,  1.82177935e-09]],

       [[ 2.89862030e-05, -1.30506244e-06, -6.21364362e-09],
        [-1.30506244e-06,  7.39687867e-07, -3.09647204e-08],
        [-6.21364362e-09, -3.09647204e-08,  1.55403796e-09]]])
Coordinates:
  * cov_i        (cov_i) <U2 24B 'b0' 'b1' 'b2'
  * cov_j        (cov_j) <U2 24B 'b0' 'b1' 'b2'
  * location_id  (location_id) int64 32B 3068918 3071502 3113100 3114697
>>> cov_to_std = lambda cov_arr: np.sqrt(np.diag(cov_arr))
>>> fits["curvefit_std_dev"] = xr.apply_ufunc(cov_to_std, fits.curvefit_covariance.groupby("location_id"), input_core_dims=[["cov_i", "cov_j"]], output_core_dims=[["parameter"]])
<stdin>:1: UserWarning: The `squeeze` kwarg to GroupBy is being removed.Pass .groupby(..., squeeze=False) to disable squeezing, which is the new default, and to silence this warning.
>>> fits.curvefit_std_dev
<xarray.DataArray 'curvefit_std_dev' (location_id: 4, parameter: 3)> Size: 96B
array([[6.15872551e-03, 9.92928181e-04, 4.47792492e-05],
       [6.18333131e-03, 9.91813485e-04, 4.51018413e-05],
       [5.97113095e-03, 9.72544425e-04, 4.26823073e-05],
       [5.38388364e-03, 8.60051084e-04, 3.94212881e-05]])
Coordinates:
  * parameter    (parameter) <U2 24B 'b0' 'b1' 'b2'
  * location_id  (location_id) int64 32B 3068918 3071502 3113100 3114697

This approach works great, except for the squeeze warning. From the looks of things, squeezing in groupby is completely deprecated so I should find another way. Should be simple, just do the squeezing in the func, right?

>>> cov_to_std = lambda cov_arr: np.sqrt(np.diag(cov_arr.squeeze()))
>>> fits["curvefit_std_dev"] = xr.apply_ufunc(cov_to_std, fits.curvefit_covariance.groupby("location_id", squeeze=False), input_core_dims=[["cov_i", "cov_j"]], output_core_dims=[["parameter"]])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/charriso/micromamba/envs/ascat_env/lib/python3.9/site-packages/xarray/core/computation.py", line 1254, in apply_ufunc
    return apply_groupby_func(this_apply, *args)
  File "/home/charriso/micromamba/envs/ascat_env/lib/python3.9/site-packages/xarray/core/computation.py", line 616, in apply_groupby_func
    applied_example, applied = peek_at(applied)
  File "/home/charriso/micromamba/envs/ascat_env/lib/python3.9/site-packages/xarray/core/utils.py", line 205, in peek_at
    peek = next(gen)
  File "/home/charriso/micromamba/envs/ascat_env/lib/python3.9/site-packages/xarray/core/computation.py", line 615, in <genexpr>
    applied: Iterator = (func(*zipped_args) for zipped_args in zip(*iterators))
  File "/home/charriso/micromamba/envs/ascat_env/lib/python3.9/site-packages/xarray/core/computation.py", line 1270, in apply_ufunc
    return apply_dataarray_vfunc(
  File "/home/charriso/micromamba/envs/ascat_env/lib/python3.9/site-packages/xarray/core/computation.py", line 316, in apply_dataarray_vfunc
    result_var = func(*data_vars)
  File "/home/charriso/micromamba/envs/ascat_env/lib/python3.9/site-packages/xarray/core/computation.py", line 850, in apply_variable_ufunc
    raise ValueError(
ValueError: applied function returned data with an unexpected number of dimensions. Received 1 dimension(s) but expected 2 dimensions with names ('location_id', 'parameter'), from:

array([6.158726e-03, 9.929282e-04, 4.477925e-05])

No actually, because apply_ufunc is expecting an array with the non-squeezed signature to be returned. So this approach is no good once squeeze stops working, I suppose I'll need a workaround.

I have one way around this that works but I don't like it as much because I have to build another DataArray and re-add the coordinates for "parameter" explicitly:

>>> def cov_to_std(cov_arr):
...     stdevs = np.sqrt(np.diag(cov_arr.squeeze().data))
...     return xr.DataArray(
...         stdevs,
...         dims=["parameter"],
...         coords={"parameter": ["b0", "b1", "b2"]},
...         name="curvefit_std_dev",
...     )
... 
>>> fits["curvefit_std_dev"] = fits.curvefit_covariance.groupby("location_id", squeeze=False).apply(cov_to_std)
>>> fits.curvefit_std_dev
<xarray.DataArray 'curvefit_std_dev' (location_id: 4, parameter: 3)> Size: 96B
array([[6.15872551e-03, 9.92928181e-04, 4.47792492e-05],
       [6.18333131e-03, 9.91813485e-04, 4.51018413e-05],
       [5.97113095e-03, 9.72544425e-04, 4.26823073e-05],
       [5.38388364e-03, 8.60051084e-04, 3.94212881e-05]])
Coordinates:
  * parameter    (parameter) <U2 24B 'b0' 'b1' 'b2'
  * location_id  (location_id) int64 32B 3068918 3071502 3113100 3114697

It would be great if there was some other relatively "automatic" approach to this problem that avoids the need for squeezing. I feel like it should be relatively simple but I've been at it for some hours now. I'm probably missing something.

Answered by keewis

Jun 25, 2024

If I understand correctly, the groupby is supposed to work around the fact that diag doesn't take an axis parameter. If so, try this:

def extract_variances(arr, dims=None, new_dim="cov"):
    if dims is None:
        dims = ["cov_i", "cov_j"]
    cov_size = min(size for name, size in arr.sizes.items() if name in dims)
    return xr.apply_ufunc(
        np.diag,
        arr,
        input_core_dims=[dims],
        output_core_dims=[[new_dim]],
        vectorize=True,
        dask="parallelized",
        dask_gufunc_kwargs={"output_sizes": {"cov": cov_size}},
    )

arr = xr.DataArray([[[1, 0], [0, 1]], [[4, 0], [0, 4]]], dims=["dim0", "cov_i", "cov_j"])
np.sqrt(extract_variances(arr))
np.sqrt

View full answer

keewis · 2024-06-25T09:57:56Z

keewis
Jun 25, 2024
Maintainer

If I understand correctly, the groupby is supposed to work around the fact that diag doesn't take an axis parameter. If so, try this:

def extract_variances(arr, dims=None, new_dim="cov"):
    if dims is None:
        dims = ["cov_i", "cov_j"]
    cov_size = min(size for name, size in arr.sizes.items() if name in dims)
    return xr.apply_ufunc(
        np.diag,
        arr,
        input_core_dims=[dims],
        output_core_dims=[[new_dim]],
        vectorize=True,
        dask="parallelized",
        dask_gufunc_kwargs={"output_sizes": {"cov": cov_size}},
    )

arr = xr.DataArray([[[1, 0], [0, 1]], [[4, 0], [0, 4]]], dims=["dim0", "cov_i", "cov_j"])
np.sqrt(extract_variances(arr))
np.sqrt(extract_variances(arr.chunk({"dim0": 1}))).compute()

The magic here is vectorize=True, which does a (relatively slow) python loop over the non-core dims. If you work with a lot of data, you might want to move the loop to compiled code (for example using numba.guvectorize) to accelerate a bit more.

4 replies

claytharrison Jun 25, 2024
Author

Thank you! This is exactly what I was looking for. I think I was nearly there but I missed the vectorize trick. I thought because I was already using numpy functions that this wouldn't be necessary, but I misunderstood exactly what's going on there.

I've modified it a bit to use da.diag as the func, and dask="allowed". This way I don't need dask_gufunc_kwargs at all. Would there be any reason not to do things this way instead?

keewis Jun 25, 2024
Maintainer

if you use da.diag, I believe you can't use it with numpy arrays (it will always return a dask array). Otherwise, no, there's no difference. To work around that, I think you can also create a function that dispatches based on the type (i.e. use np.diag for numpy.ndarray and da.diag for dask.array.Array).

dcherian Jun 25, 2024
Maintainer

No actually, because apply_ufunc is expecting an array with the non-squeezed signature to be returned. So this approach is no good once squeeze stops working, I suppose I'll need a workaround.

I think You can tell apply_ufunc to not expect location_id in the output by setting exclude_dims={"location_id"} and adding location_id to the "input_core_dims"

That said, your groupby solution is a more manual vectorize basically, since apply_ufunc for-loops over the groups, so I would just use that approach

claytharrison Jun 25, 2024
Author

Actually there is also a problem with the dask="allowed" approach when arr is chunked, whereas @keewis original answer handles this case just fine:

>>> fits["curvefit_std_dev"] = xr.apply_ufunc(cov_to_std, fits.curvefit_covariance, input_core_dims=[["cov_i", "cov_j"]], output_core_dims=[["parameter"]], dask="allowed", vectorize=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/charriso/micromamba/envs/ascat_env/lib/python3.9/site-packages/xarray/core/computation.py", line 1270, in apply_ufunc
    return apply_dataarray_vfunc(
  File "/home/charriso/micromamba/envs/ascat_env/lib/python3.9/site-packages/xarray/core/computation.py", line 316, in apply_dataarray_vfunc
    result_var = func(*data_vars)
  File "/home/charriso/micromamba/envs/ascat_env/lib/python3.9/site-packages/xarray/core/computation.py", line 825, in apply_variable_ufunc
    result_data = func(*input_data)
  File "/home/charriso/Projects/ascat_cal/src/ascat_cal/anom_detect_sigma.py", line 335, in cov_to_std
    loc_cov = da.diag(loc_cov)
  File "/home/charriso/micromamba/envs/ascat_env/lib/python3.9/site-packages/dask/array/creation.py", line 652, in diag
    raise ValueError("Array must be 1d or 2d only")
ValueError: Array must be 1d or 2d only
>>> fits.curvefit_covariance.load()
>>> fits["curvefit_std_dev"] = xr.apply_ufunc(cov_to_std, fits.curvefit_covariance, input_core_dims=[["cov_i", "cov_j"]], output_core_dims=[["parameter"]], dask="allowed", vectorize=True)
>>> fits
<xarray.Dataset> Size: 584B
Dimensions:                (location_id: 4, parameter: 3, cov_i: 3, cov_j: 3)
Coordinates:
  * parameter              (parameter) <U2 24B 'b0' 'b1' 'b2'
  * cov_i                  (cov_i) <U2 24B 'b0' 'b1' 'b2'
  * cov_j                  (cov_j) <U2 24B 'b0' 'b1' 'b2'
  * location_id            (location_id) int64 32B 3068918 3071502 ... 3114697
Data variables:
    curvefit_coefficients  (location_id, parameter) float64 96B dask.array<chunksize=(1, 3), meta=np.ndarray>
    curvefit_covariance    (location_id, cov_i, cov_j) float64 288B 3.793e-05...
    curvefit_std_dev       (location_id, parameter) float64 96B dask.array<chunksize=(4, 3), meta=np.ndarray>

@dcherian your suggestion also works very nicely. I'll try both approaches with my full dataset and see if there's much difference between them, I expect that they're basically equivalent. Thank you again for the help!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Taking the diagonal of each 2-D array along a third dimension? Current approach uses deprecated groupby-squeeze #9173

{{title}}

Replies: 1 comment 4 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Taking the diagonal of each 2-D array along a third dimension? Current approach uses deprecated groupby-squeeze #9173

claytharrison Jun 25, 2024

Replies: 1 comment · 4 replies

keewis Jun 25, 2024 Maintainer

claytharrison Jun 25, 2024 Author

keewis Jun 25, 2024 Maintainer

dcherian Jun 25, 2024 Maintainer

claytharrison Jun 25, 2024 Author

claytharrison
Jun 25, 2024

Replies: 1 comment 4 replies

keewis
Jun 25, 2024
Maintainer

claytharrison Jun 25, 2024
Author

keewis Jun 25, 2024
Maintainer

dcherian Jun 25, 2024
Maintainer

claytharrison Jun 25, 2024
Author