Skip to content

Extend rolling_exp to support pd.Timedelta objects with window halflife #10237

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

abiasiol
Copy link

  • Tests added
  • User visible changes (including notable bug fixes) are documented in whats-new.rst

Description

Extended rolling_exp to support pd.Timedelta objects for the window size when using window_type="halflife" along datetime dimensions, similar to pandas' ewm. This allows expressions like da.rolling_exp(time=pd.Timedelta(days=1), window_type="halflife").mean().

Implementation

  • Matches pandas implementation, allowing the operation only when:
    • window is a pd.Timedelta object
    • window_type is "halflife"
    • dimension is a datetime index
    • operation is mean
  • Take advantage of numbagg's implementation of nanmean which allows alpha to be an array
  • Ported over _calculate_deltas function rather than relying on pandas' private implementation

Behavior Note

One difference from pandas' behavior: when dealing with nan values and a very short timedelta, this implementation returns nan while pandas appears to carry forward the previous value. This behavior seems more appropriate to me (user can fill it later, if they need to).

Example demonstrating the difference:

times = pd.date_range("2000-01-01", freq="1D", periods=21)
da = DataArray(
    np.random.random((21, 4)),
    dims=("time", "x"),
    coords=dict(time=times),
)
da = da.where(da > 0.2)
da.to_pandas().ewm(halflife=pd.Timedelta(minutes=1), times=da.time.values).mean()
da.rolling_exp(time=pd.Timedelta(minutes=1), window_type="halflife").mean().to_pandas()

abiasiol and others added 4 commits April 19, 2025 16:34
Added validation and calculation functions for halflife operations. Updated docstrings and type hints accordingly. Moved _calculate_deltas literally from pandas/window/core/ewm.py to not rely on internal pandas function.
Introduced new test cases to validate the behavior of rolling_exp when using Timedelta windows, specifically for the halflife window type.
Checks for compatibility between window type, window, index, and operation. Check results match pandas.
Copy link

welcome bot commented Apr 20, 2025

Thank you for opening this pull request! It may take us a few days to respond here, so thank you for being patient.
If you have questions, some answers may be found in our contributing guidelines.

abiasiol and others added 3 commits April 20, 2025 11:20
…compatibility with pandas < 2.2.0

pandas ewm can work with non-ns resolution from >= 2.2.0. Here we just test that this PR rolling_exp can work with non-ns resolution.
@max-sixty
Copy link
Collaborator

thanks @abiasiol !

couple of quick questions:

  • why limit to halflife?
  • does it raise / handle indexes with uneven spacing?
  • why limit to mean?

@abiasiol
Copy link
Author

thanks @abiasiol !

couple of quick questions:
* does it raise / handle indexes with uneven spacing?

Hi @max-sixty !

It works with uneven spacing (the way that Pandas does):

times = pd.date_range("2000-01-01", freq="1D", periods=21)
times_delta = pd.to_timedelta(np.random.randint(0, 12, size=len(times)), unit="h")
times = times + times_delta

da = DataArray(
    np.random.random((21, 4)),
    dims=("time", "x"),
    coords=dict(time=times, x=["a", "b", "c", "d"]),
)

np.allclose(
    da.rolling_exp(time=pd.Timedelta(hours=2), window_type="halflife").mean().values,
    da.to_pandas()
    .ewm(halflife=pd.Timedelta(hours=2), times=da.time.values)
    .mean()
    .values,
) # True

@abiasiol
Copy link
Author

thanks @abiasiol !

couple of quick questions:

* why limit to halflife?
* why limit to mean?

Reading the docstring of Pandas ewm, mean() should be the only "supported" operation, so I kept it simple and followed that.

If times is provided, halflife and one of com, span or alpha may be provided.
halflife: If times is specified, a timedelta convertible unit over which an observation decays to half its value. Only applicable to mean(), and halflife value will not apply to the other functions.

But let me take another look, and I'll get back to you.

@max-sixty
Copy link
Collaborator

ah, great, it uses the numbagg feature which takes an array of alphas — happy to see that being used! I wrote it for myself but hadn't really integrated it into xarray

I don't fully understand why we're limited to halflife — all the window types are freely convertible to one another; though possibly I'm misunderstanding something. (and same thing with mean vs other ops, though am even less confident) — does pandas have a reason for this specificity?

I haven't looked in enough detail at the calcs, but assuming we're well-tested against the pandas implementation, that's sufficient

@abiasiol
Copy link
Author

abiasiol commented May 4, 2025

I saw that the type hints for alphas indicated it could be an array, which was very helpful for this PR!

Regarding the window parameters (span, com, halflife): while they are related, applying timedeltas directly to span or com feels less intuitive to me, as these parameters seem more count-based. Pandas' API behavior is a bit different, as in when times is specified, it requires halflife (for scaling time deltas) and optionally allows one of com/span/alpha. Currently, I believe our implementation in Xarray only permits specifying a single window_type parameter (If only halflife is provided with times, Pandas effectively defaults to a calculation equivalent to com=1 (equivalent to alpha=0.5, like in this PR).

Concerning the EWM operations: Pandas has some inconsistent behavior. For operations other than mean, when times and halflife are provided, Pandas seems to ignore those and defaults to using a fixed com=1 for the calculation (as in the example below). Our current implementation does apply the time-scaling correctly to these other operations as well, but this leads to results that are different from Pandas' output for those specific operations (e.g., std, var).

Simple example (with sum, or std, ...) on equally spaced data:

import xarray as xr
import pandas as pd

n = 20
times = pd.date_range("2020-01-01", periods=n, freq="1D")
da = xr.DataArray(np.arange(n), dims="time", coords={"time": times})
df = da.to_pandas()

pandas_1 = df.ewm(halflife=pd.Timedelta(days=1), times=df.index).sum()
pandas_2 = df.ewm(halflife=pd.Timedelta(days=2), times=df.index).sum()
pandas_com = df.ewm(com=1).sum()

print(np.allclose(pandas_1.values, pandas_com.values)) # True
print(np.allclose(pandas_2.values, pandas_com.values)) # True
print(np.allclose(pandas_1.values, pandas_2.values)) # True

# to do this, you need to comment out the operation kill-switch in this PR
xr_1 = da.rolling_exp(time=pd.Timedelta(days=1), window_type="halflife").sum().to_pandas()
xr_2 = da.rolling_exp(time=pd.Timedelta(days=2), window_type="halflife").sum().to_pandas()

print(np.allclose(xr_1.values, pandas_1.values)) # True
print(np.allclose(xr_2.values, pandas_2.values)) # False

I find the pandas.ewm API and its behavior in these cases somewhat confusing. My goal in this PR was to avoid that ambiguity by initially enabling the time-aware calculations (using halflife with time axes) only for the mean operation where the behavior is well-defined and consistent with Pandas. I would like to avoid the potentially confusing behavior for the other operations where parameters might be ignored.

We could enable other operations with time-aware calculations later. However, we would need to validate their results from scratch, and highlight that results will be different from current Pandas versions for those operations.

What are your thoughts on proceeding with this more limited implementation for now? We can expand the functionality later based on user requests for specific, currently missing operations. This approach feels safer for a first contribution by limiting the initial scope.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants