fix: resolve pandas 3 DatetimeIndex-to-int64 unit regression#639
fix: resolve pandas 3 DatetimeIndex-to-int64 unit regression#639ebrattli wants to merge 7 commits into
Conversation
In pandas 3, DatetimeIndex.to_numpy(np.int64), .astype(np.int64), and
np.array(index, dtype=np.int64) return the index's native resolution
(often microseconds) rather than nanoseconds. Any code that assumed
nanoseconds silently produced wrong values or will break when
Timestamp.value is eventually removed.
Introduce datetime_index_to_ns() in ts_utils/utility_functions.py as a
single explicit conversion point (index.as_unit("ns").asi8), then use
it in all affected modules:
- data_quality/outliers.py: fix x-axis construction in
_split_timeseries_into_time_and_value_arrays; also replace
np.linalg.inv with np.linalg.pinv in _calculate_hat_diagonal to
handle rank-deficient inputs without leaking LinAlgError
- data_quality/gaps_identification.py: fix timestamp diff in all three
gap detection functions
- detect/oscillation_detector.py: replace Timestamp.value list
comprehension with datetime_index_to_ns
- statistics/outliers_v1.py: fix DatetimeIndex cast before
standardization in _get_outlier_indices
- ts_utils/utility_functions.py: consolidate get_timestamps ns branch
to use the same helper
Add regression tests that verify correct nanosecond output from the
helper directly, that gap detection identifies gaps of the right
absolute duration, and that oscillation frequencies are correct for
modern (2024) timestamps.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #639 +/- ##
==========================================
+ Coverage 90.64% 91.06% +0.41%
==========================================
Files 111 111
Lines 4201 4185 -16
Branches 552 552
==========================================
+ Hits 3808 3811 +3
+ Misses 250 231 -19
Partials 143 143
🚀 New features to boost your workflow:
|
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request prepares the codebase for pandas 3.0 by standardizing timestamp-to-nanosecond conversions through a new utility function, datetime_index_to_ns, and updating linear algebra operations for better stability. Key changes include replacing deprecated .value and astype(np.int64) calls across several modules and adding regression tests for timestamp resolution. Review feedback suggests optimizing the hat matrix diagonal calculation to reduce memory complexity from O(n^2) to O(n) and replacing deprecated pd.Timedelta.value usage to ensure full future-proofing.
- Reduce hat diagonal memory from O(n²) to O(n) by computing np.sum(X_mat * pinv_X.T, axis=1) instead of materialising the full n×n hat matrix - Replace deprecated pd.Timedelta.value with integer division pd.Timedelta(1, unit) // pd.Timedelta(1, "ns") in get_timestamps
nithinb
left a comment
There was a problem hiding this comment.
In outliers.py this issue continues to exist. This can be addressed as part of this PR
- indsl/statistics/outliers.py:269 — replace .to_series().astype(np.int64) with
datetime_index_to_ns (same pandas 3 unit bug as outliers_v1.py, missed in first pass)
- indsl/data_quality/outliers.py:254 — replace inline .as_unit("ns").astype(np.int64).to_numpy()
with datetime_index_to_ns for consistency with all other call sites
|
Fixed in 298d3c4
|
nithinb
left a comment
There was a problem hiding this comment.
🦄
LGTM
had a nit but that is not blocking.
Replace try/except with a direct assertion — with pinv the function succeeds on rank-deficient input, so the guard is simply asserting the call returns a Series.
|
risk review ok |
Problem
In pandas 3,
DatetimeIndex.to_numpy(np.int64),.astype(np.int64), andnp.array(index, dtype=np.int64)return the index's native resolution (often microseconds) rather than nanoseconds. Code that assumed nanoseconds silently produced wrong values.Timestamp.valueis also deprecated and will eventually be removed.Affected functions:
data_quality.extreme— wrong x-axis seconds, broken hat matrix on rank-deficient inputdata_quality.gaps_identification_{z_scores,modified_z_scores,iqr}— implicit unit in timestamp diffsdetect.oscillation_detector—Timestamp.valuelist comprehension (deprecated)statistics.outliers_v1—.astype(np.int64)on DatetimeIndex before standardizationFix
Add
datetime_index_to_ns(index)ints_utils/utility_functions.pyas a single explicit conversion point (index.as_unit("ns").asi8), and use it everywhere. Also replacenp.linalg.invwithnp.linalg.pinvin the hat matrix calculation to handle rank-deficient inputs without raisingLinAlgError.All fixes restore pre-pandas-3 behavior (which was implicitly nanoseconds). The
pinvchange is the only intentional behavioral difference — it's strictly more robust for edge cases.Tests added
datetime_index_to_nsverifying nanosecond outputTimestamp.valueremoval)extreme()regression tests for modern timestamps and rank-deficient input