Skip to content

fix: resolve pandas 3 DatetimeIndex-to-int64 unit regression#639

Open
ebrattli wants to merge 7 commits into
mainfrom
fix/pandas3-datetime-index-to-int64
Open

fix: resolve pandas 3 DatetimeIndex-to-int64 unit regression#639
ebrattli wants to merge 7 commits into
mainfrom
fix/pandas3-datetime-index-to-int64

Conversation

@ebrattli
Copy link
Copy Markdown
Contributor

Problem

In pandas 3, DatetimeIndex.to_numpy(np.int64), .astype(np.int64), and np.array(index, dtype=np.int64) return the index's native resolution (often microseconds) rather than nanoseconds. Code that assumed nanoseconds silently produced wrong values. Timestamp.value is also deprecated and will eventually be removed.

Affected functions:

  • data_quality.extreme — wrong x-axis seconds, broken hat matrix on rank-deficient input
  • data_quality.gaps_identification_{z_scores,modified_z_scores,iqr} — implicit unit in timestamp diffs
  • detect.oscillation_detectorTimestamp.value list comprehension (deprecated)
  • statistics.outliers_v1.astype(np.int64) on DatetimeIndex before standardization

Fix

Add datetime_index_to_ns(index) in ts_utils/utility_functions.py as a single explicit conversion point (index.as_unit("ns").asi8), and use it everywhere. Also replace np.linalg.inv with np.linalg.pinv in the hat matrix calculation to handle rank-deficient inputs without raising LinAlgError.

All fixes restore pre-pandas-3 behavior (which was implicitly nanoseconds). The pinv change is the only intentional behavioral difference — it's strictly more robust for edge cases.

Tests added

  • Direct unit test for datetime_index_to_ns verifying nanosecond output
  • Gap detection test asserting correct absolute gap duration (would have caught the unit bug)
  • Oscillation detector test with a 2024 timestamp index verifying correct frequencies (would catch Timestamp.value removal)
  • extreme() regression tests for modern timestamps and rank-deficient input

In pandas 3, DatetimeIndex.to_numpy(np.int64), .astype(np.int64), and
np.array(index, dtype=np.int64) return the index's native resolution
(often microseconds) rather than nanoseconds. Any code that assumed
nanoseconds silently produced wrong values or will break when
Timestamp.value is eventually removed.

Introduce datetime_index_to_ns() in ts_utils/utility_functions.py as a
single explicit conversion point (index.as_unit("ns").asi8), then use
it in all affected modules:

- data_quality/outliers.py: fix x-axis construction in
  _split_timeseries_into_time_and_value_arrays; also replace
  np.linalg.inv with np.linalg.pinv in _calculate_hat_diagonal to
  handle rank-deficient inputs without leaking LinAlgError
- data_quality/gaps_identification.py: fix timestamp diff in all three
  gap detection functions
- detect/oscillation_detector.py: replace Timestamp.value list
  comprehension with datetime_index_to_ns
- statistics/outliers_v1.py: fix DatetimeIndex cast before
  standardization in _get_outlier_indices
- ts_utils/utility_functions.py: consolidate get_timestamps ns branch
  to use the same helper

Add regression tests that verify correct nanosecond output from the
helper directly, that gap detection identifies gaps of the right
absolute duration, and that oscillation frequencies are correct for
modern (2024) timestamps.
@ebrattli ebrattli requested a review from a team as a code owner May 15, 2026 10:46
@codecov
Copy link
Copy Markdown

codecov Bot commented May 15, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 91.06%. Comparing base (aac3d80) to head (beda183).
⚠️ Report is 7 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #639      +/-   ##
==========================================
+ Coverage   90.64%   91.06%   +0.41%     
==========================================
  Files         111      111              
  Lines        4201     4185      -16     
  Branches      552      552              
==========================================
+ Hits         3808     3811       +3     
+ Misses        250      231      -19     
  Partials      143      143              
Files with missing lines Coverage Δ
indsl/data_quality/gaps_identification.py 96.42% <100.00%> (ø)
indsl/data_quality/outliers.py 97.56% <100.00%> (+0.06%) ⬆️
indsl/detect/oscillation_detector.py 89.25% <ø> (ø)
indsl/statistics/outliers.py 89.01% <100.00%> (ø)
indsl/statistics/outliers_v1.py 85.88% <100.00%> (ø)
indsl/ts_utils/utility_functions.py 84.78% <100.00%> (+0.22%) ⬆️

... and 33 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 15, 2026

Unit Test Results

    24 files  +    9      24 suites  +9   34m 46s ⏱️ + 13m 18s
 1 222 tests +   10   1 222 ✅ +   10   0 💤 ±0  0 ❌ ±0 
21 360 runs  +6 411  21 348 ✅ +6 402  12 💤 +9  0 ❌ ±0 

Results for commit beda183. ± Comparison against base commit aac3d80.

♻️ This comment has been updated with latest results.

@ebrattli
Copy link
Copy Markdown
Contributor Author

/gemini review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request prepares the codebase for pandas 3.0 by standardizing timestamp-to-nanosecond conversions through a new utility function, datetime_index_to_ns, and updating linear algebra operations for better stability. Key changes include replacing deprecated .value and astype(np.int64) calls across several modules and adding regression tests for timestamp resolution. Review feedback suggests optimizing the hat matrix diagonal calculation to reduce memory complexity from O(n^2) to O(n) and replacing deprecated pd.Timedelta.value usage to ensure full future-proofing.

Comment thread indsl/data_quality/outliers.py Outdated
Comment thread indsl/ts_utils/utility_functions.py Outdated
- Reduce hat diagonal memory from O(n²) to O(n) by computing
  np.sum(X_mat * pinv_X.T, axis=1) instead of materialising the
  full n×n hat matrix
- Replace deprecated pd.Timedelta.value with integer division
  pd.Timedelta(1, unit) // pd.Timedelta(1, "ns") in get_timestamps
arwassa
arwassa previously approved these changes May 19, 2026
arwassa
arwassa previously approved these changes May 28, 2026
@ebrattli ebrattli added the waiting-for-risk-review Waiting for a member of the risk review team to take an action label May 28, 2026
@nithinb nithinb self-assigned this May 29, 2026
@nithinb nithinb added the risk-review-ongoing Risk review is in progress label May 29, 2026
Copy link
Copy Markdown

@nithinb nithinb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In outliers.py this issue continues to exist. This can be addressed as part of this PR

Comment thread indsl/data_quality/outliers.py Outdated
@nithinb nithinb added waiting-for-team Waiting for the submitter or reviewer of the PR to take an action and removed waiting-for-risk-review Waiting for a member of the risk review team to take an action labels May 29, 2026
- indsl/statistics/outliers.py:269 — replace .to_series().astype(np.int64) with
  datetime_index_to_ns (same pandas 3 unit bug as outliers_v1.py, missed in first pass)
- indsl/data_quality/outliers.py:254 — replace inline .as_unit("ns").astype(np.int64).to_numpy()
  with datetime_index_to_ns for consistency with all other call sites
@ebrattli
Copy link
Copy Markdown
Contributor Author

Fixed in 298d3c4

indsl/statistics/outliers.py:269 — replaced .to_series().astype(np.int64) with datetime_index_to_ns, same fix as was applied to outliers_v1.py in the original commit. Also took the opportunity to apply datetime_index_to_ns consistently in data_quality/outliers.py as suggested in the inline thread.

@ebrattli ebrattli requested review from arwassa and nithinb May 29, 2026 13:04
arwassa
arwassa previously approved these changes Jun 1, 2026
@ebrattli ebrattli added waiting-for-risk-review Waiting for a member of the risk review team to take an action and removed waiting-for-team Waiting for the submitter or reviewer of the PR to take an action labels Jun 1, 2026
Copy link
Copy Markdown

@nithinb nithinb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🦄

LGTM
had a nit but that is not blocking.

Comment thread tests/data_quality/test_studentised_residuals.py Outdated
Replace try/except with a direct assertion — with pinv the function
succeeds on rank-deficient input, so the guard is simply asserting the
call returns a Series.
@nithinb
Copy link
Copy Markdown

nithinb commented Jun 3, 2026

risk review ok

@nithinb nithinb added waiting-for-team Waiting for the submitter or reviewer of the PR to take an action and removed waiting-for-risk-review Waiting for a member of the risk review team to take an action labels Jun 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

risk-review-ongoing Risk review is in progress waiting-for-team Waiting for the submitter or reviewer of the PR to take an action

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants