[Analytics Backend] Timechart route end-to-end closure (1/30 → 30/30) by ahkcs · Pull Request #21747 · opensearch-project/OpenSearch

ahkcs · 2026-05-19T22:38:31Z

Summary

End-to-end coverage closure for PPL timechart on the analytics-engine route. Five coupled commits in analytics-backend-datafusion take CalciteTimechart{Command,PerFunction}IT from 1/30 → 30/30 ✅ on upstream/main. Each commit corresponds to a distinct Substrait-binding gap surfaced by progressive IT runs against a :run cluster with -Dtests.analytics.force_routing=true -Dtests.analytics.parquet_indices=true.

Commits

1. `[Analytics] Wire TIMESTAMPDIFF/TIMESTAMPADD + skip non-prefix-groupSet split`

Two coupled changes:

Register TIMESTAMPDIFF and TIMESTAMPADD as scalar functions (enum constants in ScalarFunction, added to STANDARD_PROJECT_OPS). Clears the capability-layer rejection at OpenSearchProjectRule.annotateExpr; isthmus-side substrait conversion is still gated by per-call adapters (commit 2).
OpenSearchAggregateSplitRule.onMatch() skips the PARTIAL/FINAL alternative when it would emit a row type that fails Volcano's typeMatchesInferred. Two unsafe shapes today:
- percentile_approx (2-arg aggregate whose FINAL phase needs (tdigest_state, percent_literal))
- Cross-family non-prefix groupSet — when a non-prefix group-key index lands on PARTIAL's agg-output slot whose SqlTypeFamily differs from the ORIGINAL input column's family (e.g. PPL timechart's no-by form: groupSet={2} on a [@timestamp:TIMESTAMP, cpu_usage:DOUBLE, SPAN(@timestamp):TIMESTAMP] input where the SUM aggregate's DOUBLE return at position 2 clashes with TIMESTAMP at the same input position).
  Same-family non-prefix cases (e.g. group={1} over two INTEGER columns + a NUMERIC agg) still pass through the split unchanged — PlanShapeTests.testJoinWithDifferentGroupKeys_multiShard verifies this. The skip lives in onMatch (not matches) so the SINGLE+SINGLETON alternative is still registered, giving the planner a coordinator-gather path for the skipped shapes.

2. `[Analytics Backend] Add TimestampDiffAdapter peephole for timechart per_*`

Peephole that constant-folds TIMESTAMPDIFF(out_unit, t, TIMESTAMPADD(in_unit, n, t)) to a BIGINT literal when both units are fixed-length (MICROSECOND…WEEK). Eliminates both PPL UDF references in one step — neither has a Substrait extension binding, so the unfolded calls would surface as Unable to convert call TIMESTAMPADD(...). OperatorAnnotation wrappers introduced by OpenSearchProjectRule are peeled at each operand before structural comparison.

3. `[Analytics Backend] UInt64 → Int64 bridge at partition-stream boundary`

Analytics-engine reduce-stage fragments crashed with Substrait error: Field '<name>' in Substrait schema has a different type (Int64) than the corresponding field in the table schema (UInt64) when the upstream PARTIAL fragment emitted an Arrow type Substrait/Calcite has no native equivalent for. Canonical case: DataFusion's row_number() physical op (output: UInt64) feeding a consumer plan that types ROW_NUMBER OVER as BIGINT/Int64.

Extends the existing schema_coerce::coerce_inferred_schema bridge — currently applied only at the parquet-scan boundary — to the partition-stream registration boundary used by every inter-stage exchange:

api::register_partition_stream runs coerce_inferred_schema on the producer-derived schema before session.register_partition and the IPC return path.
api::sender_send adds coerce_batch_to_sender_schema — casts incoming RecordBatch columns to match the sender's coerced schema via arrow::compute::cast. Without this, DataFusion's HashJoin / RepartitionExec strict-downcasts to Int64Type and panics with "primitive array".
DatafusionReduceSink.typesMatch relaxed to treat same-width Int (any signedness) as compatible, mirroring the existing Timestamp precision tolerance.

4. `[Analytics Backend] SpanAdapter — multi-unit fixed-length time bucket`

Pre-existing SpanAdapter handled time spans only when interval==1. Multi-unit spans (span=2m, span=12h, span=5d, ...) fell through unchanged. Rewrite for fixed-length units (s, m, h, d, w):

SPAN(t, N, '<unit>')
  → CAST(from_unixtime(CAST((to_unixtime(t) / B) * B AS DOUBLE)) AS <result>)
  where B = N * unit_seconds

Three component choices were each forced by a binding-time substrait constraint:

LOCAL_TO_UNIXTIME_OP / LOCAL_FROM_UNIXTIME_OP (package-local) instead of Calcite stdlib UNIX_SECONDS / TIMESTAMP_SECONDS. The latter bind to BigQuery-named substrait functions DF's consumer has no entries for.
No FLOOR between divide and multiply. Integer ÷ integer already truncates (correct for non-negative epoch seconds); FLOOR(i64) doesn't bind in substrait (DF's floor is floating-point only).
CAST(... AS DOUBLE) before from_unixtime. yaml signature is (fp64) → precision_timestamp<6>.

Variable-length units (M, q, y) intentionally fall through — bucketing by calendar month isn't N × fixed_seconds. Commit 5 handles those.

5. `[Analytics Backend] TimestampDiffAdapter — variable-month runtime rewrite`

Extends commit 2's peephole to handle variable-length inner units (MONTH/QUARTER/YEAR) when the out-unit is fixed-length. Folding isn't safe (Feb 2025 = 28 days, Oct 2025 = 31 days — value depends on calendar month). Atomic runtime rewrite:

TIMESTAMPDIFF(out_unit, t, TIMESTAMPADD(MONTH|QUARTER|YEAR, n, t))
  →  (to_unixtime(DATETIME_PLUS(t, INTERVAL n*m MONTH)) - to_unixtime(t)) * out_factor

where m converts QUARTER→3 / YEAR→12, INTERVAL_YEAR_MONTH is built via SqlIntervalQualifier(MONTH) (same idiom as EarliestLatestAdapter#makeIntervalAdd, which has prior wiring proof that DATETIME_PLUS(t, INTERVAL_MONTH) binds end-to-end via Substrait's standard add(timestamp, interval_year_month) into DataFusion's native interval add).

PPL's Chart#transformPerFunction always emits MILLISECOND out-unit, so the multiplier path (×1000) is the hot case; the divisor path (MINUTE→÷60 etc.) covers consumer code shapes that compute coarser durations.

Pass-rate impact

Verified locally against upstream/main + this PR's full stack:

Test class	Baseline	+ c1 (TIMESTAMPDIFF wire)	+ c2 (peephole)	+ c3 (UInt64 bridge)	+ c4 (SpanAdapter multi-unit)	+ c5 (variable-month rewrite)
`CalciteTimechartCommandIT`	1/18	1/18	5/18	18/18 ✅	18/18 ✅	18/18 ✅
`CalciteTimechartPerFunctionIT`	0/12	0/12	1/12	1/12	10/12	12/12 ✅
Combined	1/30	1/30	6/30	19/30	28/30	30/30 ✅

PlanShapeTests.testJoinWithDifferentGroupKeys_multiShard (the multi-shard join-with-non-prefix-groupSet plan-shape regression test) and the 12 other PlanShapeTests cases also pass — no regression to the existing split rule's coverage.

Sandbox check status

./gradlew check -p sandbox -Dsandbox.enabled=true results on this branch:

✅ All analytics-backend-datafusion / analytics-engine / analytics-framework / parquet-data-format / composite-engine unit tests pass (-Werror compile, spotless, checkstyle, forbiddenApis, licenseHeaders, dependencyLicenses, thirdPartyAudit all clean).
⚠️ Pre-existing failures in :sandbox:qa:analytics-engine-coordinator:internalClusterTest and :sandbox:plugins:composite-engine:internalClusterTest (9 + 2 tests) all surface the same Allocator[ROOT] closed with outstanding child allocators teardown leak. Identical failures reproduce on plain upstream/main without any of this PR's commits — confirmed in a separate run. Unrelated to the code paths this PR touches (SpanAdapter, TimestampDiffAdapter, OpenSearchAggregateSplitRule, register_partition_stream, sender_send, typesMatch).

Test plan

CalciteTimechartCommandIT — 18/18 on analytics-engine route
CalciteTimechartPerFunctionIT — 12/12 on analytics-engine route
PlanShapeTests (analytics-engine) — 13/13, no regression to non-prefix-groupSet handling
./gradlew check -p sandbox -Dsandbox.enabled=true — unit-test surface clean; pre-existing flakes in unrelated internalClusterTest modules

github-actions · 2026-05-19T22:39:13Z

PR Code Analyzer ❗

AI-powered 'Code-Diff-Analyzer' found issues on commit 46e797d.

Path	Line	Severity	Description
sandbox/plugins/analytics-backend-datafusion/src/main/java/org/opensearch/be/datafusion/DatafusionReduceSink.java	256	low	The typesMatch validation is relaxed to allow any same-bit-width integer mismatch (signed/unsigned) to pass silently. If this function acts as a correctness gate elsewhere, the blanket bypass could allow unintended data to flow through without type error. The rationale (UInt64 row_number vs Int64 BIGINT) is narrow and specific, but the predicate accepts all same-width Int pairs, which is broader than documented. No evidence of malicious intent; flagged as a minor logic anomaly worth a second review.

The table above displays the top 10 most important findings.

Total: 1 | Critical: 0 | High: 0 | Medium: 0 | Low: 1

Pull Requests Author(s): Please update your Pull Request according to the report above.

Repository Maintainer(s): You can bypass diff analyzer by adding label skip-diff-analyzer after reviewing the changes carefully, then re-run failed actions. To re-enable the analyzer, remove the label, then re-run all actions.

⚠️ Note: The Code-Diff-Analyzer helps protect against potentially harmful code patterns. Please ensure you have thoroughly reviewed the changes beforehand.

Thanks.

github-actions · 2026-05-19T22:39:57Z

PR Reviewer Guide 🔍

(Review updated until commit `cc0b9ab`)

Here are some key observations to aid the review process:

🧪 No relevant tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ Recommended focus areas for review Possible Issue The `bucketSecondsIfPositiveInteger` method returns `null` when `n <= 0`, but the caller at line 119 checks `bucketSeconds != null && bucketSeconds > 0L`. This creates a redundant check: if `bucketSeconds` is non-null, it is already guaranteed to be positive. However, the real issue is that when `n == 0`, the method returns `null` instead of `0L`, which might be a valid bucket size in some contexts. If `n == 0` is intentionally invalid, the check at line 119 is correct but redundant. If `n == 0` should be allowed, the method logic is incorrect. if (bucketSeconds != null && bucketSeconds > 0L) { return rewriteFixedLengthTimeBucket(rexBuilder, field, bucketSeconds, original.getType()); } } Possible Issue The `constantFold` method checks `outMs == 0L` at line 241, but `UNIT_TO_MILLIS` maps `MICROSECOND` to `0L` (line 74). This means any `TIMESTAMPDIFF` with `MICROSECOND` as the out-unit will return `null` from `constantFold`, even though the comment at line 244 suggests microseconds are "treated separately below" — but no separate handling exists. This causes valid microsecond-based diffs to fall through to the runtime rewrite path or be rejected, when they could be constant-folded. Long inMs = UNIT_TO_MILLIS.get(inUnit.toUpperCase(Locale.ROOT)); Long outMs = UNIT_TO_MILLIS.get(outUnit.toUpperCase(Locale.ROOT)); if (inMs == null \|\| outMs == null \|\| outMs == 0L) { return null; Arithmetic Overflow The `constantFold` method uses `Math.multiplyExact(inValue, inMs)` at line 247, which throws `ArithmeticException` on overflow. However, this exception is not caught, so it will propagate up and likely cause query failure. The caller at line 158 does not handle exceptions from `constantFold`. If `inValue` is large (e.g., `Long.MAX_VALUE` with unit `WEEK`), the multiplication overflows and crashes the adapter instead of gracefully falling back to the runtime rewrite. long totalMs = Math.multiplyExact(inValue, inMs); Possible Issue The `shouldSkipPartialFinalSplit` method checks `aggIdx >= aggCalls.size() \|\| k >= inputFields.size()` at line 114 and returns `true` (skip split) when out of bounds. However, this condition should never occur if the aggregate is well-formed — `groupSet` indices should always be valid input column indices, and `aggIdx` should always be within `aggCalls`. If this condition fires, it indicates a malformed plan, yet the code silently skips the split instead of failing fast. This could mask bugs in plan construction. if (aggIdx >= aggCalls.size() \|\| k >= inputFields.size()) { return true; // out of bounds → split would be structurally invalid

github-actions · 2026-05-19T22:40:20Z

PR Code Suggestions ✨

Latest suggestions up to cc0b9ab

Explore these optional code suggestions:

Category	Suggestion	Impact
General	Handle BigDecimal overflow exception The `longValueExact()` call can throw `ArithmeticException` when the `BigDecimal` value is out of range for `long`. Wrap this call in a try-catch to handle overflow gracefully and return `null` instead of propagating the exception. sandbox/plugins/analytics-backend-datafusion/src/main/java/org/opensearch/be/datafusion/SpanAdapter.java [138-152] long n; if (value instanceof BigDecimal bd) { if (bd.scale() > 0 && bd.stripTrailingZeros().scale() > 0) { return null; // non-integer interval not supported } - n = bd.longValueExact(); + try { + n = bd.longValueExact(); + } catch (ArithmeticException unused) { + return null; + } } else if (value instanceof Number num) { double d = num.doubleValue(); if (d != Math.floor(d)) { return null; } n = (long) d; } else { return null; } Suggestion importance[1-10]: 7 __ Why: The suggestion correctly identifies that `longValueExact()` can throw `ArithmeticException` for out-of-range values. Adding exception handling improves robustness by gracefully returning `null` instead of propagating the exception, which aligns with the method's contract of returning `null` for invalid inputs.	Medium
	Remove redundant positive check The condition `bucketSeconds > 0L` is redundant because `bucketSecondsIfPositiveInteger` already returns `null` when `n <= 0`. Remove the redundant check to simplify the logic. sandbox/plugins/analytics-backend-datafusion/src/main/java/org/opensearch/be/datafusion/SpanAdapter.java [118-121] Long bucketSeconds = bucketSecondsIfPositiveInteger(interval, unitSeconds); -if (bucketSeconds != null && bucketSeconds > 0L) { +if (bucketSeconds != null) { return rewriteFixedLengthTimeBucket(rexBuilder, field, bucketSeconds, original.getType()); } Suggestion importance[1-10]: 6 __ Why: The suggestion correctly identifies that `bucketSecondsIfPositiveInteger` already validates `n <= 0` and returns `null` in that case (line 153-155), making the `bucketSeconds > 0L` check redundant. However, the redundant check doesn't cause incorrect behavior and the improvement is minor.	Low
	Remove unreachable zero check The check `outMs == 0L` is unreachable because `UNIT_TO_MILLIS` only contains positive values (1L minimum for MILLISECOND). Remove this dead code condition to improve clarity. sandbox/plugins/analytics-backend-datafusion/src/main/java/org/opensearch/be/datafusion/TimestampDiffAdapter.java [241-243] -if (inMs == null \|\| outMs == null \|\| outMs == 0L) { +if (inMs == null \|\| outMs == null) { return null; } Suggestion importance[1-10]: 5 __ Why: The suggestion is correct that `UNIT_TO_MILLIS` only contains positive values (minimum 1L for MILLISECOND), making `outMs == 0L` unreachable. However, the check for MICROSECOND maps to 0L (line 74), so the condition is actually reachable and the suggestion is incorrect about it being dead code.	Low

Previous suggestions

Suggestions up to commit c75f6ad

Category	Suggestion	Impact
General	Remove redundant positive value check The condition `bucketSeconds > 0L` is redundant because `bucketSecondsIfPositiveInteger` already returns null for non-positive values (see the `n <= 0` check). This redundant check can be removed to simplify the logic. sandbox/plugins/analytics-backend-datafusion/src/main/java/org/opensearch/be/datafusion/SpanAdapter.java [118-121] Long bucketSeconds = bucketSecondsIfPositiveInteger(interval, unitSeconds); -if (bucketSeconds != null && bucketSeconds > 0L) { +if (bucketSeconds != null) { return rewriteFixedLengthTimeBucket(rexBuilder, field, bucketSeconds, original.getType()); } Suggestion importance[1-10]: 5 __ Why: The `bucketSeconds > 0L` check is indeed redundant since `bucketSecondsIfPositiveInteger` already ensures non-null implies positive. Removing it simplifies the code slightly without changing behavior.	Low

Suggestions up to commit e9b314c

Category	Suggestion	Impact
Possible issue	Validate schema field count before iteration If `target_schema` has fewer fields than the batch, accessing `batch.column(i)` will panic when `i >= batch.num_columns()`. Verify that both schemas have the same field count before iterating to prevent index-out-of-bounds panics. sandbox/plugins/analytics-backend-datafusion/rust/src/api.rs [969-985] +if batch.num_columns() != target_schema.fields().len() { + return Err(DataFusionError::Execution(format!( + "Schema field count mismatch: batch has {} columns, target schema has {} fields", + batch.num_columns(), + target_schema.fields().len() + ))); +} let mut columns: Vec<arrow::array::ArrayRef> = Vec::with_capacity(batch.num_columns()); for (i, target_field) in target_schema.fields().iter().enumerate() { let src = batch.column(i); if src.data_type() == target_field.data_type() { columns.push(src.clone()); } else { columns.push(arrow::compute::cast(src, target_field.data_type()).map_err(\|e\| { ... })?); } } Suggestion importance[1-10]: 8 __ Why: This is a valid safety concern. If `target_schema` has more fields than the batch, accessing `batch.column(i)` will panic. Adding an upfront validation prevents potential runtime panics and provides a clearer error message.	Medium
General	Prevent unnecessary overflow in unit conversion The multiplication `inValue * inMs` can overflow when both values are large, but the subsequent division by `outMs` might bring the result back into range. Consider performing the division before multiplication when `inMs` is divisible by `outMs` to avoid unnecessary overflow rejections. sandbox/plugins/analytics-backend-datafusion/src/main/java/org/opensearch/be/datafusion/TimestampDiffAdapter.java [240-244] +if (inMs % outMs == 0) { + long factor = inMs / outMs; + return Math.multiplyExact(inValue, factor); +} long totalMs = Math.multiplyExact(inValue, inMs); if (totalMs % outMs != 0) { return null; } return totalMs / outMs; Suggestion importance[1-10]: 7 __ Why: The suggestion correctly identifies that when `inMs` is divisible by `outMs`, performing division before multiplication can prevent overflow while maintaining correctness. This optimization is valid and improves robustness for large values.	Medium
General	Remove redundant positive value check The check `bucketSeconds > 0L` is redundant because `bucketSecondsIfPositiveInteger` already returns `null` when `n <= 0`. The null-check alone is sufficient, simplifying the condition. sandbox/plugins/analytics-backend-datafusion/src/main/java/org/opensearch/be/datafusion/SpanAdapter.java [118-121] Long bucketSeconds = bucketSecondsIfPositiveInteger(interval, unitSeconds); -if (bucketSeconds != null && bucketSeconds > 0L) { +if (bucketSeconds != null) { return rewriteFixedLengthTimeBucket(rexBuilder, field, bucketSeconds, original.getType()); } Suggestion importance[1-10]: 5 __ Why: The suggestion is correct that `bucketSecondsIfPositiveInteger` already returns `null` for non-positive values (line 153-155), making the `> 0L` check redundant. This is a minor code cleanup that improves readability.	Low

Suggestions up to commit 2817c6e

Category	Suggestion	Impact
Possible issue	Prevent silent overflow in multiplication The multiplication `n * unitSeconds` can silently overflow when both values are large positive longs, producing incorrect negative or wrapped results. Check for overflow before returning, or use `Math.multiplyExact()` to throw on overflow and handle it appropriately. sandbox/plugins/analytics-backend-datafusion/src/main/java/org/opensearch/be/datafusion/SpanAdapter.java [153-156] if (n <= 0) { return null; } -return n * unitSeconds; +try { + return Math.multiplyExact(n, unitSeconds); +} catch (ArithmeticException e) { + return null; +} Suggestion importance[1-10]: 8 __ Why: This is a valid concern as multiplying two large positive `long` values can silently overflow, producing incorrect results. Using `Math.multiplyExact()` with exception handling prevents silent data corruption in time bucket calculations, which could lead to incorrect query results.	Medium
	Validate column count before indexing If `batch.num_columns()` is less than `target_schema.fields().len()`, the loop will panic when calling `batch.column(i)` with an out-of-bounds index. Validate that the column counts match before iterating, or use `batch.columns().iter().zip(target_schema.fields())` to safely pair columns with fields. sandbox/plugins/analytics-backend-datafusion/rust/src/api.rs [969-985] +if batch.num_columns() != target_schema.fields().len() { + return Err(DataFusionError::Execution(format!( + "Column count mismatch: batch has {} columns but target schema has {} fields", + batch.num_columns(), + target_schema.fields().len() + ))); +} let mut columns: Vec<arrow::array::ArrayRef> = Vec::with_capacity(batch.num_columns()); for (i, target_field) in target_schema.fields().iter().enumerate() { let src = batch.column(i); if src.data_type() == target_field.data_type() { columns.push(src.clone()); } else { columns.push(arrow::compute::cast(src, target_field.data_type()).map_err(\|e\| { - DataFusionError::Execution(format!( + DatafusionError::Execution(format!( "Failed to cast partition stream column '{}' from {:?} to {:?}: {}", target_field.name(), src.data_type(), target_field.data_type(), e )) })?); } } Suggestion importance[1-10]: 8 __ Why: The suggestion identifies a potential panic when `batch.num_columns()` is less than `target_schema.fields().len()`. Adding an upfront validation prevents out-of-bounds access and provides a clear error message, improving robustness of the schema coercion logic.	Medium
	Handle ArithmeticException from longValueExact The `longValueExact()` call can throw `ArithmeticException` if the `BigDecimal` has a non-zero fractional part or exceeds `long` range. Wrap this in a try-catch block to return `null` gracefully instead of propagating an exception, maintaining consistency with the other validation paths. sandbox/plugins/analytics-backend-datafusion/src/main/java/org/opensearch/be/datafusion/SpanAdapter.java [138-152] long n; if (value instanceof BigDecimal bd) { if (bd.scale() > 0 && bd.stripTrailingZeros().scale() > 0) { return null; // non-integer interval not supported } - n = bd.longValueExact(); + try { + n = bd.longValueExact(); + } catch (ArithmeticException e) { + return null; + } } else if (value instanceof Number num) { double d = num.doubleValue(); if (d != Math.floor(d)) { return null; } n = (long) d; } else { return null; } Suggestion importance[1-10]: 7 __ Why: The suggestion correctly identifies that `longValueExact()` can throw `ArithmeticException` when the `BigDecimal` exceeds `long` range. However, the existing code already checks for non-integer values via `scale()`, making overflow the primary concern. Adding the try-catch improves robustness and maintains consistency with the validation pattern.	Medium

Suggestions up to commit 2817c6e

Category	Suggestion	Impact
Possible issue	Validate column count before iteration If `target_schema` has fewer fields than `batch.num_columns()`, accessing `batch.column(i)` will panic when `i` exceeds the batch's column count. Verify that both schemas have the same number of columns before iterating. sandbox/plugins/analytics-backend-datafusion/rust/src/api.rs [969-971] +if batch.num_columns() != target_schema.fields().len() { + return Err(DataFusionError::Execution(format!( + "Schema mismatch: batch has {} columns but target schema has {} fields", + batch.num_columns(), + target_schema.fields().len() + ))); +} let mut columns: Vec<arrow::array::ArrayRef> = Vec::with_capacity(batch.num_columns()); for (i, target_field) in target_schema.fields().iter().enumerate() { let src = batch.column(i); Suggestion importance[1-10]: 8 __ Why: Important safety check to prevent potential panic when `target_schema` and `batch` have mismatched column counts. The iteration uses `target_schema.fields().len()` but accesses `batch.column(i)`, which could panic if schemas diverge in size.	Medium
General	Handle potential ArithmeticException from longValueExact Calling `longValueExact()` can throw `ArithmeticException` if the `BigDecimal` has a non-zero fractional part or exceeds `Long` range. Wrap this call in a try-catch block to handle potential exceptions gracefully. sandbox/plugins/analytics-backend-datafusion/src/main/java/org/opensearch/be/datafusion/SpanAdapter.java [140-143] if (bd.scale() > 0 && bd.stripTrailingZeros().scale() > 0) { return null; // non-integer interval not supported } -n = bd.longValueExact(); +try { + n = bd.longValueExact(); +} catch (ArithmeticException e) { + return null; +} Suggestion importance[1-10]: 7 __ Why: Valid concern about `longValueExact()` potentially throwing `ArithmeticException` if the value exceeds `Long` range. The try-catch addition provides defensive error handling, though the preceding scale check already filters out non-integers.	Medium
General	Remove redundant positive value check The check `bucketSeconds > 0L` is redundant because `bucketSecondsIfPositiveInteger` already returns `null` when `n <= 0`. Remove the redundant check to simplify the condition. sandbox/plugins/analytics-backend-datafusion/src/main/java/org/opensearch/be/datafusion/SpanAdapter.java [118-121] Long bucketSeconds = bucketSecondsIfPositiveInteger(interval, unitSeconds); -if (bucketSeconds != null && bucketSeconds > 0L) { +if (bucketSeconds != null) { return rewriteFixedLengthTimeBucket(rexBuilder, field, bucketSeconds, original.getType()); } Suggestion importance[1-10]: 6 __ Why: The suggestion correctly identifies that `bucketSecondsIfPositiveInteger` already validates `n <= 0` and returns `null` in that case, making the `bucketSeconds > 0L` check redundant. This simplifies the code without changing behavior.	Low

github-actions · 2026-05-19T22:41:50Z

Persistent review updated to latest commit 2817c6e

github-actions · 2026-05-19T22:58:56Z

Persistent review updated to latest commit e9b314c

github-actions · 2026-05-19T23:06:20Z

❌ Gradle check result for e9b314c: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2026-05-19T23:46:20Z

Persistent review updated to latest commit c75f6ad

github-actions · 2026-05-20T00:42:40Z

❌ Gradle check result for c75f6ad: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@timestamp

…t split Two coupled changes that together unblock the no-by family of PPL timechart queries on the analytics-engine route: 1. Register TIMESTAMPDIFF and TIMESTAMPADD as scalar functions - Add enum constants to ScalarFunction (analytics-framework SPI) - Add to STANDARD_PROJECT_OPS in DataFusionAnalyticsBackendPlugin Clears the capability-layer rejection at OpenSearchProjectRule.annotateExpr. isthmus-side substrait conversion still needs an adapter (separate commit): the call shape `TIMESTAMPDIFF(string, ts, ts)` is not in the default extension catalog. Per-function ITs progress past the capability check and now report "Unable to convert call TIMESTAMPADD(string, i32, precision_timestamp<0>?)" — confirms the capability layer is unblocked. 2. Skip OpenSearchAggregateSplitRule when groupSet is not a {0..gc-1} prefix - matches() now returns false when the ORIGINAL groupSet doesn't equal ImmutableBitSet.range(groupCount) The split rule was previously erroring out (IllegalArgumentException, "Type mismatch ... <aggType> -> <groupType>") for the no-by avg() shape PPL timechart produces, where the Project below the Aggregate keeps the raw @timestamp at position 0 and materializes SPAN(@timestamp) at a later position. FINAL re-used the ORIGINAL's groupSet against PARTIAL's output, dereferencing a position whose type changed from group-key (TIMESTAMP) to agg-result. Trying to remap FINAL's groupSet+aggCalls to PARTIAL's leading positions breaks aggregation semantics (COUNT/AVG need function-swap that AggregateDecompositionResolver only applies at substrait-conversion time). Skipping the split for non-prefix groupSets sacrifices distributed parallelism for correctness — same trade-off pattern as the existing percentile_approx skip. CalciteStatsCommandIT remains at 47/63 (no regression), and the 4 no-by + agg() timechart tests now produce correct data rows (currently failing only on the orthogonal datetime output-cast issue tracked at opensearch-project/sql#5420). Signed-off-by: Kai Huang <huangkaics@gmail.com>

…er_* Adds a peephole adapter that constant-folds the exact expression shape PPL timechart's per_second / per_minute / per_hour / per_day aggregations produce: TIMESTAMPDIFF(out_unit, t, TIMESTAMPADD(in_unit, n, t)) When both units are fixed-length (MICROSECOND through WEEK) and the inner TIMESTAMPADD's base is the same RexNode as the outer TIMESTAMPDIFF's start, the expression evaluates to the constant n * (in_ms / out_ms) regardless of t. The adapter materializes that BIGINT literal and lets it flow through Substrait — eliminating both PPL UDF references in one step (neither has a Substrait extension binding). OperatorAnnotation wrappers (AnnotatedProjectExpression introduced by OpenSearchProjectRule) are peeled at each operand before the structural comparison, so the wrapped inner TIMESTAMPADD remains recognizable instead of looking like an annotation RexCall whose operator is ANNOTATED_PROJECT_EXPR. Out of scope (call falls through unchanged): - Variable-length out_unit / in_unit (MONTH / QUARTER / YEAR) — fold isn't safe because milliseconds-per-month depends on which month t lands in. testTimechartPerSecondWithVariableMonthLengths is an example: span=1M with per_second expects different per-row results for Feb vs Oct. - Standalone TIMESTAMPADD (not nested inside TIMESTAMPDIFF) — no adapter registered for TIMESTAMPADD on its own. - Non-literal unit / n operands — the peephole requires both to be string / integer literals. Pass-rate impact on the subset this adapter targets (CalciteTimechartPerFunctionIT): - Before: 6 tests blocked at "Unable to convert call TIMESTAMPADD(string, i32, ...)". - After: 1 test now reaches client-side assertion (sql#5420 schema-string cast), 4 tests reveal the orthogonal multi-unit SPAN limitation (opensearch-project#21584's known gap), 1 test correctly stays blocked on variable-month-length. CalciteStatsCommandIT remains at 47/63 (no regression). Signed-off-by: Kai Huang <huangkaics@gmail.com>

Resolves the 13 by-clause failures in CalciteTimechartCommandIT (5/18 → 18/18) caused by DataFusion's row_number() physical op emitting UInt64 while Calcite types ROW_NUMBER OVER as BIGINT (Int64 signed). Two coupled changes mirror the existing schema_coerce::coerce_inferred_schema bridge the parquet-scan path uses, applied at the partition-stream boundary where the analytics-engine reduce stage reads from upstream PARTIAL fragments via NativeBridge.registerPartitionStream. 1. api::register_partition_stream — coerce the producer's physical schema via schema_coerce before registering the StreamingTable and before writing the IPC bytes back to Java. The Substrait consumer plan's `ensure_schema_compatibility` (datafusion-substrait 53) strict-checks each ReadRel's base_schema against the registered table; the coercion makes Int64-declared plans bind against this table. The Java-side typesMatch tripwire then sees the coerced schema and validates incoming batches against it (next change). 2. api::sender_send — coerce incoming RecordBatches to match the sender's (now-coerced) schema via arrow::compute::cast where types differ. Without this, DataFusion's HashJoin / RepartitionExec panics with "primitive array" on as_primitive::<Int64Type>() of a UInt64 column when the producer's actual batch reaches downstream operators. cast() is the existing Arrow kernel — no-op when the schemas already match, single-cast on the mismatched columns otherwise. Same width preserves zero-copy semantics modulo the bit-reinterpret arrow handles internally. 3. DatafusionReduceSink.typesMatch — relax Java's tripwire to treat same-width Int (any signedness) as compatible, mirroring the existing Timestamp(any precision) tolerance. Same rationale: divergence is harmless because the sender_send coercion produces a downstream- compatible batch. Pass-rate impact: - CalciteTimechartCommandIT: 5/18 → 18/18 ✅ (+13) - CalciteTimechartPerFunctionIT: 1/12 → 1/12 (residuals are opensearch-project#21584's multi-unit-SPAN gap + variable-month TIMESTAMPADD) - Combined timechart: 6/30 → 19/30 (63%) Signed-off-by: Kai Huang <huangkaics@gmail.com>

Closes the 9 CalciteTimechartPerFunctionIT failures of the shape: source=events_traffic | timechart span=2m per_<unit>(packets) [by host] The pre-existing SpanAdapter handled time spans only when interval==1 (e.g. `span=1m` → `date_trunc('minute', t)`). Multi-unit time spans (`span=2m`, `span=12h`, `span=5d`, ...) fell through unchanged, surfacing to isthmus as `Unable to convert call SPAN(precision_timestamp<0>?, i32, char<1>)`. DataFusion has `date_bin` for arbitrary-interval bucketing, but Calcite has no built-in `DATE_BIN` operator and adding one to the SqlOperator table is out of scope here. Rewrite for fixed-length units (s, m, h, d, w): SPAN(t, N, '<unit>') → CAST(from_unixtime(CAST((to_unixtime(t) / B) * B AS DOUBLE)) AS <result>) where B = N * unit_seconds. Component choices (each forced by a binding-time substrait constraint surfaced during iterative IT runs): 1. `UnixTimestampAdapter.LOCAL_TO_UNIXTIME_OP` + `RustUdfDateTimeAdapters .LOCAL_FROM_UNIXTIME_OP` instead of Calcite stdlib `UNIX_SECONDS` / `TIMESTAMP_SECONDS`. The latter bind to BigQuery-named substrait functions that DF's substrait consumer doesn't register; the locally- declared operators route through the existing `FunctionMappings.s` entries to DataFusion's native `to_unixtime` / `from_unixtime` UDFs. 2. No `FLOOR` between divide and multiply. Calcite integer ÷ integer already truncates toward zero (correct for non-negative epoch seconds, which all timechart-tested data is). FLOOR(i64) doesn't bind in substrait — DF's `floor` is floating-point only. 3. `CAST(... AS DOUBLE)` before `from_unixtime`. The yaml signature is `from_unixtime(fp64) -> precision_timestamp<6>`; without the cast, isthmus reports `Unable to convert call from_unixtime(i64?)`. Variable-length units (M, q, y) intentionally fall through — bucketing by calendar month is not a `N * fixed_seconds` operation. Those tests (testTimechartPerMonthWithSpecifiedSpan, testTimechartPerSecondWith VariableMonthLengths) need a real DataFusion `date_bin` interval-month rewrite, tracked separately. Pass-rate impact: - CalciteTimechartPerFunctionIT: 1/12 → 10/12 (+9) - CalciteTimechartCommandIT: 18/18 (no regression) - Combined timechart: 19/30 → 28/30 (93%) Signed-off-by: Kai Huang <huangkaics@gmail.com>

…rite Closes the 2 residual CalciteTimechartPerFunctionIT failures of the shape: source=events_traffic | timechart span=1M per_second(packets) source=bank | timechart timefield=birthdate span=1month per_day(balance) by gender The existing peephole folds TIMESTAMPDIFF(out_unit, t, TIMESTAMPADD(in_unit, n, t)) to a constant when both units are fixed-length. For variable-length inner units (MONTH/QUARTER/YEAR), folding isn't safe — the milliseconds-per-bucket value depends on which calendar month the row's bucket lands in (Feb 2025 = 28 days = 2_419_200 seconds; Oct 2025 = 31 days = 2_678_400 seconds). The unfolded TIMESTAMPDIFF call previously fell through and surfaced as "Unable to convert call TIMESTAMPADD(string, i32, precision_timestamp<0>?)" because PPL's TIMESTAMPADD/TIMESTAMPDIFF UDFs have no Substrait extension binding. This commit extends the adapter to rewrite atomically to a runtime form when in-unit is variable-length and out-unit is fixed-length: TIMESTAMPDIFF(out_unit, t, TIMESTAMPADD(MONTH|QUARTER|YEAR, n, t)) → (to_unixtime(DATETIME_PLUS(t, INTERVAL n*m MONTH)) - to_unixtime(t)) * out_factor where m converts QUARTER→3 / YEAR→12, INTERVAL_YEAR_MONTH is constructed via SqlIntervalQualifier(MONTH) (same idiom as EarliestLatestAdapter# makeIntervalAdd, which proves DATETIME_PLUS(t, INTERVAL_MONTH) binds end-to-end via Substrait's standard add(timestamp, interval_year_month) into DataFusion's native interval add), and out_factor scales the unix-seconds difference to the requested out-unit (MICROSECOND→×1_000_000, MILLISECOND→ ×1000, SECOND→×1, MINUTE→÷60, HOUR→÷3600, DAY→÷86400, WEEK→÷604800). Out-unit MONTH/QUARTER/YEAR (variable on both sides) is still rejected — fixed-second math doesn't apply. No timechart per_* shape needs this. PPL timechart's per_function always emits MILLISECOND out-unit (see Chart#transformPerFunction: spanMillis = timestampdiff(MILLISECOND, t, timestampadd(span.unit, span.value, t))). So the MILLISECOND multiplier path is the only hot case. The other out-unit factors cover consumer-code shapes that compute coarser durations from a variable-length bucket. Pass-rate impact: - CalciteTimechartPerFunctionIT: 10/12 → 12/12 ✅ (+2) - CalciteTimechartCommandIT: 18/18 (no regression) - Combined timechart: 28/30 → 30/30 ✅ Signed-off-by: Kai Huang <huangkaics@gmail.com>

github-actions · 2026-05-20T18:23:00Z

Persistent review updated to latest commit cc0b9ab

github-actions · 2026-05-20T19:16:18Z

✅ Gradle check result for cc0b9ab: SUCCESS

codecov · 2026-05-20T19:17:27Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.44%. Comparing base (be29fd0) to head (cc0b9ab).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files

@@             Coverage Diff              @@
##               main   #21747      +/-   ##
============================================
- Coverage     73.48%   73.44%   -0.04%     
+ Complexity    75078    75068      -10     
============================================
  Files          6012     6016       +4     
  Lines        340940   341072     +132     
  Branches      49076    49091      +15     
============================================
- Hits         250543   250505      -38     
- Misses        70409    70559     +150     
- Partials      19988    20008      +20

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

sandeshkr419

Thanks for making the changes.

@timestamp

…a date_bin Adds a third lowering path to `SpanAdapter` so dashboard logs-tab histogram queries with sub-second bucket widths (e.g. `span(@timestamp, 40ms)`, `span(@timestamp, 250us)`) work on the analytics-engine route. ## Problem PR opensearch-project#21747 (Timechart route end-to-end closure) added multi-unit time SPAN support for fixed-second units `s`, `m`, `h`, `d`, `w` via integer-seconds arithmetic, intentionally leaving sub-second units out of scope ( `to_unixtime` returns BIGINT seconds and would truncate sub-second precision). The dashboard logs-tab histogram issues queries like `stats count() by span(@timestamp, 40ms)`, which falls outside both the existing `N == 1` `date_trunc` path and the new fixed-second arithmetic path — surfacing as: ``` Unable to convert call SPAN(precision_timestamp<0>?, i32, char<1>) ``` ## Fix Adds a third lowering path checked after the existing two: - `interval == 1`, any unit → `date_trunc(<unit>, field)` (unchanged) - fixed-second + `interval > 1` → integer-seconds arithmetic (opensearch-project#21747, unchanged) - **sub-second + `interval > 1` → `date_bin("<N> <unit>", field)` (new)** DataFusion's `date_bin(stride, source)` accepts a string-form interval stride like `"40 milliseconds"` or `"250 microseconds"` natively. Emitting the stride as a plain string literal avoids substrait interval-type plumbing that would otherwise be needed for sub-second precision, and matches how DataFusion's native parser accepts strides at the SQL surface. Wiring: - `opensearch_scalar_functions.yaml` — new `date_bin` entry, `(stride: string, source: precision_timestamp<P>) → precision_timestamp<P>`. - `SpanAdapter.java` — `SUB_SECOND_UNIT_TO_DATE_BIN_STRIDE` map plus a `LOCAL_DATE_BIN_OP` Calcite SqlFunction, used by the new branch in `adapt()`. - `DataFusionFragmentConvertor.java` — `FunctionMappings.s(SpanAdapter.LOCAL_DATE_BIN_OP, "date_bin")`. Multi-unit month/quarter/year continue to fall through unchanged (variable bucket length needs `date_bin` with an interval-month stride; tracked separately). ## Test plan ### Unit (`SpanAdapterSubSecondTests`, 5/5 pass) - `testMillisecondMultiUnitRewritesToDateBin` — `40ms` lowers to `date_bin("40 milliseconds", field)`. - `testMicrosecondMultiUnitRewritesToDateBin` — `250us` lowers to `date_bin("250 microseconds", field)`. - `testSubSecondUnitOneStaysOnDateTruncPath` — `N == 1` for `us`/`ms` keeps using the cheaper `date_trunc` path (regression guard). - `testFractionalIntervalForSubSecondFallsThrough` — non-integer intervals are not rewritten (would be a planner-level bug; falls through to surface as a normal substrait binding error rather than silent miscalculation). - `testAdaptedDateBinCallPreservesOriginalReturnType` — Project rowType invariant. Full `analytics-backend-datafusion:test` suite passes (no regression of the timechart fixed-second arithmetic path). ### End-to-end (`:run` + `_plugins/_ppl` against a parquet-backed index) | Query | Result | |---|---| | `span(@timestamp, 1h)` | bucket = `00:00:00` (date_trunc path) | | `span(@timestamp, 2h)` | bucket = `00:00:00` (opensearch-project#21747 arithmetic path) | | `span(@timestamp, 40ms)` | **5 distinct 40ms buckets** (`00:07:56.080`, `00:07:56.480`, `00:07:57.200`, `00:07:58`, `00:08:00`) ← new path | | Exact dashboard query (filtered range + `stats count() by span(@timestamp, 40ms)`) | 4 buckets within filter range, no errors | Signed-off-by: Kai Huang <ahkcs@amazon.com>

@timestamp

…oute Sister to SpanCommandIT (which explicitly scopes to numeric span and the Rust kernel). This new IT exercises all three time-mode lowering paths in SpanAdapter: - N==1 / date_trunc — `span(@timestamp, 1h)`, `span(@timestamp, 1ms)` - N>1 fixed-second arithmetic — `span(@timestamp, 2h)` (opensearch-project#21747) - N>1 sub-second date_bin — `span(@timestamp, 40ms)`, `span(@timestamp, 250us)` (this PR) Adds a focused 5-row `span_time` dataset with explicit sub-second timestamps. The existing `calcs` dataset's `datetime0` field is whole- second only and can't exercise sub-second bucketing. Pairs with the SpanAdapterSubSecondTests JUnit suite — the unit tests assert the rewritten RexCall shape; this IT verifies the end-to-end analytics-engine path (PPL → CalciteRelNodeVisitor → SpanAdapter → Substrait → DataFusion) returns the right buckets. All 5 tests pass on a self-contained QA cluster: - testSpanUnitOneHourLowersToDateTrunc — 5 rows in one 1h bucket - testSpanMultiUnitHoursLowersToArithmetic — 5 rows in one 2h bucket - testSpanFortyMillisecondsBucketsEachRowDistinctly — 5 distinct 40ms buckets - testSpanMicrosecondsBucketsEachRowDistinctly — 5 distinct 250us buckets - testSpanOneMillisecondStaysOnDateTruncPath — regression guard (N=1 ms) Signed-off-by: Kai Huang <ahkcs@amazon.com>

ahkcs requested a review from a team as a code owner May 19, 2026 22:38

ahkcs changed the title ~~[Analytics Backend] UInt64 → Int64 bridge at partition-stream boundary~~ [Analytics Backend] Timechart route closure — UInt64↔Int64 bridge + multi-unit SpanAdapter May 19, 2026

ahkcs force-pushed the pr/analytics-uint64-int64-bridge branch from 2817c6e to e9b314c Compare May 19, 2026 22:56

ahkcs changed the title ~~[Analytics Backend] Timechart route closure — UInt64↔Int64 bridge + multi-unit SpanAdapter~~ [Analytics Backend] Timechart route end-to-end closure (1/30 → 30/30) May 19, 2026

ahkcs force-pushed the pr/analytics-uint64-int64-bridge branch from e9b314c to c75f6ad Compare May 19, 2026 23:44

sandeshkr419 reviewed May 20, 2026

View reviewed changes

Comment thread sandbox/plugins/analytics-backend-datafusion/rust/src/api.rs Outdated

ahkcs added 5 commits May 20, 2026 11:20

ahkcs force-pushed the pr/analytics-uint64-int64-bridge branch from c75f6ad to cc0b9ab Compare May 20, 2026 18:21

ahkcs requested a review from sandeshkr419 May 20, 2026 18:51

sandeshkr419 approved these changes May 20, 2026

View reviewed changes

sandeshkr419 merged commit b599c5e into opensearch-project:main May 20, 2026
16 of 18 checks passed

ahkcs deleted the pr/analytics-uint64-int64-bridge branch May 20, 2026 21:42

ahkcs mentioned this pull request May 20, 2026

[Analytics Backend / DataFusion] Wire sub-second multi-unit span() via date_bin #21765

Open

Conversation

ahkcs commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Commits

1. [Analytics] Wire TIMESTAMPDIFF/TIMESTAMPADD + skip non-prefix-groupSet split

2. [Analytics Backend] Add TimestampDiffAdapter peephole for timechart per_*

3. [Analytics Backend] UInt64 → Int64 bridge at partition-stream boundary

4. [Analytics Backend] SpanAdapter — multi-unit fixed-length time bucket

5. [Analytics Backend] TimestampDiffAdapter — variable-month runtime rewrite

Pass-rate impact

Sandbox check status

Test plan

Uh oh!

github-actions Bot commented May 19, 2026

PR Code Analyzer ❗

Uh oh!

github-actions Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Reviewer Guide 🔍

(Review updated until commit cc0b9ab)

Uh oh!

github-actions Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Code Suggestions ✨

Previous suggestions

Uh oh!

github-actions Bot commented May 19, 2026

Uh oh!

github-actions Bot commented May 19, 2026

Uh oh!

github-actions Bot commented May 19, 2026

Uh oh!

github-actions Bot commented May 19, 2026

Uh oh!

Uh oh!

github-actions Bot commented May 20, 2026

Uh oh!

github-actions Bot commented May 20, 2026

Uh oh!

github-actions Bot commented May 20, 2026

Uh oh!

codecov Bot commented May 20, 2026

Codecov Report

Uh oh!

sandeshkr419 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ahkcs commented May 19, 2026 •

edited

Loading

1. `[Analytics] Wire TIMESTAMPDIFF/TIMESTAMPADD + skip non-prefix-groupSet split`

2. `[Analytics Backend] Add TimestampDiffAdapter peephole for timechart per_*`

3. `[Analytics Backend] UInt64 → Int64 bridge at partition-stream boundary`

4. `[Analytics Backend] SpanAdapter — multi-unit fixed-length time bucket`

5. `[Analytics Backend] TimestampDiffAdapter — variable-month runtime rewrite`

github-actions Bot commented May 19, 2026 •

edited

Loading

(Review updated until commit `cc0b9ab`)

github-actions Bot commented May 19, 2026 •

edited

Loading