Skip to content

[FEATURE] PPL per_* aggregation function support #4350

@dai-chen

Description

@dai-chen

Is your feature request related to a problem?

PPL currently lacks support for per_* aggregation functions (per_second, per_minute, per_hour, per_day). These functions calculate rate-based metrics by normalizing aggregated values to specific time units, converting raw counts into meaningful per-unit rates.

Without these functions, users cannot easily perform rate calculations that are common in performance monitoring scenarios, such as calculating packets per second, requests per minute when using the timechart command.

What solution would you like?

Implement the four per_* aggregation functions in PPL:

  • per_second(<value>) - Returns values normalized to per-second rate
  • per_minute(<value>) - Returns values normalized to per-minute rate
  • per_hour(<value>) - Returns values normalized to per-hour rate
  • per_day(<value>) - Returns values normalized to per-day rate

These functions should work exclusively with the timechart command (due to implicit timestamp field dependency):

# Sample data
  {\"_time\":\"2025-09-08T10:00:00\", \"packets\":10},
  {\"_time\":\"2025-09-08T10:00:05\", \"packets\":60},
  {\"_time\":\"2025-09-08T10:00:30\", \"packets\":20},
  {\"_time\":\"2025-09-08T10:00:50\", \"packets\":30}

# Example 1
...
| eval _time=strptime(_time, "%Y-%m-%dT%H:%M:%S")
| timechart per_second(packets) span=1m

----------------------|--------------------
_time                 | per_second(packets)
----------------------|--------------------
2025-09-08T10:00:00   | 2                   # (10+60+20+30)/60s

# Example 2
timechart per_second(packets) span=20s

----------------------|--------------------
_time                 | per_second(packets)
----------------------|--------------------
2025-09-08T10:00:00   | 3.5                  # (10+60)/20s
2025-09-08T10:00:20   | 1                    # (20)/20s
2025-09-08T10:00:40   | 1.5                  # (30)/20s

# Example 3
timechart per_minute(packets), per_hour(packets), per_day(packets) span=1m

----------------------|--------------------|--------------------|--------------------
_time                 | per_minute(packets)| per_hour(packets)  | per_day(packets)
----------------------|--------------------|--------------------|--------------------
2025-09-08T10:00:00   | 120                | 7200               | 172800 

What alternatives have you considered?

  • Manual calculation: Users can manually divide aggregated values by time span, but this requires knowledge of time conversion factors.

Do you have any additional context?

Implementation approaches

  • Short-term solution: Implement rewriting for fixed-width buckets
    • Currently PPL timechart only supports span option which is fixed interval
    • We can simply transform per_* functions to mathematical formulas at compile time
# Example
... | timechart per_second(packets) span=1m

=>

SELECT SUM(packets) / 60
...
GROUP BY SPAN(@timestamp, 1m)
  • Long-term solutions [TBD]
    • The primary challenge lies in dynamic bucketing behavior in bin-options.
    • Option 1: Output bounds from bucketing function similar as windowing function in Spark SQL
    • Option 2: Dynamic calculation via LEAD window function to determine next bucket's start time

Metadata

Metadata

Assignees

Labels

PPLPiped processing languageenhancementNew feature or request

Type

No type

Projects

Status

In review

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions