Add Derived Dataset Column Definitions #51

bcodell · 2023-12-24T02:48:05Z

Enable developers to define dataset columns that represent transformations of 1+ other dataset columns.

The actual aql might look like the following:

{% set aql %}
using customer_stream
select all activity_1 (
customer_id as customer_id,
activity_at as activity_1_at
)
append first after activity_2 (
activity_at as activity_2_at
)
derive (
datediff('d', ${activity_1_at}, ${activity_2_at}) as time_to_activity_2_days
)
{% endset %}

The resulting dataset schema should be:

customer_id (str)
activity_1_at (ts)
activity_2_at (ts)
time_to_activity_2_days (float)

Open questions:

How to identify the data type of the derived column? first-level dataset columns can be inferred because the data type of the attribute and any aggregation function applied are both known, but arbitrary sql can (and should) be used in defining these transformations
How to identify multiple derived columns? Currently columns are parsed based on the logic that a comma is only expected at the end of the column alias, but arbitrary sql (which include commas) will be used, which will break the aforementioned parsing logic
How to apply aggregations to derived columns?
- Not supported for now - need to figure out base dataset aggregation workflow semantics
How necessary are these features in aql, if the goal is interfacing in a BI layer?
- Very - need a code-centric interface to enable automated maintenance/upkeep of dataset columns as they are canonized

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Derived Dataset Column Definitions #51

Add Derived Dataset Column Definitions #51

bcodell commented Dec 24, 2023

Add Derived Dataset Column Definitions #51

Add Derived Dataset Column Definitions #51

Comments

bcodell commented Dec 24, 2023