-
Notifications
You must be signed in to change notification settings - Fork 23
Description
Overview
We should allow cube authors to declare materialization configuration via cube YAML. When deployed, the deployment process orchestrates the full three-step materialization flow behind the scenes: planning pre-agg records, scheduling pre-agg Spark workflows, and scheduling the Druid cube ingestion workflow.
Cube YAML with Materialization
name: ${prefix}my_cube
node_type: cube
metrics: [...]
dimensions: [...]
materialization:
schedule: "0 6 * * *" # cron schedule
strategy: incremental_time # or: full
lookback_window: 1 DAY # only relevant for incremental_time
partition: # required if strategy is incremental_time
dimension: shared.dims.date
granularity: DAY # or: HOUR
backfill_from: "20250101" # optional; if set, backfills from this date to todayDeployment Flow
When a cube with a materialization block is deployed, the deployment process:
POST /preaggs/plan: this plans and creates pre-agg recordsPOST /preaggs/{id}/materialize: this schedules the Spark workflow for each pre-aggPOST /cubes/{name}/materialize: this schedules the Druid ingestion workflow
Partition Resolution
The user declares the partition once at the cube level as a dimension reference. DJ derives the physical partition column and format for each pre-agg automatically by looking up how that dimension is linked on the upstream node. This means users think in terms of dimensions rather than physical column names, and the correct column is resolved per pre-agg without any additional configuration.
Validation
- partition is required when
strategy: incremental_timeand deployment fails with a clear error if omitted lookback_windowis ignored ifstrategy: full- If
backfill_fromandbackfill_toare both set, backfill runs between those two time frames. Ifbackfill_tois not set, it will automatically default to today.
Pre-agg Level Spark Config
Pre-agg names are content-addressed and not user-controllable, so there is no stable handle for attaching per-pre-agg config in YAML. Instead, Spark execution hints for pre-agg computation are declared on dimension links (see #1910)