Skip to content

Cube Materialization YAML Config #1912

@shangyian

Description

@shangyian

Overview

We should allow cube authors to declare materialization configuration via cube YAML. When deployed, the deployment process orchestrates the full three-step materialization flow behind the scenes: planning pre-agg records, scheduling pre-agg Spark workflows, and scheduling the Druid cube ingestion workflow.

Cube YAML with Materialization

  name: ${prefix}my_cube
  node_type: cube
  metrics: [...]
  dimensions: [...]

  materialization:
    schedule: "0 6 * * *"           # cron schedule
    strategy: incremental_time      # or: full
    lookback_window: 1 DAY          # only relevant for incremental_time
    partition:                      # required if strategy is incremental_time
      dimension: shared.dims.date
      granularity: DAY              # or: HOUR
    backfill_from: "20250101"       # optional; if set, backfills from this date to today

Deployment Flow

When a cube with a materialization block is deployed, the deployment process:

  1. POST /preaggs/plan: this plans and creates pre-agg records
  2. POST /preaggs/{id}/materialize: this schedules the Spark workflow for each pre-agg
  3. POST /cubes/{name}/materialize: this schedules the Druid ingestion workflow

Partition Resolution

The user declares the partition once at the cube level as a dimension reference. DJ derives the physical partition column and format for each pre-agg automatically by looking up how that dimension is linked on the upstream node. This means users think in terms of dimensions rather than physical column names, and the correct column is resolved per pre-agg without any additional configuration.

Validation

  • partition is required when strategy: incremental_time and deployment fails with a clear error if omitted
  • lookback_window is ignored if strategy: full
  • If backfill_from and backfill_to are both set, backfill runs between those two time frames. If backfill_to is not set, it will automatically default to today.

Pre-agg Level Spark Config

Pre-agg names are content-addressed and not user-controllable, so there is no stable handle for attaching per-pre-agg config in YAML. Instead, Spark execution hints for pre-agg computation are declared on dimension links (see #1910)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions