Skip to content

[build v3] Transitive CTE dependencies are not column-pruned #1906

@shangyian

Description

@shangyian

In the build_v3 pipeline, CTEs built from transitive dependencies pulled in from dimension links always emit all columns, even when only a subset is used downstream. This is because needed_columns_by_node is only populated for nodes that are directly part of the join path, and not for nodes discovered transitively by collect_refs.

Example

v3.calendar_spine is a dimension node with many columns, referenced inside v3.dim_expanded's SQL:

  -- v3.calendar_spine
  SELECT day_num, label, metadata_col1, metadata_col2, metadata_col3
  FROM v3.src_calendar

  -- v3.dim_expanded
  SELECT a.id, a.some_date, c.day_num
  FROM v3.src_expanded a
  JOIN v3.calendar_spine c ON a.date_key = c.day_num

A user queries a metric with v3.dim_expanded.some_date as a dimension. Only day_num is needed from v3.calendar_spine (as the join key inside v3.dim_expanded's body). But the generated SQL emits all 5 columns:

  -- Current (unoptimized)
  WITH
  v3_calendar_spine AS (
      SELECT day_num, label, metadata_col1, metadata_col2, metadata_col3
      FROM default.v3.src_calendar
  ),
  v3_dim_expanded AS (
      SELECT a.id, a.some_date, c.day_num
      FROM default.v3.src_expanded a
      JOIN v3_calendar_spine c ON a.date_key = c.day_num
  )
  ...

Expected:

  -- Optimized
  WITH
  v3_calendar_spine AS (
      SELECT day_num  -- only what v3_dim_expanded actually references
      FROM default.v3.src_calendar
  ),
  v3_dim_expanded AS (
      SELECT a.id, a.some_date, c.day_num
      FROM default.v3.src_expanded a
      JOIN v3_calendar_spine c ON a.date_key = c.day_num
  )
  ...

Fix Approach

After collect_refs gathers all transitive dep nodes, for each (node, upstream) pair, walk node's SQL AST to find which column references resolve to upstream's table alias and populate needed_columns_by_node with those columns. This analysis must happen before rewrite_table_references renames the aliases, so column refs can still be correlated to source nodes by original table name.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions