-
Notifications
You must be signed in to change notification settings - Fork 23
Description
In the build_v3 pipeline, CTEs built from transitive dependencies pulled in from dimension links always emit all columns, even when only a subset is used downstream. This is because needed_columns_by_node is only populated for nodes that are directly part of the join path, and not for nodes discovered transitively by collect_refs.
Example
v3.calendar_spine is a dimension node with many columns, referenced inside v3.dim_expanded's SQL:
-- v3.calendar_spine
SELECT day_num, label, metadata_col1, metadata_col2, metadata_col3
FROM v3.src_calendar
-- v3.dim_expanded
SELECT a.id, a.some_date, c.day_num
FROM v3.src_expanded a
JOIN v3.calendar_spine c ON a.date_key = c.day_num
A user queries a metric with v3.dim_expanded.some_date as a dimension. Only day_num is needed from v3.calendar_spine (as the join key inside v3.dim_expanded's body). But the generated SQL emits all 5 columns:
-- Current (unoptimized)
WITH
v3_calendar_spine AS (
SELECT day_num, label, metadata_col1, metadata_col2, metadata_col3
FROM default.v3.src_calendar
),
v3_dim_expanded AS (
SELECT a.id, a.some_date, c.day_num
FROM default.v3.src_expanded a
JOIN v3_calendar_spine c ON a.date_key = c.day_num
)
...
Expected:
-- Optimized
WITH
v3_calendar_spine AS (
SELECT day_num -- only what v3_dim_expanded actually references
FROM default.v3.src_calendar
),
v3_dim_expanded AS (
SELECT a.id, a.some_date, c.day_num
FROM default.v3.src_expanded a
JOIN v3_calendar_spine c ON a.date_key = c.day_num
)
...
Fix Approach
After collect_refs gathers all transitive dep nodes, for each (node, upstream) pair, walk node's SQL AST to find which column references resolve to upstream's table alias and populate needed_columns_by_node with those columns. This analysis must happen before rewrite_table_references renames the aliases, so column refs can still be correlated to source nodes by original table name.