Skip to content

fix(optimizer): resolve columns of a LATERAL subquery in the outer scope [CLAUDE]#7801

Closed
akiylah-armstead wants to merge 1 commit into
tobymao:mainfrom
akiylah-armstead:fix/qualify-unnest-with-ordinality-columns
Closed

fix(optimizer): resolve columns of a LATERAL subquery in the outer scope [CLAUDE]#7801
akiylah-armstead wants to merge 1 commit into
tobymao:mainfrom
akiylah-armstead:fix/qualify-unnest-with-ordinality-columns

Conversation

@akiylah-armstead

Copy link
Copy Markdown

Closes #7799.

Bug. qualify() could not resolve columns produced by a LATERAL (<subquery>)
source. This surfaced when transpiling Spark LATERAL VIEW POSEXPLODE to DuckDB:

sql = "SELECT pos, val FROM t LATERAL VIEW POSEXPLODE(t.arr) AS pos, val"
generated = parse_one(sql, read="spark").sql(dialect="duckdb")
# SELECT pos, val FROM t CROSS JOIN LATERAL
#   (SELECT pos - 1 AS pos, val FROM UNNEST(t.arr) WITH ORDINALITY AS _t0(val, pos))

qualify(parse_one(generated, read="duckdb"), dialect="duckdb",
        schema={"t": {"arr": "ARRAY<VARCHAR>"}})
# -> OptimizeError: Column 'pos' could not be resolved.

DuckDB itself resolves the query fine.

Root cause. In the generated SQL the lateral is Lateral(this=Subquery(...))
with no alias columns of its own. The optimizer derives a source's columns from
UDTF.selects, which only returned alias.columns:

class UDTF(DerivedTable):
    @property
    def selects(self):
        alias = self.args.get("alias")
        return alias.columns if alias else []

Because the Lateral has no alias columns, selects (and therefore
named_selects) was empty, so in the outer scope the lateral source resolved to
no columns and pos/val could not be qualified. The inner UNNEST ... WITH ORDINALITY AS _t0(val, pos) resolved correctly only because Unnest.selects
already special-cases the ordinality/offset column.

Fix. When a UDTF has no explicit alias columns, fall back to the columns of
the query it wraps (DerivedTable.selects, i.e. the inner Subquery/Query's
selects). This is general: it works regardless of alias names or column order,
and it leaves the existing aliased path (e.g. UNNEST(...) AS t(a, b),
LATERAL VIEW EXPLODE(...) v AS x) and Unnest.selects (which calls
super().selects) untouched, since those still have alias columns.

Tests

  • Added a regression test in tests/test_optimizer.py::test_qualify_columns
    using the exact repro from the issue. It fails before the fix
    (OptimizeError: Column 'pos' could not be resolved) and passes after,
    producing correctly-qualified SQL.
  • Full suite passes (tests/test_optimizer.py, tests/test_expressions.py,
    tests/test_build.py, tests/dialects/, tests/test_transpile.py,
    tests/test_lineage.py), and ruff check / ruff format --check are clean.

A `LATERAL (<subquery>)` source without explicit alias columns exposed no
columns to its enclosing scope, because `UDTF.selects` only returned
`alias.columns`. As a result qualify() raised `OptimizeError: Column ...
could not be resolved` for the columns of the wrapped query (e.g. the
WITH ORDINALITY column produced when transpiling Spark POSEXPLODE to DuckDB).

Fall back to the wrapped query's selects when a UDTF has no alias columns.

Fixes tobymao#7799
Comment on lines -108 to +113
return alias.columns if alias else []
if alias and alias.columns:
return alias.columns

# A UDTF without explicit alias columns (e.g. `LATERAL (<subquery>)`) exposes the
# columns produced by the query it wraps, so fall back to those.
return super().selects

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @akiylah-armstead, thank you for the PR but I don't think this is the right layer to solve this problem. I will take this on and ping you in the corresponding PR.

@georgesittas

Copy link
Copy Markdown
Collaborator

#7802

@georgesittas

Copy link
Copy Markdown
Collaborator

This PR wouldn't handle partial column alias lists well, such as in this query:

SELECT * FROM t CROSS JOIN LATERAL (SELECT 1 AS a, 2 AS b) AS x(c)

Whereas the optimizer already pushes alias names down to the subquery and projects b as well when expanding the star. Your approach would drop b:

from sqlglot import parse_one
from sqlglot.optimizer.qualify import qualify

sql = "SELECT * FROM t CROSS JOIN LATERAL (SELECT 1 AS a, 2 AS b) AS x(c)"
out = qualify(parse_one(sql, read="duckdb"), dialect="duckdb", schema={"t": {"k": "INT"}}).sql("duckdb")
print(out)
# SELECT "t"."k" AS "k", "x"."c" AS "c" FROM "t" AS "t" CROSS JOIN LATERAL (SELECT 1 AS "a", 2 AS "b") AS "x"("c")
# Note how "b" is dropped from the output

DuckDB includes this column:

import duckdb
con = duckdb.connect()
con.execute("CREATE TABLE t AS SELECT 1 AS k")
cur = con.execute("SELECT * FROM t CROSS JOIN LATERAL (SELECT 1 AS a, 2 AS b) AS x(c)")
print([d[0] for d in cur.description], cur.fetchall())   # ['k', 'c', 'b'] [(1, 1, 2)]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

LATERAL VIEW POSEXPLODE: qualify() cannot resolve a column in the SQL it generates for DuckDB

2 participants