feat(duckDB)!: Add support for shorthand struct array literals in duckDB. #6233

fivetran-amrutabhimsenayachit · 2025-10-31T19:06:40Z

The current transpilation converts MAX_BY to ARG_MAX but loses struct field names, causing incorrect results in DuckDB.

bq --project_id fivetran-wild-west query --use_legacy_sql=false "SELECT MAX_BY(name, score) FROM UNNEST([STRUCT('Alice' AS name, 85 AS score), STRUCT('Bob', 92), STRUCT('Diana', 95)])"
+-------+
|  f0_  |
+-------+
| Diana |
+-------+

python3 -c "import sqlglot as sg; print(sg.transpile(\"SELECT MAX_BY(name, score) FROM UNNEST([STRUCT('Alice' AS name, 85 AS score), STRUCT('Bob', 92), STRUCT('Diana', 95)])\", read='bigquery', write='duckdb')[0])"
-->SELECT ARG_MAX(name, score) FROM (SELECT UNNEST([{'name': 'Alice', 'score': 85}, {'_0': 'Bob', '_1': 92}, {'_0': 'Diana', '_1': 95}], max_depth => 2))


duckdb -c "SELECT ARG_MAX(name, score) FROM (SELECT UNNEST([{'name': 'Alice', 'score': 85}, {'_0': 'Bob', '_1': 92}, {'_0': 'Diana', '_1': 95}], max_depth => 2))"
┌────────────────────────┐
│ arg_max("name", score) │
│        varchar         │
├────────────────────────┤
│ Alice                  │
└────────────────────────┘

After testing the new code changes:

sqlglot % bq --project_id fivetran-wild-west query --use_legacy_sql=false "SELECT MAX_BY(name, score) FROM UNNEST([STRUCT('Alice' AS name, 85 AS score), STRUCT('Bob', 92), STRUCT('Diana', 95)])"
+-------+
|  f0_  |
+-------+
| Diana |
+-------+

sqlglot % python3 -c "import sqlglot as sg; print(sg.transpile(\"SELECT MAX_BY(name, score) FROM UNNEST([STRUCT('Alice' AS name, 85 AS score), STRUCT('Bob', 92), STRUCT('Diana', 95)])\", read='bigquery', write='duckdb')[0])"

SELECT ARG_MAX(name, score) FROM (SELECT UNNEST([{'name': 'Alice', 'score': 85}, {'name': 'Bob', 'score': 92}, {'name': 'Diana', 'score': 95}], max_depth => 2))

sqlglot % 

sqlglot % duckdb -c "SELECT ARG_MAX(name, score) FROM (SELECT UNNEST([{'name': 'Alice', 'score': 85}, {'name': 'Bob', 'score': 92}, {'name': 'Diana', 'score': 95}], max_depth => 2))"
┌────────────────────────┐
│ arg_max("name", score) │
│        varchar         │
├────────────────────────┤
│ Diana                  │
└────────────────────────┘

georgesittas · 2025-11-03T21:44:28Z

@fivetran-amrutabhimsenayachit this looks good, but upon reviewing it I realized that the issue is more general and doesn't only affect BigQuery –> DuckDB.

For instance, if we wanted to target Snowflake, the same issue would occur:

(sqlglot) ➜  sqlglot git:(main) bq query "SELECT MAX_BY(name, score) FROM UNNEST([STRUCT('Alice' AS name, 85 AS score), STRUCT('Bob', 92), STRUCT('Diana', 95)])"
+-------+
|  f0_  |
+-------+
| Diana |
+-------+
(sqlglot) ➜  sqlglot git:(main) sqlglot --read bigquery --write snowflake "SELECT MAX_BY(name, score) FROM UNNEST([STRUCT('Alice' AS name, 85 AS score), STRUCT('Bob', 92), STRUCT('Diana', 95)])"
SELECT MAX_BY(value['name'], value['score']) FROM TABLE(FLATTEN(INPUT => [OBJECT_CONSTRUCT('name', 'Alice', 'score', 85), OBJECT_CONSTRUCT('_0', 'Bob', '_1', 92), OBJECT_CONSTRUCT('_0', 'Diana', '_1', 95)])) AS _t0(seq, key, path, index, value, this)
(sqlglot) ➜  sqlglot git:(main) snow sql -q "SELECT MAX_BY(value['name'], value['score']) FROM TABLE(FLATTEN(INPUT => [OBJECT_CONSTRUCT('name', 'Alice', 'score', 85), OBJECT_CONSTRUCT('_0', 'Bob', '_1', 92), OBJECT_CONSTRUCT('_0', 'Diana', '_1', 95)])) AS _t0(seq, key, path, index, value, this)"
SELECT MAX_BY(value['name'], value['score']) FROM TABLE(FLATTEN(INPUT => [OBJECT_CONSTRUCT('name', 'Alice', 'score', 85), OBJECT_CONSTRUCT('_0', 'Bob', '_1', 92), OBJECT_CONSTRUCT('_0', 'Diana', '_1', 95)])) AS _t0(seq, key, path, index, value, this)
+---------------------------------------+
| MAX_BY(VALUE['NAME'], VALUE['SCORE']) |
|---------------------------------------|
| "Alice"                               |
+---------------------------------------+

There are two ways I see for solving this properly:

At parse time, for BigQuery, ensure that the shorthand syntax is transformed into the verbose syntax. So, in the above example, all three struct literals would have the AS name and AS score names for the struct fields. One subtle risk here is that by doing so, we can increase the input's string length, possibly resulting in large payloads for big inputs that may fail to execute due to size constraints.
Extract your logic into transforms.py, slightly refactoring it as needed, e.g., you don't need the is_bq_inline_struct logic to be pushed down to your helper. Then, you can reuse it across multiple dialects, like Snowflake, Presto, and so on, in combination with preprocess (see other examples). The disadvantage here is of course repetition, but it doesn't have the risk of (1).

I'd probably implement (2) for now because it's a smaller lift and safer. Let me know if I should clarify anything.

fivetran-amrutabhimsenayachit · 2025-11-04T15:14:47Z

@fivetran-amrutabhimsenayachit this looks good, but upon reviewing it I realized that the issue is more general and doesn't only affect BigQuery –> DuckDB.

For instance, if we wanted to target Snowflake, the same issue would occur:
(sqlglot) ➜  sqlglot git:(main) bq query "SELECT MAX_BY(name, score) FROM UNNEST([STRUCT('Alice' AS name, 85 AS score), STRUCT('Bob', 92), STRUCT('Diana', 95)])"
+-------+
|  f0_  |
+-------+
| Diana |
+-------+
(sqlglot) ➜  sqlglot git:(main) sqlglot --read bigquery --write snowflake "SELECT MAX_BY(name, score) FROM UNNEST([STRUCT('Alice' AS name, 85 AS score), STRUCT('Bob', 92), STRUCT('Diana', 95)])"
SELECT MAX_BY(value['name'], value['score']) FROM TABLE(FLATTEN(INPUT => [OBJECT_CONSTRUCT('name', 'Alice', 'score', 85), OBJECT_CONSTRUCT('_0', 'Bob', '_1', 92), OBJECT_CONSTRUCT('_0', 'Diana', '_1', 95)])) AS _t0(seq, key, path, index, value, this)
(sqlglot) ➜  sqlglot git:(main) snow sql -q "SELECT MAX_BY(value['name'], value['score']) FROM TABLE(FLATTEN(INPUT => [OBJECT_CONSTRUCT('name', 'Alice', 'score', 85), OBJECT_CONSTRUCT('_0', 'Bob', '_1', 92), OBJECT_CONSTRUCT('_0', 'Diana', '_1', 95)])) AS _t0(seq, key, path, index, value, this)"
SELECT MAX_BY(value['name'], value['score']) FROM TABLE(FLATTEN(INPUT => [OBJECT_CONSTRUCT('name', 'Alice', 'score', 85), OBJECT_CONSTRUCT('_0', 'Bob', '_1', 92), OBJECT_CONSTRUCT('_0', 'Diana', '_1', 95)])) AS _t0(seq, key, path, index, value, this)
+---------------------------------------+
| MAX_BY(VALUE['NAME'], VALUE['SCORE']) |
|---------------------------------------|
| "Alice"                               |
+---------------------------------------+
There are two ways I see for solving this properly:

At parse time, for BigQuery, ensure that the shorthand syntax is transformed into the verbose syntax. So, in the above example, all three struct literals would have the AS name and AS score names for the struct fields. One subtle risk here is that by doing so, we can increase the input's string length, possibly resulting in large payloads for big inputs that may fail to execute due to size constraints.

Extract your logic into transforms.py, slightly refactoring it as needed, e.g., you don't need the is_bq_inline_struct logic to be pushed down to your helper. Then, you can reuse it across multiple dialects, like Snowflake, Presto, and so on, in combination with preprocess (see other examples). The disadvantage here is of course repetition, but it doesn't have the risk of (1).

I'd probably implement (2) for now because it's a smaller lift and safer. Let me know if I should clarify anything.

This makes sense to me @georgesittas. I think the 2nd approach is generic and would support all the dialects in future.
If I understand this right, you want me to only write the generic code and then make changes to BQ-->DuckDB specific code part. For other dialects, we can use the helper function in the future whenever we target the duckDB transpilation for those dialetcs.

georgesittas · 2025-11-04T15:17:24Z

you want me to only write the generic code and then make changes to BQ-->DuckDB specific code part. For other dialects, we can use the helper function in the future whenever we target the duckDB transpilation for those dialetcs.

It's trivial to use it in the other dialects as well: snowflake, presto, duckdb, hive, spark2. It's basically adding a preprocess entry in TRANSFORMS. Let's do it in one fell swoop.

… transpilation

fivetran-amrutabhimsenayachit · 2025-11-05T16:23:53Z

@georgesittas :
I made the code changes to move the generic code to transforms.py and then use that in all popular dialects.
The transpilation from bq--> dialects is successful. But post transpilation, for some dialects, the queries do not work as expected since there are some pre-existing issues associated with the function's handling.
I am listing my observations after testing. I have tried to test the transpiled queries as much as possible directly on the corresponding db(s).

Test queries:

SELECT * FROM UNNEST([STRUCT('Alice' AS name, 85 AS score), STRUCT('Bob', 92), STRUCT('Diana', 95)])
SELECT * FROM UNNEST([STRUCT('Alice' AS name, STRUCT(85 AS math, 90 AS english) AS scores), STRUCT('Bob' AS name, STRUCT(92 AS math, 88 AS english) AS scores)])

✅ What's Working (8 dialects - verified):
Field name inheritance works perfectly. All 3 structs get field names:

Snowflake - OBJECT_CONSTRUCT('name', 'Alice', 'score', 85) (all 3 have names)
Presto/Trino/Athena - CAST(ROW(...) AS ROW(name VARCHAR, score INTEGER)) (all 3 have type casts)
DuckDB - {'name': 'Alice', 'score': 85} (all 3 have field names)
Spark2/Databricks - STRUCT('Alice' AS name, 85 AS score) (all 3 have AS clauses)
Hive - Field names added to AST but stripped by generator (expected behavior)

❌ What's Not Working (5 dialects - pre-existing bugs):

Postgres/Redshift - No STRUCT→ROW conversion. Postgres uses ROW() not STRUCT() [✅ Pre-existing on main]
ClickHouse - No UNNEST→arrayJoin conversion. ClickHouse uses arrayJoin() not UNNEST() [✅ Pre-existing on main]
MySQL/Oracle/TSQL/Singlestore - These databases don't support STRUCT/UNNEST syntax at all [✅ Dialect limitation]
Field names are correctly inherited in all these dialects. The failures are due to other missing conversions, unrelated to the PR.

⚠️ MAX_BY Queries

MAX_BY queries fail in Spark2/Databricks/Hive: Current output: SELECT MAX_BY(name, score) FROM EXPLODE(ARRAY(...))
❌ Invalid - SELECT FROM EXPLODE() is not valid Spark syntax Should be: SELECT MAX_BY(name, score) FROM (SELECT INLINE(ARRAY(...)))
✅ Valid - needs INLINE or LATERAL VIEW This is a pre-existing UNNEST→EXPLODE conversion bug (verified on main branch). Field names are correctly inherited, but the EXPLODE syntax is wrong. This might need another PR.

feat(duckDB)!: Add support for shorthand struct array literals in duckDB

630e149

fivetran-amrutabhimsenayachit requested review from fivetran-BradfordPaskewitz and georgesittas October 31, 2025 19:06

fivetran-amrutabhimsenayachit self-assigned this Oct 31, 2025

fivetran-amrutabhimsenayachit marked this pull request as draft October 31, 2025 19:31

fivetran-amrutabhimsenayachit removed request for fivetran-BradfordPaskewitz and georgesittas November 3, 2025 15:37

feat(duckDB): fix nested STRUCT transpilation and tests

b9f56d6

fivetran-amrutabhimsenayachit marked this pull request as ready for review November 3, 2025 16:20

fivetran-amrutabhimsenayachit requested review from fivetran-BradfordPaskewitz and georgesittas November 3, 2025 16:20

feat(duckdb)!: Extend the struct fix to all dialects for bq-->dialect…

42650c3

… transpilation

georgesittas merged commit fabbf05 into main Nov 7, 2025
7 checks passed

georgesittas deleted the RD-1050236-transpile-big-querys-max-by-aggregate-function-to-duck-db branch November 7, 2025 15:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(duckDB)!: Add support for shorthand struct array literals in duckDB. #6233

feat(duckDB)!: Add support for shorthand struct array literals in duckDB. #6233

Uh oh!

fivetran-amrutabhimsenayachit commented Oct 31, 2025 •

edited

Loading

Uh oh!

georgesittas commented Nov 3, 2025

Uh oh!

fivetran-amrutabhimsenayachit commented Nov 4, 2025

Uh oh!

georgesittas commented Nov 4, 2025 •

edited

Loading

Uh oh!

fivetran-amrutabhimsenayachit commented Nov 5, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat(duckDB)!: Add support for shorthand struct array literals in duckDB. #6233

feat(duckDB)!: Add support for shorthand struct array literals in duckDB. #6233

Uh oh!

Conversation

fivetran-amrutabhimsenayachit commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

georgesittas commented Nov 3, 2025

Uh oh!

fivetran-amrutabhimsenayachit commented Nov 4, 2025

Uh oh!

georgesittas commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fivetran-amrutabhimsenayachit commented Nov 5, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fivetran-amrutabhimsenayachit commented Oct 31, 2025 •

edited

Loading

georgesittas commented Nov 4, 2025 •

edited

Loading