Skip to content

Conversation

@fivetran-amrutabhimsenayachit
Copy link
Collaborator

@fivetran-amrutabhimsenayachit fivetran-amrutabhimsenayachit commented Oct 31, 2025

The current transpilation converts MAX_BY to ARG_MAX but loses struct field names, causing incorrect results in DuckDB.

bq --project_id fivetran-wild-west query --use_legacy_sql=false "SELECT MAX_BY(name, score) FROM UNNEST([STRUCT('Alice' AS name, 85 AS score), STRUCT('Bob', 92), STRUCT('Diana', 95)])"
+-------+
|  f0_  |
+-------+
| Diana |
+-------+

python3 -c "import sqlglot as sg; print(sg.transpile(\"SELECT MAX_BY(name, score) FROM UNNEST([STRUCT('Alice' AS name, 85 AS score), STRUCT('Bob', 92), STRUCT('Diana', 95)])\", read='bigquery', write='duckdb')[0])"
-->SELECT ARG_MAX(name, score) FROM (SELECT UNNEST([{'name': 'Alice', 'score': 85}, {'_0': 'Bob', '_1': 92}, {'_0': 'Diana', '_1': 95}], max_depth => 2))


duckdb -c "SELECT ARG_MAX(name, score) FROM (SELECT UNNEST([{'name': 'Alice', 'score': 85}, {'_0': 'Bob', '_1': 92}, {'_0': 'Diana', '_1': 95}], max_depth => 2))"
┌────────────────────────┐
│ arg_max("name", score) │
│        varchar         │
├────────────────────────┤
│ Alice                  │
└────────────────────────┘

After testing the new code changes:

sqlglot % bq --project_id fivetran-wild-west query --use_legacy_sql=false "SELECT MAX_BY(name, score) FROM UNNEST([STRUCT('Alice' AS name, 85 AS score), STRUCT('Bob', 92), STRUCT('Diana', 95)])"
+-------+
|  f0_  |
+-------+
| Diana |
+-------+

sqlglot % python3 -c "import sqlglot as sg; print(sg.transpile(\"SELECT MAX_BY(name, score) FROM UNNEST([STRUCT('Alice' AS name, 85 AS score), STRUCT('Bob', 92), STRUCT('Diana', 95)])\", read='bigquery', write='duckdb')[0])"

SELECT ARG_MAX(name, score) FROM (SELECT UNNEST([{'name': 'Alice', 'score': 85}, {'name': 'Bob', 'score': 92}, {'name': 'Diana', 'score': 95}], max_depth => 2))

sqlglot % 

sqlglot % duckdb -c "SELECT ARG_MAX(name, score) FROM (SELECT UNNEST([{'name': 'Alice', 'score': 85}, {'name': 'Bob', 'score': 92}, {'name': 'Diana', 'score': 95}], max_depth => 2))"
┌────────────────────────┐
│ arg_max("name", score) │
│        varchar         │
├────────────────────────┤
│ Diana                  │
└────────────────────────┘

@georgesittas
Copy link
Collaborator

@fivetran-amrutabhimsenayachit this looks good, but upon reviewing it I realized that the issue is more general and doesn't only affect BigQuery –> DuckDB.

For instance, if we wanted to target Snowflake, the same issue would occur:

(sqlglot) ➜  sqlglot git:(main) bq query "SELECT MAX_BY(name, score) FROM UNNEST([STRUCT('Alice' AS name, 85 AS score), STRUCT('Bob', 92), STRUCT('Diana', 95)])"
+-------+
|  f0_  |
+-------+
| Diana |
+-------+
(sqlglot) ➜  sqlglot git:(main) sqlglot --read bigquery --write snowflake "SELECT MAX_BY(name, score) FROM UNNEST([STRUCT('Alice' AS name, 85 AS score), STRUCT('Bob', 92), STRUCT('Diana', 95)])"
SELECT MAX_BY(value['name'], value['score']) FROM TABLE(FLATTEN(INPUT => [OBJECT_CONSTRUCT('name', 'Alice', 'score', 85), OBJECT_CONSTRUCT('_0', 'Bob', '_1', 92), OBJECT_CONSTRUCT('_0', 'Diana', '_1', 95)])) AS _t0(seq, key, path, index, value, this)
(sqlglot) ➜  sqlglot git:(main) snow sql -q "SELECT MAX_BY(value['name'], value['score']) FROM TABLE(FLATTEN(INPUT => [OBJECT_CONSTRUCT('name', 'Alice', 'score', 85), OBJECT_CONSTRUCT('_0', 'Bob', '_1', 92), OBJECT_CONSTRUCT('_0', 'Diana', '_1', 95)])) AS _t0(seq, key, path, index, value, this)"
SELECT MAX_BY(value['name'], value['score']) FROM TABLE(FLATTEN(INPUT => [OBJECT_CONSTRUCT('name', 'Alice', 'score', 85), OBJECT_CONSTRUCT('_0', 'Bob', '_1', 92), OBJECT_CONSTRUCT('_0', 'Diana', '_1', 95)])) AS _t0(seq, key, path, index, value, this)
+---------------------------------------+
| MAX_BY(VALUE['NAME'], VALUE['SCORE']) |
|---------------------------------------|
| "Alice"                               |
+---------------------------------------+

There are two ways I see for solving this properly:

  1. At parse time, for BigQuery, ensure that the shorthand syntax is transformed into the verbose syntax. So, in the above example, all three struct literals would have the AS name and AS score names for the struct fields. One subtle risk here is that by doing so, we can increase the input's string length, possibly resulting in large payloads for big inputs that may fail to execute due to size constraints.
  2. Extract your logic into transforms.py, slightly refactoring it as needed, e.g., you don't need the is_bq_inline_struct logic to be pushed down to your helper. Then, you can reuse it across multiple dialects, like Snowflake, Presto, and so on, in combination with preprocess (see other examples). The disadvantage here is of course repetition, but it doesn't have the risk of (1).

I'd probably implement (2) for now because it's a smaller lift and safer. Let me know if I should clarify anything.

@fivetran-amrutabhimsenayachit
Copy link
Collaborator Author

@fivetran-amrutabhimsenayachit this looks good, but upon reviewing it I realized that the issue is more general and doesn't only affect BigQuery –> DuckDB.

For instance, if we wanted to target Snowflake, the same issue would occur:

(sqlglot) ➜  sqlglot git:(main) bq query "SELECT MAX_BY(name, score) FROM UNNEST([STRUCT('Alice' AS name, 85 AS score), STRUCT('Bob', 92), STRUCT('Diana', 95)])"
+-------+
|  f0_  |
+-------+
| Diana |
+-------+
(sqlglot) ➜  sqlglot git:(main) sqlglot --read bigquery --write snowflake "SELECT MAX_BY(name, score) FROM UNNEST([STRUCT('Alice' AS name, 85 AS score), STRUCT('Bob', 92), STRUCT('Diana', 95)])"
SELECT MAX_BY(value['name'], value['score']) FROM TABLE(FLATTEN(INPUT => [OBJECT_CONSTRUCT('name', 'Alice', 'score', 85), OBJECT_CONSTRUCT('_0', 'Bob', '_1', 92), OBJECT_CONSTRUCT('_0', 'Diana', '_1', 95)])) AS _t0(seq, key, path, index, value, this)
(sqlglot) ➜  sqlglot git:(main) snow sql -q "SELECT MAX_BY(value['name'], value['score']) FROM TABLE(FLATTEN(INPUT => [OBJECT_CONSTRUCT('name', 'Alice', 'score', 85), OBJECT_CONSTRUCT('_0', 'Bob', '_1', 92), OBJECT_CONSTRUCT('_0', 'Diana', '_1', 95)])) AS _t0(seq, key, path, index, value, this)"
SELECT MAX_BY(value['name'], value['score']) FROM TABLE(FLATTEN(INPUT => [OBJECT_CONSTRUCT('name', 'Alice', 'score', 85), OBJECT_CONSTRUCT('_0', 'Bob', '_1', 92), OBJECT_CONSTRUCT('_0', 'Diana', '_1', 95)])) AS _t0(seq, key, path, index, value, this)
+---------------------------------------+
| MAX_BY(VALUE['NAME'], VALUE['SCORE']) |
|---------------------------------------|
| "Alice"                               |
+---------------------------------------+

There are two ways I see for solving this properly:

  1. At parse time, for BigQuery, ensure that the shorthand syntax is transformed into the verbose syntax. So, in the above example, all three struct literals would have the AS name and AS score names for the struct fields. One subtle risk here is that by doing so, we can increase the input's string length, possibly resulting in large payloads for big inputs that may fail to execute due to size constraints.
  2. Extract your logic into transforms.py, slightly refactoring it as needed, e.g., you don't need the is_bq_inline_struct logic to be pushed down to your helper. Then, you can reuse it across multiple dialects, like Snowflake, Presto, and so on, in combination with preprocess (see other examples). The disadvantage here is of course repetition, but it doesn't have the risk of (1).

I'd probably implement (2) for now because it's a smaller lift and safer. Let me know if I should clarify anything.

This makes sense to me @georgesittas. I think the 2nd approach is generic and would support all the dialects in future.
If I understand this right, you want me to only write the generic code and then make changes to BQ-->DuckDB specific code part. For other dialects, we can use the helper function in the future whenever we target the duckDB transpilation for those dialetcs.

@georgesittas
Copy link
Collaborator

georgesittas commented Nov 4, 2025

you want me to only write the generic code and then make changes to BQ-->DuckDB specific code part. For other dialects, we can use the helper function in the future whenever we target the duckDB transpilation for those dialetcs.

It's trivial to use it in the other dialects as well: snowflake, presto, duckdb, hive, spark2. It's basically adding a preprocess entry in TRANSFORMS. Let's do it in one fell swoop.

@fivetran-amrutabhimsenayachit
Copy link
Collaborator Author

@georgesittas :
I made the code changes to move the generic code to transforms.py and then use that in all popular dialects.
The transpilation from bq--> dialects is successful. But post transpilation, for some dialects, the queries do not work as expected since there are some pre-existing issues associated with the function's handling.
I am listing my observations after testing. I have tried to test the transpiled queries as much as possible directly on the corresponding db(s).

Test queries:

  1. SELECT * FROM UNNEST([STRUCT('Alice' AS name, 85 AS score), STRUCT('Bob', 92), STRUCT('Diana', 95)])
  2. SELECT * FROM UNNEST([STRUCT('Alice' AS name, STRUCT(85 AS math, 90 AS english) AS scores), STRUCT('Bob' AS name, STRUCT(92 AS math, 88 AS english) AS scores)])

✅ What's Working (8 dialects - verified):
Field name inheritance works perfectly. All 3 structs get field names:

  • Snowflake - OBJECT_CONSTRUCT('name', 'Alice', 'score', 85) (all 3 have names)
  • Presto/Trino/Athena - CAST(ROW(...) AS ROW(name VARCHAR, score INTEGER)) (all 3 have type casts)
  • DuckDB - {'name': 'Alice', 'score': 85} (all 3 have field names)
  • Spark2/Databricks - STRUCT('Alice' AS name, 85 AS score) (all 3 have AS clauses)
  • Hive - Field names added to AST but stripped by generator (expected behavior)

❌ What's Not Working (5 dialects - pre-existing bugs):

  • Postgres/Redshift - No STRUCT→ROW conversion. Postgres uses ROW() not STRUCT() [✅ Pre-existing on main]
  • ClickHouse - No UNNEST→arrayJoin conversion. ClickHouse uses arrayJoin() not UNNEST() [✅ Pre-existing on main]
  • MySQL/Oracle/TSQL/Singlestore - These databases don't support STRUCT/UNNEST syntax at all [✅ Dialect limitation]
    Field names are correctly inherited in all these dialects. The failures are due to other missing conversions, unrelated to the PR.

⚠️ MAX_BY Queries

  • MAX_BY queries fail in Spark2/Databricks/Hive: Current output: SELECT MAX_BY(name, score) FROM EXPLODE(ARRAY(...))
  • ❌ Invalid - SELECT FROM EXPLODE() is not valid Spark syntax Should be: SELECT MAX_BY(name, score) FROM (SELECT INLINE(ARRAY(...)))
  • ✅ Valid - needs INLINE or LATERAL VIEW This is a pre-existing UNNEST→EXPLODE conversion bug (verified on main branch). Field names are correctly inherited, but the EXPLODE syntax is wrong. This might need another PR.

@georgesittas georgesittas merged commit fabbf05 into main Nov 7, 2025
7 checks passed
@georgesittas georgesittas deleted the RD-1050236-transpile-big-querys-max-by-aggregate-function-to-duck-db branch November 7, 2025 15:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants