fix(spark): parse month, day without leading zeros by deepyaman · Pull Request #7773 · tobymao/sqlglot

deepyaman · 2026-06-20T17:31:34Z

Regarding testing each dialect, I ended up testing Ibis backends, which largely supports the earlier conclusion that Spark 3+ (and Databricks) were the issues:

Verdict	Backends
✅ Works	DuckDB, Polars, BigQuery, MySQL, PostgreSQL, Trino, SingleStoreDB, Oracle, RisingWave
❌ Returns NULL silently	Impala, PySpark, Databricks
❌ Returns wrong date	Flink
— Not implemented	ClickHouse, SQLite, DataFusion, MS SQL Server, Druid, Exasol
— Not available	Materialize

Impala and Flink don't have SQLGlot dialect (although Ibis extends the Hive dialect for Impala), and Flink issues specifically run deeper. I can try to spend some time testing the backends Ibis doesn't implement this functionality for separately, if helpful.

Note: I did use Claude specifically to implement the additions in this PR, as well as to test across Ibis backends, but I have reviewed the code/implementation and think it makes sense.

Spark 3+ parses MM/dd strictly (single-digit months/days don't parse), unlike the lax %m/%d most dialects produce. Map Spark's MM/dd to a distinct canonical token (%mstrict/%dstrict) only when parsing, so the strict format roundtrips while lax %m/%d still becomes the lenient M/d. Formatting keeps the padded %m/%d -> MM/dd. The internal tokens degrade to %m/%d for every other dialect via the metaclass inverse fallback and the generic strtotime_sql. Assisted-by: Claude Opus 4.8 <noreply@anthropic.com>

Replace the per-call local import and `assert isinstance(self.dialect, Spark)` in SparkGenerator.format_time with a TYPE_CHECKING-guarded `dialect: Spark` class annotation, removing the runtime overhead while keeping mypy happy. Assisted-by: Claude Opus 4.8 <noreply@anthropic.com>

Removes the pyspark-specific StringToDate/StringToTimestamp override that mapped Python strftime tokens to lenient single-letter Spark specifiers. This is handled upstream by tobymao/sqlglot#7773, which makes Spark's StrToDate/StrToTime emit lenient `M`/`d` while keeping strict `MM`/`dd` only when round-tripping Spark's own SQL. Until that sqlglot release is available and the lower bound is bumped, the released sqlglot emits strict `MM/dd` and Spark returns NULL for single-digit month/day, so mark the single-digit `as_date` tests `notyet` for pyspark/databricks (mirroring impala). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

… [revert before merge] Temporary: installs the patched sqlglot (tobymao/sqlglot#7773) in the test_pyspark job via a submodule-stripped clone + `uv add` path source, and drops the pyspark/databricks `notyet` markers so the single-digit `as_date` tests are expected to PASS. This proves the PR's generated SQL actually executes single-digit dates in strict (non-legacy) Spark, not just an identical-format inference. Revert this commit; it must not merge. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

CI on the override-free code revealed that Spark 3+ in strict (non-legacy) mode does not silently return NULL for single-digit %m/%d — it raises: SparkUpgradeException on Spark 3.5 and DateTimeException on Spark 4.0 (both subclasses of pyspark's base PySparkException, including the Spark Connect variant). Databricks raises DatabricksServerOperationError. Point the pyspark/databricks `notyet` markers at those exceptions instead of AssertionError so the single-digit `as_date` tests xfail cleanly until the sqlglot fix (tobymao/sqlglot#7773) is released and the lower bound is bumped. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The strict->lax canonical fallback was applied only to INVERSE_TIME_MAPPING, so BigQuery's `FORMAT '...'` clause (which uses INVERSE_FORMAT_MAPPING) leaked the internal token, e.g. Spark `TO_DATE(x, 'MM/dd/yyyy')` -> BigQuery `... FORMAT 'MMstrict/DDstrict/YYYY'`. Apply the same fallback to INVERSE_FORMAT_MAPPING via a shared helper so it degrades to 'MM/DD/YYYY'. Assisted-by: Claude Opus 4.8 <noreply@anthropic.com>

…LAUDE Reuse dialect.STRICT_PARSE_TIME_EXPRESSIONS in SparkGenerator.format_time instead of a duplicate LENIENT_TIME_EXPRESSIONS tuple, removing the cross-module constant that had to be kept in sync by hand. Also document why the base strtotime_sql degrades only the strict tokens rather than routing the whole format through self.format_time(). Assisted-by: Claude Opus 4.8 <noreply@anthropic.com>

Declare LENIENT_INVERSE_TIME_MAPPING/_TRIE on the base Dialect (mirroring STRICT_TIME_MAPPING and INVERSE_TIME_MAPPING) and build the trie in the metaclass. This drops the Spark-only attribute that forced a `dialect: Spark` type-narrowing annotation in SparkGenerator plus its TYPE_CHECKING import, and lets spark.py stop hand-building the trie. No behavior change. Assisted-by: Claude Opus 4.8 <noreply@anthropic.com>

…CLAUDE Replace the `inverse_time_mapping is None and ...` guard with the same `inverse_time_mapping or ...` fallback idiom the base Generator.format_time uses. Same semantics (an explicitly-passed mapping still wins), but consistent with the base and without the extra clause. No behavior change. Assisted-by: Claude Opus 4.8 <noreply@anthropic.com>

…AUDE Drop the "used by the generator's format_time override / trie built by the metaclass" note on Spark.LENIENT_INVERSE_TIME_MAPPING (no other mapping documents where it's consumed or that the metaclass builds its trie), and reword the base docstring's awkward "Lets e.g. Spark" to the parenthetical "(e.g. Spark)" form already used by STRICT_TIME_MAPPING. Assisted-by: Claude Opus 4.8 <noreply@anthropic.com>

geooo109 · 2026-06-24T10:51:28Z

@deepyaman the solution seems more complicated as expected. We should only change the TIME_MAPPING and the INVERSE_TIME_MAPPING.

We should add in spark's TIME_MAPPING the "MM": "%mstrict", "dd": "%dstrict" + the INVERSE as we already do. Then, for the tests that fail in the case of spark as input we should add in the INVERSE_TIME_MAPPING for the output dialect the corresponding mapping for %mstrict and %dstrict.

deepyaman · 2026-06-24T15:45:32Z

@deepyaman the solution seems more complicated as expected. We should only change the TIME_MAPPING and the INVERSE_TIME_MAPPING.

We should add in spark's TIME_MAPPING the "MM": "%mstrict", "dd": "%dstrict" + the INVERSE as we already do. Then, for the tests that fail in the case of spark as input we should add in the INVERSE_TIME_MAPPING for the output dialect the corresponding mapping for %mstrict and %dstrict.

I think this is quite similar to what I had initially. The problem is that things break in the format direction:

>>> import sqlglot
>>> sqlglot.transpile("SELECT STRPTIME(x,'%Y-%m-%d')", read="duckdb", write="spark")
["SELECT TO_TIMESTAMP(x, 'yyyy-M-d')"]  # good
>>> sqlglot.transpile("SELECT DATE_FORMAT(x,'yyyy-MM-dd')", read="hive", write="spark")
["SELECT DATE_FORMAT(x, 'yyyy-M-d')"]  # bad; should be 'yyyy-MM-dd'

The added complexity is needed in order to support both parse and format directions; reimplementing the simplification, get 17 test failures (all in the format direction).

FYI what I tried for the "simple" approach is to update sqlglot/dialects/spark.py to remove STRICT_TIME_MAPPING, LENIENT_INVERSE_TIME_MAPPING and just do:

    TIME_MAPPING = {
         **Spark2.TIME_MAPPING,
         "MM": "%mstrict",
         "dd": "%dstrict",
     }

    INVERSE_TIME_MAPPING = {
         "%m": "M",
         "%d": "d",
     }

Not sure if you had something different in mind.

georgesittas · 2026-06-25T10:09:38Z

>>> sqlglot.transpile("SELECT DATE_FORMAT(x,'yyyy-MM-dd')", read="hive", write="spark")
["SELECT DATE_FORMAT(x, 'yyyy-M-d')"]  # bad; should be 'yyyy-MM-dd'

If Hive follows what Spark does, shouldn't your TIME_MAPPING and INVERSE_TIME_MAPPING changes go to Hive? This looks expected to me if you only change Spark.

geooo109 · 2026-06-25T10:30:54Z

@deepyaman yeah, I thought we have focused on spark only my bad. Hive as I mentioned is strict so we should move there the logic (you can test it also locally to verify like the rest of the dialects). So, the simple map approach I mentioned should work, without any additions.

deepyaman · 2026-06-25T12:59:16Z

>>> sqlglot.transpile("SELECT DATE_FORMAT(x,'yyyy-MM-dd')", read="hive", write="spark")
["SELECT DATE_FORMAT(x, 'yyyy-M-d')"]  # bad; should be 'yyyy-MM-dd'
If Hive follows what Spark does, shouldn't your TIME_MAPPING and INVERSE_TIME_MAPPING changes go to Hive? This looks expected to me if you only change Spark.

@georgesittas @geooo109 Yes, sorry, I missed that Hive was strict, since I haven't been testing anything directly using Hive (well, I guess Impala). I'm happy to update the mappings there.

Spark2 is still lenient, so I'll need to re-override that, and then back again for Spark (strict by default, unless you specify spark.sql.legacy.timeParserPolicy).

However, this can be illustrated using another dialect (most of the failing tests if take the "simple" fix aren't the mistaken Hive ones). For instance:

>>> import sqlglot
>>> sqlglot.transpile("SELECT STRPTIME(x, '%Y-%m-%d')", read="duckdb", write="spark")
['yyyy-M-d' ]  # good
>>> sqlglot.transpile("SELECT STRFTIME(x, '%Y-%m-%d')", read="duckdb", write="spark")
['yyyy-M-d']  # bad; should be 'yyyy-MM-dd'

georgesittas · 2026-06-25T13:15:27Z

I'm not sure I understand: why is M bad when going from duckdb to spark, given that the former is lenient? Also, @deepyaman, feel free to join our Slack to iterate on this faster. I think it'll be easier that way.

deepyaman · 2026-06-25T13:56:56Z

I'm not sure I understand: why is M bad when going from duckdb to spark, given that the former is lenient?

%m for formatting should map to MM, whereas %-m would map to M.

Also, @deepyaman, feel free to join our Slack to iterate on this faster. I think it'll be easier that way.

Will do!

georgesittas · 2026-06-25T14:01:25Z

Ok, in any case, the mechanism is quite clear: TIME_MAPPING tries to map dialect-specific formats to a canonical representation (think of it as a "format IR") that captures their semantics and then INVERSE_TIME_MAPPING is responsible for mapping that representation into another (or the same) dialect's formats.

By convention, the formats we use for TIME_MAPPING's values are compatible with Python's strftime spec, because these are common and easily understood, but if dialects expose more format variants that aren't covered by said spec, we need to create our own, like %mstrict.

I believe this context should be enough to handle this issue. It's simply a matter of finding the right representation for a given dialect format. If tests break, I believe that means that they're either incorrect or the chosen format is incorrect.

deepyaman · 2026-06-25T16:54:13Z

Ok, in any case, the mechanism is quite clear: TIME_MAPPING tries to map dialect-specific formats to a canonical representation (think of it as a "format IR") that captures their semantics and then INVERSE_TIME_MAPPING is responsible for mapping that representation into another (or the same) dialect's formats.

By convention, the formats we use for TIME_MAPPING's values are compatible with Python's strftime spec, because these are common and easily understood, but if dialects expose more format variants that aren't covered by said spec, we need to create our own, like %mstrict.

I believe this context should be enough to handle this issue. It's simply a matter of finding the right representation for a given dialect format. If tests break, I believe that means that they're either incorrect or the chosen format is incorrect.

The reason for introducing the STRICT_TIME_MAPPING and LENIENT_INVERSE_TIME_MAPPING is that the existing time mappings (TIME_MAPPING, INVERSE_TIME_MAPPING) are just used for capturing from/to dialect, not parse/format role. The specific code paths:

	read (dialect → canonical)	write (canonical → dialect)
parse role (`StrToTime`…)	`STRICT_TIME_MAPPING`	`LENIENT_INVERSE_TIME_MAPPING`
format role (`TimeToStr`…)	`TIME_MAPPING`	`INVERSE_TIME_MAPPING`

DuckDB %m has to become M when parsing and MM when formatting, so I don't think one mapping covers that. If you mean that the role piece could be solved by adding unique tokens for each role (e.g. %mparselenient, %mparsestrict, %mformatpadded, %mformatunpadded), but it would require adding adding support for that to each dialect.

Let me also follow up on Slack, if it helps clarify any misunderstandings I may have!

deepyaman · 2026-06-26T01:20:51Z

If you mean that the role piece could be solved by adding unique tokens for each role (e.g. %mparselenient, %mparsestrict, %mformatpadded, %mformatunpadded), but it would require adding adding support for that to each dialect.

Here's a Claude-generated prototype of what that could look like: https://github.com/deepyaman/sqlglot/tree/four-token-prototype

georgesittas · 2026-06-26T12:47:59Z

Ok, I get it now, thanks for clarifying. I wasn't thinking about the differences between strptime and strftime. This is somewhat annoying because it complicates things.

I need to think more about this. Probably going to circle around to this PR in a few days.

deepyaman added 5 commits June 10, 2026 17:05

fix(spark): parse month, day without leading zeros

4806819

fix(spark): also fix generation of exp.StrToTime

4c2e16b

chore(spark): narrow type to fix mypy attr-defined

af89ba3

deepyaman marked this pull request as ready for review June 20, 2026 17:31

deepyaman mentioned this pull request Jun 20, 2026

test: add single-digit month/day date/timestamp cast tests (#12004) ibis-project/ibis#12021

Draft

deepyaman added 2 commits June 20, 2026 14:56

georgesittas requested a review from geooo109 June 21, 2026 14:18

georgesittas assigned geooo109 Jun 21, 2026

geooo109 reviewed Jun 22, 2026

View reviewed changes

Comment thread sqlglot/generators/spark.py Outdated

Comment thread sqlglot/generators/spark.py Outdated

Comment thread sqlglot/generators/spark.py Outdated

deepyaman added 3 commits June 22, 2026 18:25

georgesittas self-assigned this Jun 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(spark): parse month, day without leading zeros#7773

fix(spark): parse month, day without leading zeros#7773
deepyaman wants to merge 10 commits into
tobymao:mainfrom
deepyaman:patch-1

deepyaman commented Jun 20, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

geooo109 commented Jun 24, 2026 •

edited

Loading

Uh oh!

deepyaman commented Jun 24, 2026

Uh oh!

georgesittas commented Jun 25, 2026

Uh oh!

geooo109 commented Jun 25, 2026 •

edited

Loading

Uh oh!

deepyaman commented Jun 25, 2026

Uh oh!

georgesittas commented Jun 25, 2026

Uh oh!

deepyaman commented Jun 25, 2026

Uh oh!

georgesittas commented Jun 25, 2026

Uh oh!

deepyaman commented Jun 25, 2026

Uh oh!

deepyaman commented Jun 26, 2026

Uh oh!

georgesittas commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

deepyaman commented Jun 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

geooo109 commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

deepyaman commented Jun 24, 2026

Uh oh!

georgesittas commented Jun 25, 2026

Uh oh!

geooo109 commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

deepyaman commented Jun 25, 2026

Uh oh!

georgesittas commented Jun 25, 2026

Uh oh!

deepyaman commented Jun 25, 2026

Uh oh!

georgesittas commented Jun 25, 2026

Uh oh!

deepyaman commented Jun 25, 2026

Uh oh!

deepyaman commented Jun 26, 2026

Uh oh!

georgesittas commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

deepyaman commented Jun 20, 2026 •

edited

Loading

geooo109 commented Jun 24, 2026 •

edited

Loading

geooo109 commented Jun 25, 2026 •

edited

Loading