Skip to content

fix(spark): parse month, day without leading zeros#7773

Open
deepyaman wants to merge 10 commits into
tobymao:mainfrom
deepyaman:patch-1
Open

fix(spark): parse month, day without leading zeros#7773
deepyaman wants to merge 10 commits into
tobymao:mainfrom
deepyaman:patch-1

Conversation

@deepyaman

@deepyaman deepyaman commented Jun 20, 2026

Copy link
Copy Markdown

Following up on #7739 (comment)

Regarding testing each dialect, I ended up testing Ibis backends, which largely supports the earlier conclusion that Spark 3+ (and Databricks) were the issues:

Verdict Backends
✅ Works DuckDB, Polars, BigQuery, MySQL, PostgreSQL, Trino, SingleStoreDB, Oracle, RisingWave
❌ Returns NULL silently Impala, PySpark, Databricks
❌ Returns wrong date Flink
— Not implemented ClickHouse, SQLite, DataFusion, MS SQL Server, Druid, Exasol
— Not available Materialize

Impala and Flink don't have SQLGlot dialect (although Ibis extends the Hive dialect for Impala), and Flink issues specifically run deeper. I can try to spend some time testing the backends Ibis doesn't implement this functionality for separately, if helpful.

Note: I did use Claude specifically to implement the additions in this PR, as well as to test across Ibis backends, but I have reviewed the code/implementation and think it makes sense.

Spark 3+ parses MM/dd strictly (single-digit months/days don't parse),
unlike the lax %m/%d most dialects produce. Map Spark's MM/dd to a distinct
canonical token (%mstrict/%dstrict) only when parsing, so the strict format
roundtrips while lax %m/%d still becomes the lenient M/d. Formatting keeps the
padded %m/%d -> MM/dd. The internal tokens degrade to %m/%d for every other
dialect via the metaclass inverse fallback and the generic strtotime_sql.

Assisted-by: Claude Opus 4.8 <noreply@anthropic.com>
Replace the per-call local import and `assert isinstance(self.dialect, Spark)`
in SparkGenerator.format_time with a TYPE_CHECKING-guarded `dialect: Spark`
class annotation, removing the runtime overhead while keeping mypy happy.

Assisted-by: Claude Opus 4.8 <noreply@anthropic.com>
@deepyaman deepyaman marked this pull request as ready for review June 20, 2026 17:31
deepyaman added a commit to deepyaman/ibis that referenced this pull request Jun 20, 2026
Removes the pyspark-specific StringToDate/StringToTimestamp override that
mapped Python strftime tokens to lenient single-letter Spark specifiers.
This is handled upstream by tobymao/sqlglot#7773,
which makes Spark's StrToDate/StrToTime emit lenient `M`/`d` while keeping
strict `MM`/`dd` only when round-tripping Spark's own SQL.

Until that sqlglot release is available and the lower bound is bumped, the
released sqlglot emits strict `MM/dd` and Spark returns NULL for single-digit
month/day, so mark the single-digit `as_date` tests `notyet` for
pyspark/databricks (mirroring impala).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
deepyaman added a commit to deepyaman/ibis that referenced this pull request Jun 20, 2026
… [revert before merge]

Temporary: installs the patched sqlglot (tobymao/sqlglot#7773) in the
test_pyspark job via a submodule-stripped clone + `uv add` path source, and
drops the pyspark/databricks `notyet` markers so the single-digit `as_date`
tests are expected to PASS. This proves the PR's generated SQL actually
executes single-digit dates in strict (non-legacy) Spark, not just an
identical-format inference.

Revert this commit; it must not merge.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
deepyaman added a commit to deepyaman/ibis that referenced this pull request Jun 20, 2026
CI on the override-free code revealed that Spark 3+ in strict (non-legacy)
mode does not silently return NULL for single-digit %m/%d — it raises:
SparkUpgradeException on Spark 3.5 and DateTimeException on Spark 4.0 (both
subclasses of pyspark's base PySparkException, including the Spark Connect
variant). Databricks raises DatabricksServerOperationError.

Point the pyspark/databricks `notyet` markers at those exceptions instead of
AssertionError so the single-digit `as_date` tests xfail cleanly until the
sqlglot fix (tobymao/sqlglot#7773) is released and the lower bound is bumped.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The strict->lax canonical fallback was applied only to INVERSE_TIME_MAPPING,
so BigQuery's `FORMAT '...'` clause (which uses INVERSE_FORMAT_MAPPING) leaked
the internal token, e.g. Spark `TO_DATE(x, 'MM/dd/yyyy')` -> BigQuery
`... FORMAT 'MMstrict/DDstrict/YYYY'`. Apply the same fallback to
INVERSE_FORMAT_MAPPING via a shared helper so it degrades to 'MM/DD/YYYY'.

Assisted-by: Claude Opus 4.8 <noreply@anthropic.com>
…LAUDE

Reuse dialect.STRICT_PARSE_TIME_EXPRESSIONS in SparkGenerator.format_time
instead of a duplicate LENIENT_TIME_EXPRESSIONS tuple, removing the cross-module
constant that had to be kept in sync by hand. Also document why the base
strtotime_sql degrades only the strict tokens rather than routing the whole
format through self.format_time().

Assisted-by: Claude Opus 4.8 <noreply@anthropic.com>
Comment thread sqlglot/generators/spark.py Outdated
Comment thread sqlglot/generators/spark.py Outdated
Comment thread sqlglot/generators/spark.py Outdated
Declare LENIENT_INVERSE_TIME_MAPPING/_TRIE on the base Dialect (mirroring
STRICT_TIME_MAPPING and INVERSE_TIME_MAPPING) and build the trie in the
metaclass. This drops the Spark-only attribute that forced a `dialect: Spark`
type-narrowing annotation in SparkGenerator plus its TYPE_CHECKING import, and
lets spark.py stop hand-building the trie. No behavior change.

Assisted-by: Claude Opus 4.8 <noreply@anthropic.com>
…CLAUDE

Replace the `inverse_time_mapping is None and ...` guard with the same
`inverse_time_mapping or ...` fallback idiom the base Generator.format_time
uses. Same semantics (an explicitly-passed mapping still wins), but consistent
with the base and without the extra clause. No behavior change.

Assisted-by: Claude Opus 4.8 <noreply@anthropic.com>
…AUDE

Drop the "used by the generator's format_time override / trie built by the
metaclass" note on Spark.LENIENT_INVERSE_TIME_MAPPING (no other mapping documents
where it's consumed or that the metaclass builds its trie), and reword the base
docstring's awkward "Lets e.g. Spark" to the parenthetical "(e.g. Spark)" form
already used by STRICT_TIME_MAPPING.

Assisted-by: Claude Opus 4.8 <noreply@anthropic.com>
@geooo109

geooo109 commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

@deepyaman the solution seems more complicated as expected. We should only change the TIME_MAPPING and the INVERSE_TIME_MAPPING.

We should add in spark's TIME_MAPPING the "MM": "%mstrict", "dd": "%dstrict" + the INVERSE as we already do. Then, for the tests that fail in the case of spark as input we should add in the INVERSE_TIME_MAPPING for the output dialect the corresponding mapping for %mstrict and %dstrict.

@deepyaman

Copy link
Copy Markdown
Author

@deepyaman the solution seems more complicated as expected. We should only change the TIME_MAPPING and the INVERSE_TIME_MAPPING.

We should add in spark's TIME_MAPPING the "MM": "%mstrict", "dd": "%dstrict" + the INVERSE as we already do. Then, for the tests that fail in the case of spark as input we should add in the INVERSE_TIME_MAPPING for the output dialect the corresponding mapping for %mstrict and %dstrict.

I think this is quite similar to what I had initially. The problem is that things break in the format direction:

>>> import sqlglot
>>> sqlglot.transpile("SELECT STRPTIME(x,'%Y-%m-%d')", read="duckdb", write="spark")
["SELECT TO_TIMESTAMP(x, 'yyyy-M-d')"]  # good
>>> sqlglot.transpile("SELECT DATE_FORMAT(x,'yyyy-MM-dd')", read="hive", write="spark")
["SELECT DATE_FORMAT(x, 'yyyy-M-d')"]  # bad; should be 'yyyy-MM-dd'

The added complexity is needed in order to support both parse and format directions; reimplementing the simplification, get 17 test failures (all in the format direction).

FYI what I tried for the "simple" approach is to update sqlglot/dialects/spark.py to remove STRICT_TIME_MAPPING, LENIENT_INVERSE_TIME_MAPPING and just do:

    TIME_MAPPING = {
         **Spark2.TIME_MAPPING,
         "MM": "%mstrict",
         "dd": "%dstrict",
     }

    INVERSE_TIME_MAPPING = {
         "%m": "M",
         "%d": "d",
     }

Not sure if you had something different in mind.

@georgesittas

Copy link
Copy Markdown
Collaborator
>>> sqlglot.transpile("SELECT DATE_FORMAT(x,'yyyy-MM-dd')", read="hive", write="spark")
["SELECT DATE_FORMAT(x, 'yyyy-M-d')"]  # bad; should be 'yyyy-MM-dd'

If Hive follows what Spark does, shouldn't your TIME_MAPPING and INVERSE_TIME_MAPPING changes go to Hive? This looks expected to me if you only change Spark.

@geooo109

geooo109 commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator

@deepyaman yeah, I thought we have focused on spark only my bad. Hive as I mentioned is strict so we should move there the logic (you can test it also locally to verify like the rest of the dialects). So, the simple map approach I mentioned should work, without any additions.

@deepyaman

Copy link
Copy Markdown
Author
>>> sqlglot.transpile("SELECT DATE_FORMAT(x,'yyyy-MM-dd')", read="hive", write="spark")
["SELECT DATE_FORMAT(x, 'yyyy-M-d')"]  # bad; should be 'yyyy-MM-dd'

If Hive follows what Spark does, shouldn't your TIME_MAPPING and INVERSE_TIME_MAPPING changes go to Hive? This looks expected to me if you only change Spark.

@georgesittas @geooo109 Yes, sorry, I missed that Hive was strict, since I haven't been testing anything directly using Hive (well, I guess Impala). I'm happy to update the mappings there.

Spark2 is still lenient, so I'll need to re-override that, and then back again for Spark (strict by default, unless you specify spark.sql.legacy.timeParserPolicy).

However, this can be illustrated using another dialect (most of the failing tests if take the "simple" fix aren't the mistaken Hive ones). For instance:

>>> import sqlglot
>>> sqlglot.transpile("SELECT STRPTIME(x, '%Y-%m-%d')", read="duckdb", write="spark")
['yyyy-M-d' ]  # good
>>> sqlglot.transpile("SELECT STRFTIME(x, '%Y-%m-%d')", read="duckdb", write="spark")
['yyyy-M-d']  # bad; should be 'yyyy-MM-dd'

@georgesittas

Copy link
Copy Markdown
Collaborator

I'm not sure I understand: why is M bad when going from duckdb to spark, given that the former is lenient? Also, @deepyaman, feel free to join our Slack to iterate on this faster. I think it'll be easier that way.

@deepyaman

Copy link
Copy Markdown
Author

I'm not sure I understand: why is M bad when going from duckdb to spark, given that the former is lenient?

%m for formatting should map to MM, whereas %-m would map to M.

Also, @deepyaman, feel free to join our Slack to iterate on this faster. I think it'll be easier that way.

Will do!

@georgesittas

Copy link
Copy Markdown
Collaborator

Ok, in any case, the mechanism is quite clear: TIME_MAPPING tries to map dialect-specific formats to a canonical representation (think of it as a "format IR") that captures their semantics and then INVERSE_TIME_MAPPING is responsible for mapping that representation into another (or the same) dialect's formats.

By convention, the formats we use for TIME_MAPPING's values are compatible with Python's strftime spec, because these are common and easily understood, but if dialects expose more format variants that aren't covered by said spec, we need to create our own, like %mstrict.

I believe this context should be enough to handle this issue. It's simply a matter of finding the right representation for a given dialect format. If tests break, I believe that means that they're either incorrect or the chosen format is incorrect.

@deepyaman

Copy link
Copy Markdown
Author

Ok, in any case, the mechanism is quite clear: TIME_MAPPING tries to map dialect-specific formats to a canonical representation (think of it as a "format IR") that captures their semantics and then INVERSE_TIME_MAPPING is responsible for mapping that representation into another (or the same) dialect's formats.

By convention, the formats we use for TIME_MAPPING's values are compatible with Python's strftime spec, because these are common and easily understood, but if dialects expose more format variants that aren't covered by said spec, we need to create our own, like %mstrict.

I believe this context should be enough to handle this issue. It's simply a matter of finding the right representation for a given dialect format. If tests break, I believe that means that they're either incorrect or the chosen format is incorrect.

The reason for introducing the STRICT_TIME_MAPPING and LENIENT_INVERSE_TIME_MAPPING is that the existing time mappings (TIME_MAPPING, INVERSE_TIME_MAPPING) are just used for capturing from/to dialect, not parse/format role. The specific code paths:

read (dialect → canonical) write (canonical → dialect)
parse role (StrToTime…) STRICT_TIME_MAPPING LENIENT_INVERSE_TIME_MAPPING
format role (TimeToStr…) TIME_MAPPING INVERSE_TIME_MAPPING

DuckDB %m has to become M when parsing and MM when formatting, so I don't think one mapping covers that. If you mean that the role piece could be solved by adding unique tokens for each role (e.g. %mparselenient, %mparsestrict, %mformatpadded, %mformatunpadded), but it would require adding adding support for that to each dialect.

Let me also follow up on Slack, if it helps clarify any misunderstandings I may have!

@deepyaman

Copy link
Copy Markdown
Author

If you mean that the role piece could be solved by adding unique tokens for each role (e.g. %mparselenient, %mparsestrict, %mformatpadded, %mformatunpadded), but it would require adding adding support for that to each dialect.

Here's a Claude-generated prototype of what that could look like: https://github.com/deepyaman/sqlglot/tree/four-token-prototype

@georgesittas

Copy link
Copy Markdown
Collaborator

Ok, I get it now, thanks for clarifying. I wasn't thinking about the differences between strptime and strftime. This is somewhat annoying because it complicates things.

I need to think more about this. Probably going to circle around to this PR in a few days.

@georgesittas georgesittas self-assigned this Jun 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants