fix(spark): parse month, day without leading zeros#7773
Conversation
Spark 3+ parses MM/dd strictly (single-digit months/days don't parse), unlike the lax %m/%d most dialects produce. Map Spark's MM/dd to a distinct canonical token (%mstrict/%dstrict) only when parsing, so the strict format roundtrips while lax %m/%d still becomes the lenient M/d. Formatting keeps the padded %m/%d -> MM/dd. The internal tokens degrade to %m/%d for every other dialect via the metaclass inverse fallback and the generic strtotime_sql. Assisted-by: Claude Opus 4.8 <noreply@anthropic.com>
Replace the per-call local import and `assert isinstance(self.dialect, Spark)` in SparkGenerator.format_time with a TYPE_CHECKING-guarded `dialect: Spark` class annotation, removing the runtime overhead while keeping mypy happy. Assisted-by: Claude Opus 4.8 <noreply@anthropic.com>
Removes the pyspark-specific StringToDate/StringToTimestamp override that mapped Python strftime tokens to lenient single-letter Spark specifiers. This is handled upstream by tobymao/sqlglot#7773, which makes Spark's StrToDate/StrToTime emit lenient `M`/`d` while keeping strict `MM`/`dd` only when round-tripping Spark's own SQL. Until that sqlglot release is available and the lower bound is bumped, the released sqlglot emits strict `MM/dd` and Spark returns NULL for single-digit month/day, so mark the single-digit `as_date` tests `notyet` for pyspark/databricks (mirroring impala). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… [revert before merge] Temporary: installs the patched sqlglot (tobymao/sqlglot#7773) in the test_pyspark job via a submodule-stripped clone + `uv add` path source, and drops the pyspark/databricks `notyet` markers so the single-digit `as_date` tests are expected to PASS. This proves the PR's generated SQL actually executes single-digit dates in strict (non-legacy) Spark, not just an identical-format inference. Revert this commit; it must not merge. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
CI on the override-free code revealed that Spark 3+ in strict (non-legacy) mode does not silently return NULL for single-digit %m/%d — it raises: SparkUpgradeException on Spark 3.5 and DateTimeException on Spark 4.0 (both subclasses of pyspark's base PySparkException, including the Spark Connect variant). Databricks raises DatabricksServerOperationError. Point the pyspark/databricks `notyet` markers at those exceptions instead of AssertionError so the single-digit `as_date` tests xfail cleanly until the sqlglot fix (tobymao/sqlglot#7773) is released and the lower bound is bumped. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The strict->lax canonical fallback was applied only to INVERSE_TIME_MAPPING, so BigQuery's `FORMAT '...'` clause (which uses INVERSE_FORMAT_MAPPING) leaked the internal token, e.g. Spark `TO_DATE(x, 'MM/dd/yyyy')` -> BigQuery `... FORMAT 'MMstrict/DDstrict/YYYY'`. Apply the same fallback to INVERSE_FORMAT_MAPPING via a shared helper so it degrades to 'MM/DD/YYYY'. Assisted-by: Claude Opus 4.8 <noreply@anthropic.com>
…LAUDE Reuse dialect.STRICT_PARSE_TIME_EXPRESSIONS in SparkGenerator.format_time instead of a duplicate LENIENT_TIME_EXPRESSIONS tuple, removing the cross-module constant that had to be kept in sync by hand. Also document why the base strtotime_sql degrades only the strict tokens rather than routing the whole format through self.format_time(). Assisted-by: Claude Opus 4.8 <noreply@anthropic.com>
Declare LENIENT_INVERSE_TIME_MAPPING/_TRIE on the base Dialect (mirroring STRICT_TIME_MAPPING and INVERSE_TIME_MAPPING) and build the trie in the metaclass. This drops the Spark-only attribute that forced a `dialect: Spark` type-narrowing annotation in SparkGenerator plus its TYPE_CHECKING import, and lets spark.py stop hand-building the trie. No behavior change. Assisted-by: Claude Opus 4.8 <noreply@anthropic.com>
…CLAUDE Replace the `inverse_time_mapping is None and ...` guard with the same `inverse_time_mapping or ...` fallback idiom the base Generator.format_time uses. Same semantics (an explicitly-passed mapping still wins), but consistent with the base and without the extra clause. No behavior change. Assisted-by: Claude Opus 4.8 <noreply@anthropic.com>
…AUDE Drop the "used by the generator's format_time override / trie built by the metaclass" note on Spark.LENIENT_INVERSE_TIME_MAPPING (no other mapping documents where it's consumed or that the metaclass builds its trie), and reword the base docstring's awkward "Lets e.g. Spark" to the parenthetical "(e.g. Spark)" form already used by STRICT_TIME_MAPPING. Assisted-by: Claude Opus 4.8 <noreply@anthropic.com>
|
@deepyaman the solution seems more complicated as expected. We should only change the We should add in spark's |
I think this is quite similar to what I had initially. The problem is that things break in the format direction: >>> import sqlglot
>>> sqlglot.transpile("SELECT STRPTIME(x,'%Y-%m-%d')", read="duckdb", write="spark")
["SELECT TO_TIMESTAMP(x, 'yyyy-M-d')"] # good
>>> sqlglot.transpile("SELECT DATE_FORMAT(x,'yyyy-MM-dd')", read="hive", write="spark")
["SELECT DATE_FORMAT(x, 'yyyy-M-d')"] # bad; should be 'yyyy-MM-dd'The added complexity is needed in order to support both parse and format directions; reimplementing the simplification, get 17 test failures (all in the format direction). FYI what I tried for the "simple" approach is to update TIME_MAPPING = {
**Spark2.TIME_MAPPING,
"MM": "%mstrict",
"dd": "%dstrict",
}
INVERSE_TIME_MAPPING = {
"%m": "M",
"%d": "d",
}Not sure if you had something different in mind. |
If Hive follows what Spark does, shouldn't your |
|
@deepyaman yeah, I thought we have focused on spark only my bad. Hive as I mentioned is strict so we should move there the logic (you can test it also locally to verify like the rest of the dialects). So, the simple map approach I mentioned should work, without any additions. |
@georgesittas @geooo109 Yes, sorry, I missed that Hive was strict, since I haven't been testing anything directly using Hive (well, I guess Impala). I'm happy to update the mappings there. Spark2 is still lenient, so I'll need to re-override that, and then back again for Spark (strict by default, unless you specify However, this can be illustrated using another dialect (most of the failing tests if take the "simple" fix aren't the mistaken Hive ones). For instance: >>> import sqlglot
>>> sqlglot.transpile("SELECT STRPTIME(x, '%Y-%m-%d')", read="duckdb", write="spark")
['yyyy-M-d' ] # good
>>> sqlglot.transpile("SELECT STRFTIME(x, '%Y-%m-%d')", read="duckdb", write="spark")
['yyyy-M-d'] # bad; should be 'yyyy-MM-dd' |
|
I'm not sure I understand: why is |
Will do! |
|
Ok, in any case, the mechanism is quite clear: By convention, the formats we use for I believe this context should be enough to handle this issue. It's simply a matter of finding the right representation for a given dialect format. If tests break, I believe that means that they're either incorrect or the chosen format is incorrect. |
The reason for introducing the
DuckDB Let me also follow up on Slack, if it helps clarify any misunderstandings I may have! |
Here's a Claude-generated prototype of what that could look like: https://github.com/deepyaman/sqlglot/tree/four-token-prototype |
|
Ok, I get it now, thanks for clarifying. I wasn't thinking about the differences between I need to think more about this. Probably going to circle around to this PR in a few days. |
Following up on #7739 (comment)
Regarding testing each dialect, I ended up testing Ibis backends, which largely supports the earlier conclusion that Spark 3+ (and Databricks) were the issues:
Impala and Flink don't have SQLGlot dialect (although Ibis extends the Hive dialect for Impala), and Flink issues specifically run deeper. I can try to spend some time testing the backends Ibis doesn't implement this functionality for separately, if helpful.
Note: I did use Claude specifically to implement the additions in this PR, as well as to test across Ibis backends, but I have reviewed the code/implementation and think it makes sense.