Fix tokenization of qualified identifiers with numeric prefix. #1803

romanb · 2025-04-09T18:46:41Z

This PR is a follow-up to #856. The remaining problem is that queries with qualified identifiers having numeric prefixes currently fail to parse due to incorrect tokenization. For example:

SELECT t.123abc FROM my_table t

This is currently tokenized as

...
t (Word)
.123abc (Word)
...

whereas it should be tokenized as

...
t (Word)
. (Period)
123abc (Word)
...

Of course, the potential ambiguity of identifiers of the form 12e34, i.e. that on their own could be seen as number tokens, also needs to be taken into account. If 12e34 is unqualified, it should be tokenized as a number (this is already the case) but in SELECT t.12e34 FROM my_table t, it should be tokenized as a word as well, to be able to successfully parse it as a compound identifier in MySQL.

The only option I saw to solve these problems unambiguously was to give the private next_token function in the Tokenizer as context a reference to the previous token in the second argument, which can then be used to disambiguate these cases and correctly decide what type of token to produce for dialects that support numeric prefixes. I included commentary and tests to help further clarify the situation.

Queries with qualified identifiers having numeric prefixes currently fail to parse due to incorrect tokenization. Currently, "t.123abc" tokenizes as "t" (Word) followed by ".123abc" (Number).

iffyio · 2025-04-10T05:29:33Z

tests/sqlparser_mysql.rs

+    mysql().verified_stmt("SELECT t.15to29 FROM my_table AS t");
+    match mysql()
+        .parse_sql_statements("SELECT t.15to29 FROM my_table AS t")
+        .unwrap()
+        .pop()
+    {


Suggested change

mysql().verified_stmt("SELECT t.15to29 FROM my_table AS t");

match mysql()

.parse_sql_statements("SELECT t.15to29 FROM my_table AS t")

.unwrap()

.pop()

{

match mysql().verified_stmt("SELECT t.15to29 FROM my_table AS t") {

does this format work the same to remove the second parse statement?

Yes, that simplifies the test, I was simply ignorant of the return value. Thanks for the suggestion.

iffyio · 2025-04-10T06:12:29Z

src/tokenizer.rs

@@ -895,7 +895,7 @@ impl<'a> Tokenizer<'a> {
        };

        let mut location = state.location();
-        while let Some(token) = self.next_token(&mut state)? {
+        while let Some(token) = self.next_token(&mut state, buf.last().map(|t| &t.token))? {


Can we implement instead in the parser as a special for self.dialect.supports_numeric_prefix()? somewhat similar to what we do for BigQuery here. In that when its time to combine identifiers into a compound identifier we check if each subsequent identifier part is unquoted and prefixed by ., and if so we drop the prefix. I imagine that would be done here

Thinking that could be a smaller change since the behavior needs a bit of extra context around the tokens which the tokenizer isn't so good at handling cleanly

Thank you for the review and the suggestion. The latest commit implements an alternative approach within Parser#parse_compound_expr. However, I have to admit that I personally prefer the previous fix in the tokenizer, for a few reasons:

The Tokenizer is part of the public API of the crate, and without fixing this in the tokenizer, it will continue to exhibit the following behavior with the Mysql dialect:

t.1to2 tokenizes to t (Word), .1to2 (Word)

t.1e2 tokenizes to t (Word), .1e2 (Number)

t.`1to2` tokenizes to t (Word), . (Period), 1to2 (Word)

t.`1e2` tokenizes to t (Word), . (Period) 1e2 (Word)

This could be very surprising for users and arguably be considered incorrect (when using the Mysql dialect).

The handling of Word and Number tokens in parse_compound_expr is not the most elegant with having to split off the . and having to create the correct off-by-one spans accordingly.

Unqualified identifiers that start with digits are already handled in the tokenizer - and I think have to be (see Support identifiers beginning with digits in MySQL #856). Handling the same problem of misinterpreted tokens for qualified identifiers in the parser seems a bit disconnected and a like a downstream solution to the tokenizer producing incorrect tokens, at least that is how I currently perceive it (see point 1).

Let me know which solution you prefer (with or without the latest commit) or if you have another idea.

I think that's fair, we can revert back to the tokenizer version in that case? to keep the tokenizer behavior non-surprising

iffyio · 2025-04-10T06:13:33Z

tests/sqlparser_mysql.rs

+#[test]
+fn parse_qualified_identifiers_with_numeric_prefix() {
+    // Case 1: Qualified column name that starts with digits.
+    mysql().verified_stmt("SELECT t.15to29 FROM my_table AS t");


could we add a test case for the behavior of multiple accesses e.g. t.15to29.16to30?

Good idea, I added such a test.

This reverts commit aab48d4.

iffyio

LGTM! Thanks @romanb!
cc @alamb

Fix tokenization of qualified identifiers with numeric prefix.

0279883

Queries with qualified identifiers having numeric prefixes currently fail to parse due to incorrect tokenization. Currently, "t.123abc" tokenizes as "t" (Word) followed by ".123abc" (Number).

romanb force-pushed the qualified-identifier-numeric-prefix branch from bde5493 to 0279883 Compare April 9, 2025 19:14

Update inline comment.

0eda729

iffyio reviewed Apr 10, 2025

View reviewed changes

Roman Borschel added 2 commits April 10, 2025 11:36

Improve tests as suggested in code review.

e32c3c8

Alternative fix implementated in the parser.

aab48d4

romanb force-pushed the qualified-identifier-numeric-prefix branch from 9c9b602 to aab48d4 Compare April 10, 2025 13:51

Revert "Alternative fix implementated in the parser."

a47e2ab

This reverts commit aab48d4.

iffyio approved these changes Apr 11, 2025

View reviewed changes

iffyio merged commit bbc80d7 into apache:main Apr 11, 2025
9 checks passed

romanb deleted the qualified-identifier-numeric-prefix branch April 11, 2025 19:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix tokenization of qualified identifiers with numeric prefix. #1803

Fix tokenization of qualified identifiers with numeric prefix. #1803

Uh oh!

romanb commented Apr 9, 2025 •

edited

Loading

Uh oh!

iffyio Apr 10, 2025

Uh oh!

romanb Apr 10, 2025

Uh oh!

iffyio Apr 10, 2025

Uh oh!

romanb Apr 10, 2025 •

edited

Loading

Uh oh!

iffyio Apr 11, 2025

Uh oh!

romanb Apr 11, 2025

Uh oh!

iffyio Apr 10, 2025

Uh oh!

romanb Apr 10, 2025

Uh oh!

iffyio left a comment

Uh oh!

Uh oh!

Uh oh!

Fix tokenization of qualified identifiers with numeric prefix. #1803

Fix tokenization of qualified identifiers with numeric prefix. #1803

Uh oh!

Conversation

romanb commented Apr 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

iffyio Apr 10, 2025

Choose a reason for hiding this comment

Uh oh!

romanb Apr 10, 2025

Choose a reason for hiding this comment

Uh oh!

iffyio Apr 10, 2025

Choose a reason for hiding this comment

Uh oh!

romanb Apr 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

iffyio Apr 11, 2025

Choose a reason for hiding this comment

Uh oh!

romanb Apr 11, 2025

Choose a reason for hiding this comment

Uh oh!

iffyio Apr 10, 2025

Choose a reason for hiding this comment

Uh oh!

romanb Apr 10, 2025

Choose a reason for hiding this comment

Uh oh!

iffyio left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

romanb commented Apr 9, 2025 •

edited

Loading

romanb Apr 10, 2025 •

edited

Loading