Skip to content

Fix tokenization of qualified identifiers with numeric prefix. #1803

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Apr 11, 2025

Conversation

romanb
Copy link
Contributor

@romanb romanb commented Apr 9, 2025

This PR is a follow-up to #856. The remaining problem is that queries with qualified identifiers having numeric prefixes currently fail to parse due to incorrect tokenization. For example:

SELECT t.123abc FROM my_table t 

This is currently tokenized as

...
t (Word)
.123abc (Word)
...

whereas it should be tokenized as

...
t (Word)
. (Period)
123abc (Word)
...

Of course, the potential ambiguity of identifiers of the form 12e34, i.e. that on their own could be seen as number tokens, also needs to be taken into account. If 12e34 is unqualified, it should be tokenized as a number (this is already the case) but in SELECT t.12e34 FROM my_table t, it should be tokenized as a word as well, to be able to successfully parse it as a compound identifier in MySQL.

The only option I saw to solve these problems unambiguously was to give the private next_token function in the Tokenizer as context a reference to the previous token in the second argument, which can then be used to disambiguate these cases and correctly decide what type of token to produce for dialects that support numeric prefixes. I included commentary and tests to help further clarify the situation.

Queries with qualified identifiers having numeric prefixes currently
fail to parse due to incorrect tokenization.

Currently, "t.123abc" tokenizes as "t" (Word) followed by ".123abc"
(Number).
@romanb romanb force-pushed the qualified-identifier-numeric-prefix branch from bde5493 to 0279883 Compare April 9, 2025 19:14
Comment on lines 1932 to 1937
mysql().verified_stmt("SELECT t.15to29 FROM my_table AS t");
match mysql()
.parse_sql_statements("SELECT t.15to29 FROM my_table AS t")
.unwrap()
.pop()
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
mysql().verified_stmt("SELECT t.15to29 FROM my_table AS t");
match mysql()
.parse_sql_statements("SELECT t.15to29 FROM my_table AS t")
.unwrap()
.pop()
{
match mysql().verified_stmt("SELECT t.15to29 FROM my_table AS t") {

does this format work the same to remove the second parse statement?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that simplifies the test, I was simply ignorant of the return value. Thanks for the suggestion.

@@ -895,7 +895,7 @@ impl<'a> Tokenizer<'a> {
};

let mut location = state.location();
while let Some(token) = self.next_token(&mut state)? {
while let Some(token) = self.next_token(&mut state, buf.last().map(|t| &t.token))? {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we implement instead in the parser as a special for self.dialect.supports_numeric_prefix()? somewhat similar to what we do for BigQuery here. In that when its time to combine identifiers into a compound identifier we check if each subsequent identifier part is unquoted and prefixed by ., and if so we drop the prefix. I imagine that would be done here

Thinking that could be a smaller change since the behavior needs a bit of extra context around the tokens which the tokenizer isn't so good at handling cleanly

Copy link
Contributor Author

@romanb romanb Apr 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the review and the suggestion. The latest commit implements an alternative approach within Parser#parse_compound_expr. However, I have to admit that I personally prefer the previous fix in the tokenizer, for a few reasons:

  1. The Tokenizer is part of the public API of the crate, and without fixing this in the tokenizer, it will continue to exhibit the following behavior with the Mysql dialect:

    • t.1to2 tokenizes to t (Word), .1to2 (Word)

    • t.1e2 tokenizes to t (Word), .1e2 (Number)

    • t.`1to2` tokenizes to t (Word), . (Period), 1to2 (Word)

    • t.`1e2` tokenizes to t (Word), . (Period) 1e2 (Word)

      This could be very surprising for users and arguably be considered incorrect (when using the Mysql dialect).

  2. The handling of Word and Number tokens in parse_compound_expr is not the most elegant with having to split off the . and having to create the correct off-by-one spans accordingly.

  3. Unqualified identifiers that start with digits are already handled in the tokenizer - and I think have to be (see Support identifiers beginning with digits in MySQL #856). Handling the same problem of misinterpreted tokens for qualified identifiers in the parser seems a bit disconnected and a like a downstream solution to the tokenizer producing incorrect tokens, at least that is how I currently perceive it (see point 1).

Let me know which solution you prefer (with or without the latest commit) or if you have another idea.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's fair, we can revert back to the tokenizer version in that case? to keep the tokenizer behavior non-surprising

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

#[test]
fn parse_qualified_identifiers_with_numeric_prefix() {
// Case 1: Qualified column name that starts with digits.
mysql().verified_stmt("SELECT t.15to29 FROM my_table AS t");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we add a test case for the behavior of multiple accesses e.g. t.15to29.16to30?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, I added such a test.

@romanb romanb force-pushed the qualified-identifier-numeric-prefix branch from 9c9b602 to aab48d4 Compare April 10, 2025 13:51
Copy link
Contributor

@iffyio iffyio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks @romanb!
cc @alamb

@iffyio iffyio merged commit bbc80d7 into apache:main Apr 11, 2025
9 checks passed
@romanb romanb deleted the qualified-identifier-numeric-prefix branch April 11, 2025 19:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants