-
Notifications
You must be signed in to change notification settings - Fork 601
Fix tokenization of qualified identifiers with numeric prefix. #1803
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix tokenization of qualified identifiers with numeric prefix. #1803
Conversation
Queries with qualified identifiers having numeric prefixes currently fail to parse due to incorrect tokenization. Currently, "t.123abc" tokenizes as "t" (Word) followed by ".123abc" (Number).
bde5493
to
0279883
Compare
tests/sqlparser_mysql.rs
Outdated
mysql().verified_stmt("SELECT t.15to29 FROM my_table AS t"); | ||
match mysql() | ||
.parse_sql_statements("SELECT t.15to29 FROM my_table AS t") | ||
.unwrap() | ||
.pop() | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mysql().verified_stmt("SELECT t.15to29 FROM my_table AS t"); | |
match mysql() | |
.parse_sql_statements("SELECT t.15to29 FROM my_table AS t") | |
.unwrap() | |
.pop() | |
{ | |
match mysql().verified_stmt("SELECT t.15to29 FROM my_table AS t") { |
does this format work the same to remove the second parse statement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that simplifies the test, I was simply ignorant of the return value. Thanks for the suggestion.
@@ -895,7 +895,7 @@ impl<'a> Tokenizer<'a> { | |||
}; | |||
|
|||
let mut location = state.location(); | |||
while let Some(token) = self.next_token(&mut state)? { | |||
while let Some(token) = self.next_token(&mut state, buf.last().map(|t| &t.token))? { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we implement instead in the parser as a special for self.dialect.supports_numeric_prefix()
? somewhat similar to what we do for BigQuery here. In that when its time to combine identifiers into a compound identifier we check if each subsequent identifier part is unquoted and prefixed by .
, and if so we drop the prefix. I imagine that would be done here
Thinking that could be a smaller change since the behavior needs a bit of extra context around the tokens which the tokenizer isn't so good at handling cleanly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the review and the suggestion. The latest commit implements an alternative approach within Parser#parse_compound_expr
. However, I have to admit that I personally prefer the previous fix in the tokenizer, for a few reasons:
-
The
Tokenizer
is part of the public API of the crate, and without fixing this in the tokenizer, it will continue to exhibit the following behavior with the Mysql dialect:-
t.1to2
tokenizes tot
(Word),.1to2
(Word) -
t.1e2
tokenizes tot
(Word),.1e2
(Number) -
t.`1to2`
tokenizes tot
(Word),.
(Period),1to2
(Word) -
t.`1e2`
tokenizes tot
(Word),.
(Period)1e2
(Word)This could be very surprising for users and arguably be considered incorrect (when using the Mysql dialect).
-
-
The handling of
Word
andNumber
tokens inparse_compound_expr
is not the most elegant with having to split off the.
and having to create the correct off-by-one spans accordingly. -
Unqualified identifiers that start with digits are already handled in the tokenizer - and I think have to be (see Support identifiers beginning with digits in MySQL #856). Handling the same problem of misinterpreted tokens for qualified identifiers in the parser seems a bit disconnected and a like a downstream solution to the tokenizer producing incorrect tokens, at least that is how I currently perceive it (see point 1).
Let me know which solution you prefer (with or without the latest commit) or if you have another idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's fair, we can revert back to the tokenizer version in that case? to keep the tokenizer behavior non-surprising
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
tests/sqlparser_mysql.rs
Outdated
#[test] | ||
fn parse_qualified_identifiers_with_numeric_prefix() { | ||
// Case 1: Qualified column name that starts with digits. | ||
mysql().verified_stmt("SELECT t.15to29 FROM my_table AS t"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we add a test case for the behavior of multiple accesses e.g. t.15to29.16to30
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, I added such a test.
9c9b602
to
aab48d4
Compare
This reverts commit aab48d4.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is a follow-up to #856. The remaining problem is that queries with qualified identifiers having numeric prefixes currently fail to parse due to incorrect tokenization. For example:
This is currently tokenized as
whereas it should be tokenized as
Of course, the potential ambiguity of identifiers of the form
12e34
, i.e. that on their own could be seen as number tokens, also needs to be taken into account. If12e34
is unqualified, it should be tokenized as a number (this is already the case) but inSELECT t.12e34 FROM my_table t
, it should be tokenized as a word as well, to be able to successfully parse it as a compound identifier in MySQL.The only option I saw to solve these problems unambiguously was to give the private
next_token
function in theTokenizer
as context a reference to the previous token in the second argument, which can then be used to disambiguate these cases and correctly decide what type of token to produce for dialects that support numeric prefixes. I included commentary and tests to help further clarify the situation.