Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenizer panics when given weird unicode #70

Open
SnirkImmington opened this issue Jun 24, 2018 · 0 comments
Open

Tokenizer panics when given weird unicode #70

SnirkImmington opened this issue Jun 24, 2018 · 0 comments
Labels
area: lex Issues which affect the lexing and tokenizing of code issue: bug Bug report points: 1 Simple or straightforward changes to the code priority: low Consider higher priority issues first

Comments

@SnirkImmington
Copy link
Collaborator

SnirkImmington commented Jun 24, 2018

Example:

fn main()
    let m̞͉̮y̛̛͟Ȅ̐ͫd̺̊͡gͩ͞ͅy᷂᷆᷆V̴ͮ̚ȁ̖᷂r̉ͦ͜ = 6

This actually causes a panic in the tokenizer, because I didn't write it to return a Result, and there's a bunch of cases in next() which check for various properties of the peeked character:

        if peek == '\n' {
            // ... handle newlines
        }
        else if peek.is_number() {
            self.parse_float_literal()
        } else if peek == '_' || peek.is_letter() {
            self.parse_keyword_or_ident()
        } else if char_is_symbol(peek) {
            self.parse_symbol()
        } else {
            panic!("Unknown character `{:?}` in next_line", peek);
        }

There's a (kind of hardcoded check) char_is_symbol but we also check peek.is_letter() which is defined in the unicode_categories crate. The default case where the panic is hit happens if a token begins with a unicode character not in the letter category or meeting any of the other conditions.

The tokenizer should be changed to return a Result type, as here it should give a basic indication of "this symbol should not be here". This could include lexer errors for other forms of whitespace.

@SnirkImmington SnirkImmington added area: lex Issues which affect the lexing and tokenizing of code priority: low Consider higher priority issues first issue: bug Bug report labels Jun 24, 2018
@SnirkImmington SnirkImmington added the points: 1 Simple or straightforward changes to the code label Jan 9, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: lex Issues which affect the lexing and tokenizing of code issue: bug Bug report points: 1 Simple or straightforward changes to the code priority: low Consider higher priority issues first
Projects
None yet
Development

No branches or pull requests

1 participant