Tokenizer panics when given weird unicode #70

SnirkImmington · 2018-06-24T06:12:47Z

Example:

fn main()
    let m̞͉̮y̛̛͟Ȅ̐ͫd̺̊͡gͩ͞ͅy᷂᷆᷆V̴ͮ̚ȁ̖᷂r̉ͦ͜ = 6

This actually causes a panic in the tokenizer, because I didn't write it to return a Result, and there's a bunch of cases in next() which check for various properties of the peeked character:

        if peek == '\n' {
            // ... handle newlines
        }
        else if peek.is_number() {
            self.parse_float_literal()
        } else if peek == '_' || peek.is_letter() {
            self.parse_keyword_or_ident()
        } else if char_is_symbol(peek) {
            self.parse_symbol()
        } else {
            panic!("Unknown character `{:?}` in next_line", peek);
        }

There's a (kind of hardcoded check) char_is_symbol but we also check peek.is_letter() which is defined in the unicode_categories crate. The default case where the panic is hit happens if a token begins with a unicode character not in the letter category or meeting any of the other conditions.

The tokenizer should be changed to return a Result type, as here it should give a basic indication of "this symbol should not be here". This could include lexer errors for other forms of whitespace.

The text was updated successfully, but these errors were encountered:

SnirkImmington added area: lex Issues which affect the lexing and tokenizing of code priority: low Consider higher priority issues first issue: bug Bug report labels Jun 24, 2018

SnirkImmington added the points: 1 Simple or straightforward changes to the code label Jan 9, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizer panics when given weird unicode #70

Tokenizer panics when given weird unicode #70

SnirkImmington commented Jun 24, 2018 •

edited

Loading

Tokenizer panics when given weird unicode #70

Tokenizer panics when given weird unicode #70

Comments

SnirkImmington commented Jun 24, 2018 • edited Loading

SnirkImmington commented Jun 24, 2018 •

edited

Loading