-
Notifications
You must be signed in to change notification settings - Fork 0
Support Unicode identifiers in Aster source files #39
Description
What
I've been working through the audit findings and one of the items (non-ASCII rejection in the lexer) led to a bigger question: should Aster support Unicode identifiers?
We just landed a fix to allow non-ASCII characters inside string literals (emoji, CJK, accented chars, etc.), but identifiers are still ASCII-only. That means let café = 42 is a lexer error, even though most modern languages (Python, Rust, Swift, Go, Java, Kotlin) have supported Unicode identifiers for years.
Why this matters
Non-English-speaking developers are forced to transliterate every identifier into Latin characters. That hurts readability for the people writing and maintaining that code. It also blocks idiomatic mathematical notation (Greek letters for variables, etc.).
RFC
There's an RFC at unicode-identifiers.md covering the full design:
- Source encoding: UTF-8 required (already effectively true)
- Identifier rules: UAX Iterable unique method #31 Default Identifiers (same approach as Python PEP 3131, Rust RFC 2457)
- Normalization: NFC (matching Rust and Swift)
- Confusable detection: deferred to a future lint pass, not in scope for v1
- Dependencies:
unicode-identcrate (UAX Iterable unique method #31 tables) andunicode-normalizationcrate (NFC)
Implementation scope
The main work is in the lexer:
- Replace
colbyte-offset tracking with a proper byte-offset cursor - Use
unicode_ident::is_xid_start(ch)for identifier start detection (currentlych.is_ascii_alphabetic() || ch == '_') - Use
unicode_ident::is_xid_continue(ch)for identifier continue (currentlych.is_ascii_alphanumeric() || ch == '_') - NFC-normalize collected identifiers before keyword lookup and interning
- Update span tracking throughout for multi-byte characters
The formatter, parser, and type checker should work without changes since they operate on token spans (byte offsets into the source). Codegen emits names as UTF-8 strings into DWARF debug info, which should also work, but needs testing.
Migration
No breaking changes. ASCII is a subset of UTF-8, and ASCII identifier chars are a subset of UAX #31. Every existing program stays valid.
Labels
- enhancement
- lexer