Skip to content

Support Unicode identifiers in Aster source files #39

@brianp

Description

@brianp

What

I've been working through the audit findings and one of the items (non-ASCII rejection in the lexer) led to a bigger question: should Aster support Unicode identifiers?

We just landed a fix to allow non-ASCII characters inside string literals (emoji, CJK, accented chars, etc.), but identifiers are still ASCII-only. That means let café = 42 is a lexer error, even though most modern languages (Python, Rust, Swift, Go, Java, Kotlin) have supported Unicode identifiers for years.

Why this matters

Non-English-speaking developers are forced to transliterate every identifier into Latin characters. That hurts readability for the people writing and maintaining that code. It also blocks idiomatic mathematical notation (Greek letters for variables, etc.).

RFC

There's an RFC at unicode-identifiers.md covering the full design:

  • Source encoding: UTF-8 required (already effectively true)
  • Identifier rules: UAX Iterable unique method #31 Default Identifiers (same approach as Python PEP 3131, Rust RFC 2457)
  • Normalization: NFC (matching Rust and Swift)
  • Confusable detection: deferred to a future lint pass, not in scope for v1
  • Dependencies: unicode-ident crate (UAX Iterable unique method #31 tables) and unicode-normalization crate (NFC)

Implementation scope

The main work is in the lexer:

  1. Replace col byte-offset tracking with a proper byte-offset cursor
  2. Use unicode_ident::is_xid_start(ch) for identifier start detection (currently ch.is_ascii_alphabetic() || ch == '_')
  3. Use unicode_ident::is_xid_continue(ch) for identifier continue (currently ch.is_ascii_alphanumeric() || ch == '_')
  4. NFC-normalize collected identifiers before keyword lookup and interning
  5. Update span tracking throughout for multi-byte characters

The formatter, parser, and type checker should work without changes since they operate on token spans (byte offsets into the source). Codegen emits names as UTF-8 strings into DWARF debug info, which should also work, but needs testing.

Migration

No breaking changes. ASCII is a subset of UTF-8, and ASCII identifier chars are a subset of UAX #31. Every existing program stays valid.

Labels

  • enhancement
  • lexer

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestmediumShould get done, not urgentparserParsing, syntax, lexerrfcTied to a specific RFC or design doc

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions