Support Unicode identifiers in Aster source files

## What

I've been working through the audit findings and one of the items (non-ASCII rejection in the lexer) led to a bigger question: should Aster support Unicode identifiers?

We just landed a fix to allow non-ASCII characters inside string literals (emoji, CJK, accented chars, etc.), but identifiers are still ASCII-only. That means `let café = 42` is a lexer error, even though most modern languages (Python, Rust, Swift, Go, Java, Kotlin) have supported Unicode identifiers for years.

## Why this matters

Non-English-speaking developers are forced to transliterate every identifier into Latin characters. That hurts readability for the people writing and maintaining that code. It also blocks idiomatic mathematical notation (Greek letters for variables, etc.).

## RFC

There's an RFC at `unicode-identifiers.md` covering the full design:

- **Source encoding:** UTF-8 required (already effectively true)
- **Identifier rules:** UAX #31 Default Identifiers (same approach as Python PEP 3131, Rust RFC 2457)
- **Normalization:** NFC (matching Rust and Swift)
- **Confusable detection:** deferred to a future lint pass, not in scope for v1
- **Dependencies:** `unicode-ident` crate (UAX #31 tables) and `unicode-normalization` crate (NFC)

## Implementation scope

The main work is in the lexer:

1. Replace `col` byte-offset tracking with a proper byte-offset cursor
2. Use `unicode_ident::is_xid_start(ch)` for identifier start detection (currently `ch.is_ascii_alphabetic() || ch == '_'`)
3. Use `unicode_ident::is_xid_continue(ch)` for identifier continue (currently `ch.is_ascii_alphanumeric() || ch == '_'`)
4. NFC-normalize collected identifiers before keyword lookup and interning
5. Update span tracking throughout for multi-byte characters

The formatter, parser, and type checker should work without changes since they operate on token spans (byte offsets into the source). Codegen emits names as UTF-8 strings into DWARF debug info, which should also work, but needs testing.

## Migration

No breaking changes. ASCII is a subset of UTF-8, and ASCII identifier chars are a subset of UAX #31. Every existing program stays valid.

## Labels

- enhancement
- lexer

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Unicode identifiers in Aster source files #39

What

Why this matters

RFC

Implementation scope

Migration

Labels

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Support Unicode identifiers in Aster source files #39

Description

What

Why this matters

RFC

Implementation scope

Migration

Labels

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions