Skip to content

perf: investigate VM optimization opportunities#44

Merged
danielsreichenbach merged 1 commit intomainfrom
perf/phase5-vm-optimizations
Feb 25, 2026
Merged

perf: investigate VM optimization opportunities#44
danielsreichenbach merged 1 commit intomainfrom
perf/phase5-vm-optimizations

Conversation

@danielsreichenbach
Copy link
Copy Markdown
Member

Summary

  • Add #[inline] annotations to table lookup functions (get, get_int, get_str, get_hash, next)
  • Investigate and document four optimization approaches with measured results

Investigation Results

What was tried

Optimization Result Reason
#[inline] on table lookups No change LLVM already inlines within-crate in release
Cache upvalue GcRefs at frame entry Slower (Vec::clone adds 18M heap allocs for fib(35)) Borrow checker prevents holding slice while mutating state
#[inline] on precall/poscall Slower execute() code bloat causes instruction cache pressure
Eliminate EntryState branch in arena.get() Not possible without unsafe Compiler cannot prove generation match implies Occupied

Root cause analysis

perf stat comparison on fib(35):

Metric PUC-Rio rilua Ratio
Instructions 12.5B 32.6B 2.6x
Branch misses 6.0M 12.3M 2.0x
Cache misses 184K 389K 2.1x
IPC 4.53 4.19 0.92x

The ~2.2x overhead is dominated by instruction count (2.6x more code executed), not cache misses or branch mispredictions. The extra instructions come from:

  • Bounds-checked arena access on every GC object read (Vec::get + generation comparison)
  • Val enum discriminant checks on every value operation
  • Result propagation through the ? operator
  • EntryState matching (Occupied vs Free) in arena.get()

These are distributed across every VM instruction and are the structural cost of safe Rust without unsafe.

Test plan

  • All 1325 tests pass
  • PUC-Rio test suite (all.lua) passes
  • Clippy clean
  • A/B benchmarks show no regression

Add #[inline] annotations to table.get(), get_int(), get_str(),
get_hash(), and next(). These are called on every OP_GETTABLE,
OP_SETTABLE, and table iteration step.

Benchmarking shows LLVM already inlines these within the crate in
release builds, so no measurable performance change in same-crate
usage. The annotations ensure inlining for cross-crate consumers.

Investigation of other optimizations (upvalue ref caching, precall
inlining, arena EntryState elimination) found:

- Upvalue ref caching: Vec::clone at frame entry adds 18M heap
  allocations for fib(35), making it slower. Borrow checker prevents
  holding a slice reference while mutating state.
- precall/poscall inlining: increases execute() code size, causing
  instruction cache pressure. Net negative.
- Arena EntryState elimination: requires unsafe to skip the redundant
  Occupied/Free match (compiler cannot prove the invariant that
  generation match implies Occupied).

perf stat shows the ~2.2x overhead is structural: rilua executes 2.6x
more instructions than PUC-Rio for the same work, due to bounds-checked
arena access, Val enum discriminant checks, and Result propagation.
These are distributed across every VM instruction and cannot be
eliminated without unsafe code.
@danielsreichenbach danielsreichenbach merged commit 5191147 into main Feb 25, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

1 participant