perf: investigate VM optimization opportunities by danielsreichenbach · Pull Request #44 · wowemulation-dev/rilua

danielsreichenbach · 2026-02-25T16:32:25Z

Summary

Add #[inline] annotations to table lookup functions (get, get_int, get_str, get_hash, next)
Investigate and document four optimization approaches with measured results

Investigation Results

What was tried

Optimization	Result	Reason
`#[inline]` on table lookups	No change	LLVM already inlines within-crate in release
Cache upvalue GcRefs at frame entry	Slower (Vec::clone adds 18M heap allocs for fib(35))	Borrow checker prevents holding slice while mutating state
`#[inline]` on precall/poscall	Slower	execute() code bloat causes instruction cache pressure
Eliminate EntryState branch in arena.get()	Not possible without unsafe	Compiler cannot prove generation match implies Occupied

Root cause analysis

perf stat comparison on fib(35):

Metric	PUC-Rio	rilua	Ratio
Instructions	12.5B	32.6B	2.6x
Branch misses	6.0M	12.3M	2.0x
Cache misses	184K	389K	2.1x
IPC	4.53	4.19	0.92x

The ~2.2x overhead is dominated by instruction count (2.6x more code executed), not cache misses or branch mispredictions. The extra instructions come from:

Bounds-checked arena access on every GC object read (Vec::get + generation comparison)
Val enum discriminant checks on every value operation
Result propagation through the ? operator
EntryState matching (Occupied vs Free) in arena.get()

These are distributed across every VM instruction and are the structural cost of safe Rust without unsafe.

Test plan

All 1325 tests pass
PUC-Rio test suite (all.lua) passes
Clippy clean
A/B benchmarks show no regression

Add #[inline] annotations to table.get(), get_int(), get_str(), get_hash(), and next(). These are called on every OP_GETTABLE, OP_SETTABLE, and table iteration step. Benchmarking shows LLVM already inlines these within the crate in release builds, so no measurable performance change in same-crate usage. The annotations ensure inlining for cross-crate consumers. Investigation of other optimizations (upvalue ref caching, precall inlining, arena EntryState elimination) found: - Upvalue ref caching: Vec::clone at frame entry adds 18M heap allocations for fib(35), making it slower. Borrow checker prevents holding a slice reference while mutating state. - precall/poscall inlining: increases execute() code size, causing instruction cache pressure. Net negative. - Arena EntryState elimination: requires unsafe to skip the redundant Occupied/Free match (compiler cannot prove the invariant that generation match implies Occupied). perf stat shows the ~2.2x overhead is structural: rilua executes 2.6x more instructions than PUC-Rio for the same work, due to bounds-checked arena access, Val enum discriminant checks, and Result propagation. These are distributed across every VM instruction and cannot be eliminated without unsafe code.

danielsreichenbach merged commit 5191147 into main Feb 25, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: investigate VM optimization opportunities#44

perf: investigate VM optimization opportunities#44
danielsreichenbach merged 1 commit intomainfrom
perf/phase5-vm-optimizations

danielsreichenbach commented Feb 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

danielsreichenbach commented Feb 25, 2026

Summary

Investigation Results

What was tried

Root cause analysis

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

1 participant