Lesson 8: Loop Optimization #560

sampsyo · 2025-08-24T22:55:01Z

sampsyo
Aug 24, 2025
Maintainer

jeffreyqdd · 2025-10-26T18:52:30Z

jeffreyqdd
Oct 26, 2025

Source Code: https://github.com/jeffreyqdd/rust_bril/

What I Did

I implemented Loop Invariant Code Motion as a separate compiler pass that runs after Dead Code Elimination (DCE) and GVN (I'll talk about this later). This pass identifies computations inside loops that do not depend on the loop iteration and safely moves them outside the loop to improve runtime performance. Because my code is in SSA form, I can naively move constants into the loop preheader, allowing me to iteratively move instructions that depend on loop-invariant outputs into the preheader.

The LICM implementation runs in four main phases:

Reaching Definitions Analysis: Simplified due to the program’s SSA form, which makes dataflow tracking straightforward.
Backedge Detection and Natural Loop Construction: Identifies loops by finding backedges in the control flow graph and expanding them into their natural loop regions.
Filtering Non-Natural Loops: Ensures only well-formed, reducible loops are processed.
Invariant Instruction Hoisting: Detects loop-invariant instructions and moves them to the loop preheader when it is safe to do so.
Example: In order of left to right, original code, LICM, LICM + DCE & GVN.

Correctness: This pass has been tested on ALL bril benchmarks, and the outputs match those of the original code. This pass is also tested in combination with DCE/LVN.

Global Value Numbering Example
I extended my Local Value Numbering (LVN) implementation to perform Global Value Numbering (GVN) across the entire control flow graph (In my feeble attempt to improve wall clock execution). My GVN has

constant expression evaluation
copy propagation
commutivity awareness

Example: In order of left to right, original code, DCE & GVN.

Correctness: This pass has been tested on ALL bril benchmarks, and the outputs match those of the original code.

Trickiest Part

The trickiest part of implementing LICM was converting out of SSA form which required me to:

Add preheaders
Remap jumps TO the preheader, while ensuring jumps in the loop body are not changed.
Remap phi labels before pushing into id instructions.

To support this, I extended my Basic Block structure with two additional fields:

A preheader instruction array – holds instructions to be inserted before the loop.
A loop backedge flag – marks edges that represent loop backedges

I personally think this is a necessary but jank solution arising from the tight coupling between my compiler’s CFGs, dominance relationships, and label mappings, which effectively make my AbstractFunction structure immutable.

I think my compiler would benefit immensely from another pass. By rebuilding the CFG, the compiler could reduce the label and block overhead introduced by:

Generating fresh labels for unnamed blocks used in φ-nodes,
Emitting preamble code for SSA conversion, and
Inserting loop preheaders during LICM.

Benchmarking

I used hyperfine, a command-line benchmarking tool that aggregates statistics across multiple runs. The benchmarks were performed on an Apple M2 Pro (6P + 4E) with 16 GB RAM, with the window open and fans set to max. I had scripts monitoring CPU Freq and Thermal Throttling to ensure all benchmarks were run fairly. I ran with a multiprocessing queue of 3 workers to guarantee performance core scheduling, and to minimize cache pollution between the benchmarks.

Each bril file was run on 5 different compilation flags:

original - unmodified code
ssa - in and out of ssa, no optimizations
loop - only LICM
dce & gvn
all - GVN, DCE, and LICM

The results were unfortunate given the amount of time I spent on task 8.

In the worst case, converting into and out of SSA greatly increased dynamic instruction count, and consequently, runtime. However, there were some programs that benefited immensely e.g. Delannoy.

In general, it was better to not compile the programs at all due to the overhead generated by SSA.

📊 Average metrics across all benchmarks:
   • original    :  1754632 instructions,   8.65ms ± 0.99ms
   • ssa         :  2159383 instructions,   9.34ms ± 0.91ms
   • loop        :  2082021 instructions,   9.10ms ± 0.68ms
   • lvn & dce   :  2022191 instructions,   8.92ms ± 0.69ms
   • all         :  1951181 instructions,   8.86ms ± 0.89ms

🚀 Speedup relative to SSA baseline:
   • original    : 1.08x
   • loop        : 1.03x
   • lvn & dce   : 1.05x
   • all         : 1.05x

Interestingly enough, running optimizations made the programs more stable.

Conclusion?
LICM made the programs faster, but still not enough to counteract the overhead of converting to and from SSA. I think I will need to write a copy-propagation pass and a redundant basic-block elimination pass to achieve actual wall-clock speedup reliably.

Most of the shorter programs could not be benchmarked properly because their timings were wildly unstable. Looking at the CPU workloads, more time was spent in the system calls (timing, running the process, cleaning up) than the actual program itself.

Generative AI

I used Copilot to help me write the plotting function and write the benchmark scaffolding.

Summary

I think I deserve a Michelin star for the time I put into this project to get the optimizations working correctly and performing a comprehensive benchmark.

1 reply

jeffreyqdd Oct 26, 2025

dce_lvn_then_loop.zip

Explore benchmark plots^

magg1egao · 2025-10-28T02:33:38Z

magg1egao
Oct 28, 2025

Team Members
Serena Zhang (syz8), Maggie Gao (mg2447), Jacqueline Wen (jw2347)

Source code URL

Pass

For this assignment, we implemented LICM with LLVM. We noticed that LLVM had a lot of pre-built functions, such as llvm::hasLoopInvariantOperands, and we used this to make our lives a little easier. We implemented the LICM optimization as decribed on the website using the same skeleton as Task7.

Working with LLVM for this assignment was pretty frustrating. Namely, we had to juggle dynamically casting between const and non-const instructions (honestly this is pretty terrible C++ coding) as some built in LLVM functions required the parameters to be const, while others did not.

Testing

To first sanity check our code, we write a simple C++ program and compared the intermediate representations with and without the pass. When we manually compared the intermediate representations, we noticed that the optimization was being implemented.

To more rigorously test our code, we used Embench. Here are our results:

Comparison Type	aha-mont64	crc32	cubic	edn	huffbench	matmult-int	md5sum	minver	nbody	nettle-aes	nettle-sha256	nsichneu	picojpeg	primecount	qrduino	sglib-combined	slre	st	statemate	tarfind	ud	wikisort
Original Speed	2.45	1.95	191.96	9.23	8.28	8.86	5.79	29.79	500.62	6.29	5.81	8.26	10.47	3.07	6.69	4.33	5.38	115.43	9.62	8.87	9.24	38.85
Optimized Speed	2.45	1.94	190.00	9.57	8.33	8.78	5.36	29.84	513.06	6.11	5.78	8.25	10.54	3.07	7.00	4.57	6.28	113.86	12.81	8.88	9.20	37.85

We also created a graph (with normalized values) to help visual the results:

It seems that we were able to speed up some of the programs with our LICM optimization. However, there are also many programs where the speed is unaffected or increased by the optimization. Overall, we observed no substantial results with embench. Perhaps more loop optimizations are necessary for obvious speedups in programs.

Hardest Part

The most challenging part of this assignment was trying to navigate the complexity of LLVM and understanding how its internal analyses interacted. While conceptually the LICM algorithm is straightforward, implementing it in LLVM required careful management. Another challenge was setting up and running Embench for testing. The documentation and benchmarking workflow was initially confusing and took a fair amount of time. However, after much trial and error, we were finally able to get Embench working and collect performance data to test our optimization.

Michelin Star

We believe that we deserve a michelin star for completely the specification of this assignment. Although we didn't see a desirable outcome, we still learned a lot by completing this assignment.

0 replies

tobiwg · 2025-10-28T04:44:35Z

tobiwg
Oct 28, 2025

Team: Adnan Al Armouti, Tobias Weinberg

Loop-Invariant Code Motion (LICM) for Bril

For this assignment, we built a Loop-Invariant Code Motion (LICM) optimization pass for Bril, written completely in Python.
The idea was simple: find computations inside loops that never change between iterations, and move (or hoist) them outside the loop so they only execute once.

What we did

We started by constructing a control-flow graph and computing dominators to detect backedges (u → h) and identify natural loops.
For each loop, we checked which instructions were loop-invariant — meaning they’re pure (no side effects or exceptions) and all their operands come from outside the loop or from other invariant instructions.
Once identified, those invariants were cloned into a preheader block that executes before the loop, and deleted from inside the loop.
We also added a small local DCE pass afterward to clean up unused values.

What went wrong (and how we fixed it)

At first, things broke in entertaining ways:

We accidentally hoisted phi nodes and branch conditions, which quickly destroyed correctness.
Missing dominance checks caused instructions to move before their definitions.
Hoisting div led to divide-by-zero exceptions in some benchmarks.

We fixed this by tightening our criteria — only hoisting pure arithmetic or logical ops (add, mul, eq, etc.), skipping phi and div, and re-running DCE afterward.

Results

We evaluated our pass on the full Bril benchmark suite:

Metric	Count / Value
Total Benchmarks	66
Improved	20
Unchanged	32
Regressed	1
Incorrect Results (LICM)	5
Missing Results (LICM)	2
Timeout Results (LICM)	8
Total Absolute Improvement	3350 instructions
Average Improvement	63.21 instructions
Mean Percent Improvement	3.68 %

About one-third of programs sped up, most stayed the same, and only one regressed.

Takeaways

Building LICM from scratch made us appreciate how much infrastructure (CFGs, dominators, SSA reasoning) modern compilers already have.
it was satisfying to see real data showing the benefit of moving just a few instructions.

GenAI disclaimer

We used ChatGPT (GPT-5) to help debug and structure our implementation — especially to clarify dominance logic, SSA rules, and control-flow algorithms.
It occasionally suggested LLVM-specific details that didn’t apply to Bril, so we verified and corrected those manually.

0 replies

SolidLao · 2025-10-28T18:13:33Z

SolidLao
Oct 28, 2025

Team Members

Ning Wang (nw366), Jiale Lao (jl4492)

Source Code

https://github.com/NingWang0123/cs6120/tree/main/assa8

Implementation Summary

We implemented Loop Invariant Code Motion (LICM), a classic loop optimization that hoists loop-invariant computations out of loops to reduce redundant work. The implementation works on Bril programs in SSA form.

Key Components

ssa.py: Converts Bril programs to SSA form using dominance frontiers for phi node placement and variable renaming.
out_of_ssa.py: Converts programs out of SSA form using parallel copy scheduling to handle phi nodes, including cycle detection.
licm.py: Implements the LICM optimization with the following steps:
1. Convert function to SSA form
2. Find natural loops using back-edge detection in the dominator tree
3. Ensure each loop has a unique preheader block
4. Identify loop-invariant instructions (pure operations with all operands defined outside the loop)
5. Hoist instructions that dominate all loop exits to the preheader
6. Convert back out of SSA form
helpers.py: Provides CFG construction, dominance analysis, and utility functions
main.py: Driver program that applies LICM to Bril programs

Testing Methodology

We tested the implementation on programs from the Bril benchmark suite (bril/benchmarks/core/) that contain loop-invariant computations suitable for LICM optimization. Our testing approach includes:

Correctness Verification: Compare output of original vs optimized programs to ensure semantic equivalence
Dynamic Instruction Counting: Use brili's profiling mode to measure actual instruction execution counts
Performance Measurement: Calculate speedup as original_instructions / optimized_instructions

Evaluation Results

We successfully optimized 8 benchmark programs that contain loop-invariant computations. All optimizations preserved program correctness while achieving measurable performance improvements.

Benchmark Results

Program	Dynamic Instructions Saved	Speedup	Description
quadratic.bril	76	1.107x	Quadratic formula in loop with constant computations
sum-sq-diff.bril	199	1.070x	Sum of squares computation with invariant operations
sum-to-ten.bril	10	1.075x	Simple counting loop with hoistable arithmetic
loopfact.bril	7	1.064x	Factorial with loop-invariant comparisons
check-primes.bril	364	1.045x	Prime checking with repeated computations
sum-digits.bril	9	1.043x	Digit extraction with invariant operations
pascals-row.bril	4	1.028x	Pascal's triangle row generation
permutation.bril	2	1.016x	Permutation counting

Aggregate Performance:

Total Dynamic Instructions Saved: 671 instructions
Average Speedup: 1.056x (5.6% improvement)
Best Speedup: 1.107x on quadratic.bril (10.7% improvement)
Speedup Range: 1.016x to 1.107x

Visualizations Using GenAI

We generated a visualization of the LICM optimization results:

Speedup Comparison

Shows the speedup achieved for each of the 8 successfully optimized programs, with:

Horizontal bar charts comparing speedup ratios
Dynamic instruction reduction for each program
Best performer: quadratic.bril at 1.107x speedup

Challenges

We selected to use Bril so there are not many challenges when implementating these codes: we are already familiar with Bril.
Reviewing other discussions about LLVM, seems like this modern compiler is both complex and elegant.
Next time we should try to implement these functions in LLVM.

Michelin Star

We have correct implementations, comprehensive tests, evaluations, and visualizations. We think we deserve a michelin star.

0 replies

Mond45 · 2025-10-29T00:07:41Z

Mond45
Oct 29, 2025

Source code: https://github.com/Mond45/llvm-pass-skeleton/tree/loop

What I Did

I implemented an LLVM pass that performs Loop-Invariant Code Motion (LICM) using the new pass manager.
My pass runs on each loop within a function. The first step involves collecting loop-invariant instructions using the iteration-to-convergence algorithm. This process is simplified by LLVM's helper functions, such as Loop::isLoopInvariant and isSafeToSpeculativelyExecute. The latter was particularly useful for detecting whether an instruction may have side effects.

I also used an LLVM's utility functions to create a loop preheader as needed, and moved the collected loop-invariant instructions there once the identification step was complete.

The Hardest Part

The most challenging part was figuring out why my pass wasn't performing any transformations, even on a simple example where an instruction was certainly loop-invariant. I eventually discovered that under -O0, the mem2reg pass isn't executed, meaning variables remain in load / store form rather than being promoted to SSA registers, making it difficult to detect loop-invariant instructions.

To fix this, I included mem2reg to run before my LICM pass. This was also tricky, as it required navigating LLVM's documentation and distinguishing between classes meant for the legacy and the new pass managers. Another issue I faced was that mem2reg still didn't seem to execute even after being explicitly included. I later fixed this by supplying an additional argument to clang, as explained here.

GenAI Usage

I used ChatGPT as a reference for finding relevant LLVM functions. It was mostly helpful, although in some cases it suggested classes intended for the legacy pass manager.

Testing and Benchmarking

I used the PolyBench C benchmark suite to test and evaluate the performance of my LICM optimization pass.
The suite includes several algorithms that heavily use loops such as matrix multiplication and matrix decomposition. I used the suite's reference output generator to verify the correctness of the optimized programs.

For performance evaluation, I wrote a script to run each benchmark five times, collect the median execution time, and compute the speedup (as the ratio between the median times of the optimized and baseline runs). A graph showing speedup for each benchmark are shown below:

The overall result, computed as the geometric mean of speedups across all benchmarks is 1.1234.

Conclusion

I believe I deserve a Michelin star for this work. I successfully implemented an LLVM LICM pass, verified its correctness, and provided a performance evaluation.

0 replies

arw274 · 2025-10-29T00:41:39Z

arw274
Oct 29, 2025

Team Members

Nate Young (nty3) and Amanda Wang (arw274).

Source Code

https://github.com/arw274/cs6120_tasks/tree/main/task8

Implementation Summary

We implemented LICM for Bril in Python. Our implementation consisted of two main components: one for detecting natural loops and another for executing LICM. The first part involved finding all backedges, and then for each backedge, we iteratively accumulate predecessors of the tail until we return to the head to form the loop. For executing LICM, we create preheaders wherever necessary, and then identify loop-invariance instructions to move to the corresponding pre-header.

Testing

We tested our optimization on all core and float Bril benchmarks and verified that our LICM did not break on any of the benchmarks. To cherry-pick a couple of benchmarks on which LICM actually resulted in a speedup:

benchmark	baseline	licm	speedup
quadratic	785	577	1.36x
primes-between	574100	449983	1.28x
fizz-buzz	3652	2929	1.25x
pascals-row	146	125	1.17x
sum-sq-diff	3038	2640	1.15x
pythagorean_triple	61518	54015	1.14x
lcm	2326	2071	1.12x

The Hardest Part

We thought about edge cases for a while, and didn't fully resolve them here, even though it didn't fail on any benchmarks. In particular, our code for detecting natural loops initially would construct a new loop for each backedge, even if multiple backedges pointed to the same header (say, if there was an if-else statement at the end of a for loop). To resolve this, we just merged all loops that shared the same header. However, it seems like this could be improved if a program contained nested loops that shared the same header (say, if we had several nested do-while loops). In this case, the loop-invariant instructions for the inner and outer loops could be different, and we'd ideally have separate pre-headers. We didn't deal with this because of time constraints though.

Michelin Star

We believe we deserve a Michelin star for carefully handling/thinking about edge cases and for producing an implementation of LICM that was successful on all core Bril benchmarks.

0 replies

Nikil-Shyamsunder · 2025-10-29T01:10:57Z

Nikil-Shyamsunder
Oct 29, 2025

Code: https://github.com/Nikil-Shyamsunder/compilers-cs6120/tree/main/llvm/loopFusion

What I Did

I implemented a simplified Loop Fusion Pass in LLVM that analyzes candidate loops for fusion and merges them when appropriate. The pass uses LoopInfo to identify top-level loops and checks adjacency by verifying whether the first loop’s unique exit block leads directly to the preheader of the next loop. It then uses ScalarEvolution to ensure that both loops have identical trip counts before fusing. If they qualify, the pass clones the body of the second loop into the first, remapping operands and unifying them under a single induction variable. The pass assumes canonicalized loops (via LoopSimplifyPass and LCSSAPass) with clean preheaders, headers, and latches to ensure structural consistency.

Testing

I tested the pass using a simple program that performs matrix-matrix multiplication followed by ReLU 100 times inside a loop. After applying the pass, I verified through the .ll output that the loop was successfully removed. I also tested on loops that had different trip counts or that had operations in between the loops to make sure that they did not get fused. For benchmarking, I ran a Python script that executed 1,000 runs (100,000 loop executions total), comparing baseline and optimized builds. The optimized version achieved a 1.07% speedup, with (maybe) statistically significant improvement across multiple trials? I tried to calculate total dynamic instruction count, but couldn't figure out how. It's still a fairly small improvement but the number of trials is quite large.

Matrix-Vector Multiply Benchmark
Comparing optimized vs baseline compilation

============================================================
Benchmarking: Baseline (no optimization pass)
============================================================

Statistics:
  Mean:   117686.79 μs
  Median: 117378.50 μs
  StdDev: 3339.69 μs
  Min:    114468 μs
  Max:    153952 μs

============================================================
Benchmarking: Optimized (loop fusion pass)
============================================================

Statistics:
  Mean:   116430.37 μs
  Median: 115711.00 μs
  StdDev: 3683.90 μs
  Min:    114462 μs
  Max:    173330 μs

============================================================
COMPARISON SUMMARY
============================================================

Baseline mean:  117686.79 μs
Optimized mean: 116430.37 μs

Speedup:        1.011x
Improvement:    1.07% faster

95% Confidence Intervals:
  Baseline:  117686.79 ± 207.00 μs
  Optimized: 116430.37 ± 228.33 μs

Result: Difference appears? statistically significant
============================================================

Hardest Part

The most challenging aspect was understanding LLVM’s internal structure and pass pipeline. Unlike Bril, LLVM’s ecosystem is complex but provides powerful tools like ScalarEvolution and LoopInfo. Setting up the correct pipeline (using LoopSimplifyPass and LCSSAPass) was essential to make both the analysis and transformation work properly. Documentation and the complexity of the infrastructure made the learning curve steep, but understanding how to integrate different analyses was rewarding.

Michelin Star

I think I deserve a Michelin star for implementing a nontrivial compiler optimization in LLVM. The pass makes some strong assumptions about the form of the program, but it works and I found a way to test and benchmark it effectively. I also made sure to learn and use the other passes and helper constructs in the LLVM ecosystem to make this work.

0 replies

tf-mac · 2025-10-29T01:54:52Z

tf-mac
Oct 29, 2025

Source Code: HERE

Implementation:

I implemented LICM in LLVM. The implementation here was surprisingly trivial once I took advantage of LLVM's internal features (loop finding, preheader finding). I essentially had LLVM find the preheader, get all the components of the loop, and then if those components have ops that are from inside the loop (and aren't being moved) add them to the list. Once that's done iterate through and move them all to the pre-header

Testing:

I ran my code on all of the embench benchmarks, and observed a moderate speedup in performance across the board. The changes were not dramatic however, and only a few examples saw significant changes. I suspect this is because while I performed LICM, I did not remove dead loops.

GAI Statement:

I used ChatGPT again for references to methods and types. A good use of this was finding out how loop passes work, and discovering the "move before" function in LLVM, which made this implementation a lot easier. However, it failed pretty badly at embench, completely misunderstanding how to run it (thinking it used make)

Michelin Star:

I believe I deserve a Michelin star for this as I implemented LICM and performed rigorous testing using embench.

0 replies

YoruCathy · 2025-10-29T02:29:23Z

YoruCathy
Oct 29, 2025

Implementation code: Link

What I did

I implemented a Loop-Invariant Code Motion (LICM) pass in LLVM that identifies computations inside loops which don’t depend on the loop iteration and moves them to the loop preheader. I used LLVM’s analyses like LoopInfo, DominatorTree, and AliasAnalysis to ensure safety and correctness. I tested the pass on a subset of Embench benchmarks. The results showed small but consistent speedups on arithmetic kernels, confirming the pass works correctly.

Evaluation

I evaluated on a subset (aha-mont64, crc32, cubic, edn, matmult-int, minver, and nbody) of Embench. I ran each benchmark for 3 times and 20 times, and the results are quite similar, so I'll present the result for 3 runs here. I recorded the time to run in seconds for the baseline (w/o LICM) and the implementation (w/ LICM), and calculated the median and the percentafe of the speedup.

The result are as the following:

=== Summary (Median-based) ===
     benchmark  baseline_median_s  licm_median_s  speedup_median
0   aha-mont64               6.69           6.49           1.031
1        crc32               5.52           5.46           1.011
2        cubic               0.02           0.02           1.000
3          edn               6.90           7.19           0.960
4  matmult-int               2.93           3.03           0.967
5       minver               0.81           0.83           0.976
6        nbody               0.07           0.08           0.875

I plotted them out for better visualization:

The results show that my LICM pass achieves modest performance gains on arithmetic-heavy benchmarks, with speedups typically around 1–3%. This suggests that the pass successfully hoisted loop-invariant computations and reduced redundant operations. However, the improvements are limited maybe because LLVM’s default optimization pipeline already performs similar transformations, leaving little redundant work for my pass to eliminate. Minor slowdowns in a few cases may be due to measurement noise or additional analysis overhead. It could also be that the benchmarks are the ones that are simpler so that there's not much space to optimize.

The hardest part

The implementation is rather trivial like other people in this thread mentioned. I found it quite tricky to run the embench benchmark. I used a lot of help from GPT to figure it out.

Gen AI usage

I used ChatGPT to help me do this homework. I used it to:

Write a script to run the pass on Embench
Make the plots
Write a readme for my code
It did a really good job on using Embench (quite different from what @tf-mac mentioned) if you give it enough context like what are in the code, what are the error logs, and other information that you think might be helpful.

0 replies

zc579 · 2025-10-29T02:43:44Z

zc579
Oct 29, 2025

What I did
I implemented an Induction Variable Elimination pass in LLVM.
The task itself was not particularly difficult, but I still spent a significant amount of time setting up and debugging the environment.
In the implementation, I first check the loop structure to ensure it is a natural loop. Then I identify variables that fit the form K = A × i + B, and rewrite them as:
base = A * Start + B
step = A * C
r = phi [base, preheader], [r + step, latch]
Finally, I replace all the old instructions with the new recurrence variable r.
Hardest part
The main difficulty lies in the fact that a loop can contain multiple variables of the form A × i + B, but there is only one induction variable. Therefore, it is necessary to rely on ScalarEvolution to determine whether a variable has the canonical form {Start, +C} and whether its step is a constant.
Additionally, when inserting new instructions, they must appear before all their uses and respect the dominance relationship.
To ensure correctness, I insert all base and step computations in the preheader using:
Instruction *PHInsert = Preheader->getTerminator();
This guarantees that all inserted instructions dominate their uses and that the generated IR is valid.
Result

test.ll

out.ll

0 replies

SyphonArch · 2025-10-29T03:24:37Z

SyphonArch
Oct 29, 2025

Team & Code

Jake Hyun | Codebase

What I Did

I implemented Loop-Invariant Code Motion (LICM) for Bril and built a harness to evaluate it across benchmark suites. The pass identifies natural loops via back-edges (tail -> head where head dominates tail), computes a dependency-closed set of invariant instructions, creates a fresh preheader for each loop, and hoists safe, pure, single-def instructions out of the loop. Safety checks include purity (no effects, conservative op allowlist), unique def of destinations within the loop, and dominance criteria so the hoisted values are available at all loop entries.

I also added an optional SSA mode: convert to SSA, run LICM, then convert back. The harness can additionally run the raw SSA program as a baseline. This allows comparing "After vs Before" (original program) and "After vs SSA" (benefit of LICM beyond SSA form alone).

Testing and Results

Correctness: The harness runs original and optimized programs with identical #ARGS: inputs, compares outputs, and collects static and dynamic instruction counts (brili -p). On the full benchmark set (122 programs), every case passed equivalence checks.

Target programs: 122
Successful optimizations: 122/122

Geometric-mean ratios (we use GM because these are relative speedup/slowdown ratios; it aggregates multiplicative effects across programs). Lower is better when After/Before or After/SSA are reported:

Without SSA wrapping (After/Before):
- Static: 1.017x (slight increase in static count)
- Dynamic: 0.992x (slight decrease in executed instructions)
With SSA wrapping:
- After/Before — Static: 1.149x, Dynamic: 1.130x (SSA structure raises counts vs the original non-SSA)
- After/SSA — Static: 0.965x, Dynamic: 0.894x (LICM removes redundancy beyond SSA alone)

Observations

Behavior is preserved across the suite (122/122 Good!).
Without SSA, we see a small static overhead (ratio > 1) and near break-even dynamics (ratio ~ 1), with slight dynamic reductions on average.
With SSA, LICM outperforms the raw SSA baseline on average (After/SSA < 1.0), indicating it removes redundancy introduced by SSA form.
From the SSA plot, programs with larger SSA dynamic overhead tend to benefit more from LICM (lower After/SSA).

Example run summaries:

# All suites, no SSA wrapper
python test_licm.py benchmarks/core/ --out results_licm_core.json --plot --png licm_core.png
Matplotlib is building the font cache; this may take a moment.
Target programs: 66
100%|███████████████████████████████████████████████████████| 66/66 [00:20<00:00,  3.22it/s]
Successful optimizations: 63/66
Static Instr Ratio (GM): 1.004x
Dynamic Instr Ratio (GM): 0.987x
Wrote results to results_licm_core.json
Saved plot to licm_core.png

# All suites, with SSA wrapper and SSA baseline comparison
python test_licm.py benchmarks/core benchmarks/float benchmarks/long benchmarks/mem benchmarks/mixed --plot --png ssa_licm_all.png --ssa --out results_ssa_licm.json
Target programs: 122
100%|███████████████████████████████████████████████| 122/122 [01:51<00:00,  1.10it/s]
Successful optimizations: 122/122
Static Instr Ratio (GM): 1.149x
Dynamic Instr Ratio (GM): 1.130x
Static vs SSA (GM): 0.965x
Dynamic vs SSA (GM): 0.894x
Wrote results to results_ssa_licm.json
Saved plot to ssa_licm_all.png

Plots:

Hardest Part

Getting loops right: self-loops kept "grabbing" the entry/exit blocks (spotted in squares.bril). Limiting the loop to nodes dominated by the header (then closing over predecessors) made the sets sane.
Preheaders in the wild: reusing a predecessor picked an exit or made the header unreachable (orders.bril, legendre.bril). We now always synthesize a fresh preheader, retarget only external preds, and skip hoisting if there aren’t any.

Generative AI

I used ChatGPT to port the SSA harness from lesson 6 to this LICM lesson, adapting it for the new optimization while maintaining similar structure and functionality. This accelerated iteration, helped spot subtle correctness issues.

Michelin Star

Runs across all 122 benchmarks, with and without SSA, and validates end-to-end (equivalence, JSON stats, plots). It's a compact, disciplined LICM pass that does real work across the suite, and I claim a Michelin star for completeness.

0 replies

maheshejs · 2025-10-29T03:49:24Z

maheshejs
Oct 29, 2025

Loop-Invariant Code Motion in SSA Land!
Code: https://github.com/maheshejs/cs6120pr

LICM

I picked Bril and implemented in Racket loop-invariant code motion optimization. I added 3 passes to my compiler pipeline, broken into three stages: finding natural loops, loop-invariant analysis, and hoisting invariants. Loop-invariant analysis is completed in SSA form which simplifies the analysis. Hoisting invariants is also in SSA form and happens before hoisting phis.

Finding Natural Loops

I implemented natural loop detection following the SCC definition of a natural loop: a natural loop is a strongly connected component (SCC) of the CFG, with a single-entry point called the header which dominates all nodes in the SCC. First, I compute dominators for every node in the CFG. Then, I iteratively identify back edges, edges where the target dominates the source. For each strongly connected component (SCC), computed with Racket's Graph Library, I compute the unique header node as the intersection of the SCC with the dominator sets. If a header exists, I filter back edges ending at that header and pick one whose source’s dominator set fully contains all sources of such edges; this is done to handle nested loops. This back edge is temporarily removed to isolate the loop. The loop body is identified as the SCC nodes, and exits are computed as nodes in the SCC whose successors lie outside the SCC. The process recurses on a copy of the CFG with the back edge removed, building a hash mapping fresh loop identifiers to (header, body, exits) triples until all natural loops are discovered.

Loop-Invariant Analysis
I implemented loop-invariant analysis following Steve Chong’s lecture on loop optimizations. The analysis iterates over each loop and each assignment x := v1 op v2 to determine whether it is invariant. An assignment is invariant if, for each operand v1 and v2, one of the following holds: it is a constant, all definitions reaching it are outside the loop, or its single reaching definition is itself loop-invariant.

The procedure works in three steps. First, I identify all const operations in Bril and mark them invariant for their loops. Second, I check whether the reaching definition of each operand lies outside the loop. In SSA form, this is simplified because each operand has only one definition. So, I extended SSA’s rename-variables to include basic block IDs, which allows me to quickly check whether a definition belongs or not to the current loop. Third, I iteratively mark assignments as invariant if their operand definitions are invariant or already marked invariant, repeating this process until no further changes occur. This guarantees that all loop invariants are identified, preparing for code motion or other loop optimizations.

Hoisting Invariants
For safe hoisting of loop-invariant instructions, three conditions must hold:

The definition dominates all of its uses.
No other definitions of the same variable exist in the loop.
The instruction dominates all loop exits.

In SSA form, conditions (1) and (2) are automatically satisfied, so only (3) must be explicitly verified. To do this, for a given assignment, I extract the basic block ID from the destination (as in the loop-invariant analysis) and check that this block dominates all loop exits computed during natural loop identification. Once verified, the remaining safe invariants are hoisted to the loop preheader, using the same mechanism employed for hoisting phi nodes.

Testing
I tested LICM using:

turnt with Bril examples and handwritten cases,
brili to interpret transformed programs and confirm correctness, and
brench to run SSA roundtrip on all Bril core benchmarks, checking correctness and dynamic instruction counts.

Results
Compared to my task 6’s SSA baseline:

Overall, LICM improves 1/4 of the benchmarks. In gebmm, the dynamic instruction count worsens by 1 instruction. This could be due to nested loops in the benchmark, which might require duplicate hoisting.

Hardest Part
For this task, the main challenge was reasoning about the loop optimization in SSA form, which was nontrivial. However, this made me appreciate the benefits of SSA form more since it really simplifies some analysis.

Conclusion
I believe I deserve a Michelin star for successfully implementing a loop optimization, loop-invariant code motion (LICM), fully integrated with all my previous tasks (SSA, dominance utilities, global LVN+DCE).

0 replies

az275 · 2025-10-31T01:22:11Z

az275
Oct 31, 2025

Code

I implemented a LICM pass in LLVM. I made use of several of LLVM's utility functions, e.g. isSafeToSpeculativelyExecute and hasLoopInvariantOperands, which helped to simplify the implementation.

There were several challenging aspects of this task: specifically, working with LLVM, and testing/debugging. I spent a fair amount of time trying to figure out why my LICM pass was not doing anything; this turned out to be because a mem2reg pass needed to be run before the LICM pass in order for loop invariant instructions to be identified. It's also necessary to compile with -O1 flag or higher.

I tested on all programs in the PolyBench benchmark suite. (1) I validated correctness, meaning that my LICM pass does not change the result of any of the computations, and (2) I ran each program with and without the LICM pass 5x, and computed the mean execution times and percentage speedup of the LICM optimization over the baseline. See the script for running the benchmarks, and the results. I sort of reinvented the wheel here; the benchmark suite has some utilities for running benchmarks, but it was fairly trivial with some help from ChatGPT to write a new script to compile and run everything with the desired flags.

My LICM pass does not seem to generate much speedup on these benchmarks in practice. It does indeed move some instructions out of the loops, but I suspect that the logic may be too conservative. Another factor is variance in measurements across runs. However, I haven't had time to investigate this further.

GenAI: I used ChatGPT to help write the script to run the PolyBench benchmarks. I gave it an example of the correct compiler commands to use for one benchmark, and asked it to run each baseline and LICM optimization 5 times and compute the mean runtimes and percentage speedups. It did a passable job; I'm sure we can further decrease verbosity/improve extensibility, but it works just fine.

Star? I think so! The results don't show consistent or significant speedups, but I learned more about LLVM and did some pretty complete testing/benchmarking, so I think it was a good effort.

0 replies

itingtsai · 2025-10-31T02:26:05Z

itingtsai
Oct 31, 2025

Source Code

Implementation

I implemented the recommended LICM pass that follows the standard rule. An instruction is loop-invariant if all the values it depends on come from outside the loop or from other already-invariant instructions. The pass works in four steps: 1. It first calls simplifyLoop() to make sure each loop has a proper preheader and exit blocks before running the optimization. 2. It repeatedly scans the loop, marking any instruction as invariant if it is safe to remove, has no side effects, is not a PHI node or terminator, and all its operands are either constants, defined outside the loop, or already marked invariant. 3. Before moving anything, it verifies that the instruction dominates all its uses, is not used by PHI nodes, and that its basic block dominates all loop exits. 4. Safe invariant instructions are then moved to the loop preheader. The pass is conservative. It skips anything that might have side effects or isn’t safe to remove, so it doesn’t hoist loads or other risky operations. I tested it on three of the Embench and confirmed that it detects and moves invariant instructions in normalized loops.

Testing

The LICM pass was tested using three benchmarks from Embench: cubic, matmult-int, and nbody. The testing pipeline followed three steps. First, each benchmark’s source code was compiled to LLVM Intermediate Representation (IR). Second, the custom LICM pass implemented as skeleton-pass inside SkeletonPass.dylib was applied to the generated IR using the LLVM opt tool. Finally, the original and transformed IR files were compared using diff to quantify the modifications. The pass successfully transformed all three benchmarks: 1. cubic: from 275 to 281 lines, with 42 lines changed and 3 loop exit blocks added. 2. matmult-int: from 297 to 303 lines, with 32 lines changed and 3 loop exit blocks added. 3. nbody: from 403 to 440 lines, with 117 lines changed and 23 loop exit blocks added.

0 replies

CynyuS · 2025-10-31T04:20:34Z

CynyuS
Oct 31, 2025

open source: https://github.com/CynyuS/loop_opt

Cynthia Shao and Jonathan Brown worked together on this task.

Summary

We implemented LICM in LLVM. Furthermore, we created .sh scripts to automate running the benchmarks for the Embench Suite. Overall, this task was very fun and playing around with LLVM and benchmarking libraries was a great learning experience!

LICM

The first step to implementing LICM was to identify when code was loop invariant. I first took all the Loop Invariant instructions and stored them into a vector, then removed them and inserted them before the preheader terminator. Then I quickly realized that my Loop invariant implementation… didn’t work. I realized I had to account for if the instr is a terminator, mayHaveSideEffects, and make sure that it isn’t a store/load. After this, we had to make sure that the instr wasn’t used within the loop.

Testing and Benchmarking

For testing, we used the llvm-test-suite to test our passes correctness on 2463 designs. This test suite also conveniently functions as a benchmark, but we decided to use Embench instead!

For benchmarking, we utilized the Embench Suite. This was a simple and easy way to benchmark our pass compared to the baseline of no opts pass -O0 for clang. We had to implement a new script that would use the time module in python to capture the correct runtime of running these programs, as the original benchmark times in milliseconds were captured on a much slower processor. The good news is that the LICM pass on a variety of programs resulted in an overall speedup, and no slow downs!!

BENCHMARK COMPARISON: Baseline (-O0) vs With LICM Pass (-O0 + LICM skeleton pass)


Benchmark            Baseline (ms)   With Pass (ms)  Speedup    % Change
--------------------------------------------------------------------------------
aha-mont64           2.00            2.00            1.000        +0.00%
crc32                3.00            2.00            1.500       -33.33%
cubic                2.00            2.00            1.000        +0.00%
edn                  3.00            3.00            1.000        +0.00%
huffbench            2.00            2.00            1.000        +0.00%
matmult-int          3.00            3.00            1.000        +0.00%
md5sum               2.00            2.00            1.000        +0.00%
minver               2.00            2.00            1.000        +0.00%
nbody                2.00            1.00            2.000       -50.00%
nettle-aes           2.00            2.00            1.000        +0.00%
nettle-sha256        2.00            2.00            1.000        +0.00%
nsichneu             2.00            2.00            1.000        +0.00%
picojpeg             4.00            4.00            1.000        +0.00%
primecount           4.00            4.00            1.000        +0.00%
qrduino              4.00            4.00            1.000        +0.00%
sglib-combined       3.00            3.00            1.000        +0.00%
slre                 3.00            3.00            1.000        +0.00%
st                   2.00            2.00            1.000        +0.00%
statemate            2.00            2.00            1.000        +0.00%
tarfind              2.00            2.00            1.000        +0.00%
ud                   2.00            2.00            1.000        +0.00%
wikisort             2.00            2.00            1.000        +0.00%
--------------------------------------------------------------------------------

Geometric Mean Speedup: 1.051x
Arithmetic Mean Speedup: 1.068x

Pass IMPROVED performance by 5.12%

Hardest Part

We struggled while researching LoopPass and LLVM to find the functions necessary to complete this assignment. Something that was helpful was using the autocomplete that would list all possible methods / fields of a given variable. Reading through this as well as researching was significantly helpful in discovering the methods and types to implement our LICM pass. We also struggled to set up and find appropriate benchmarking methods and correctness verification suites, and we were wondering what methods were used to formally verify LLVM passes.

Self Assessment

We believe we deserve a Michelin star because we are proud of our work! We brainstormed an implementation of LICM within LLVM, and thoroughly tested it. We tested thoroughly using the LLVM lit test suite, and the Embench test suite for performance. We also practiced good team coding practices with git, and learned a lot about LLVM and the benchmarking process, ensuring correctness across all programs! We believe we deserve a Michelin star for the tenacity and persistence put into the task.

0 replies

pedropontesgarcia · 2025-10-31T16:45:51Z

pedropontesgarcia
Oct 31, 2025

Loop-invariant code motion

By Helen and Pedro, at our Codeberg repo

We implemented a loop-invariant code motion (LICM) pass that operates on Bril programs in SSA.

Evaluation

The pass was tested in isolation by comparing SSA+LICM to an SSA baseline. We did not want to rely on the performance of our out-of-SSA pass as it would likely produce noisier results. We evaluated using the core Bril benchmarks, and the pass produced correct output in all cases. Figure 1 displays the benchmarking results as percentage speedup of SSA+LICM over SSA.

Figure 1: Percentage speedup on benchmarks.

We observed that near half of the benchmarks presented speedups, a few of them above the 10% threshold. In particular, we found that benchmarks that were loop-heavy had relatively large performance gains. For example: quadratic.bril has five loop-invariant instructions in the main loop of its square root function that are identified and hoisted to the preheader by the LICM pass, resulting in a speedup of over 10%.

We initially also found that some benchmarks showed speeddowns. The implementation of the pass (see below) depends on several passes that can add overhead, and further work was needed to undo the simplifications that the LICM pass depends on. We found that, in particular, loop canonicalization sometimes added empty blocks that end up unutilized, and similarly with unnecessary jump instructions. We were able to revert these additions after the LICM pass by implementing an empty block removal pass and a redundant jump/return removal pass. With these passes added, none of the benchmarks showed speeddowns.

Implementation

The Bril development suite is undergoing a major redesign that created very significant task dependency and delays in testing loop-related code. The redesign involved adapting the entire code base to new interfaces and was very challenging. Nonetheless, we believe that managing the technical debt now instead of postponing it further was the correct decision to move forward with the project.

Workflow aside, we began our loop journey by ensuring that blocks had a canonical form, beginning with a label and ending with a terminator instruction. This was desirable because the loop canonicalization pass needed to insert blocks, and repointing predecessors and successors was much easier in this canonical block form. We also built an interface to insert empty blocks given a set of predecessors and a successor.

Afterwards, we began work on the analysis pass to detect natural loops. We used dominance analysis to identify backedges and construct the loop body for a given header block, and also stored information about preheader candidates, exiting blocks, latches, and exit blocks. We used the LLVM loop terminology guide to identify edge cases.

The next step was canonicalizing loops. We built a pass to ensure that every natural loop had:

a unique header, not shared by any other loops;
a unique preheader with the header as its only successor;
a unique latch, with a backedge to the header;
dedicated exit blocks with no predecessors outside the loop.

We modeled these conditions after the LLVM LoopSimplify pass. The pass was designed to be as non-conservative as possible; for example, if unique preheaders, latches, or exits already exist, the pass does not create new ones.

Finally, we got to implementing LICM. Since the code was in SSA form, some of the conditions become trivial --- for instance, the reaching definitions analysis is redundant since there is only one definition per variable. We identified instructions that were loop-invariant to convergence, excluding side effects and PHI nodes. We were relatively conservative in this assessment, which meant that thanks to the SSA guarantees, we were always able to move an instruction marked as loop-invariant to the preheader. Lastly, we designed the pass itself, collecting loops, canonicalizing them, and conducting LICM on each of them.

Afterwards, we worked to undo some of the loop simplification as we found it added performance overhead. As discussed above, we implemented an empty block removal pass and a redundant jump removal pass.

Challenges and star

Both the redesign and the natural loop detection and canonicalization were very major challenges. We initially considered simply implementing LICM in LLVM, which already has those passes as well as helper functions to determine whether an instruction is loop-invariant or safe to hoist. But ultimately, we felt that a large part of our learning about loops would have been greatly diminished had we used LLVM. We are also consistent in our choice of not using generative AI for this project. We worked hard on this and are proud of our evolving code base, which throughout this redesign has become very significantly more flexible and usable. We find our loop detection and canonicalization well-documented, elegant, modular, and simple in most ways. And it is also performant, demonstrating significant speedups in about half of the benchmarks in the suite. In summary, despite the two-day delay on delivery, we are happy with our work and believe we deserve a star!

0 replies

Lesson 8: Loop Optimization #560

Uh oh!

sampsyo Aug 24, 2025 Maintainer

Replies: 16 comments · 1 reply

Uh oh!

Uh oh!

What I Did

Trickiest Part

Benchmarking

Generative AI

Summary

Uh oh!

Uh oh!

Uh oh!

Pass

Testing

Hardest Part

Michelin Star

Uh oh!

Team: Adnan Al Armouti, Tobias Weinberg

Loop-Invariant Code Motion (LICM) for Bril

What we did

What went wrong (and how we fixed it)

Results

Takeaways

GenAI disclaimer

Uh oh!

Team Members

Source Code

Implementation Summary

Key Components

Testing Methodology

Evaluation Results

Benchmark Results

Visualizations Using GenAI

Speedup Comparison

Challenges

Michelin Star

Uh oh!

Uh oh!

What I Did

The Hardest Part

GenAI Usage

Testing and Benchmarking

Conclusion

Uh oh!

Team Members

Source Code

Implementation Summary

Testing

The Hardest Part

Michelin Star

Uh oh!

Uh oh!

What I Did

Testing

Hardest Part

Michelin Star

Uh oh!

Uh oh!

What I did

Evaluation

The hardest part

Gen AI usage

Uh oh!

Uh oh!

Team & Code

What I Did

Testing and Results

Observations

Hardest Part

Generative AI

Michelin Star

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Implementation

Testing

Uh oh!

sampsyo
Aug 24, 2025
Maintainer

Replies: 16 comments 1 reply