Replies: 24 comments 30 replies
-
| 
        
 My Pass "Testing" Thoughts  | 
  
Beta Was this translation helpful? Give feedback.
-
        
PassThis week was a bit simpler because of exams, but I wrote a quick pass that counts the number of instructions in a basic block and calls a runtime function with the index of the block as well as its instruction count to be displayed. It inserts the function call at the top of each basic block with the corresponding arguments, and I verified that the function calls are exactly what we would expect in the modified LLVM code dump. I also executed these manually for testing. ConclusionsI spent a bunch of time trying to figure out how we could best generate temporary files in   | 
  
Beta Was this translation helpful? Give feedback.
-
| 
         (in collaboration with @lisarli and @dhan0779) Code can be found here. We wrote our LLVM pass to target a specific program: a serial implementation of a toy particle simulation where the particles repel each other. In compiling this program to LLVM IR with the -O3 flag we saw this instruction sequence:   %6 = fmul <2 x double> %5, %5
  %7 = extractelement <2 x double> %6, i64 1
  %8 = extractelement <2 x double> %5, i64 0
  %9 = tail call double @llvm.fmuladd.f64(double %8, double %8, double %7)which corresponds to this C++ code:     double dx = neighbor->x - particle->x;
    double dy = neighbor->y - particle->y;
    double r2 = dx * dx + dy * dy;and compiles down to this ARM assembly: 	fsub.2d	v0, v0, v1
	fmul.2d	v1, v0, v0
	mov	d1, v1[1]
	fmadd	d1, d0, d0, d1We thought this was a bit odd since we're essentially redoing a multiplication operation. In theory, it seems like this should be equivalent to this instruction sequence, which doesn't redo the multiplication on the first element of the %5 vector:   %6 = fmul <2 x double> %5, %5
  %7 = extractelement <2 x double> %6, i64 1
  %8 = extractelement <2 x double> %6, i64 0
  %9 = fadd %7, %8Although we figured there was probably a good reason for this, such as the first version being faster somehow, and there was a chance that the two versions are not completely equivalent due to some floating-point error, we wanted to try both versions and see if our version would run faster (or generate different assembly) than the version output by  For our pass, we found the pattern matching constructs a bit clunky to work with, so we just used  After running  the generated LLVM IR is as shown:   %6 = fmul <2 x double> %5, %5
  %7 = extractelement <2 x double> %6, i64 1
  %8 = extractelement <2 x double> %6, i64 0
  %9 = fadd double %7, %8and the final ARM assembly output is as shown: 	fsub.2d	v0, v0, v1
	fmul.2d	v1, v0, v0
	faddp.2d	d1, v1Notably, the ARM assembly uses a  We ran the simulation a few times at different particle sizes with the same starting seed. Unfortunately, the results seem to be within the margin of error. I guess we did not end up finding a new LLVM optimization. We ran the simulation a few more times for those particle counts (N) and it seemed to be the case that for smaller particle counts our altered version seemed to be slightly faster while for larger particle counts the original version seemed to be slightly faster, this difference could be chalked up to any number of confounding factors. We believe our work deserves a Michelin star or two because of the effort we made in inspecting the LLVM IR generated from various programs, finding generated instructions in the O3 optimized code that seemed like they might be inefficient, writing a transformation pass that makes a hypothesized optimization, and evaluating our hypothesized optimization. Note: Originally, we tried to optimize the LLVM IR snippet using a vector reduction intrinsic to sum up the elements of the   | 
  
Beta Was this translation helpful? Give feedback.
-
| 
        
 My pass was pretty simple, I randomly changed Add, Sub, Mul and SDiv operations. I grabbed some C programs from the internet that seemed like they would be easy to test. Honestly the more I tried testing, the more I regretted creating this pass - it was very annoying to test. I had to manually change some conditional statements from the code to avoid seg faults. It is obvious that this would not be very testable especially when scaling to more complex programs, but I didn't realize how quickly programs would get too complex to test. I stopped in complexity at factorial. Testing was just manually looking at the prints and outputs. Overall this was silly but a good way to get somewhat familiar with LLVM, especially the setup and basic instruction parsing/ modification/ creation. I believe this work receives a Michelin star, it wasn't extremely ambitious but it meets the requirements.  | 
  
Beta Was this translation helpful? Give feedback.
-
| 
        
 The hardest part of this assignment was setting up LLVM and having it use my pass and figuring out the CMake stuff. The posted tutorials were helpful but I had to do some extra stuff I don't exactly understand to get it working. For this task I implemented a function inlining pass. I thought this would be interesting because there are different ways to configure this pass, like with how many passes of inlining we should do, as well as the instruction threshold for choosing when to inline. Starting out, I thought this was going to require a fair bit of thought; stitching code together would require generating new temps for the function arguments, making the right assignments, and replacing all of the callee's  I might go back and implement the inlining myself, but I expect it'll take a while for me to figure out. With my spare time I benchmarked the pass a bit. I grabbed some benchmarks from here and ran them for different configurations of inlining iterations and inlining thresholds. I got some graphs that look like these: In general, though, inlining seems like a pretty decent optimization averaged across the 4 benchmarks I tried: More intense benchmarking with greater care towards curating a "representative" set of benchmarks would probably reveal more precise trends that would inform how a "real compiler" would want to do inlining, but I thought it was cool to see some preliminary results.  | 
  
Beta Was this translation helpful? Give feedback.
-
        
PassI was not too ambitious this week. I implemented a heap instrument pass, which replaces original malloc/free invoke with a customized version. The custom heap interface is defined in Rust as a thin wrapper over system alloc (actually pretty boring, it calls libc anyway, just want to play around with the Rust ffi a bit). It takes an extra function name argument to keep track of heap statistics per function. An extra dump function will be inserted right before the main function exits to print out the profiling result, including the total bytes allocated and the number of malloc/free calls made per function. The rust crate is compiled into a cdylib and linked with object file compiled from c. I have not done too many serious tests for this. I just ran it against a couple of c programs and manually check the result. In LLVM, I mainly played around with the IRBuilder for  ConclusionI think the hardest part of this lesson might be reading the llvm doc and finding the right api to use. I sometimes use AI tool to help with that. This lesson did help me gain some familiarity with LLVM.  | 
  
Beta Was this translation helpful? Give feedback.
-
| 
        
 The task. For this task, I implemented a pass which adds, to each loop header, a statement to print out "Here we go again...". This helps to instill in the user a sense of déjà vu at each new iteration, and an increased frustration that the running code seems to take one step back for every two steps forwards: "It already ran those instructions, doesn't it have something new to do?" The implementation. While a simple and unambitious task, there were a few tricky parts: 
 The evaluation. I ran it on some simple example programs, and to find a real-world program where the output would nevertheless be interpretable, I decided to run it on an old version of GNU yes. (I removed some GNU coreutils specific parts to make it easy to compile.) Here's an example: As you can see, there are many loops even in this simple program.  | 
  
Beta Was this translation helpful? Give feedback.
-
| 
         Group (@ngernest, @katherinewu312, @samuelbreckenridge) We implemented a pass which replaces every division instruction with a  In other words, our pass turns instructions like: %11 = fdiv float 5.0, %10         ; Note that %10 may store the value 0into: %11 = fcmp oeq float %10, 0.0
%12 = fdiv float 5.0, %10
%13 = select i1 %11, float 0.0, float %12     ; equivalent to the C ternary stmt `%13 = (%11 == 0) ? 0 : %12;`We support both unsigned/signed integer division ( To aid debugging, every time our pass replaces a division instruction with Implementing this pass was relatively straightforward, although we realized we Simple example: int main() {
    float zero = 0;
    float result = 5 / zero;
    int use_site = (int) result + 2;
    return use_site;	
}If we use  define i32 @main() {
  ...
  ret i32 poison
}After performing various local optimizations, the compiler determines that this function returns a  However, if we use  define i32 @main() {
  ...
  ret i32 2
}The resultant LLVM code returns a concrete non-zero value, which is expected! This is because our More complicated examples:  | 
  
Beta Was this translation helpful? Give feedback.
-
        LLVM PassI am detailing the implementation of a basic version of compiler-enforced cooperation. To minimize the occurrence of head-of-line blocking for user-facing applications, recent works have shifted the task of context-switching between threads out of the operating system and into a userspace scheduler. Userspace schedulers with better knowledge of an application can make better scheduling decisions and make faster context-switches (e.g., 30 cycles) using userspace thread libraries (e.g., Boost). This practice has been called compiler-enforced cooperation, as it relies on compiler instrumentation to enable the fast context switches. There are currently three implementations of compiler-enforced cooperation in the literature: (1) In the first, a sending thread sets a signal (writes to a shared memory location) which is periodically polled by a recipient thread at compiler-inserted probe points, directing the recipient to yield the core, (2) In the second, the polling thread checks the CPU's Time-stamp counter at each probe point (e.g., using the rdtsc) instruction to check whether it has exceeded a predefined time quanta (e.g., 5 us). In this basic implementation, probe points are inserted every 5 LLVM IR instructions. The rdtsc instruction is called at every probe point and a histogram of inter-probe point times is printed at program completion -- the goal was to see what the distribution of times between probe point invocations looked like. Similar approaches could be useful for ensuring adequate probe point density for real compiler-enforced cooperation passes or compiler-enabled code profiling. Generative AI Disclaimer: I had some questions related to ChatGPT after using: ChatGPT said that o3-mini-high is about as good as an advanced LeetCoder. Also said we can expect 353,000 metric tons of CO2 per year (~83,500 gasoline-powered cars driving 11,500 miles each year on 22MPG) if 800M people each made one LLM query per night. ChatGPT said they aim to hit 1 billion users by the end of this year, and I think most users are making more than one query (it took 7 for this assignment). Also, 4o just scaled the 0.0003 kWh for a single inference by 800,000,000 queries x 365 days. Servers consume energy while idle, energy is used for other infrastructure (e.g., cooling), and this doesn't factor in the embodied carbon of all the GPUs, so probably not such an accurate estimate.  | 
  
Beta Was this translation helpful? Give feedback.
-
        Automatic memoization of pure functionsCode: https://github.com/ethanuppal/cs6120/tree/main/lesson7 Note There are technically two "versions" of this pass,  I'm super proud of the pass I made. I spent over 13 hours on it. I used Rust to implement an LLVM pass that conservatively identifies pure functions with up to three integer parameters and automatically memoizes them to a configurable degree. By default, it only memoizes a parameter  For example, consider the following function: int add1(int a) {
    return a + 1;
}It is "pure" --- it has no side effects. I used conservative heuristics on the instructions to to determine whether a function was pure. There are some low-hanging fruits for improvement here. For example, I didn't do an initial pass to identify potential pure functions and then resolve interprocedural calls, so if a pure function calls another pure function it will not be identified as pure. I decided to only focus on pure functions with 3 or fewer  Then, I inject basic blocks into each function and hooks at each return. At the start, I check if the memoization tables even apply to the parameters, and if they do, whether we have already cached something. At each return, I inefficiently replace it with a branch on whether the parameters are in bounds of the memoization tables (reusing the same boolean variable from the header block --- maybe I should have just recomputed it as to not hamper register allocation) and if so whether it can use the cache. View the full memoized source code for the 
 | 
  
Beta Was this translation helpful? Give feedback.
-
| 
         My Code is Here: https://github.com/mt-xing/cs6120/tree/main/llvm-pass-skeleton This was an unfortunately busy week for me (2 grading sessions + my own prelim) and as a result, I'm attempting something quite unambitious. In the very first week of CS 2112, Prof. Myers would always go and tell the freshman that any good compiler will take an expresison of the form  My pass searches every instruction to see if it is a binary op, and if so, if it is a multiplication command. If both are true, it then checks if the right operand is a constant integer, and finally, whether it is 1) greater than 0, 2) less than or equal to the largest integer that can be safely stored as a double, and 3) whether the floor of the base 2 log of the value is equal to the base 2 log of the value. In other words, it is a multiplication by a power of 2 that I can safely compute with doubles without worrying about loss of precision. Anything that matches this pattern will be rewritten into a left shift, keeping the left operand the same, and using the constant computed value of the base 2 log as the right operand. Like the example shown in class, I do not bother removing the old operand, instead relying upon dead code elimination to do that. For testing, I first wrote a manual test case with multiplication by many different integers. I verified manually by looking at the IR that only the ones that were powers of two were rewritten. I also verified that the original operands became dead code, unused in the future. Also, I ran the code to ensure the outputs matched before and after my optimization. I then, with the help of Annabel, found a nontrivial linear algebra library written in C. Compiling this with my pass, I also checked the output of the IR to see if it looked reasonable. The hardest part was getting the LLVM project to run on my computer at all. I initially tried to make it work in Ubuntu on WSL, but no matter what I did, the pass would either print nothing or crash complaining about running out of memory. I tried cleaning everything, including cmake and llvm, off my machine and reinstalling everything from scratch. It still would not work. I got this far in class and could not figure out why. I next tried to make it work on Windows, but first, the  So, out of options, I dipped into $15 of my leftover free Azure credits to spin up a Linux VM in the cloud. SSHing into this box meant I finally got the thing to compile and run, and that's where I ended up doing all my work.  | 
  
Beta Was this translation helpful? Give feedback.
-
| 
        
 For this assignment i aimed to create a simple profiler tool. The idea is that an instruction count tool can keep track of the number of instructions that all functions in a c program execute to give a rough idea into where bottlenecks might be occurring or for simple empirical evidence of analysis of algorithms. For my real world example I implemented insertion sort in c and and executed sort on one array of 10 elements and another of 100 elements; and the profiler tool identified that the sorting of 10 elements lead to the execution of 1,570 instructions whereas of 100 elements to the execution of 110,065 instructions (in line with an O(n^2) algorithm). Overall, I aimed for a not a super ambitious implementation this week but i'm still happy with the results and about having a way to measure instruction count automatically for any c program.  | 
  
Beta Was this translation helpful? Give feedback.
-
| 
         Code here For this week's tasks, getting LLVM running on my computer was maybe the hardest part. Once I got it all running, though, the task itself was not difficult, especially since I used the template already given to us in the Skeleton.cpp file. I took all adds in a file and turned them into subtracts with the same arguments. The most difficult thing was certainly handling the difficult-to-parse LLVM syntax, and the opacity of the documentation. Also, I have very limited experience with C++ as well, so, though the syntax isn't as uninuitive there, I struggled with the one-two punch of two new language syntaxes. For testing, I used my own toy test cases, and ran the file on a C file I found online, from this repo, which does matrix and vector manipulations. I worked (distantly; more like we consulted with each other) with Michael on this project and he took the repo and stuck a lot of the files together into one big megafile, which I used to make sure everything ran smoothly.  | 
  
Beta Was this translation helpful? Give feedback.
-
| 
        
 Getting LLVM on my computer was a little trivial. I use NixOS, so it was just a matter of adding  For the task, I initially wanted to make a pass that would replace the logical  Instead, I opted for something less ambitious and just made a pass that will print every time a function is called. The idea is that someone could possibly use this as a simple way to check something like how many times each function is called when their program runs without going and adding print statements in every function. The actual implementation wasn't that difficult, as I mainly just modified the skeleton code in the tutorial. For testing, I made a simple C program that had 4 different ways to calculate the nth Fibonacci number, and got this output: I believe this work deserved a Michelin Star because although it is a bit unambitious, this plugin could be useful in some (admittedly contrived) situations.  | 
  
Beta Was this translation helpful? Give feedback.
-
| 
        
 I implemented an LLVM pass that rewrites floating-point division operations into multiplications by the reciprocal, essentially transforming x / y into x * (1 / y). To test my implementation, I used a simple C program containing several division cases: one with a variable divisor (a / b), one with a constant divisor (a / 4.0), and one with an expression that simplifies to a constant (a / (2.0 + 2.0)). I verified the correctness by inspecting the emitted LLVM IR, which showed that non-constant divisions were replaced with a reciprocal computation followed by a multiplication, while constant cases were optimized directly by LLVM. I also ran the transformed binary to check for performance improvements, observing that the modified IR offered equivalent results with the potential for better runtime efficiency on non-constant operations. This work would be even cooler if I could extend the pass to handle more cases. For instance, I could avoid unnecessary floating-point extensions (the current implementation sometimes promotes operations to double and then truncates them back to float), thereby keeping the calculations entirely in float precision. Moreover, optimizing more complex cases—like hoisting the reciprocal calculation out of a loop when the divisor is computed at runtime but remains constant across iterations—could further boost performance. The hardest part of this task was safely modifying the IR without invalidating iterators; I solved this by first collecting the target instructions and erasing them afterward.  | 
  
Beta Was this translation helpful? Give feedback.
-
| 
        
 I wrote a LLVM pass that goes over each instruction in a program and if it is an integer multiplication or division with at least one operand that’s a power of 2, then it’s converted to a bit shifting operation. I identify if an integer value is a power of 2 by zero-extending it to a 64-bit unsigned integer with  I tested and evaluated the pass by writing a test program that generated 10 million operations that are a mix of integer multiplication (with the power of 2 randomly in either operand), signed integer division, and unsigned integer division. The result of each operation is output to a file, along with the runtime on the last line. I compiled and ran the test program with and without the SkeletonPass plugin included and diff-ed the resulting output files. In the end, it did not like seem the transformations had any or much of an effect. Averaged over 3 trials, running with the plugin was 4.68s while without the plugin was 4.71s. When I removed writing the output to files (and only included a print of the elapsed time), which may bias the results, and increased the number of operations to 10 billion: running with the plugin was 80.29s while without the plugin was 81.58s, so a very minor speedup for a contrived situation.  | 
  
Beta Was this translation helpful? Give feedback.
-
| 
         https://github.com/smd21/cs6120_homework Once I got everything running, I implemented a pretty barebones pass that just prints out a message when a function is called or a branch is taken. I tried to get this pass to print out the name of the function/branch, but ran into some issues due to not really understanding how stuff is stored in the API. I have a sense as to where I went wrong (I think I made some faulty assumptions), but am still working on fully implementing this functionality. For testing, I ran my program on a couple of different C files that call various functions from main and have loops/if statements/etc to ensure the IR had branch statements. I inspected the output manually and made sure it seemed to be what I expected (honestly, I kind of used this as an opportunity to just see the difference between all the control flow instructions in LLVM lol)  | 
  
Beta Was this translation helpful? Give feedback.
-
| 
        
 I implemented some conversions into bit operations where possible, like multiplication -> left shift, or division -> right shift, or modulo -> bitwise and. Relied a lot on isPowerOf2 that LLVM comes with, which proved kind of handy. The implementation was fairly simple, a lot of issues were just from setting things up. I also attempted some relatively basic constant folding, but encountered some weird behaviors. I had a line  I tested it on a few manually written programs, to see that the instructions were being replaced and the original instructions were being removed. I've been trying to get this running on a more "industrial-strength" program, specifically trying this on x264. It seems kind of hard to know for sure if you compiled something with only the pass you want, especially in a more complicated make system. I've found some benchmarks for x264 online (phoronix-test-suite), which I'm going to try running when they download, hopefully there will be some differences in performance, but I'm a little pessimistic about this since I expect multiplication and division operations to already be fairly efficient in hardware compared to other instructions.  | 
  
Beta Was this translation helpful? Give feedback.
-
        
SummaryI made a tool that checks if a instruction is a multiplication instruction; if so, I print out that a multiplication instruction was found. I also had a part of my pass where, if an instruction was a binary operator and it its arguments were commutable, then I would swap the arguments. To get these methods from the LLVM internal rep, I spent a good amount of time on the BinaryOperator doxygen. TestingI took a matrix multiplication benchmark from a benchmarks repo to test my pass. For each of the dynamically-executed multiplication instructions, I indeed saw my expected printed output. I also made sure to print out the LLVM IR code for this C program, with and without my pass being registered; for the binary operator instructions where swapping the arguments was equivalent, I made sure that these instructions had their arguments swapped. Hardest PartFor me, the hardest part was navigating the documentation and figuring out the types of methods and general C++ syntax. I have never written any C++ before, so this was cool, but also a little overwhelming. I feel like it would be really cool to get good at playing around with the LLVM internal rep, though! I think my work deserves a star!  | 
  
Beta Was this translation helpful? Give feedback.
-
| 
         Here is my code: link I am very late with this assignment. What I wanted to do was not necessarily ambitious, but I was curious to try something like that. There is this renderer called Mitsuba3. I wanted to do something about IR of the renderer so that I can visually see the effect of the change. My original plan was to make all diffuse objects look darker, or another idea was to make all green objects red. I did not manage to do exactly that because the renderer uses complex data types and it was not obvious how to scale a color vector. I was hoping that I will be able to figure it out by explicitly multiplying the vector by a constant in the cpp code and comparing the IR of this function with the original one, but the difference in the two IR was too hard to interpret. So, I decided to scale a scalar function by a constant. It took me some time to make it work, to be honest. For a while I could see no change when compiling with my pass. Later, I figured that the reason was which variant of Mitsuba I was using (llvm_ad_rgb should have been replaced with scalar_rgb). Now it works. Here are the two different images: the original one and the modified one.  | 
  
Beta Was this translation helpful? Give feedback.
-
        
OverviewThis is very late, for which we both apologize. For our pass, we implemented a handful of analysis and optimizations on multiplication and division operations, as well as safety checks on division. Firstly, we implemented a checker for division by a constant 0, which handles such cases by just setting the expected quotient to zero.For this, we considered a few options – including preserving the dividend, or even setting the result to some arbitrary constant. Ultimately, we ended up doing the 0 resetting and also outputting an informative message. For cases of dividing by a nonzero constant, we implemented a right-shifting optimization, such that we sequence right-shifting as much as possible, before continuing with the remaining division in the case that the divisor is nonzero. Again, every time a division instruction is identified with a constant divisor, this is identified and noted, with the corresponding optimization onted as well. For multiplication, we implemented left-shifting for multiplication done by a constant on either the LHS or RHS. We also put in constant propagation of sorts (that is to say,  Here's a little snippet of the outputs given as a program is optimized:  
TestingDuring development, we tested against a simple handmade calculator program we wrote, which is accessible in our code. Subsequently, we put in benchmarks for fast exponentation, factorial computation and are testing on a handful of others we found in the LLVM benchmarks repository. Given how short these processes already are, it's not too clear if this greatly does optimize the code, but it does preserve correctness, as outputs remain consistent! TakeawaysI (somewhat surprisingly) found this more challenging than a lot of the other exercises. While I find pure C++ in general to be pretty straightforward (minus debugging), it was a difficult for to wrap my head around some of the syntax and the different methods. However, after taking some time to actually review the object structure of program representations and the different keywords, this just turned into plain old C++, and it became a relatively relaxing exercise to do.  | 
  
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
| 
         code  | 
  
Beta Was this translation helpful? Give feedback.
-
| 
         I implemented a super simple inliner fuzzer for LLVM.1 The goal is to help explore the inliner decision-space. Several ideas to use this include: 
 But I did not have time to explore any of these cool things. Right now, the pass currently uses an RNG that can be seeded at compile-time, e.g. like so: Correctness. This was tested on two hand-written c programs, a simple fibonacci program and a simple addition program. It's more-likely-than-not correct as I used several built-in LLVM utilities and modeled my code off of the built-in LLVM  Footnotes | 
  
Beta Was this translation helpful? Give feedback.






Uh oh!
There was an error while loading. Please reload this page.
-
Share your experiences getting started with writing an LLVM pass here!
Beta Was this translation helpful? Give feedback.
All reactions