Provably Correct Peephole Optimizations with Alive #492
Replies: 22 comments 7 replies
-
|
Critique: This is a very cool bit of work, especially in terms of finding real bugs in LLVM. I'm always interested in when bugs are found in common compilers, as I have to wonder how many poor developers were gaslit into thinking their own code wasn't working when it was the compiler's fault. My biggest critique of this work is that it introduces a whole new language just for peephole optimizations - specifically ones that fit within a specific template system - and I'm not familiar with the breakdown of types of optimizations in LLVM, but in this class alone we've seen a large number of optimizations that operate beyond a basic block, and those seem to also be the ones that are the most bug-prone. While they clearly have proven that there are still bugs in these peephole optimizations for them to find, I have to wonder if a system for proving the correctness of larger optimizations would be more useful. (I'm sure it would be much harder to build, though). Discussion Question: The paper presents a new language that uses templates to write certain peephole optimizations. Do we think people would be receptive to learning a separate language just to code up this subset of one particular type of optimization? Especially since all the optimizations that fit this would need to fit within the templates. Even within local optimizations on a single basic block, is the template language expressive enough to encode all or even most optimizations that can be done at this level? |
Beta Was this translation helpful? Give feedback.
-
|
Critique: Discussion Question: The authors mention that they use the SMT solver to infer Alive types for variables (§3.2), although they use a classical (unification-based) type inference algorithm to find explicit types (as required by LLVM) when generating C++ code (§4). Maybe a combination of SMT + classical PL techniques is the way to get the best of both worlds: the proof automation from the SMT solver, but also the (proven) correctness guarantees from the latter. |
Beta Was this translation helpful? Give feedback.
-
|
I was most interested by the discussion of attributes and undefined behavior in this paper. I hadn't previously considered the role of undefined behavior in compiler optimizations so it was cool to see how it was handled in LLVM/Alive, seems like a powerful construct, albeit difficult to work with. One of the harder to grasp ideas was that reasoning about undefined behavior introduces some asymmetry between source and target in our optimizations: correctness needs to be satisfied universally over target undefs but existentially over source undefs, and similarly we would like as many as possible attributes in the source but as few as possible in the target, a bit nonintuitive. Because of this I appreciated section 3.4 and the algorithm for attribute inference which I found to be a really nice example of Alive taking over something that is hard to reason about. That said, it is a bit unclear to me what kind of performance benefits might be expected from strengthening postconditions, could have been nice to see some evaluation of how many extra optimizations are enabled. I also would have been interested in more movtivation of the Alive paradigm of verifying developer optimizations in light of work on automatic optimization generation (mentioned in related work). Discussion questions: The related work section cites work on automatically generating peephole optimizations going back to 1984, why are developer-created optimizations still the focus of work from 2015? The paper mentions incorrect interpretation of the reference and/or failure to match the semantics intended by the LLVM developers as a threat to the validity of Alive, is there any solution to this? |
Beta Was this translation helpful? Give feedback.
-
CritiqueI really liked the way the paper was structured and the description of the language was very clear and made a lot of sense to me. I was particularly excited by the details about how undef and poison values were treated, and the strong guarantee that the set of possible undefined values stays the same for all undefined variables waas some cool attention to detail - I could see a much more relaxed scheme that instead allows undefined values to be arbitrarily chosen. I also especially liked the section about automatically inferring attributes like Discussion Questions
|
Beta Was this translation helpful? Give feedback.
-
Critique:This paper was really fascinating and promising. The correctness of a compiler is usually just assumed, and it seems like the authors gave some demonstrations of a new, lower-cost paradigm for helping ensure correctness (as opposed to more expensive, bottom-up verification like CompCert). For the scope of the paper, I thought they accomplished a lot. I was especially impressed by how their constraints managed to handle normal undefinedness and deferred undefinedness (poisoned values), even across some memory operations. These constraints were clear, and it intuitively makes sense why they are equivalent to transformation correctness. I don’t know if this is reasonable to expect (since it wasn’t the authors’ focus, of course), but I had somewhat hoped for some correctness guarantees on entire passes, and not just on “peephole optimizations” within one specific pass. These smaller optimizations are obviously super important and prone to error (based on the 8 mistakes they found in existing LLVM implementations), but I’d imagine the majority of bugs come from slightly more global analyses. I’d be excited if something like the three-condition-correctness model (used in Theorems 1 & 2) can be generalized to bigger transformations and beyond simple expression optimizations. Questions:How do you think these peephole transformation correctness constraints can generalize to non-peephole (more global) transformations? |
Beta Was this translation helpful? Give feedback.
-
|
I thought this paper was a fun read. I think it is neat that the tool hooks nicely into LLVM by using LLVM syntax. I'd imagine this makes it easy and approachable for compiler developers to use. Even though the paper does target a "small" subset of optimizations and code, I think any system that makes meaningful progress in specifying and verifying such a huge and widely-used project as LLVM is a big win. The fact that the authors caught real bugs with their tool is evidence enough of its utility, imo. So I kinda disagree with others in the thread who have implied that the omission of control-flow-sensitive optimization is a weakness; to me, it seems like the authors set a reasonable scope for the system and accomplished it well. I read this paper as evidence that specifying and verifying even "simple" optimizations like the ones discussed is difficult enough; I wonder what must go into doing it for "harder" optimizations. One control-flow-sensitive optimization I can see the authors' approach being applied to is something like loop-invariant code motion. It seems like Alive would want to check the constraints about correctness, definedness, and poison-free-ness for all iterations of the loop before and after the code motion. The problem setup of identifying code that is loop-invariant and then moving it outside the loop would fit nicely into the syntax of Alive that is already proposed. Something like (using not-precise syntax and not SSA): I'm not sure about other optimizations, though. This paper is about finding bugs in optimizations in LLVM IR. I imagine similar bugs can arise in parts of the compiler that translate between two languages that have different semantics (a frontend and IR, or IR and machine code). Do you think the authors' approach can be applied to these translations as well? What complications could arise? |
Beta Was this translation helpful? Give feedback.
-
|
I think this was an interesting and, in a way, satisfying paper to read. I find it frustrating that proving code correctness reliably is so difficult, but it's reassuring to know that, at least in some applications and cases, it is possible—even though I've read in this discussion thread that SMT solvers are not always reliable. It is impressive (and terrifying) that each of these peephole optimizations requires a manually specified pattern. I think it’s very cool that the authors were able to use their method to find existing bugs and prevent new ones from emerging. I'm also curious to learn how much speedup the peephole optimizations provide. Discussion question: Coming up with optimizable patterns seems to require a lot of effort, even though there are some tools for automatic generation. It also appears that the set of all useful patterns has not yet been exhausted. I wonder what the source of these new patterns is. Are they related to new features of the language? Is it possible to mathematically describe the set of all useful patterns? |
Beta Was this translation helpful? Give feedback.
-
|
Critique: I thought this paper was pretty cool, as the ability to formally prove optimizations and systematically generate counterexamples for incorrect optimizations seems pretty powerful. It was especially interesting how they encapsulated and formalized undefined behavior in LLVM. However, using SMT solvers seems pretty expensive, and the authors themselves mention that Alive can take hours for large bitwidths in Section 6, resulting in a number of limitations in Alive. These limitations, in combination with the fact that using Alive requires learning a new DSL, make me wonder how usable and widely adopted Alive would be. I think it also would have been nice to see a comparison of execution time of compiled code if all InstCombine optimizations could have been implemented in Alive, as maybe this would yield some insight towards their comment that “[w]hen a postcondition is strengthened, more instructions retain undefined behavior attributes, permitting additional optimizations to be performed”. Discussion Question: The authors mention potential "threats to validity" in Alive that could arise due to incorrect formalizations of LLVM semantics. This seems a bit counterintuitive for a system which is intended to prove the correctness of LLVM optimizations. How would we verify the correctness of Alive, maybe with empirical testing? |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
|
I enjoyed this paper, I felt like the authors were trying to thread the needle in proving correctness of optimization while minimizing engineering cost (both in terms of time spent building the proofs and in easing adoption by allowing developers to write optimizations in a language that is similar to what they are used to writing instead of having to learn a completely new proof language to show correctness). It is also nice to see that DSLs can be adopted in smaller-scope problems and make an impact that can identify and tackle real worlds in compilers. Discussion question: |
Beta Was this translation helpful? Give feedback.
-
|
I thought this paper was written well and probably one of the more smoother ones to read. I was impressed by the fact they were actually acble to uncover some of the optimizations were incorrect. But my main question was if this was worth it, especially if it requires learning a separate language and isn't plug-and-play. They hand-translated 300+ optimizations... I think including an approximation of the time-spent would be good information to include so we could gauge the cost of learning such a tool. I guess that could be a discussion question; do other people think this would be good information to include? But one thing that does bother me a little is that it seems many things are hard to verify because they are hard to be formalized. Is verification worth the heavy cost if in the end they can only be used for verifying "easier" things? I also wonder how many of these bugs could have been found by hand. My experience with DSLs, which might not translate completely here is that programming in them requires pretty decent understanding formally of how the verification works. If you're already at the point of rewriting optimizations with such knowledge, could it be cheaper to prove by hand? But I guess there is some comfort in knowing a computer verified it, but who is verifying the verifier, and who is verifying that? When can we be convinced the verification tool is completely correct? |
Beta Was this translation helpful? Give feedback.
-
|
I thought this was a very interesting paper, since we've focused more a lot on the optimizations but not the correctness of them. I think the biggest overhead is that there is a completely new language especially since I'm not too familiar with LLVM IR, but it does make sense that this was meant for compiler developers already working on the IR level of LLVM (which people have said that Alive is a plus). I was probably most impressed by the way they handled undefined behavior such as poison values and undefs. Generating C++ code is also a plus, and I'm more curious about the potential speedup that these peephole optimizations provide; they have a section at the end saying that it provides a 7% speedup with gcc and I'm wondering how much it would be when all InstCombine optimizations are implemented. Question: Given the success of Alive in verifying LLVM’s peephole optimizations, how can the approach be adapted to automate the verification of optimizations in other compiler frameworks or domains? What adaptations or innovations might be necessary to generalize Alive? |
Beta Was this translation helpful? Give feedback.
-
|
Critique : Discussion Question: |
Beta Was this translation helpful? Give feedback.
-
|
I thought this was a quite strong paper, especially considering how even though it was limited to simple types (no floats or aggregate types), it was able to reveal a significant amount of bugs in real-world code. I was curious to learn more about how Alive is used now, and while doing more research I came across the second iteration, Alive2. In the Alive paper, they had noted in the Related Work sections that translation validation was a promising approach, but required a large overhead in proof machinery being executed during every compilation. Alive2 builds off this idea, doing bounded translation validation, where it limits resource consumption by reducing complexity, such as unrolling loops to a bound. Though this is only a partial analysis, it seems to be a success, also finding quite a few bugs in the LLVM code base. What’s the best way to ensure correctness for sources with floats and aggregate types? Alive not supporting aggregate types seems like a large limitation to me. Is this an instance when it might make more sense to use random testing vs formal verification for gaining more confidence in correctness? |
Beta Was this translation helpful? Give feedback.
-
|
I liked this paper! Especially the opening line and the on-the-nose-ness of compiler hackers solving a problem with a new programming language (and accompanying compiler). Something I liked is that I thought they did a good job of (in their abbreviated words) "capturing, but largely hiding, the detailed semantics of undefined behavior". While the thrust of the paper was about proving peephole optimizations correct or incorrect, one of the most interesting parts of the results section was how many attributes they were able to add to the outputs of already correct optimizations. I thought that was a testament to their user-friendly formal methods tool: it captured the semantics of UB (that is why it could add UB-related attributes) but it did all of its reasoning about UB on its own, finding the optimal set of attributes for a user-specified optimization. I also found Alive to be a fairly elegant language. I understood it immediately and found it to be a terser and more informative alternative to a C++ LLVM pass. The biggest flaw for me was the emphasis on "aggressively taking advantage of" undefined behavior. Mostly because I think languages like C and C++ that have lots of UB are waning in popularity in favor of languages that might have no UB at all by the time they make it to the LLVM IR. Questions: Should LLVM developers check Alive code into the repo and use compile-time-generated C++ instead of hand-written C++? Even if the resulting compiler is slower? The authors mention that MulDivRem optimizations were both the most buggy and the most time-consuming to prove correct/incorrect. How valuable is this tool if it takes seconds to run? Hours? How valuable is it if only the hours-long runs are finding bugs in your code? |
Beta Was this translation helpful? Give feedback.
-
|
I thought this paper was a really good read! I do wish I had more background knowledge about LLVN, since I think some of the more interesting parts probably went a bit over my head. Still, I felt that the paper did a good job of explaining things and was pretty approachable. It’s really cool that Alive was able to expose some bugs within the InstCombine pass! Compiler correctness is typically something developers (or at least I) take for granted, so I found it pretty striking that Alive is able to guarantee correctness. The authors do mention that Alive can run pretty slow while verifying certain passes, but honestly I think this a worthwhile tradeoff for the assurance that your compiler is actually compiling correctly. I saw online that the authors have an Alive2 paper. (based on my 2 minute skim) Alive2 incorporates a lot of extensions the authors didn't handle in this paper (ex: floating point) and caught ~40 more errors in LLVN! Discussion questions: |
Beta Was this translation helpful? Give feedback.
-
|
Critique: Compiler verification is an incredibly fascinating field, and it was super cool to see this deep insight into the topic. It strikes me that thus so far in what we have covered in 6120, we have primarily focused on implementation of optimizations – that is to say, make our IR code better in certain ways (more optimal), but not necessarily rigidly show that this is correct. PL, Compilers and Verification go hand-in-hand, so this paper definitely touched on a lot of important points at the intersection of the three fields. I found it to be a great read. In the first few pages itself it was really fascinating to see how one could begin with the underlying setup of LLVM, and subsequently stack upon a fairly advanced type system, semantics and predicate logic to create the infrastructure that is the Alive DSL. Having said that, it does seem that the amount of information that is presented right at the start could be a bit more spread out. Section 3 presents a really good overview of verifying both soundness and correctness of peephole optimizations, using both the type system and predicate logic. This approach feels in part almost like an axiomatic approach towards determining the correctness of given compiler optimizations, and does so in a very interesting way. Tangentially to the material itself, I thought this section of the paper did a great job of building up both intuition and notation for encoding conditions into the language, and subsequently guaranteeing correctness. Altogether, I think that the paper does a great job of unifying a few heavy-duty perspectives on optimizing compilers, presents a fairly involved and complex language to stack upon and help the LLVM infrastructure, and certainly creates the right formal environment to prove correctness. Discussion Question: This kept popping up with me as I went through the paper, but are there parallels that one can draw to axiomatic semantics and Hoare triples with the notion of soundness as defined within Alive? (especially see 3.3.{2, 3} and Theorem 2). To some extent, from a layman's perspective, this does seem to be the obvious way to reason about proving a compiler optimization – showing |
Beta Was this translation helpful? Give feedback.
-
|
I was super impressed by the paper. Their framework for doing and proving optimizations is both cool and found actual bugs in LLVM. Their implementation effort and persistence shows. I was wondering a little about how they could replace SMT solvers. They do note at the end (and parts within) that SMT solvers are expensive (like hours-level expensive) and some of their constraints grow uncomfortably superlinearly. I think using an SMT solver was definitely the right choice because this paper already contributes so much, but I was wondering if --- even if it slightly limits the applicable scope --- a specialized solver could be designed and implemented for the purposes of just "peephole optimization correctness in the context of compilers with simple rules on how poison values and undefined values propagate". |
Beta Was this translation helpful? Give feedback.
-
|
This paper was actually pretty fun to read because it tackles something super practical—how do we make sure compiler optimizations don’t break things? The fact that LLVM’s peephole optimizations are mostly written by hand, with correctness depending on a mix of intuition and testing, honestly surprised me. It makes sense that bugs slip through, and I really liked that this paper doesn’t just point out the problem—it actually builds a usable tool (Alive) to help fix it. What I really enjoyed was how Alive feels like it bridges formal verification and real-world compiler development. The idea of using SMT solvers to check whether transformations preserve semantics is really cool, and it makes Alive more than just a static checker—it actually proves correctness. They also ran Alive on hundreds of real LLVM optimizations, found bugs, and even got some fixes pushed upstream. That kind of impact makes a research paper feel way more relevant. Alive primarily focuses on integer-based transformations, but LLVM also performs optimizations on floating-point operations and aggregate types (like structs and arrays). How could Alive be extended to verify transformations on floating-point expressions while accounting for rounding errors and IEEE-754 quirks? Similarly, what challenges would arise if Alive were adapted to reason about optimizations involving aggregate types, especially when memory layout or aliasing comes into play? |
Beta Was this translation helpful? Give feedback.
-
|
Critique: I thought this paper was very interesting! Even with very little knowledge of LLVM (or anything compiler-related in general ...), it was surprisingly readable, although I needed to read about how poison and undef worked on my own. I thought the technique for automatically generating and converting constraints was very innovative. Discussion question: |
Beta Was this translation helpful? Give feedback.
-
|
Critique: I did not understand what this paper is saying because I do not use LLVM or SMTs. I would have liked for them to have mentioned how important peephole optimizations are in the grand scheme of a compiler (i.e., what proportion of optimizations are peephole optimizations)? Discussion Question: What does it mean that when there is a undef value "the compiler is free to pick a convenient value for each use of undef to enable aggressive optimizations? I am curious about examples that show the aggressive optimizations that can be enabled and how picking any value maintains correctness. I would think that an optimization would have to hold for all values to maintain correctness because the optimization can't risk changing the functionality of the program by not accounting for a certain value. |
Beta Was this translation helpful? Give feedback.
-
|
Critique The paper was really well written, and I appreciated the attention the paper gave towards treating Question There are a lot of compiler passes that are more complicated than peephole optimizations. Could Alice be extended to prove a wider range of optimizations? |
Beta Was this translation helpful? Give feedback.

Uh oh!
There was an error while loading. Please reload this page.
-
Post your comments here!
Discussion leads: @Annacaro22 @katherinewu312 @calciiium @InnovativeInventor
Beta Was this translation helpful? Give feedback.
All reactions