diff --git a/content/blog/2025-11-06-tbaa.md b/content/blog/2025-11-06-tbaa.md new file mode 100644 index 000000000..cd7fbbcdf --- /dev/null +++ b/content/blog/2025-11-06-tbaa.md @@ -0,0 +1,79 @@ ++++ +title = "Approximating a Pareto-Optimal Solution to Alias Analysis" + +[extra] +latex = false +bio = """ +Cynthia Shao is an undergraduate junior studying ECE and CS at Cornell. In her free time she likes to crochet, knit, make fun matcha drinks, rock climb, and perform Chinese martial arts. + +Jonathan Brown is an undergraduate junior studying ECE and CS at Cornell. In his free time he likes to program games, play the piano, and play badminton. + +Pedro Pontes García is an undergraduate senior studying CS at Cornell. As a pseudo-math major, he appreciates the formal universe of PL. Depending on the season, he enjoys skiing, canoe camping, and wandering into the woods in any form. +""" + +[[extra.authors]] +name = "Cynthia Shao" +[[extra.authors]] +name = "Jonathan Brown" +[[extra.authors]] +name = "Pedro Pontes García" ++++ + +## Introduction + +Aliasing is a central issue for programming contexts that involve reasoning about memory. In particular, compilers are unable to immediately determine whether two variables are aliased to the same address in memory, which prevents most kinds of simple optimizations, such as basic dead code elimination (DCE) or common subexpression elimination (CSE). Alias analysis is a general set of techniques to determine which variables are guaranteed to not be aliased, as opposed to variables that may alias. Type-based alias analysis (TBAA) is one such technique, proposed by Diwan, McKinley, and Moss in [their PLDI '98 paper](https://dl.acm.org/doi/10.1145/277650.277670). It is based on simple typing principles for type-safe languages, and performs surprisingly close to the theoretical limit of alias analysis. + +## Type-based alias analysis + +Let L be a hypothetical type-unsafe C-like language. Consider the following function in L, which illustrates the problem of aliasing: + +```c +bool foo(int* a, long* b) { + *a = 1; // instruction 1 + *b = 2; // instruction 2 + return *a == 1; +} +``` + +The behavior of the function `foo` is completely dependent on whether `a` and `b` point to the same memory address. If they do not, then it is immediately evident that instruction 2 is redundant and can be eliminated; this would also be the case necessarily if `a` and `b` were not pointers. However, if they are aliases, then eliminating instruction 2 would produce an incorrect result: the correct behavior would be for instruction 2 to modify the value at the shared memory address, in which case the boolean return value becomes false, but if instruction 2 is removed, then the return value incorrectly becomes true. + +Type-based alias analysis proposes the following simple test to check if variables may at all alias: + +1. Language L is type-safe. +1. Pointer variables a and b have different types. +1. (1) and (2) imply that a and b do not alias. + +We revise our example by redefining L to be type-safe. In that case, we can be certain that `a` and `b` are not aliases, since they have different types. Therefore, a DCE compiler pass can safely eliminate instruction 2 while preserving correctness. + +TBAA is extensible to languages with support for subtyping; in this case, we must introduce additional constraints: two pointers of types `T*` and `U*` may alias only if T and U are related by a subtype relation. For instance, if U is a subtype of T, then aliasing is possible, because an object of type U can be referenced through a pointer of type T; if T and U are disjoint in the subtype hierarchy, then their pointers cannot alias. TBAA is also naturally extensible to field types for structured data types; we do not discuss the specifics of that extension in this brief summary. + +## Critique + +### Merits + +This paper evaluates the tradeoffs between precision and speed within alias analysis. Alias analysis is inherently an [undecidable problem](https://dl.acm.org/doi/10.1145/186025.186041). There is a concept from economics that we are borrowing here called Pareto optimality. It is represented as a curve called the Pareto frontier that bounds the state of all optimal problems, and there is always a tradeoff between improving one objective over the other; in this case we have two objectives, precision and accuracy. There are always many metrics to evaluate optimality, but we are bounded by this problem of undecidability, limiting precision. The paper's main contribution is that it creates an upper bound on this precision limit, demonstrating that we achieve a Pareto-optimal solution by accelerating computation time while also providing an upper bound on precision. + +
+
+Figure 1: Pareto Frontier with axis of Precision and Computational Cost. The red curve is the Pareto Frontier representing optimal tradeoffs where no technique can improve both precision and speed simultaneously beyond this boundary. The orange dot is TBAA, purple is Types and fields, and blue is Flow Insensitive. The grey line is the hypothetical upper bound. These are plotted in relativity to one another, as cost was not concretely given in the paper.
+