sampsyo · pedropontesgarcia · Nov 7, 2025
diff --git a/content/blog/2025-11-06-tbaa.md b/content/blog/2025-11-06-tbaa.md
@@ -0,0 +1,79 @@
++++
+title = "Approximating a Pareto-Optimal Solution to Alias Analysis"
+
+[extra]
+latex = false
+bio = """
+Cynthia Shao is an undergraduate junior studying ECE and CS at Cornell. In her free time she likes to crochet, knit, make fun matcha drinks, rock climb, and perform Chinese martial arts. 
+
+Jonathan Brown is an undergraduate junior studying ECE and CS at Cornell. In his free time he likes to program games, play the piano, and play badminton.
+
+Pedro Pontes García is an undergraduate senior studying CS at Cornell. As a pseudo-math major, he appreciates the formal universe of PL. Depending on the season, he enjoys skiing, canoe camping, and wandering into the woods in any form.
+"""
+
+[[extra.authors]]
+name = "Cynthia Shao"
+[[extra.authors]]
+name = "Jonathan Brown"
+[[extra.authors]]
+name = "Pedro Pontes García"
++++
+
+## Introduction
+
+Aliasing is a central issue for programming contexts that involve reasoning about memory. In particular, compilers are unable to immediately determine whether two variables are aliased to the same address in memory, which prevents most kinds of simple optimizations, such as basic dead code elimination (DCE) or common subexpression elimination (CSE). Alias analysis is a general set of techniques to determine which variables are guaranteed to not be aliased, as opposed to variables that may alias. Type-based alias analysis (TBAA) is one such technique, proposed by Diwan, McKinley, and Moss in [their PLDI '98 paper](https://dl.acm.org/doi/10.1145/277650.277670). It is based on simple typing principles for type-safe languages, and performs surprisingly close to the theoretical limit of alias analysis.
+
+## Type-based alias analysis
+
+Let L be a hypothetical type-unsafe C-like language. Consider the following function in L, which illustrates the problem of aliasing:
+
+```c
+bool foo(int* a, long* b) {
+    *a = 1; // instruction 1
+    *b = 2; // instruction 2
+    return *a == 1;
+}
+```
+
+The behavior of the function `foo` is completely dependent on whether `a` and `b` point to the same memory address. If they do not, then it is immediately evident that instruction 2 is redundant and can be eliminated; this would also be the case necessarily if `a` and `b` were not pointers. However, if they are aliases, then eliminating instruction 2 would produce an incorrect result: the correct behavior would be for instruction 2 to modify the value at the shared memory address, in which case the boolean return value becomes false, but if instruction 2 is removed, then the return value incorrectly becomes true.
+
+Type-based alias analysis proposes the following simple test to check if variables may at all alias:
+
+1. Language L is type-safe.
+1. Pointer variables a and b have different types.
+1. (1) and (2) imply that a and b do not alias.
+
+We revise our example by redefining L to be type-safe. In that case, we can be certain that `a` and `b` are not aliases, since they have different types. Therefore, a DCE compiler pass can safely eliminate instruction 2 while preserving correctness.
+
+TBAA is extensible to languages with support for subtyping; in this case, we must introduce additional constraints: two pointers of types `T*` and `U*` may alias only if T and U are related by a subtype relation. For instance, if U is a subtype of T, then aliasing is possible, because an object of type U can be referenced through a pointer of type T; if T and U are disjoint in the subtype hierarchy, then their pointers cannot alias. TBAA is also naturally extensible to field types for structured data types; we do not discuss the specifics of that extension in this brief summary.
+
+## Critique
+
+### Merits
+
+This paper evaluates the tradeoffs between precision and speed within alias analysis. Alias analysis is inherently an [undecidable problem](https://dl.acm.org/doi/10.1145/186025.186041). There is a concept from economics that we are borrowing here called Pareto optimality. It is represented as a curve called the Pareto frontier that bounds the state of all optimal problems, and there is always a tradeoff between improving one objective over the other; in this case we have two objectives, precision and accuracy. There are always many metrics to evaluate optimality, but we are bounded by this problem of undecidability, limiting precision. The paper's main contribution is that it creates an upper bound on this precision limit, demonstrating that we achieve a Pareto-optimal solution by accelerating computation time while also providing an upper bound on precision.
+
+<p align="center">
+<img width="512" height="422" alt="image" src="https://github.com/user-attachments/assets/c315a223-141e-4e26-a8d4-a3c5ad8afd6e" /></br>
+<span style="font-size: 12px;"><b>Figure 1:</b> Pareto Frontier with axis of Precision and Computational Cost. The red curve is the Pareto Frontier representing optimal tradeoffs where no technique can improve both precision and speed simultaneously beyond this boundary. The orange dot is TBAA, purple is Types and fields, and blue is Flow Insensitive. The grey line is the hypothetical upper bound. These are plotted in relativity to one another, as cost was not concretely given in the paper.</span>
+</p>
+
+As mentioned in the [first paper reading](https://dl.acm.org/doi/10.1145/2528521.1508275) of the year, evaluating the effects of an optimization pass is often made difficult due to confounding factors. The authors of this paper argue that static analysis on the precision of items that "may-alias" should not be the individual metric to evaluate the effects of the pass. Instead, they argue we should holistically evaluate the merits of alias analysis by using dynamic and limit-based evaluations to measure the impact in combination with the downstream optimizations created. 
+
+This paper additionally explores the tradeoffs between using higher precision techniques such as field name analysis and flow-insensitive analysis versus a simple type based alias analysis. The surprising conclusion was that the other more precise techniques didn’t enable enough accuracy boost compared to the performance sink, and that simply using TBAA was enough to enable downstream optimizations.
+
+Finally, the elegance and simplicity of TBAA make it particularly attractive for production compilers. The analysis requires minimal infrastructure (just type information, which compilers already maintain), integrates seamlessly with existing optimization passes, and scales gracefully to large codebases. This combination of effectiveness and practicality explains its widespread adoption.
+
+### Shortcomings
+
+Despite the presentation of many new evaluation metrics, there were some that were missed; most significantly, the evaluation lacks any discussion of compile-time performance impact. While the authors argue that TBAA is computationally efficient, they provide no empirical data on compilation time overhead. Given that compile-time performance is a critical practical concern, the omission of this impact is surprising. 
+
+The paper's evaluation is also limited to Modula-3 programs, which constrains the generalizability of findings. Modern languages with richer type systems (parametric polymorphism, higher-order types, dependent types) or different memory models might exhibit different cost-benefit tradeoffs. The paper's focus on type-safe contexts is methodologically sound for controlled evaluation, but leaves open questions about TBAA's effectiveness in languages with more complex aliasing patterns.
+
+Additionally, while the paper demonstrates that TBAA approaches the "limit" of alias analysis, this limit itself is measured against other static techniques. The paper doesn't deeply explore the semantic gap between what static alias analysis can achieve and what runtime aliasing behavior actually occurs. There may be aliasing patterns that are statically unprovable but dynamically rare, or vice versa. Understanding this gap could inform adaptive or profile-guided approaches to alias analysis.
+
+Lastly, the paper's treatment of structured types and field sensitivity, as mentioned earlier, lacks the depth given to pointer aliasing. Field-level aliasing patterns can be quite different from whole-object aliasing, and the dismissal of field-sensitive analysis as too expensive might overlook opportunities for selective application where it provides high value.
+
+## Discussion and further work
+
+In terms of direct follow-up work, the same authors wrote subsequent papers titled "The Limits of Alias Analysis for Scalar Optimizations" (2004), which looks to detail how the accuracy of alias analysis helps for compiler optimizations such as loop-invariant code motion, copy propagation, global value numbering, loop unrolling, and global variable replacement. Furthermore, we found another paper titled "Using Types to Analyze and Optimize Object-Oriented Programs" (2001), which appears to expand on the idea of using TBAA beyond the analyses shown in the previous paper, and further explores how type information can guide alias analysis. In terms of how TBAA is used in the field, LLVM supports a TBAA pass and generates type metadata for pointers; Clang tracks a `!tbaa` metadata for every load and store; and in those contexts, standard passes such as GVN and LICM benefit from TBAA too. This technique seems to have heavily influenced certain toolchains and languages, such as Swift’s compiler. We also discovered that it is largely unused in dynamically typed languages and unsafe Rust blocks (in which case type safety cannot be guaranteed). In terms of performance, it still is a compelling feature as it is fast, safe, effective, and scales well.