Skip to content

Conversation

@yjyolandeyan
Copy link

@yjyolandeyan yjyolandeyan commented Oct 22, 2025

Description

Added support for SQL IN operator in IndexJoinOptimizer to enable index join optimization for IN predicates in join conditions. This change covers both the Java planner (IndexJoinOptimizer.java) and the native execution path (PrestoToVeloxQueryPlan.cpp).

Changes include:

  • Java side: Modified IndexJoinOptimizer.java to recognize SpecialFormExpression.Form.IN and extract lookup variables from IN expressions
  • Native side: Added isIn() helper function and logic in PrestoToVeloxQueryPlan.cpp to convert IN expressions to InIndexLookupCondition
  • Tests: Added comprehensive test coverage in TestNativeIndexJoinLogicalPlanner.java, TestSequenceStorageLogicalPlanner.java, and PrestoToVeloxQueryPlanTest.cpp

Motivation and Context

Presto now supports RewriteConstantArrayContainsToInExpression that converts CONTAINS(ARRAY[...], column) into SQL IN expressions when the array contains only constants. However, IndexJoinOptimizer previously only recognized CONTAINS expressions, not IN expressions.

This meant that:

  1. Direct SQL IN predicates in join conditions (e.g., o.custkey IN (1, 2, 3)) were not eligible for index join optimization
  2. CONTAINS expressions converted to IN by the rewrite rule also lost index join optimization eligibility

This change ensures that both direct SQL IN and rewritten CONTAINS→IN expressions can benefit from index join optimization.

Impact

Performance: Queries using IN predicates in join conditions can now leverage index joins, potentially improving performance significantly for queries with selective IN predicates on indexed columns.

User-facing: Users can now write cleaner SQL using the IN operator directly instead of the less intuitive CONTAINS(ARRAY[...], column) syntax, while still getting index join optimization.

API: No public API changes. This is an internal optimization enhancement.

Test Plan

  1. Added unit tests in TestNativeIndexJoinLogicalPlanner.java covering:

    • IN with constant values only
    • IN with probe-side variables (mixed and all variables)
    • Verification that index joins are created correctly
  2. Added unit tests in TestSequenceStorageLogicalPlanner.java covering:

    • IN operator with various scenarios
    • Testing with rewrite_constant_array_contains_to_in_expression enabled to verify CONTAINS→IN conversion works correctly
  3. Added C++ unit test in PrestoToVeloxQueryPlanTest.cpp to verify parsing of IndexJoinNode with IN operator

  4. Manually tested on production-like queries to verify:

    • With rewrite flag OFF: CONTAINS expressions produce index joins
    • With rewrite flag ON: CONTAINS converted to IN still produce index joins
    • Direct SQL IN expressions produce index joins

All tests pass successfully.

Contributor checklist

  • Please make sure your submission complies with our contributing guide, in particular code style and commit standards.
  • PR description addresses the issue accurately and concisely.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality. (N/A - no new user-facing features)
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed. (Pending)
  • If adding new dependencies, verified they have an OpenSSF Scorecard score of 5.0 or higher. (N/A - no new dependencies)
  • E2E: 20251022_051122_00008_mqhze (with rewrite = true) -> index join

20251022_050924_00006_mqhze (with rewrite = false) -> index join

Release Notes

== NO RELEASE NOTE ==

@sourcery-ai
Copy link
Contributor

sourcery-ai bot commented Oct 22, 2025

Reviewer's Guide

This PR extends index join optimization to support SQL IN predicates by updating both the Java planner and the native execution path to recognize IN expressions as lookup conditions, and adds unit tests across Java and C++ to validate the new behavior.

Sequence diagram for index join optimization with IN predicate

sequenceDiagram
    participant Planner as "IndexJoinOptimizer"
    participant Expr as "CallExpression (IN)"
    participant Context as "LookupContext"

    Planner->>Expr: Check getDisplayName() == "IN"
    Expr-->>Planner: Return arguments
    Planner->>Expr: Check isVariable(arguments[0])
    Expr-->>Planner: Return true/false
    alt arguments[0] is variable
        Planner->>Context: Add arguments[0] to lookupVariables
    end
Loading

Class diagram for updated IndexJoinOptimizer logic

classDiagram
    class IndexJoinOptimizer {
        +optimizeJoin()
        +isVariable(Expression): boolean
        +getLookupVariables()
    }
    class CallExpression {
        +getDisplayName(): String
        +getArguments(): List<Expression>
    }
    class VariableReferenceExpression {
    }
    IndexJoinOptimizer --> CallExpression
    IndexJoinOptimizer --> VariableReferenceExpression
    CallExpression --> Expression

    %% Highlight new IN support
    IndexJoinOptimizer : +Handles CONTAINS expressions
    IndexJoinOptimizer : +Handles IN expressions (new)
Loading

File-Level Changes

Change Details Files
Enable IN predicate recognition in the Java IndexJoinOptimizer
  • Detect SpecialFormExpression.Form.IN in call expressions
  • Validate argument count and variable position for IN
  • Extract and register the lookup variable from IN predicates
IndexJoinOptimizer.java
Translate IN expressions into InIndexLookupCondition in native planner
  • Add isIn() helper to identify IN nodes
  • Implement conversion of IN expressions to InIndexLookupCondition
PrestoToVeloxQueryPlan.cpp
Add test coverage for IN-based index join optimization
  • Extend TestNativeIndexJoinLogicalPlanner with IN predicate plan assertions
  • Add IN operator scenarios to TestSequenceStorageLogicalPlanner
  • Introduce C++ unit test in PrestoToVeloxQueryPlanTest.cpp for IN handling
TestNativeIndexJoinLogicalPlanner.java
TestSequenceStorageLogicalPlanner.java
PrestoToVeloxQueryPlanTest.cpp

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes and they look great!

Prompt for AI Agents
Please address the comments from this code review:

## Individual Comments

### Comment 1
<location> `presto-tests/src/test/java/com/facebook/presto/tests/TestNativeIndexJoinLogicalPlanner.java:147-145` </location>
<code_context>
                             filter(tableScan("lineitem")),
                             indexSource("orders"))));

+            assertPlan("" +
+                            "SELECT *\n" +
+                            "FROM (\n" +
+                            "  SELECT *\n" +
+                            "  FROM lineitem\n" +
+                            "  WHERE partkey % 8 = 0) l\n" +
+                            joinType + " JOIN orders o\n" +
+                            "  ON l.orderkey = o.orderkey\n" +
+                            "  AND o.custkey IN (1, l.partkey, 3)\n",
+                    anyTree(indexJoin(
+                            filter(tableScan("lineitem")),
+                            indexSource("orders"))));
+
             assertPlan("" +
</code_context>

<issue_to_address>
**suggestion (testing):** Missing negative test cases for IN expressions in index joins.

Please add tests to cover cases where IN expressions should not result in index joins, such as with non-constant arrays, unsupported types, or empty arrays.

Suggested implementation:

```java
            assertPlan("" +
                            "SELECT *\n" +
                            "FROM (\n" +
                            "  SELECT *\n" +
                            "  FROM lineitem\n" +
                            "  WHERE partkey % 8 = 0) l\n" +
                            joinType + " JOIN orders o\n" +
                            "  ON l.orderkey = o.orderkey\n" +
                            "  AND o.custkey IN (1, l.partkey, 3)\n",
                    anyTree(indexJoin(
                            filter(tableScan("lineitem")),
                            indexSource("orders"))));

            // Negative test: IN list is a column reference (non-constant array)
            assertPlan("" +
                            "SELECT *\n" +
                            "FROM lineitem l\n" +
                            joinType + " JOIN orders o\n" +
                            "  ON l.orderkey = o.orderkey\n" +
                            "  AND o.custkey IN (l.partkey)\n",
                    anyTree(
                            join(
                                    tableScan("lineitem"),
                                    tableScan("orders"))));

            // Negative test: IN list contains unsupported type (e.g., VARCHAR)
            assertPlan("" +
                            "SELECT *\n" +
                            "FROM lineitem l\n" +
                            joinType + " JOIN orders o\n" +
                            "  ON l.orderkey = o.orderkey\n" +
                            "  AND o.custkey IN ('a', 'b', 'c')\n",
                    anyTree(
                            join(
                                    tableScan("lineitem"),
                                    tableScan("orders"))));

            // Negative test: IN list is empty
            assertPlan("" +
                            "SELECT *\n" +
                            "FROM lineitem l\n" +
                            joinType + " JOIN orders o\n" +
                            "  ON l.orderkey = o.orderkey\n" +
                            "  AND o.custkey IN ()\n",
                    anyTree(
                            join(
                                    tableScan("lineitem"),
                                    tableScan("orders"))));

```

- If your test framework requires a specific matcher to assert the absence of an index join, you may need to adjust `anyTree(join(...))` to match your conventions.
- If the `joinType` variable is not defined for these negative cases, you may need to specify the join type explicitly (e.g., "INNER", "LEFT", etc.).
- If the empty IN list syntax (`IN ()`) is not supported by your SQL parser, you may need to use a workaround such as `IN (NULL)` or another method to simulate an empty list.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@yjyolandeyan
Copy link
Author

currently added logic for both rewrite CONTAINS to IN (will be a part of SuperExpression and matched before normal non-equal expressions like CONTAINS), and IN (as a regular expression, so query is written as key IN xx). Happy to remove the regular IN case if that's not needed.

@zacw7 zacw7 self-requested a review October 22, 2025 19:01
@steveburnett
Copy link
Contributor

Thanks for the release note! Nit of formatting:

== NO RELEASE NOTE ==

Copy link
Member

@zacw7 zacw7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG. Thanks for adding it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants