Skip to content

Join Probe Side Bitmask Cache #77

Description

@peterxcli

https://dl.acm.org/doi/epdf/10.1145/3626246.3653395

For now, we considered the predicate cache only for filtered scansover relations. Redshift, however, also implements Semi-Join Filtersthat push hash joins into table scans on the probe side. Semi-joinfilters eliminate rows without a join partner early on: they evaluatethe join predicate during the scan and test if a tuple is part of thehash table built for the join. Redshift builds a bloom filter [ 12 ] whileinserting the tuples on the build side of a hash join and passes thefilter to the table scan on the probe side. Figure 12 shows the queryexecution plan for the query from before with a semi-join filter.Since Redshift evaluates the semi-join filters during the vector-ized scan, we can also cache the result. The predicate cache canexploit filters and index the join, i.e., the next time Redshift pro-cesses a query with the same join, only the tuples with a join partnerare scanned. Thus, the predicate cache combines the benefits ofscan-based caching with the ones of a query-based result cache.However, this comes at a cost: the keys for the cached entrymust now include relevant parts of the join and build side to match.

We include the join predicate o_orderkey = l_orderkey and arepresentation of the build side. For the build side of the join, thekey stores the scanned table and its filter predicates. With the semi-join filter, the cached entries become 100 times more selective, andonly 0.02 % of the rows in lineitem are accessed. Entries with andwithout a semi-join filter are stored in the same cache, and wechoose the most selective matching entry.Combining the predicate cache with semi-join filters also hasdisadvantages: (1) the hit rate reduces as the cached keys are morecomplex and also depend on filter predicates on other tables, and(2) entries with a semi-join filter are invalidated by inserts, deletes,and updates on the build side of the join. As modern cloud datawarehouses often store the data in a star- or snowflake-like schema,semi-join filters can eliminate many rows early on, and includingjoins in the predicate cache is beneficial for many queries. Hence,we must weigh the benefits of including the join and reducingthe number of rows to scan against the disadvantages of a morecomplex key and the increased invalidation rate.

Image

the cache key format:

Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions