Join Probe Side Bitmask Cache

> https://dl.acm.org/doi/epdf/10.1145/3626246.3653395

For now, we considered the predicate cache only for filtered scansover relations. Redshift, however, also implements Semi-Join Filtersthat push hash joins into table scans on the probe side. Semi-joinfilters eliminate rows without a join partner early on: they evaluatethe join predicate during the scan and test if a tuple is part of thehash table built for the join. Redshift builds a bloom filter [ 12 ] whileinserting the tuples on the build side of a hash join and passes thefilter to the table scan on the probe side. Figure 12 shows the queryexecution plan for the query from before with a semi-join filter.Since Redshift evaluates the semi-join filters during the vector-ized scan, we can also cache the result. The predicate cache canexploit filters and index the join, i.e., the next time Redshift pro-cesses a query with the same join, only the tuples with a join partnerare scanned. Thus, the predicate cache combines the benefits ofscan-based caching with the ones of a query-based result cache.However, this comes at a cost: the keys for the cached entrymust now include relevant parts of the join and build side to match. 

We include the join predicate o_orderkey = l_orderkey and arepresentation of the build side. For the build side of the join, thekey stores the scanned table and its filter predicates. With the semi-join filter, the cached entries become 100 times more selective, andonly 0.02 % of the rows in lineitem are accessed. Entries with andwithout a semi-join filter are stored in the same cache, and wechoose the most selective matching entry.Combining the predicate cache with semi-join filters also hasdisadvantages: (1) the hit rate reduces as the cached keys are morecomplex and also depend on filter predicates on other tables, and(2) entries with a semi-join filter are invalidated by inserts, deletes,and updates on the build side of the join. As modern cloud datawarehouses often store the data in a star- or snowflake-like schema,semi-join filters can eliminate many rows early on, and includingjoins in the predicate cache is beneficial for many queries. Hence,we must weigh the benefits of including the join and reducingthe number of rows to scan against the disadvantages of a morecomplex key and the increased invalidation rate.


<img width="377" height="213" alt="Image" src="https://github.com/user-attachments/assets/14cc5a62-ad10-41e3-bbd9-5cb9e7a1cf43" />

the cache key format:

<img width="367" height="106" alt="Image" src="https://github.com/user-attachments/assets/d0284e3e-60c8-49cd-a111-6b3bedb831be" />

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Join Probe Side Bitmask Cache #77

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Join Probe Side Bitmask Cache #77

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions