Description
Right now, since 1 partition means 1 repository, we know joins (by repository) can only happen in the same partition.
Instead, we iterate and try to join with all of the partitions together.
Imagine we have 3 partitions.
These are the rows returned by each partition in the left side of a join.
- P1: 45
- P2: 5
- P3: 50
These are the rows returned by each partition in the right side of a join.
- P1: 55
- P2: 15
- P3: 30
We are joining 100 rows with 100 rows, which produces 10000 rows, that are then filtered by the join conditions (but we still make those 10k iterations).
Instead, if we did this per partition, these would be the produced rows (then filtered by conditions):
- P1: 2475
- P2: 75
- P3: 1500
The total amount of rows produced is 4050 rows, which is a 40% of the number of rows generated before. This number grows enormously as the number of partitions and rows grow.
What could we do?
A rule that runs at the end of the analysis and transforms joins (the ones left after the squash) into something like:
Concat
|- InnerJoin
|- PartitionTable(TableA)
|- PartitionTable(TableB)
PartitionTable is a table that will only return the rows for one partition.
Concat is a node that will iterate over all partitions and transform all its Table
children into PartitionTable
. Then, all the rows of each partition will be put together and returned to the user. This will also happen in parallel.
Essentially, Concat is like an Exchange. The only thing it differs is the fact that it can handle binary nodes and not only unary nodes. This is something that cannot be done in go-mysql-server but can be done here, since we know for certain that partitions are the same for each table.
Called it Concat but the name is pretty lame so we should think of a better name, like PartitionExchange, BinaryExchange or something like that.
This should make (not squashed) joins —and in a real life applications you will have many of them because leaves will be subqueries— much much faster.