-
Notifications
You must be signed in to change notification settings - Fork 182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CometHashJoin always selects BuildRight which causes potential performance regression #1382
Comments
@hayman42 what a great find. I have not observed this myself even at SF10000, probably because by default we were falling back to SMJ. Would you be able to compare the plan with Comet shuffle disabled? |
@parthchandra With comet shuffle disabled, the plan is almost like vanilla spark's because it replaces comet SHJ to spark SHJ. And thus it preserves spark's performance. Here is the plan with comet shuffle disabled. comet shuffle disabledimage is not attached so I put text instead
I have another question regarding your comment. Is it ok to use comet SMJ to a large dataset? Or did you just disable comet shuffle? I observed comet SMJ is way too slower compared to spark and that is why I am trying to use SHJ. |
Yeah, I was afraid that would be the case. Interesting that Spark gets the plan right but it gets messed up with Comet. Afaik, Comet itself does not do any of the build side planning. Maybe it should, which is what you've tried to do here. I'm not the expert on this, I'm afraid. @viirya any thoughts?
It should be ok to use Comet SMJ. We may be spilling too soon for Comet SMJ causing the slower performance. @kazuyukitanimura any thoughts on this? |
We have not enabled SHJ as default because we haven't implemented spilling IIRC. |
@kazuyukitanimura I am not sure but I think the slowness comes from CometExchange that is executed after the join with BuildLeft this is for original Comet CometExchange
shuffle records written: 65,254,713
number of spills: 2,619
shuffle write time total (min, med, max )
29.7 s (2 ms, 43 ms, 138 ms )
number of input batches: 100,000
records read: 65,254,713
memory pool time total (min, med, max )
1.2 m (25 ms, 113 ms, 183 ms )
local bytes read total (min, med, max )
513.1 MiB (4.3 MiB, 7.7 MiB, 9.5 MiB )
fetch wait time total (min, med, max )
0 ms (0 ms, 0 ms, 0 ms )
remote bytes read total (min, med, max )
3.4 GiB (37.6 MiB, 52.2 MiB, 54.1 MiB )
repartition time total (min, med, max )
50.8 s (21 ms, 53 ms, 315 ms )
decoding and decompression time total (min, med, max )
3.3 m (1.6 s, 2.4 s, 8.6 s )
local blocks read: 171,154
spilled bytes: 2,250,148,478,976
remote blocks read: 1,160,828
data size total (min, med, max )
4.4 GiB (4.5 MiB, 6.8 MiB, 7.2 MiB )
native shuffle writer time total (min, med, max )
4.8 m (100 ms, 341 ms, 1.5 s )
number of partitions: 2,000
encoding and compression time total (min, med, max )
1.9 m (34 ms, 102 ms, 744 ms )
remote reqs duration total (min, med, max )
1.2 m (339 ms, 639 ms, 2.2 s )
shuffle bytes written total (min, med, max )
3.9 GiB (2.6 MiB, 6.2 MiB, 6.8 MiB ) and here is the metric with my change
It is weird that most of the metrics including spill size and execution time get 7-8x higher. I don't know why it happens but I am trying to figure out. |
Spilling greatly affect the speed. Need to understand why it is the case |
I remember that Comet query planning doesn't change build side on HashJoin: datafusion-comet/spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala Lines 2994 to 2995 in f099e6e
If you turn off AQE, will it affect the build side? |
@viirya It seems AQE does not affect the build side. I confirmed both Spark and Comet behave the same as before with AQE disabled. |
Spilling 16 TB ( |
Describe the bug
First of all, thank you guys for such a great project. I am currently doing some research to see if our team can make use of datafusion comet to our workload.
As mentioned in hashjoin, it is important to keep build side table as small as possible. I am not sure if it is intended, but anyway current comet's implementation always chooses BuildRight unless it is impossible to build right. This causes performance regression for query like tpch q9.
Additionally, I tried to make some modifications to RewriteJoin so that build side selection works based on size of each table, but then other bugs happen.
Below are the metrics from CometHashJoin.
Before
BuildRight is selected even if right table is much larger
After modification (I refered Gluten's source code https://github.com/hayman42/datafusion-comet/blob/main/spark/src/main/scala/org/apache/comet/rules/RewriteJoin.scala)
BuildLeft is selected and as a result CometHashJoin has become faster
but afterwards I could find other bugs which make the job even slower. I just made changes to it without deep understanding of this project so I think that is the reason why.
Steps to reproduce
Run tpch SF200 q9 with following configs
Most of our workload is TB~PB scale, so I used multiple executors to test scalability.
I added
--conf spark.sql.adaptive.maxShuffledHashJoinLocalMapThreshold=64MB
for vanilla spark test to always enable SHJ.Expected behavior
Ideally it should behave like vanilla spark's SHJ. Query with Spark SHJ is almost 8x faster in the setting above
Additional context
Below are the details for each setting
Comet (BuildRight - BuildRight - BuildRight)
Spark (BuildLeft - BuildRight - BuildLeft)
Comet with custom RewriteRule (BuildLeft - BuildRight - BuildRight(left table has become larger for unknown reason))
The text was updated successfully, but these errors were encountered: