You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently many Presto (JDBC based) connectors such as mysql, postgres, Oracle are single threaded. When doing aggregation calculation, Presto will do a full table scan, then do the aggregation on all records returned from the data source, which is very slow in some case. For example, for TPCH 100G orders 150M records table stored in Postgresql, if we do the following count aggregation query on presto, it took 8 minutes 30 seconds to return the result.
While in postgresql itself, it only took 5 seconds for the results to return.
tpch100g=# EXPLAIN ANALYSE select o_custkey, count() from orders where to_char(o_orderdate, 'mm/yyyy') = '03/1997' group by o_custkey having count() > 4;
QUERY PLAN
Custom Scan (Citus Adaptive) (cost=0.00..0.00 rows=0 width=0) (actual time=5134.805..5134.806 rows=5 loops=1)
Task Count: 32
Tasks Shown: One of 32
-> Task
Node: host=myhost port=5432 dbname=tpch100g
-> Finalize GroupAggregate (cost=122383.11..125236.20 rows=7492 width=16) (actual time=1396.918..1396.918 rows=0 loops=1)
Group Key: o_custkey
Filter: (count(*) > 4)
Rows Removed by Filter: 55564
-> Gather Merge (cost=122383.11..124808.74 rows=19534 width=16) (actual time=1356.274..1378.625 rows=58563 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial GroupAggregate (cost=121383.09..121554.01 rows=9767 width=16) (actual time=1354.141..1363.419 rows=19521 loops=3)
Group Key: o_custkey
-> Sort (cost=121383.09..121407.51 rows=9767 width=8) (actual time=1354.135..1356.468 rows=20041 loops=3)
Sort Key: o_custkey
Sort Method: quicksort Memory: 1721kB
Worker 0: Sort Method: quicksort Memory: 1699kB
Worker 1: Sort Method: quicksort Memory: 1704kB
-> Parallel Seq Scan on orders_102808 orders (cost=0.00..120735.84 rows=9767 width=8) (actual time=0.305..1346.003 rows=20041 loops=3)
Filter: (to_char((o_orderdate)::timestamp with time zone, 'mm/yyyy'::text) = '03/1997'::text)
Rows Removed by Filter: 1542604
Planning Time: 0.086 ms
Execution Time: 1397.120 ms
Planning Time: 1.381 ms
Execution Time: 5134.832 ms
(26 rows)
In the above example, it's 8.5 minutes in Presto vs 5 seconds in Postgresql Citus cluster. It's a big difference.
It will be nice to have a switch that can be turned on/off in each session, that we could decide if we want to have Presto to return the full table do the aggregation, or we could push down to the data source to do the aggregation, and only returns the final results.
The text was updated successfully, but these errors were encountered:
Uh oh!
There was an error while loading. Please reload this page.
Currently many Presto (JDBC based) connectors such as mysql, postgres, Oracle are single threaded. When doing aggregation calculation, Presto will do a full table scan, then do the aggregation on all records returned from the data source, which is very slow in some case. For example, for TPCH 100G orders 150M records table stored in Postgresql, if we do the following count aggregation query on presto, it took 8 minutes 30 seconds to return the result.
While in postgresql itself, it only took 5 seconds for the results to return.
In the above example, it's 8.5 minutes in Presto vs 5 seconds in Postgresql Citus cluster. It's a big difference.
It will be nice to have a switch that can be turned on/off in each session, that we could decide if we want to have Presto to return the full table do the aggregation, or we could push down to the data source to do the aggregation, and only returns the final results.
The text was updated successfully, but these errors were encountered: