Can deduplication query be optimized in Trino or Spark the only solution? #26850
Unanswered
nikita-sheremet-java-developer
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Here is a table:
Parquet size is about 14Gb
Row count about 505 millions.
Duplicated rows (same emails with different cases) about 14 000
And SQL
I had several cluster setup (trino coordinator had 64Gb RAM):
So query concentrated into single worker and eats all query memory. On production I have seen about 71680Mb usage and
errors like
The node may have crashed or be under too much load. On local I have seenQuery exceeded per-node memory limit ofSo even with 256Gb memory I can not finih this query with trino.
I have spun up Spark cluster with 2 workers: 4CPU and 16Gb memory (plus same master and data nodes) and query had been
finished after 30 minutes without any errors.
Also note that
select count(*) from deduplicatedquery works in trino.Can my query be run somehow in Trino with cheap hardware? Or Spark with it optimized joins the only way?
Many thanks in advance!
Beta Was this translation helpful? Give feedback.
All reactions