Can deduplication query be optimized in Trino or Spark the only solution? #26850

nikita-sheremet-java-developer · 2025-10-06T06:34:31Z

nikita-sheremet-java-developer
Oct 6, 2025

Here is a table:


CREATE TABLE my_iceberg.my_schema.my_table (
   id bigint,
   some_id bigint,
   email varchar,
   first_name varchar,
   last_name varchar,
   name varchar,
   birthday date,
   sex varchar,
   created_at timestamp(6) with time zone,
   ip varchar,
   updated_at timestamp(6) with time zone,
   my_timestamp_first timestamp(6) with time zone,
   my_timestamp_second timestamp(6) with time zone NOT NULL
)
WITH (
   format = 'PARQUET',
   format_version = 2,
   location = 's3a://my_bucket/my_schema/my_table',
   sorted_by = ARRAY['id ASC NULLS FIRST']
)

Parquet size is about 14Gb
Row count about 505 millions.
Duplicated rows (same emails with different cases) about 14 000

And SQL

  with
    row_numbered as (
      select *, row_number() OVER (PARTITION BY trim(lower(email)) ORDER BY created_at) rn from my_iceberg.my_schema.my_table
    ),
    deduplicated as (
      select * from row_numbered where rn = 1
    )
  SELECT * from deduplicated

I had several cluster setup (trino coordinator had 64Gb RAM):

production 10 workers (20CPU), query.maxMemory=716800MB query.maxMemoryPerNode=71680MB, maxHeapSize: 102400M
local 1 worker (4CPU), query.maxMemory=174080MB query.maxMemoryPerNode=174080MB, maxHeapSize: 204800M

So query concentrated into single worker and eats all query memory. On production I have seen about 71680Mb usage and
errors like The node may have crashed or be under too much load. On local I have seen Query exceeded per-node memory limit of

So even with 256Gb memory I can not finih this query with trino.

I have spun up Spark cluster with 2 workers: 4CPU and 16Gb memory (plus same master and data nodes) and query had been
finished after 30 minutes without any errors.

Also note that select count(*) from deduplicated query works in trino.

Can my query be run somehow in Trino with cheap hardware? Or Spark with it optimized joins the only way?

Many thanks in advance!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Can deduplication query be optimized in Trino or Spark the only solution? #26850

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Can deduplication query be optimized in Trino or Spark the only solution? #26850

Uh oh!

nikita-sheremet-java-developer Oct 6, 2025

Replies: 0 comments

nikita-sheremet-java-developer
Oct 6, 2025