Skip to content

bug: Slow run deletion #2105

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
rharkor opened this issue May 24, 2025 · 1 comment
Open

bug: Slow run deletion #2105

rharkor opened this issue May 24, 2025 · 1 comment

Comments

@rharkor
Copy link
Contributor

rharkor commented May 24, 2025

Provide environment information

System:
OS: Linux 6.8 Ubuntu 24.04.2 LTS 24.04.2 LTS (Noble Numbat)
CPU: (4) x64 AMD EPYC-Milan Processor
Memory: 10.10 GB / 15.24 GB
Container: Yes
Shell: 5.2.21 - /bin/bash
Binaries:
Node: 18.19.1 - /usr/bin/node
npm: 9.2.0 - /usr/bin/npm

Describe the bug

I self hosted trigger for a long time. I have t many runs stored in my db so I recently decided to clean the older ones. But my script seems to took a very long time to only delete 1500 runs. When I tried to describe the query plan to delete one row it appears to me that two relations where missing some key indexes.

Find slow foreign keys

postgres=# begin;
explain (analyze,buffers,timing)
delete from "TaskRun" where id='cmawag7nafpp8o0281yebzmdz';
rollback;
BEGIN
                                                           QUERY PLAN                                                           
--------------------------------------------------------------------------------------------------------------------------------
 Delete on "TaskRun"  (cost=0.42..8.44 rows=0 width=0) (actual time=3.989..3.990 rows=0 loops=1)
   Buffers: shared hit=7 read=5 dirtied=1
   ->  Index Scan using "TaskRun_pkey" on "TaskRun"  (cost=0.42..8.44 rows=1 width=6) (actual time=1.936..1.938 rows=1 loops=1)
         Index Cond: (id = 'cmawag7nafpp8o0281yebzmdz'::text)
         Buffers: shared hit=2 read=2
 Planning Time: 0.136 ms
 Trigger for constraint TaskRunAttempt_taskRunId_fkey on TaskRun: time=2.418 calls=1
 Trigger for constraint TaskRunDependency_taskRunId_fkey on TaskRun: time=2.381 calls=1
 Trigger for constraint BatchTaskRunItem_taskRunId_fkey on TaskRun: time=2.630 calls=1
 Trigger for constraint Checkpoint_runId_fkey on TaskRun: time=0.861 calls=1
 Trigger for constraint CheckpointRestoreEvent_runId_fkey on TaskRun: time=0.742 calls=1
 Trigger for constraint BulkActionItem_sourceRunId_fkey on TaskRun: time=0.032 calls=1
 Trigger for constraint BulkActionItem_destinationRunId_fkey on TaskRun: time=0.015 calls=1
 Trigger for constraint _TaskRunToTaskRunTag_A_fkey on TaskRun: time=2.354 calls=1
 Trigger for constraint ProjectAlert_taskRunId_fkey on TaskRun: time=0.015 calls=1
 Trigger for constraint TaskRun_rootTaskRunId_fkey on TaskRun: time=1.790 calls=1
 Trigger for constraint TaskRun_parentTaskRunId_fkey on TaskRun: time=1.418 calls=1
 Trigger for constraint TaskRunDependency_dependentAttemptId_fkey on TaskRunAttempt: time=0.022 calls=1
 Trigger for constraint BatchTaskRun_dependentTaskAttemptId_fkey on TaskRunAttempt: time=0.638 calls=1
 Trigger for constraint Checkpoint_attemptId_fkey on TaskRunAttempt: time=0.814 calls=1
 Trigger for constraint BatchTaskRunItem_taskRunAttemptId_fkey on TaskRunAttempt: time=1.489 calls=1
 Trigger for constraint CheckpointRestoreEvent_attemptId_fkey on TaskRunAttempt: time=40.039 calls=1
 Trigger for constraint ProjectAlert_taskRunAttemptId_fkey on TaskRunAttempt: time=0.033 calls=1
 Trigger for constraint TaskRun_parentTaskRunAttemptId_fkey on TaskRunAttempt: time=336.831 calls=1
 Execution Time: 398.553 ms
(25 rows)

ROLLBACK

Fixing the indexes

postgres=# -- does not block writers while it builds
CREATE INDEX CONCURRENTLY IF NOT EXISTS
       "TaskRun_parentTaskRunAttemptId_idx"
    ON "TaskRun" ("parentTaskRunAttemptId");

CREATE INDEX CONCURRENTLY IF NOT EXISTS
       "CheckpointRestoreEvent_attemptId_idx"
    ON "CheckpointRestoreEvent" ("attemptId");
CREATE INDEX
CREATE INDEX
postgres=# begin;                                   
explain (analyze,buffers,timing)
delete from "TaskRun" where id='cmawag7nafpp8o0281yebzmdz';
rollback;
BEGIN
                                                           QUERY PLAN                                                           
--------------------------------------------------------------------------------------------------------------------------------
 Delete on "TaskRun"  (cost=0.42..8.44 rows=0 width=0) (actual time=0.132..0.133 rows=0 loops=1)
   Buffers: shared hit=11 dirtied=1
   ->  Index Scan using "TaskRun_pkey" on "TaskRun"  (cost=0.42..8.44 rows=1 width=6) (actual time=0.046..0.047 rows=1 loops=1)
         Index Cond: (id = 'cmawag7nafpp8o0281yebzmdz'::text)
         Buffers: shared hit=4
 Planning:
   Buffers: shared hit=17 read=1
 Planning Time: 0.301 ms
 Trigger for constraint TaskRunAttempt_taskRunId_fkey on TaskRun: time=0.134 calls=1
 Trigger for constraint TaskRunDependency_taskRunId_fkey on TaskRun: time=0.092 calls=1
 Trigger for constraint BatchTaskRunItem_taskRunId_fkey on TaskRun: time=0.087 calls=1
 Trigger for constraint Checkpoint_runId_fkey on TaskRun: time=0.138 calls=1
 Trigger for constraint CheckpointRestoreEvent_runId_fkey on TaskRun: time=0.379 calls=1
 Trigger for constraint BulkActionItem_sourceRunId_fkey on TaskRun: time=0.109 calls=1
 Trigger for constraint BulkActionItem_destinationRunId_fkey on TaskRun: time=0.061 calls=1
 Trigger for constraint _TaskRunToTaskRunTag_A_fkey on TaskRun: time=0.079 calls=1
 Trigger for constraint ProjectAlert_taskRunId_fkey on TaskRun: time=0.050 calls=1
 Trigger for constraint TaskRun_rootTaskRunId_fkey on TaskRun: time=0.189 calls=1
 Trigger for constraint TaskRun_parentTaskRunId_fkey on TaskRun: time=0.245 calls=1
 Trigger for constraint TaskRunDependency_dependentAttemptId_fkey on TaskRunAttempt: time=0.067 calls=1
 Trigger for constraint BatchTaskRun_dependentTaskAttemptId_fkey on TaskRunAttempt: time=0.081 calls=1
 Trigger for constraint Checkpoint_attemptId_fkey on TaskRunAttempt: time=0.059 calls=1
 Trigger for constraint BatchTaskRunItem_taskRunAttemptId_fkey on TaskRunAttempt: time=0.063 calls=1
 Trigger for constraint CheckpointRestoreEvent_attemptId_fkey on TaskRunAttempt: time=0.135 calls=1
 Trigger for constraint ProjectAlert_taskRunAttemptId_fkey on TaskRunAttempt: time=0.043 calls=1
 Trigger for constraint TaskRun_parentTaskRunAttemptId_fkey on TaskRunAttempt: time=0.154 calls=1
 Execution Time: 2.353 ms
(27 rows)

ROLLBACK

It could be usefull to add those by default.

If you are okay with that I could maybe create a pr for this.

Thanks

Reproduction repo

https://trigger.dev/docs/open-source-self-hosting#self-hosting

To reproduce

Just selfhost trigger and try having more than >500k task run in your db.

Additional information

No response

@matt-aitken
Copy link
Member

@rharkor unfortunately we won’t be able to merge this into main. Each index slows insert performance and we already have too many on this table (we’ll be removing some soon when we make Clickhouse the default way we show run lists and metrics).

The real solution for this is to get rid of foreign key constraints completely for TaskRun then there won’t be any cascades on delete. That requires some changes to the application logic to deal with nulls but would be better.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants