-
Notifications
You must be signed in to change notification settings - Fork 695
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A bug about zombie logical replication slot that cannot be terminated after citus_rebalance_wait() with statement_timeout on version 13.0.1 #7896
Labels
Comments
the concrete test log with timestamp, after inserting those data: citus test: 2025-02-09 21:41:20.00116
NOTICE: Moving shard 102029 from host.docker.internal:5437 to host.docker.internal:5434 ...
--- the following 77 lines are omitted.
NOTICE: Moving shard 102318 from host.docker.internal:5437 to host.docker.internal:5435 ...
SET statement_timeout = '240s';SELECT citus_drain_node('host.docker.internal', 5437);
citus test: 2025-02-09 21:43:08.00850
select * from citus_add_node('host.docker.internal', 5437);
citus test: 2025-02-09 21:43:08.00852
select * from citus_remove_node('host.docker.internal', 5435);
ERROR: cannot remove or disable the node host.docker.internal:5435 because because it contains the only shard placement for shard 102010
DETAIL: One of the table(s) that prevents the operation complete successfully is public.t5
HINT: To proceed, either drop the tables or use undistribute_table() function to convert them to local tables
Loading citus tables...
citus test: 2025-02-09 21:43:08.00918
NOTICE: creating a new table for public.t0
NOTICE: moving the data of public.t0
NOTICE: dropping the old public.t0
NOTICE: renaming the new table to public.t0
SELECT alter_distributed_table('t0', 'c3', shard_count:=100);
Loading citus tables...
No more tables to distribute, in rebalancing.
citus test: 2025-02-09 21:43:09.00395
select * from citus_add_secondary_node('host.docker.internal', 5438, 'host.docker.internal', 5437);
Loading citus tables...
citus test: 2025-02-09 21:43:09.00465
SELECT update_distributed_table_colocation('t3', colocate_with:='t3');
Loading citus tables...
citus test: 2025-02-09 21:43:09.00532
SELECT update_distributed_table_colocation('t3', colocate_with:='t3');
citus test: 2025-02-09 21:43:09.00545
SET statement_timeout = '240s';SELECT citus_drain_node('host.docker.internal', 5437);
Loading citus tables...
citus test: 2025-02-09 21:43:09.00637
SELECT pg_dist_shard.shardid::text, * FROM pg_dist_shard join citus_shards on pg_dist_shard.shardid = citus_shards.shardid where citus_table_type = 'distributed';
citus test: 2025-02-09 21:43:09.00675
SELECT citus_move_shard_placement(102107, 'host.docker.internal', 5435, 'host.docker.internal', 5438);
ERROR: Moving shards to a secondary (e.g., replica) node is not supported
Loading citus tables...
citus test: 2025-02-09 21:43:09.00732
NOTICE: creating a new table for public.t5
NOTICE: moving the data of public.t5
NOTICE: dropping the old public.t5
NOTICE: renaming the new table to public.t5
SELECT undistribute_table('t5');
Loading citus tables...
citus test: 2025-02-09 21:43:09.00988
NOTICE: creating a new table for public.t0
NOTICE: moving the data of public.t0
NOTICE: dropping the old public.t0
NOTICE: renaming the new table to public.t0
SELECT alter_distributed_table('t0', 'c2');
citus test: 2025-02-09 21:43:10.00562
NOTICE: Moving shard 102120 from host.docker.internal:5434 to host.docker.internal:5435 ...
--- the following 164 lines are omitted.
NOTICE: Moving shard 102518 from host.docker.internal:5434 to host.docker.internal:5435 ...
SET statement_timeout = '240s';SELECT citus_drain_node('host.docker.internal', 5434);
Loading citus tables...
citus test: 2025-02-09 21:46:54.00405
SELECT pg_dist_shard.shardid::text, * FROM pg_dist_shard join citus_shards on pg_dist_shard.shardid = citus_shards.shardid where citus_table_type = 'distributed';
citus test: 2025-02-09 21:46:54.00435
SELECT citus_move_shard_placement(102095, 'host.docker.internal', 5435, 'host.docker.internal', 5438);
ERROR: Moving shards to a secondary (e.g., replica) node is not supported
citus test: 2025-02-09 21:46:54.00437
SET statement_timeout = '240s';SELECT citus_drain_node('host.docker.internal', 5435);
ERROR: Shard replication factor (1) cannot be greater than number of nodes with should_have_shards=true (0).
Loading citus tables...
citus test: 2025-02-09 21:46:54.00533
SELECT truncate_local_data_after_distributing_table('t6');
Loading citus tables...
citus test: 2025-02-09 21:46:54.00599
NOTICE: creating a new table for public.t0
NOTICE: moving the data of public.t0
NOTICE: dropping the old public.t0
NOTICE: renaming the new table to public.t0
SELECT undistribute_table('t0');
Loading citus tables...
citus test: 2025-02-09 21:46:54.00942
NOTICE: creating a new table for public.t7
SELECT alter_distributed_table('t7', 'c41');
ERROR: replication_factor (1) exceeds number of worker nodes (0)
HINT: Add more worker nodes or try again with a lower replication factor.
citus test: 2025-02-09 21:46:55.00203
SELECT update_distributed_table_colocation('t3', colocate_with:='t7');
ERROR: cannot colocate tables t7 and t3
DETAIL: Distribution column types don't match for t7 and t3.
Loading citus tables...
citus test: 2025-02-09 21:46:55.00261
SELECT alter_distributed_table('t3', 'pkey', colocate_with:='t3');
ERROR: cannot colocate with t3 and change distribution column to pkey because data type of column pkey is different then the distribution column of the t3
citus test: 2025-02-09 21:46:55.00263
select * from citus_remove_node('host.docker.internal', 5438);
citus test: 2025-02-09 21:46:55.00292
select * from citus_add_secondary_node('host.docker.internal', 5438, 'host.docker.internal', 5435);
Loading citus tables...
citus test: 2025-02-09 21:46:55.00361
SELECT alter_distributed_table('t5', 'pkey', shard_count:=40);
ERROR: cannot alter table because the table is not distributed
citus test: 2025-02-09 21:46:55.00363
select * from citus_add_node('host.docker.internal', 5436);
Loading citus tables...
citus test: 2025-02-09 21:46:55.00523
SELECT update_distributed_table_colocation('t3', colocate_with:='t7');
ERROR: cannot colocate tables t7 and t3
DETAIL: Distribution column types don't match for t7 and t3.
citus test: 2025-02-09 21:46:55.00524
update pg_dist_rebalance_strategy set improvement_threshold = 0.21 where name = 'by_disk_size';
citus test: 2025-02-09 21:46:55.00528
NOTICE: Scheduled 231 moves as job 1
DETAIL: Rebalance scheduled as background job
HINT: To monitor progress, run: SELECT * FROM citus_rebalance_status();
SELECT citus_rebalance_start(rebalance_strategy:='by_disk_size', shard_transfer_mode:='force_logical');
citus test: 2025-02-09 21:46:55.00831
SET statement_timeout = '20s';SELECT citus_rebalance_wait();
ERROR: canceling statement due to statement timeout
(Wrote by me) --- ^C or PQfinish() here;
citus reset
NOTICE: Citus partially supports CREATE DATABASE for distributed databases
DETAIL: Citus does not propagate CREATE DATABASE command to workers
HINT: You can manually create a database and its extensions on workers.
CREATE EXTENSION IF NOT EXISTS citus;
citus reset
NOTICE: Citus partially supports CREATE DATABASE for distributed databases
DETAIL: Citus does not propagate CREATE DATABASE command to workers
HINT: You can manually create a database and its extensions on workers.
CREATE EXTENSION IF NOT EXISTS citus;
citus reset
ERROR: database "testdb" is used by an active logical replication slot
DETAIL: There is 1 active slot.
retry drop database in host.docker.internal: 5435
citus reset
ERROR: database "testdb" is used by an active logical replication slot
DETAIL: There is 1 active slot.
retry drop database in host.docker.internal: 5435
citus reset
ERROR: database "testdb" is used by an active logical replication slot
DETAIL: There is 1 active slot. |
the log of Master node: 2025-02-09 21:46:55 UTC [57156] ERROR: cannot colocate tables t7 and t3
2025-02-09 21:46:55 UTC [57156] DETAIL: Distribution column types don't match for t7 and t3.
2025-02-09 21:46:55 UTC [57156] STATEMENT: SELECT update_distributed_table_colocation('t3', colocate_with:='t7');
2025-02-09 21:46:55 UTC [57156] ERROR: cannot colocate with t3 and change distribution column to pkey because data type of column pkey is different then the distribution column of the t3
2025-02-09 21:46:55 UTC [57156] STATEMENT: SELECT alter_distributed_table('t3', 'pkey', colocate_with:='t3');
2025-02-09 21:46:55 UTC [57156] ERROR: cannot alter table because the table is not distributed
2025-02-09 21:46:55 UTC [57156] STATEMENT: SELECT alter_distributed_table('t5', 'pkey', shard_count:=40);
2025-02-09 21:46:55 UTC [57156] ERROR: cannot colocate tables t7 and t3
2025-02-09 21:46:55 UTC [57156] DETAIL: Distribution column types don't match for t7 and t3.
2025-02-09 21:46:55 UTC [57156] STATEMENT: SELECT update_distributed_table_colocation('t3', colocate_with:='t7');
2025-02-09 21:46:55 UTC [57165] LOG: deferred drop of orphaned resource public.t0_102518 on host.docker.internal:5434 completed
2025-02-09 21:46:55 UTC [57165] STATEMENT: CALL citus_cleanup_orphaned_resources()
2025-02-09 21:47:01 UTC [55574] LOG: found scheduled background tasks, starting new background task queue monitor
2025-02-09 21:47:01 UTC [55574] CONTEXT: Citus maintenance daemon for database 5023555 user 10
2025-02-09 21:47:01 UTC [57181] LOG: task jobid/taskid started: 1/1
2025-02-09 21:47:01 UTC [57181] CONTEXT: Citus Background Task Queue Monitor: testdb
2025-02-09 21:47:01 UTC [57187] LOG: logical decoding found consistent point at 8/A6879488
2025-02-09 21:47:01 UTC [57187] DETAIL: There are no running transactions.
2025-02-09 21:47:01 UTC [57187] STATEMENT: CREATE_REPLICATION_SLOT citus_shard_move_slot_9_10_248 LOGICAL pgoutput EXPORT_SNAPSHOT;
2025-02-09 21:47:01 UTC [57187] LOG: exported logical decoding snapshot: "00000076-00000297-1" with 0 transaction IDs
2025-02-09 21:47:01 UTC [57187] STATEMENT: CREATE_REPLICATION_SLOT citus_shard_move_slot_9_10_248 LOGICAL pgoutput EXPORT_SNAPSHOT;
2025-02-09 21:47:01 UTC [57191] LOG: starting logical decoding for slot "citus_shard_move_slot_9_10_248"
2025-02-09 21:47:01 UTC [57191] DETAIL: Streaming transactions committing after 8/A68794C0, reading WAL from 8/A6879488.
2025-02-09 21:47:01 UTC [57191] STATEMENT: START_REPLICATION SLOT "citus_shard_move_slot_9_10_248" LOGICAL 0/0 (proto_version '4', origin 'any', publication_names '"citus_shard_move_publication_9_10_248"', binary 'true')
2025-02-09 21:47:01 UTC [57191] LOG: logical decoding found consistent point at 8/A6879488
2025-02-09 21:47:01 UTC [57191] DETAIL: There are no running transactions.
2025-02-09 21:47:01 UTC [57191] STATEMENT: START_REPLICATION SLOT "citus_shard_move_slot_9_10_248" LOGICAL 0/0 (proto_version '4', origin 'any', publication_names '"citus_shard_move_publication_9_10_248"', binary 'true')
2025-02-09 21:47:02 UTC [57183] LOG: The LSN of the target subscriptions on node host.docker.internal:5436 have caught up with the source LSN
2025-02-09 21:47:02 UTC [57183] STATEMENT: SELECT pg_catalog.citus_copy_shard_placement(102008, 6, 9, transfer_mode := 'force_logical')
2025-02-09 21:47:02 UTC [57183] LOG: The LSN of the target subscriptions on node host.docker.internal:5436 have caught up with the source LSN
2025-02-09 21:47:02 UTC [57183] STATEMENT: SELECT pg_catalog.citus_copy_shard_placement(102008, 6, 9, transfer_mode := 'force_logical')
2025-02-09 21:47:02 UTC [57183] LOG: The LSN of the target subscriptions on node host.docker.internal:5436 have caught up with the source LSN
2025-02-09 21:47:02 UTC [57183] STATEMENT: SELECT pg_catalog.citus_copy_shard_placement(102008, 6, 9, transfer_mode := 'force_logical')
2025-02-09 21:47:02 UTC [57181] LOG: task jobid/taskid succeeded: 1/1
2025-02-09 21:47:02 UTC [57181] CONTEXT: Citus Background Task Queue Monitor: testdb
2025-02-09 21:47:02 UTC [57181] LOG: task jobid/taskid started: 1/2
2025-02-09 21:47:02 UTC [57181] CONTEXT: Citus Background Task Queue Monitor: testdb
2025-02-09 21:47:03 UTC [57192] LOG: The LSN of the target subscriptions on node host.docker.internal:5436 have caught up with the source LSN
2025-02-09 21:47:03 UTC [57192] CONTEXT: Citus Background Task Queue Executor: testdb/postgres for (1/2)
2025-02-09 21:47:03 UTC [57192] LOG: The LSN of the target subscriptions on node host.docker.internal:5436 have caught up with the source LSN
2025-02-09 21:47:03 UTC [57192] CONTEXT: Citus Background Task Queue Executor: testdb/postgres for (1/2)
2025-02-09 21:47:03 UTC [57192] LOG: The LSN of the target subscriptions on node host.docker.internal:5436 have caught up with the source LSN
2025-02-09 21:47:03 UTC [57192] CONTEXT: Citus Background Task Queue Executor: testdb/postgres for (1/2)
2025-02-09 21:47:03 UTC [57181] LOG: task jobid/taskid succeeded: 1/2
2025-02-09 21:47:03 UTC [57181] CONTEXT: Citus Background Task Queue Monitor: testdb
2025-02-09 21:47:03 UTC [57181] LOG: task jobid/taskid started: 1/3
--- the following 1034 lines are omitted, which is all about task job, and all succeeded.
2025-02-09 21:48:58 UTC [57181] LOG: task jobid/taskid started: 1/89
2025-02-09 21:48:58 UTC [57181] CONTEXT: Citus Background Task Queue Monitor: testdb
2025-02-09 21:48:59 UTC [57662] LOG: deferred drop of orphaned resource public.t2_102219 on host.docker.internal:5435 completed
2025-02-09 21:48:59 UTC [57662] STATEMENT: CALL citus_cleanup_orphaned_resources()
2025-02-09 21:49:00 UTC [57661] LOG: The LSN of the target subscriptions on node host.docker.internal:5436 have caught up with the source LSN
2025-02-09 21:49:00 UTC [57661] CONTEXT: Citus Background Task Queue Executor: testdb/postgres for (1/89)
2025-02-09 21:49:00 UTC [57661] LOG: The LSN of the target subscriptions on node host.docker.internal:5436 have caught up with the source LSN
2025-02-09 21:49:00 UTC [57661] CONTEXT: Citus Background Task Queue Executor: testdb/postgres for (1/89)
2025-02-09 21:49:00 UTC [57661] LOG: The LSN of the target subscriptions on node host.docker.internal:5436 have caught up with the source LSN
2025-02-09 21:49:00 UTC [57661] CONTEXT: Citus Background Task Queue Executor: testdb/postgres for (1/89)
2025-02-09 21:49:00 UTC [57181] LOG: task jobid/taskid succeeded: 1/89
2025-02-09 21:49:00 UTC [57181] CONTEXT: Citus Background Task Queue Monitor: testdb
2025-02-09 21:49:00 UTC [57181] LOG: task jobid/taskid started: 1/90
2025-02-09 21:49:00 UTC [57181] CONTEXT: Citus Background Task Queue Monitor: testdb
2025-02-09 21:49:00 UTC [57666] LOG: deferred drop of orphaned resource public.t2_102220 on host.docker.internal:5435 completed
2025-02-09 21:49:00 UTC [57666] STATEMENT: CALL citus_cleanup_orphaned_resources()
2025-02-09 21:49:01 UTC [57665] LOG: The LSN of the target subscriptions on node host.docker.internal:5436 have caught up with the source LSN
2025-02-09 21:49:01 UTC [57665] CONTEXT: Citus Background Task Queue Executor: testdb/postgres for (1/90)
2025-02-09 21:49:01 UTC [57665] LOG: The LSN of the target subscriptions on node host.docker.internal:5436 have caught up with the source LSN
2025-02-09 21:49:01 UTC [57665] CONTEXT: Citus Background Task Queue Executor: testdb/postgres for (1/90)
2025-02-09 21:49:01 UTC [57665] LOG: The LSN of the target subscriptions on node host.docker.internal:5436 have caught up with the source LSN
2025-02-09 21:49:01 UTC [57665] CONTEXT: Citus Background Task Queue Executor: testdb/postgres for (1/90)
2025-02-09 21:49:01 UTC [57181] LOG: task jobid/taskid succeeded: 1/90
2025-02-09 21:49:01 UTC [57181] CONTEXT: Citus Background Task Queue Monitor: testdb
2025-02-09 21:49:01 UTC [57181] LOG: task jobid/taskid started: 1/91
2025-02-09 21:49:01 UTC [57181] CONTEXT: Citus Background Task Queue Monitor: testdb
2025-02-09 21:49:01 UTC [57676] LOG: deferred drop of orphaned resource public.t2_102221 on host.docker.internal:5435 completed
2025-02-09 21:49:01 UTC [57676] STATEMENT: CALL citus_cleanup_orphaned_resources()
2025-02-09 21:49:02 UTC [55973] FATAL: terminating connection due to administrator command
2025-02-09 21:49:02 UTC [55594] FATAL: terminating connection due to administrator command
2025-02-09 21:49:02 UTC [55587] FATAL: terminating connection due to administrator command
2025-02-09 21:49:02 UTC [57181] LOG: handling termination signal
2025-02-09 21:49:02 UTC [57181] CONTEXT: Citus Background Task Queue Monitor: testdb
2025-02-09 21:49:02 UTC [57675] FATAL: terminating background worker "Citus Background Task Queue Executor: testdb/postgres for (1/91)" due to administrator command
2025-02-09 21:49:02 UTC [57675] CONTEXT: Citus Background Task Queue Executor: testdb/postgres for (1/91)
2025-02-09 21:49:02 UTC [55589] FATAL: terminating connection due to administrator command
2025-02-09 21:49:02 UTC [57156] FATAL: terminating connection due to administrator command
2025-02-09 21:49:02 UTC [57156] STATEMENT: SET statement_timeout = '240s';SELECT citus_rebalance_wait();
2025-02-09 21:49:02 UTC [57181] LOG: task jobid/taskid failed: 1/91
2025-02-09 21:49:02 UTC [57181] CONTEXT: Citus Background Task Queue Monitor: testdb
2025-02-09 21:49:02 UTC [55593] FATAL: terminating connection due to administrator command
2025-02-09 21:49:02 UTC [55591] FATAL: terminating connection due to administrator command
2025-02-09 21:49:02 UTC [1] LOG: background worker "Citus Background Task Queue Executor: testdb/postgres for (1/91)" (PID 57675) exited with exit code 1
2025-02-09 21:49:02 UTC [57677] FATAL: terminating connection due to administrator command
2025-02-09 21:49:03 UTC [57680] LOG: PID 55589 in cancel request did not match any process
2025-02-09 21:49:03 UTC [57681] LOG: PID 55587 in cancel request did not match any process
2025-02-09 21:49:03 UTC [57682] LOG: PID 55591 in cancel request did not match any process
2025-02-09 21:49:03 UTC [27] LOG: checkpoint starting: immediate force wait
2025-02-09 21:49:03 UTC [27] LOG: checkpoint complete: wrote 3 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.007 s, sync=0.004 s, total=0.024 s; sync files=2, longest=0.003 s, average=0.002 s; distance=1336 kB, estimate=7817 kB; lsn=8/A69012C0, redo lsn=8/A6901268
2025-02-09 21:49:03 UTC [57684] LOG: PID 55594 in cancel request did not match any process
2025-02-09 21:49:03 UTC [57685] LOG: PID 55593 in cancel request did not match any process
2025-02-09 21:49:05 UTC [57687] LOG: starting maintenance daemon on database 5024507 user 10
2025-02-09 21:49:05 UTC [57687] CONTEXT: Citus maintenance daemon for database 5024507 user 10
2025-02-09 21:49:23 UTC [57683] LOG: could not receive data from client: Connection reset by peer
2025-02-09 21:54:03 UTC [27] LOG: checkpoint starting: time
2025-02-09 21:55:55 UTC [27] LOG: checkpoint complete: wrote 1117 buffers (6.8%); 0 WAL file(s) added, 0 removed, 0 recycled; write=111.661 s, sync=0.075 s, total=111.757 s; sync files=400, longest=0.003 s, average=0.001 s; distance=6806 kB, estimate=7716 kB; lsn=8/A6FA6CC8, redo lsn=8/A6FA6C38 |
the log of worker node: 2025-02-09 21:47:02 UTC [70304] LOG: logical decoding found consistent point at 9/C5A55C38
2025-02-09 21:47:02 UTC [70304] DETAIL: There are no running transactions.
2025-02-09 21:47:02 UTC [70304] STATEMENT: CREATE_REPLICATION_SLOT citus_shard_move_slot_9_10_249 LOGICAL pgoutput EXPORT_SNAPSHOT;
2025-02-09 21:47:02 UTC [70304] LOG: exported logical decoding snapshot: "00000073-00001961-1" with 0 transaction IDs
2025-02-09 21:47:02 UTC [70304] STATEMENT: CREATE_REPLICATION_SLOT citus_shard_move_slot_9_10_249 LOGICAL pgoutput EXPORT_SNAPSHOT;
2025-02-09 21:47:02 UTC [70306] LOG: starting logical decoding for slot "citus_shard_move_slot_9_10_249"
2025-02-09 21:47:02 UTC [70306] DETAIL: Streaming transactions committing after 9/C5A55C70, reading WAL from 9/C5A55C38.
2025-02-09 21:47:02 UTC [70306] STATEMENT: START_REPLICATION SLOT "citus_shard_move_slot_9_10_249" LOGICAL 0/0 (proto_version '4', origin 'any', publication_names '"citus_shard_move_publication_9_10_249"', binary 'true')
2025-02-09 21:47:02 UTC [70306] LOG: logical decoding found consistent point at 9/C5A55C38
2025-02-09 21:47:02 UTC [70306] DETAIL: There are no running transactions.
2025-02-09 21:47:02 UTC [70306] STATEMENT: START_REPLICATION SLOT "citus_shard_move_slot_9_10_249" LOGICAL 0/0 (proto_version '4', origin 'any', publication_names '"citus_shard_move_publication_9_10_249"', binary 'true')
2025-02-09 21:47:04 UTC [70318] LOG: logical decoding found consistent point at 9/C5A6A288
--- the following 960 lines are omitted, which is all about replication slot , and all succeeded.
2025-02-09 21:49:00 UTC [71134] STATEMENT: CREATE_REPLICATION_SLOT citus_shard_move_slot_9_10_337 LOGICAL pgoutput EXPORT_SNAPSHOT;
2025-02-09 21:49:00 UTC [71134] LOG: exported logical decoding snapshot: "00000075-0000198D-1" with 0 transaction IDs
2025-02-09 21:49:00 UTC [71134] STATEMENT: CREATE_REPLICATION_SLOT citus_shard_move_slot_9_10_337 LOGICAL pgoutput EXPORT_SNAPSHOT;
2025-02-09 21:49:00 UTC [71136] LOG: starting logical decoding for slot "citus_shard_move_slot_9_10_337"
2025-02-09 21:49:00 UTC [71136] DETAIL: Streaming transactions committing after 9/C5C50EE8, reading WAL from 9/C5C50EB0.
2025-02-09 21:49:00 UTC [71136] STATEMENT: START_REPLICATION SLOT "citus_shard_move_slot_9_10_337" LOGICAL 0/0 (proto_version '4', origin 'any', publication_names '"citus_shard_move_publication_9_10_337"', binary 'true')
2025-02-09 21:49:00 UTC [71136] LOG: logical decoding found consistent point at 9/C5C50EB0
2025-02-09 21:49:00 UTC [71136] DETAIL: There are no running transactions.
2025-02-09 21:49:00 UTC [71136] STATEMENT: START_REPLICATION SLOT "citus_shard_move_slot_9_10_337" LOGICAL 0/0 (proto_version '4', origin 'any', publication_names '"citus_shard_move_publication_9_10_337"', binary 'true')
2025-02-09 21:49:01 UTC [71141] LOG: logical decoding found consistent point at 9/C5C53548
2025-02-09 21:49:01 UTC [71141] DETAIL: There are no running transactions.
2025-02-09 21:49:01 UTC [71141] STATEMENT: CREATE_REPLICATION_SLOT citus_shard_move_slot_9_10_338 LOGICAL pgoutput EXPORT_SNAPSHOT;
2025-02-09 21:49:01 UTC [71141] LOG: exported logical decoding snapshot: "00000078-000019D6-1" with 0 transaction IDs
2025-02-09 21:49:01 UTC [71141] STATEMENT: CREATE_REPLICATION_SLOT citus_shard_move_slot_9_10_338 LOGICAL pgoutput EXPORT_SNAPSHOT;
2025-02-09 21:49:01 UTC [71143] LOG: starting logical decoding for slot "citus_shard_move_slot_9_10_338"
2025-02-09 21:49:01 UTC [71143] DETAIL: Streaming transactions committing after 9/C5C53580, reading WAL from 9/C5C53548.
2025-02-09 21:49:01 UTC [71143] STATEMENT: START_REPLICATION SLOT "citus_shard_move_slot_9_10_338" LOGICAL 0/0 (proto_version '4', origin 'any', publication_names '"citus_shard_move_publication_9_10_338"', binary 'true')
2025-02-09 21:49:01 UTC [71143] LOG: logical decoding found consistent point at 9/C5C53548
2025-02-09 21:49:01 UTC [71143] DETAIL: There are no running transactions.
2025-02-09 21:49:01 UTC [71143] STATEMENT: START_REPLICATION SLOT "citus_shard_move_slot_9_10_338" LOGICAL 0/0 (proto_version '4', origin 'any', publication_names '"citus_shard_move_publication_9_10_338"', binary 'true')
2025-02-09 21:49:03 UTC [68165] WARNING: terminating connection due to administrator command
2025-02-09 21:49:03 UTC [68165] CONTEXT: while executing command on host.docker.internal:5433
2025-02-09 21:49:03 Citus maintenance daemon for database 6287007 user 10
2025-02-09 21:49:04 UTC [71152] ERROR: database "testdb" is used by an active logical replication slot
2025-02-09 21:49:04 UTC [71152] DETAIL: There is 1 active slot.
2025-02-09 21:49:04 UTC [71152] STATEMENT: drop database if exists testdb with (force);
2025-02-09 21:49:10 UTC [71160] ERROR: database "testdb" is used by an active logical replication slot
2025-02-09 21:49:10 UTC [71160] DETAIL: There is 1 active slot.
2025-02-09 21:49:10 UTC [71160] STATEMENT: drop database if exists testdb with (force);
2025-02-09 21:49:16 UTC [71175] ERROR: database "testdb" is used by an active logical replication slot
2025-02-09 21:49:16 UTC [71175] DETAIL: There is 1 active slot.
2025-02-09 21:49:16 UTC [71175] STATEMENT: drop database if exists testdb with (force);
2025-02-09 21:49:22 UTC [71183] ERROR: database "testdb" is used by an active logical replication slot
2025-02-09 21:49:22 UTC [71183] DETAIL: There is 1 active slot.
2025-02-09 21:49:22 UTC [71183] STATEMENT: drop database if exists testdb with (force);
2025-02-09 21:49:23 UTC [71183] LOG: could not receive data from client: Connection reset by peer
2025-02-09 21:50:15 UTC [71279] LOG: starting maintenance daemon on database 6287007 user 10
2025-02-09 21:50:15 UTC [71279] CONTEXT: Citus maintenance daemon for database 6287007 user 10
2025-02-09 21:50:34 UTC [71320] ERROR: database "testdb" is used by an active logical replication slot
2025-02-09 21:50:34 UTC [71320] DETAIL: There is 1 active slot. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Symptom
Citus version:
Citus 13.0.1 on x86_64-pc-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit
Postgres version:
PostgreSQL 17.2 (Debian 17.2-1.pgdg120+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit
This bug is related to
SELECT citus_rebalance_wait()
. Based on the logs generated by our testing tool, I found that the issue arises because we had pre-set a timeout for each statement usingSET statement_timeout
. Oncecitus_rebalance_wait
hits this timeout, it raises the errorERROR: canceling statement due to statement timeout
. Then we close the session, which can be reproduced by manually pressingCtrl+C
to cancel the query and close the session in advance, or by callingPQfinish()
in C code. I suspect that at this point, a “zombie” logical replication slot—created by Citus but not properly removable—may appear on the worker node.After reconnecting, we can still query the coordinator normally and even run
DROP DATABASE WITH (FORCE)
then, successfully dropping the database on the coordinator. However, because of the zombie logical replication slot on the worker node, the database cannot be dropped there.Could not find any normal method to recover the worker node from the zombie logical replication.
How to reproduce
I believe this is a potential robustness-related bug. It leaves the cluster in a partially dropped state, where the coordinator sees the database as gone, but the worker still has remnants. Since this bug was discovered by an automated testing tool, I will provide logs from multiple occurrences of this issue, including logs from the testing tool, the master node, and the affected worker node. (Continuous updates will follow.)
The test environment is based on Docker, where each Citus node's IP address is host.docker.internal. The master node runs on port 5433, while five worker nodes are hosted on ports 5434-5438.
Action list produced by tester:
The text was updated successfully, but these errors were encountered: