Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A bug about zombie logical replication slot that cannot be terminated after citus_rebalance_wait() with statement_timeout on version 13.0.1 #7896

Open
duerwuyi opened this issue Feb 10, 2025 · 3 comments

Comments

@duerwuyi
Copy link

duerwuyi commented Feb 10, 2025

Symptom

Citus version: Citus 13.0.1 on x86_64-pc-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit
Postgres version: PostgreSQL 17.2 (Debian 17.2-1.pgdg120+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit

This bug is related to SELECT citus_rebalance_wait(). Based on the logs generated by our testing tool, I found that the issue arises because we had pre-set a timeout for each statement using SET statement_timeout. Once citus_rebalance_wait hits this timeout, it raises the error ERROR: canceling statement due to statement timeout. Then we close the session, which can be reproduced by manually pressing Ctrl+C to cancel the query and close the session in advance, or by calling PQfinish() in C code. I suspect that at this point, a “zombie” logical replication slot—created by Citus but not properly removable—may appear on the worker node.

After reconnecting, we can still query the coordinator normally and even run DROP DATABASE WITH (FORCE) then, successfully dropping the database on the coordinator. However, because of the zombie logical replication slot on the worker node, the database cannot be dropped there.

DROP DATABSE testdb;
ERROR:  database "testdb" is used by an active logical replication slot
DETAIL:  There is 1 active slot.

SELECT * FROM pg_replication_slots;
slot_name            |  plugin  | slot_type | datoid  | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn | wal_status | safe_wal_size | two_phase | inactive_since | conflicting | invalidation_reason | failover | synced 
--------------------------------+----------+-----------+---------+----------+-----------+--------+------------+------+--------------+-------------+---------------------+------------+---------------+-----------+----------------+-------------+---------------------+----------+--------
 citus_shard_move_slot_9_10_338 | pgoutput | logical   | 6287007 | testdb   | f         | t      |      71143 |      |       328096 | 9/C5CEE298  | 9/C5CEE2D0          | reserved   |               | f         |                | f           |                     | f        | f
(1 row)


SELECT * FROM pg_drop_replication_slot('citus_shard_move_slot_9_10_338');
ERROR: replication slot "citus_shard_move_slot_9_10_338" is active for PID 71143

--- SELECT pg_terminate_backend(71143);
pg_terminate_backend
1	true
(1 row)

SELECT * FROM pg_drop_replication_slot('citus_shard_move_slot_9_10_338');
ERROR: replication slot "citus_shard_move_slot_9_10_338" is active for PID 80178

Could not find any normal method to recover the worker node from the zombie logical replication.

How to reproduce

I believe this is a potential robustness-related bug. It leaves the cluster in a partially dropped state, where the coordinator sees the database as gone, but the worker still has remnants. Since this bug was discovered by an automated testing tool, I will provide logs from multiple occurrences of this issue, including logs from the testing tool, the master node, and the affected worker node. (Continuous updates will follow.)

The test environment is based on Docker, where each Citus node's IP address is host.docker.internal. The master node runs on port 5433, while five worker nodes are hosted on ports 5434-5438.

Action list produced by tester:

SET statement_timeout = '6s';
SET citus.shard_replication_factor TO 1; SET citus.enable_repartition_joins to ON;
CREATE EXTENSION IF NOT EXISTS citus;
;
select * from citus_add_node('host.docker.internal', 5434);
;
select * from citus_add_node('host.docker.internal', 5435);
;
select * from citus_add_node('host.docker.internal', 5436);
;
select * from citus_add_node('host.docker.internal', 5437);
;
select * from citus_add_node('host.docker.internal', 5438);
;
SELECT citus_set_coordinator_host('host.docker.internal', '5433');
;
select * from citus_remove_node('host.docker.internal', 5438);
;
select * from citus_remove_node('host.docker.internal', 5436);
;
select * from citus_get_active_worker_nodes();
;
DROP TABLE IF EXISTS t0;
create table t0 ( 
vkey int4 ,
pkey int4 ,
c0 numeric ,
c1 int4 ,
c2 int4 ,
c3 int4 ,
c4 int4 ,
c5 numeric ,
c6 text ,
c7 int4 ,
c8 numeric 

);
;
DROP TABLE IF EXISTS t1;
create table t1 ( 
vkey int4 ,
pkey int4 ,
c9 text ,
c10 timestamp ,
c11 int4 ,
c12 timestamp ,
c13 text ,
c14 numeric ,
c15 text ,
c16 timestamp ,
c17 timestamp 

);
;
DROP TABLE IF EXISTS t2;
create table t2 ( 
vkey int4 ,
pkey int4 ,
c18 text ,
c19 int4 

);
;
DROP TABLE IF EXISTS t3;
create table t3 ( 
vkey int4 ,
pkey int4 ,
c20 text ,
c21 int4 ,
c22 int4 ,
c23 numeric ,
c24 int4 ,
c25 timestamp 

);
;
DROP TABLE IF EXISTS t4;
create table t4 ( 
vkey int4 ,
pkey int4 ,
c26 text 

);
;
DROP TABLE IF EXISTS t5;
create table t5 ( 
vkey int4 ,
pkey int4 ,
c27 text ,
c28 numeric ,
c29 numeric ,
c30 text ,
c31 int4 

);
;
DROP TABLE IF EXISTS t6;
create table t6 ( 
vkey int4 ,
pkey int4 ,
c32 numeric ,
c33 timestamp ,
c34 timestamp ,
c35 int4 ,
c36 int4 ,
c37 numeric ,
c38 timestamp ,
c39 text ,
c40 timestamp 

);
;
DROP TABLE IF EXISTS t7;
create table t7 ( 
vkey int4 ,
pkey int4 ,
c41 text ,
c42 numeric ,
c43 text 

);
;
;
SELECT create_reference_table('t6');
ALTER TABLE t6 REPLICA IDENTITY FULL;  
SELECT create_distributed_table('t5', 'c30', shard_count:=64);
ALTER TABLE t5 ADD CONSTRAINT t5_pkey PRIMARY KEY (c30);  
SELECT create_distributed_table('t2', 'pkey', colocate_with:='t5');  
SELECT create_distributed_table('t3', 'c25', shard_count:=44);
ALTER TABLE t3 ADD CONSTRAINT t3_pkey PRIMARY KEY (c25);  
SELECT create_distributed_table('t1', 'c17');
ALTER TABLE t1 ADD CONSTRAINT t1_pkey PRIMARY KEY (c17);  
SELECT create_distributed_table('t7', 'vkey', colocate_with:='t3');  
SELECT create_distributed_table('t7', 'vkey', colocate_with:='t1');  
SELECT create_reference_table('t4');
ALTER TABLE t4 REPLICA IDENTITY FULL;  
SELECT create_distributed_table('t0', 'vkey', shard_count:=15);
ALTER TABLE t0 REPLICA IDENTITY FULL;  
SELECT create_distributed_table('t2', 'pkey', colocate_with:='t5');  
SELECT create_distributed_table('t2', 'c18', colocate_with:='t5');
ALTER TABLE t2 REPLICA IDENTITY FULL;  
SELECT create_distributed_table('t7', 'vkey', shard_count:=91);
ALTER TABLE t7 ADD CONSTRAINT t7_pkey PRIMARY KEY (vkey);  
SELECT * FROM citus_tables;
insert into t5 (vkey, pkey, c27, c28, c29, c30, c31) values 
(1, 11000, '', 0.0, 33.75, '', 0);
insert into t5 (vkey, pkey, c27, c28, c29, c30, c31) values 
(2, 12000, 'c>A', -70.18, -53.21, 'W>G', -0);
insert into t7 (vkey, pkey, c41, c42, c43) values 
(3, 13000, '-', 0.0, '#]');
insert into t4 (vkey, pkey, c26) values 
(4, 14000, 'R/Y,{');
insert into t7 (vkey, pkey, c41, c42, c43) values 
(5, 15000, 'S', -0.0, 'QYY8b');
insert into t6 (vkey, pkey, c32, c33, c34, c35, c36, c37, c38, c39, c40) values 
(6, 16000, 0.0, make_timestamp(2060, 11, 19, 5, 46, 4), make_timestamp(2011, 5, 22, 2, 42, 36), -27, 81, 0.0, make_timestamp(2083, 3, 20, 9, 9, 9), 'X', make_timestamp(2071, 3, 20, 5, 21, 33));
insert into t4 (vkey, pkey, c26) values 
(7, 17000, '3r');
insert into t1 (vkey, pkey, c9, c10, c11, c12, c13, c14, c15, c16, c17) values 
(8, 18000, '#(Q5', make_timestamp(2064, 12, 7, 11, 2, 43), -32, make_timestamp(2054, 1, 28, 23, 57, 20), 'KE`', -0.0, '.P', make_timestamp(2101, 6, 13, 22, 24, 49), make_timestamp(2052, 4, 25, 4, 45, 0));
insert into t5 (vkey, pkey, c27, c28, c29, c30, c31) values 
(9, 19000, 'ud9', 63.35, -0.0, 'R?ZJA', 79);
insert into t3 (vkey, pkey, c20, c21, c22, c23, c24, c25) values 
(10, 20000, '0', -66, 37, -95.92, 19, make_timestamp(2051, 9, 13, 20, 54, 13));
insert into t0 (vkey, pkey, c0, c1, c2, c3, c4, c5, c6, c7, c8) values
(11, 21000, -24.44, -18, -0, -37, 65, -62.3, 'X}M', -9, 28.18);
insert into t6 (vkey, pkey, c32, c33, c34, c35, c36, c37, c38, c39, c40) values
(12, 22000, -53.69, make_timestamp(2033, 2, 16, 9, 18, 28), make_timestamp(2104, 9, 3, 21, 10, 33), 0, 4, 57.24, make_timestamp(2033, 8, 2, 22, 51, 25), 'P', make_timestamp(2091, 7, 5, 9, 42, 4));
insert into t6 (vkey, pkey, c32, c33, c34, c35, c36, c37, c38, c39, c40) values
(13, 23000, 0.0, make_timestamp(2101, 7, 1, 3, 50, 2), make_timestamp(2049, 3, 22, 8, 33, 29), 71, 88, -41.11, make_timestamp(2072, 7, 6, 22, 46, 18), 'B!', make_timestamp(2093, 11, 25, 3, 11, 32));
insert into t5 (vkey, pkey, c27, c28, c29, c30, c31) values
(14, 24000, 'J|S', -77.8, -0.0, 'Gd7h', 29);
insert into t5 (vkey, pkey, c27, c28, c29, c30, c31) values
(15, 25000, '`N-2>', 1.5, 0.0, 'P', -84);
insert into t4 (vkey, pkey, c26) values
(16, 26000, '');
insert into t6 (vkey, pkey, c32, c33, c34, c35, c36, c37, c38, c39, c40) values
(17, 27000, -76.20, make_timestamp(2090, 12, 27, 19, 34, 12), make_timestamp(1990, 11, 5, 14, 3, 37), -0, 25, 2.29, make_timestamp(2072, 2, 24, 20, 24, 27), 'Uc', make_timestamp(1984, 8, 18, 17, 25, 4));
insert into t5 (vkey, pkey, c27, c28, c29, c30, c31) values
(18, 28000, '[', -0.0, -0.0, '', 7);
insert into t4 (vkey, pkey, c26) values
(19, 29000, '$7Fj');
insert into t4 (vkey, pkey, c26) values
(20, 30000, '7}?E');
insert into t4 (vkey, pkey, c26) values
(21, 31000, '8]u');
insert into t3 (vkey, pkey, c20, c21, c22, c23, c24, c25) values
(22, 32000, '', -82, 57, 0.0, -0, make_timestamp(1981, 10, 21, 14, 11, 40));
insert into t4 (vkey, pkey, c26) values
(23, 33000, 'EXM1');
insert into t7 (vkey, pkey, c41, c42, c43) values
(24, 34000, 'yZi*', 0.0, 'i');
insert into t0 (vkey, pkey, c0, c1, c2, c3, c4, c5, c6, c7, c8) values
(25, 35000, -85.49, -1, -0, -74, -58, -42.6, '', 25, -17.54);
insert into t1 (vkey, pkey, c9, c10, c11, c12, c13, c14, c15, c16, c17) values
(26, 36000, '7z', make_timestamp(2029, 6, 19, 23, 17, 34), 0, make_timestamp(2090, 3, 28, 19, 4, 7), '', 77.64, '', make_timestamp(2002, 5, 27, 7, 5, 27), make_timestamp(2106, 2, 6, 7, 9, 21));
insert into t1 (vkey, pkey, c9, c10, c11, c12, c13, c14, c15, c16, c17) values
(27, 37000, 'Z$', make_timestamp(1975, 4, 15, 9, 24, 26), -100, make_timestamp(2043, 9, 8, 20, 37, 16), 'Ws', -62.39, 'f@', make_timestamp(2041, 1, 8, 5, 48, 40), make_timestamp(2009, 11, 6, 18, 46, 48));
insert into t0 (vkey, pkey, c0, c1, c2, c3, c4, c5, c6, c7, c8) values
(28, 38000, -7.31, -90, 95, 0, -35, 41.5, '', -38, 0.0);
insert into t6 (vkey, pkey, c32, c33, c34, c35, c36, c37, c38, c39, c40) values
(29, 39000, 99.30, make_timestamp(2005, 11, 27, 20, 55, 50), make_timestamp(2035, 4, 13, 21, 37, 16), -0, 87, 77.91, make_timestamp(2002, 8, 25, 18, 44, 10), 'GX', make_timestamp(2023, 7, 5, 2, 13, 47));
insert into t1 (vkey, pkey, c9, c10, c11, c12, c13, c14, c15, c16, c17) values
(30, 40000, '-}s', make_timestamp(2015, 3, 11, 18, 55, 50), 63, make_timestamp(2091, 3, 28, 2, 42, 22), '', -55.30, '', make_timestamp(2031, 11, 26, 18, 41, 28), make_timestamp(2079, 10, 25, 18, 55, 48));
insert into t1 (vkey, pkey, c9, c10, c11, c12, c13, c14, c15, c16, c17) values
(31, 41000, 'H573', make_timestamp(2094, 8, 2, 19, 25, 9), -49, make_timestamp(2093, 1, 22, 6, 30, 30), 'D*G', -27.58, 'nw3B', make_timestamp(2044, 12, 3, 7, 22, 20), make_timestamp(2073, 11, 16, 14, 44, 13));
insert into t5 (vkey, pkey, c27, c28, c29, c30, c31) values
(32, 42000, 'B}Tlc', 0.0, -0.0, 'm]$zb', -62);
insert into t4 (vkey, pkey, c26) values
(33, 43000, '');
insert into t4 (vkey, pkey, c26) values
(34, 44000, '78+');
insert into t1 (vkey, pkey, c9, c10, c11, c12, c13, c14, c15, c16, c17) values
(35, 45000, 'JMZ7s', make_timestamp(2012, 9, 24, 4, 42, 28), 19, make_timestamp(1996, 9, 14, 9, 23, 33), 'b[', -0.0, '', make_timestamp(2014, 11, 10, 23, 15, 37), make_timestamp(2101, 6, 4, 20, 2, 48));
insert into t0 (vkey, pkey, c0, c1, c2, c3, c4, c5, c6, c7, c8) values
(36, 46000, -86.39, 2, -81, -55, 0, -61.11, '7-', -50, 0.0);
insert into t6 (vkey, pkey, c32, c33, c34, c35, c36, c37, c38, c39, c40) values
(37, 47000, 0.0, make_timestamp(2102, 7, 6, 7, 50, 1), make_timestamp(2057, 8, 1, 23, 53, 26), 91, 41, -0.0, make_timestamp(2035, 7, 25, 2, 23, 24), '~QeY', make_timestamp(2079, 4, 7, 7, 36, 24));
insert into t6 (vkey, pkey, c32, c33, c34, c35, c36, c37, c38, c39, c40) values
(38, 48000, -81.14, make_timestamp(2067, 5, 9, 4, 21, 13), make_timestamp(1996, 5, 17, 7, 59, 46), 27, -0, -8.86, make_timestamp(2017, 3, 9, 5, 6, 48), 'R8iqX', make_timestamp(2026, 12, 5, 5, 51, 11));
insert into t6 (vkey, pkey, c32, c33, c34, c35, c36, c37, c38, c39, c40) values
(39, 49000, 4.72, make_timestamp(1976, 10, 8, 0, 30, 41), make_timestamp(2087, 12, 11, 6, 42, 19), -0, -21, -21.93, make_timestamp(2059, 3, 21, 2, 51, 33), '', make_timestamp(2106, 6, 20, 20, 37, 32));
insert into t2 (vkey, pkey, c18, c19) values
(40, 50000, 'W', 1);
insert into t1 (vkey, pkey, c9, c10, c11, c12, c13, c14, c15, c16, c17) values
(41, 51000, '6e', make_timestamp(2096, 8, 1, 18, 18, 15), -0, make_timestamp(2066, 7, 5, 14, 40, 31), '$&', 83.76, 'Z[7Ma', make_timestamp(2099, 5, 10, 16, 28, 57), make_timestamp(2029, 1, 18, 21, 2, 13));
insert into t4 (vkey, pkey, c26) values
(42, 52000, 'OW7');
insert into t7 (vkey, pkey, c41, c42, c43) values
(43, 53000, '', 24.11, '[S');
insert into t4 (vkey, pkey, c26) values
(44, 54000, 'P8.fB');
insert into t0 (vkey, pkey, c0, c1, c2, c3, c4, c5, c6, c7, c8) values
(45, 55000, 62.84, -79, -12, 0, 10, 95.27, '!TE', 0, 0.0);
insert into t6 (vkey, pkey, c32, c33, c34, c35, c36, c37, c38, c39, c40) values
(46, 56000, -0.0, make_timestamp(2091, 7, 23, 15, 34, 35), make_timestamp(2057, 10, 6, 23, 6, 16), 46, -1, 48.95, make_timestamp(2050, 4, 13, 11, 27, 30), 'A=', make_timestamp(2069, 9, 3, 4, 42, 2));
insert into t3 (vkey, pkey, c20, c21, c22, c23, c24, c25) values
(47, 57000, ';', 36, 27, -15.82, 0, make_timestamp(2095, 3, 26, 22, 1, 10));
insert into t3 (vkey, pkey, c20, c21, c22, c23, c24, c25) values
(48, 58000, 'FX', -0, 0, 84.44, 77, make_timestamp(1972, 1, 22, 19, 59, 29));
insert into t3 (vkey, pkey, c20, c21, c22, c23, c24, c25) values
(49, 59000, 'og^', 65, 35, -0.0, -58, make_timestamp(2031, 10, 28, 0, 56, 54));
insert into t3 (vkey, pkey, c20, c21, c22, c23, c24, c25) values
(50, 60000, 'Y?1', -88, -66, 95.20, -88, make_timestamp(2038, 9, 18, 0, 50, 32));
insert into t5 (vkey, pkey, c27, c28, c29, c30, c31) values
(51, 61000, '/xG&4', -0.0, 0.0, 'MfHl', 0);
insert into t5 (vkey, pkey, c27, c28, c29, c30, c31) values
(52, 62000, '-o#6', -96.13, -49.15, 'n?', 3);
insert into t2 (vkey, pkey, c18, c19) values
(53, 63000, '', 59);
insert into t7 (vkey, pkey, c41, c42, c43) values
(54, 64000, 'bd', -54.31, '');
insert into t6 (vkey, pkey, c32, c33, c34, c35, c36, c37, c38, c39, c40) values
(55, 65000, 0.0, make_timestamp(2070, 1, 11, 20, 37, 54), make_timestamp(2097, 2, 19, 13, 58, 39), -100, 85, -83.92, make_timestamp(2041, 12, 23, 16, 23, 32), '/[;f', make_timestamp(2046, 7, 4, 18, 0, 27));
insert into t7 (vkey, pkey, c41, c42, c43) values
(56, 66000, '', 0.0, 'i');
insert into t1 (vkey, pkey, c9, c10, c11, c12, c13, c14, c15, c16, c17) values
(57, 67000, '(fl', make_timestamp(2049, 9, 14, 0, 10, 51), 93, make_timestamp(2071, 8, 9, 22, 37, 32), 'Ea', -83.62, '', make_timestamp(2031, 6, 11, 12, 25, 10), make_timestamp(2100, 5, 18, 4, 34, 27));
insert into t5 (vkey, pkey, c27, c28, c29, c30, c31) values
(58, 68000, 'FA?=t', 99.7, 4.48, 'O', -0);
insert into t1 (vkey, pkey, c9, c10, c11, c12, c13, c14, c15, c16, c17) values
(60, 70000, '', make_timestamp(2060, 1, 17, 18, 21, 54), -63, make_timestamp(2091, 12, 2, 7, 36, 39), '', 97.98, 'x1[n', make_timestamp(1981, 7, 24, 6, 7, 45), make_timestamp(2103, 8, 25, 2, 21, 18));
insert into t3 (vkey, pkey, c20, c21, c22, c23, c24, c25) values
(61, 71000, 'V$X', -34, -70, -0.0, 93, make_timestamp(2002, 10, 28, 4, 51, 3));
insert into t3 (vkey, pkey, c20, c21, c22, c23, c24, c25) values
(62, 72000, 'M1h:', 21, -39, 42.1, 0, make_timestamp(2086, 5, 9, 3, 53, 24));
insert into t0 (vkey, pkey, c0, c1, c2, c3, c4, c5, c6, c7, c8) values
(63, 73000, 29.54, 24, -58, -9, -0, -85.60, 'N&03i', 28, -4.7);
insert into t0 (vkey, pkey, c0, c1, c2, c3, c4, c5, c6, c7, c8) values
(64, 74000, 0.0, 9, 79, 0, -0, -0.0, '', -75, 0.0);
create index i0 on t5 (c30 asc, c29  , c27 asc, vkey desc, pkey  );
insert into t4 (vkey, pkey, c26) values
(65, 75000, '};*TV');
insert into t7 (vkey, pkey, c41, c42, c43) values
(66, 76000, '', -3.11, '?');
insert into t4 (vkey, pkey, c26) values
(67, 77000, 'y|');
insert into t5 (vkey, pkey, c27, c28, c29, c30, c31) values
(68, 78000, 'mh~M', 0.0, 5.17, '?-=v3', 99);
insert into t4 (vkey, pkey, c26) values
(69, 79000, 'O(>c');
insert into t4 (vkey, pkey, c26) values
(70, 80000, 'KP');
insert into t5 (vkey, pkey, c27, c28, c29, c30, c31) values
(71, 81000, '65', -22.34, -67.85, '', 15);
insert into t6 (vkey, pkey, c32, c33, c34, c35, c36, c37, c38, c39, c40) values
(72, 82000, 89.66, make_timestamp(2054, 12, 17, 8, 12, 20), make_timestamp(2027, 2, 2, 1, 42, 18), -0, -0, 16.57, make_timestamp(2062, 12, 26, 16, 32, 4), '@R5(', make_timestamp(2022, 4, 9, 19, 19, 4));
insert into t2 (vkey, pkey, c18, c19) values
(73, 83000, 'A0OS@', 34);
insert into t6 (vkey, pkey, c32, c33, c34, c35, c36, c37, c38, c39, c40) values
(74, 84000, 0.0, make_timestamp(2023, 2, 15, 13, 58, 42), make_timestamp(2075, 1, 25, 15, 7, 31), 0, 92, -0.0, make_timestamp(2042, 11, 14, 11, 59, 23), 'DCi', make_timestamp(2082, 4, 10, 2, 59, 5));
insert into t6 (vkey, pkey, c32, c33, c34, c35, c36, c37, c38, c39, c40) values
(75, 85000, -72.34, make_timestamp(2047, 9, 21, 20, 16, 53), make_timestamp(2000, 3, 5, 19, 19, 33), -75, 0, 86.38, make_timestamp(2034, 10, 2, 23, 10, 7), 'R70:', make_timestamp(2015, 6, 16, 17, 17, 34));
insert into t5 (vkey, pkey, c27, c28, c29, c30, c31) values
(76, 86000, '', -0.0, 20.82, '*', -12);
insert into t0 (vkey, pkey, c0, c1, c2, c3, c4, c5, c6, c7, c8) values
(77, 87000, 69.90, 60, -54, 88, -56, 45.48, 'FcOSj', -43, -82.11);
insert into t4 (vkey, pkey, c26) values
(78, 88000, '{EPZ');
insert into t1 (vkey, pkey, c9, c10, c11, c12, c13, c14, c15, c16, c17) values
(79, 89000, 'oSy', make_timestamp(2014, 7, 14, 10, 6, 53), 52, make_timestamp(2006, 5, 5, 14, 58, 27), '7', -0.0, 'HAdc', make_timestamp(2013, 11, 6, 7, 12, 39), make_timestamp(2102, 3, 16, 13, 5, 53));
insert into t1 (vkey, pkey, c9, c10, c11, c12, c13, c14, c15, c16, c17) values
(80, 90000, 'Cl ', make_timestamp(2030, 5, 3, 23, 7, 29), -1, make_timestamp(2006, 10, 17, 18, 3, 42), '', -84.4, 'M`', make_timestamp(2074, 1, 6, 4, 27, 29), make_timestamp(2030, 2, 15, 10, 58, 8));
insert into t4 (vkey, pkey, c26) values
(81, 91000, 'q0');
insert into t3 (vkey, pkey, c20, c21, c22, c23, c24, c25) values
(82, 92000, '6.', -0, -0, 65.9, 81, make_timestamp(2075, 10, 22, 8, 39, 10));
insert into t6 (vkey, pkey, c32, c33, c34, c35, c36, c37, c38, c39, c40) values
(83, 93000, -35.42, make_timestamp(2050, 8, 16, 8, 12, 14), make_timestamp(2083, 5, 22, 1, 22, 32), -27, 10, -21.44, make_timestamp(1989, 5, 10, 22, 46, 48), '1&', make_timestamp(2050, 5, 4, 9, 59, 34));
insert into t0 (vkey, pkey, c0, c1, c2, c3, c4, c5, c6, c7, c8) values
(84, 94000, -60.30, -68, 68, -0, 0, -0.0, 'XH3Sr', -65, -0.0);
insert into t4 (vkey, pkey, c26) values
(85, 95000, '');
insert into t7 (vkey, pkey, c41, c42, c43) values
(86, 96000, 'vH', -0.0, 'Z1i~');
insert into t5 (vkey, pkey, c27, c28, c29, c30, c31) values
(87, 97000, '<8', -97.15, 85.92, '[;:', 0);
insert into t2 (vkey, pkey, c18, c19) values
(88, 98000, 'KHe', -63);
insert into t5 (vkey, pkey, c27, c28, c29, c30, c31) values
(89, 99000, '>=U4.', -0.0, 35.48, '=', -90);
insert into t3 (vkey, pkey, c20, c21, c22, c23, c24, c25) values
(90, 100000, 'X^x', -0, 0, 0.0, -0, make_timestamp(2072, 4, 22, 7, 8, 53));
insert into t4 (vkey, pkey, c26) values
(91, 101000, 'p');
insert into t2 (vkey, pkey, c18, c19) values
(92, 102000, '*', 0);
insert into t4 (vkey, pkey, c26) values
(93, 103000, ']}{P');
insert into t4 (vkey, pkey, c26) values
(94, 104000, '!l$z7');
insert into t6 (vkey, pkey, c32, c33, c34, c35, c36, c37, c38, c39, c40) values
(95, 105000, -82.71, make_timestamp(2042, 9, 17, 6, 32, 8), make_timestamp(2029, 8, 14, 2, 46, 2), 55, -15, 74.4, make_timestamp(2045, 12, 5, 0, 2, 59), 'P', make_timestamp(2059, 5, 2, 23, 56, 53));
insert into t0 (vkey, pkey, c0, c1, c2, c3, c4, c5, c6, c7, c8) values
(96, 106000, -59.94, 0, -85, 0, 46, 95.96, '', 45, -71.47);
insert into t0 (vkey, pkey, c0, c1, c2, c3, c4, c5, c6, c7, c8) values
(97, 107000, 0.0, -55, -94, 55, 89, 6.8, '20Ti', -0, -43.74);
insert into t1 (vkey, pkey, c9, c10, c11, c12, c13, c14, c15, c16, c17) values
(98, 108000, 'U.', make_timestamp(2041, 4, 3, 7, 39, 12), -37, make_timestamp(2040, 9, 15, 15, 28, 44), '', 87.52, 'wi19', make_timestamp(1998, 11, 19, 1, 37, 27), make_timestamp(2099, 8, 10, 15, 40, 14));
insert into t1 (vkey, pkey, c9, c10, c11, c12, c13, c14, c15, c16, c17) values
(99, 109000, '>X', make_timestamp(1970, 9, 1, 16, 50, 6), -54, make_timestamp(1989, 1, 5, 15, 25, 34), 'pEA#@', -0.0, 'y7l<', make_timestamp(2063, 9, 3, 18, 5, 41), make_timestamp(2081, 10, 3, 5, 52, 4));
insert into t0 (vkey, pkey, c0, c1, c2, c3, c4, c5, c6, c7, c8) values
(100, 110000, -17.34, 5, 7, 0, 85, 0.0, 'W$', -4, -0.0);
insert into t1 (vkey, pkey, c9, c10, c11, c12, c13, c14, c15, c16, c17) values
(101, 111000, 'ov#', make_timestamp(2094, 4, 5, 5, 57, 46), 26, make_timestamp(1991, 10, 14, 8, 35, 45), 'BA@F(', -9.30, 'y O.', make_timestamp(2031, 5, 17, 15, 50, 55), make_timestamp(1977, 8, 27, 15, 14, 40));
insert into t1 (vkey, pkey, c9, c10, c11, c12, c13, c14, c15, c16, c17) values
(102, 112000, '', make_timestamp(2021, 12, 14, 20, 28, 9), -11, make_timestamp(2044, 9, 20, 4, 10, 14), 'M*a', 0.0, 'eE', make_timestamp(1977, 6, 18, 10, 45, 46), make_timestamp(2049, 1, 16, 10, 12, 53));
insert into t2 (vkey, pkey, c18, c19) values
(103, 113000, 'SX:', 15);
insert into t4 (vkey, pkey, c26) values
(104, 114000, 'aZh(');
insert into t1 (vkey, pkey, c9, c10, c11, c12, c13, c14, c15, c16, c17) values
(105, 115000, '4g', make_timestamp(2040, 10, 19, 14, 14, 18), 3, make_timestamp(2006, 9, 26, 0, 31, 8), 'URb', 0.0, 'HJPK9', make_timestamp(2065, 5, 4, 22, 2, 48), make_timestamp(2047, 8, 18, 11, 10, 2));
insert into t2 (vkey, pkey, c18, c19) values
(106, 116000, '', 27);
insert into t7 (vkey, pkey, c41, c42, c43) values
(107, 117000, 'BGJ:@', 66.86, '');
insert into t6 (vkey, pkey, c32, c33, c34, c35, c36, c37, c38, c39, c40) values
(108, 118000, -0.0, make_timestamp(2077, 1, 9, 19, 20, 26), make_timestamp(2026, 4, 17, 21, 37, 59), -12, -58, 65.64, make_timestamp(2009, 4, 23, 12, 50, 6), 'Zp3', make_timestamp(2053, 12, 13, 10, 30, 19));
insert into t2 (vkey, pkey, c18, c19) values
(109, 119000, ' [K', 0);
insert into t0 (vkey, pkey, c0, c1, c2, c3, c4, c5, c6, c7, c8) values
(110, 120000, 0.2, -94, 0, 44, -48, -74.96, 'X/', 95, -32.60);
insert into t2 (vkey, pkey, c18, c19) values
(111, 121000, 'E$:MQ', -22);
insert into t1 (vkey, pkey, c9, c10, c11, c12, c13, c14, c15, c16, c17) values
(112, 122000, '=&', make_timestamp(2060, 10, 24, 14, 17, 58), -0, make_timestamp(2019, 7, 18, 22, 23, 22), 'BOF', -98.6, '@U', make_timestamp(1987, 2, 17, 13, 58, 46), make_timestamp(2089, 5, 7, 16, 55, 37));
insert into t2 (vkey, pkey, c18, c19) values
(113, 123000, '', 0);
insert into t7 (vkey, pkey, c41, c42, c43) values
(114, 124000, 'odQ2', 86.65, '');
insert into t2 (vkey, pkey, c18, c19) values
(115, 125000, ',|qoc', 0);
insert into t4 (vkey, pkey, c26) values
(116, 126000, 'd}c');
insert into t0 (vkey, pkey, c0, c1, c2, c3, c4, c5, c6, c7, c8) values
(117, 127000, 59.88, -0, 0, -0, 80, -39.73, 'b)[^~', -83, -14.52);
insert into t2 (vkey, pkey, c18, c19) values
(118, 128000, 'g0', -0);
insert into t3 (vkey, pkey, c20, c21, c22, c23, c24, c25) values
(119, 129000, 'D', -91, -0, -21.78, -78, make_timestamp(2054, 8, 21, 22, 3, 58));
insert into t7 (vkey, pkey, c41, c42, c43) values
(120, 130000, '/U', -27.13, 'v');
insert into t6 (vkey, pkey, c32, c33, c34, c35, c36, c37, c38, c39, c40) values
(121, 131000, 0.0, make_timestamp(1973, 9, 21, 20, 13, 10), make_timestamp(2009, 6, 20, 10, 18, 55), 60, 67, 28.86, make_timestamp(2084, 8, 12, 14, 51, 39), 'Xx(q', make_timestamp(2050, 3, 27, 10, 49, 57));
insert into t1 (vkey, pkey, c9, c10, c11, c12, c13, c14, c15, c16, c17) values
(122, 132000, 'w<V', make_timestamp(2030, 1, 15, 12, 0, 12), 77, make_timestamp(2078, 1, 5, 11, 10, 43), '', -39.6, 'J]N]?', make_timestamp(1993, 2, 18, 20, 12, 19), make_timestamp(2066, 5, 2, 4, 32, 58));
insert into t5 (vkey, pkey, c27, c28, c29, c30, c31) values
(124, 134000, '3^^d', 90.90, -0.0, '3', 25);
insert into t3 (vkey, pkey, c20, c21, c22, c23, c24, c25) values
(125, 135000, 'o', 80, 70, 80.63, 27, make_timestamp(2075, 2, 28, 18, 54, 27));
insert into t3 (vkey, pkey, c20, c21, c22, c23, c24, c25) values
(126, 136000, '2SZ9[', 65, 6, 8.84, 99, make_timestamp(2088, 7, 17, 20, 24, 41));

SET statement_timeout = '240s';SELECT citus_drain_node('host.docker.internal', 5437);
select * from citus_add_node('host.docker.internal', 5437);
select * from citus_remove_node('host.docker.internal', 5435);  
SELECT alter_distributed_table('t0', 'c3', shard_count:=100);  
select * from citus_add_secondary_node('host.docker.internal', 5438, 'host.docker.internal', 5437);  
SELECT update_distributed_table_colocation('t3', colocate_with:='t3');  
SELECT update_distributed_table_colocation('t3', colocate_with:='t3');
SET statement_timeout = '240s';SELECT citus_drain_node('host.docker.internal', 5437);  
SELECT pg_dist_shard.shardid::text, *     FROM pg_dist_shard join citus_shards     on pg_dist_shard.shardid = citus_shards.shardid     where citus_table_type = 'distributed';
SELECT citus_move_shard_placement(102107, 'host.docker.internal', 5435, 'host.docker.internal', 5438);  
SELECT undistribute_table('t5');  
SELECT alter_distributed_table('t0', 'c2');
SET statement_timeout = '240s';SELECT citus_drain_node('host.docker.internal', 5434);  
SELECT pg_dist_shard.shardid::text, *     FROM pg_dist_shard join citus_shards     on pg_dist_shard.shardid = citus_shards.shardid     where citus_table_type = 'distributed';
SELECT citus_move_shard_placement(102095, 'host.docker.internal', 5435, 'host.docker.internal', 5438);
SET statement_timeout = '240s';SELECT citus_drain_node('host.docker.internal', 5435);  
SELECT truncate_local_data_after_distributing_table('t6');  
SELECT undistribute_table('t0');  
SELECT alter_distributed_table('t7', 'c41');
SET citus.task_assignment_policy = 'round-robin'
SELECT update_distributed_table_colocation('t3', colocate_with:='t7');  
SELECT alter_distributed_table('t3', 'pkey', colocate_with:='t3');
select * from citus_remove_node('host.docker.internal', 5438);
select * from citus_add_secondary_node('host.docker.internal', 5438, 'host.docker.internal', 5435);  
SELECT alter_distributed_table('t5', 'pkey', shard_count:=40);
select * from citus_add_node('host.docker.internal', 5436);  
SELECT update_distributed_table_colocation('t3', colocate_with:='t7');
update pg_dist_rebalance_strategy set improvement_threshold = 0.21 where name = 'by_disk_size';

--- this statement will lead to bug, if citus_rebalance_wait() cannot complete in 240s (or it seems ctrl-c before 240s is still possible to reproduce it. If your CPU is fast enough, change the timeout.).
SELECT citus_rebalance_start(rebalance_strategy:='by_disk_size', shard_transfer_mode:='force_logical');
SET statement_timeout = '240s';
SELECT citus_rebalance_wait(); 

--- press CTRL-C or reset connection by PQfinish()

SET statement_timeout = '6s';
SET citus.shard_replication_factor TO 1; SET citus.enable_repartition_joins to ON;
drop database if exists testdb with (force); --- on master
create database testdb; 
drop database if exists testdb with (force); --- on worker1
create database testdb; 
CREATE EXTENSION IF NOT EXISTS citus;
drop database if exists testdb with (force); --- on worker2
create database testdb; 
CREATE EXTENSION IF NOT EXISTS citus;
drop database if exists testdb with (force); --- on worker3, failed
@duerwuyi
Copy link
Author

duerwuyi commented Feb 10, 2025

the concrete test log with timestamp, after inserting those data:

 citus test: 2025-02-09 21:41:20.00116
NOTICE:  Moving shard 102029 from host.docker.internal:5437 to host.docker.internal:5434 ...
--- the following 77 lines are omitted.
NOTICE:  Moving shard 102318 from host.docker.internal:5437 to host.docker.internal:5435 ...
SET statement_timeout = '240s';SELECT citus_drain_node('host.docker.internal', 5437);
 citus test: 2025-02-09 21:43:08.00850
select * from citus_add_node('host.docker.internal', 5437);
 citus test: 2025-02-09 21:43:08.00852
select * from citus_remove_node('host.docker.internal', 5435);
ERROR:  cannot remove or disable the node host.docker.internal:5435 because because it contains the only shard placement for shard 102010
DETAIL:  One of the table(s) that prevents the operation complete successfully is public.t5
HINT:  To proceed, either drop the tables or use undistribute_table() function to convert them to local tables

Loading citus tables...
 citus test: 2025-02-09 21:43:08.00918
NOTICE:  creating a new table for public.t0
NOTICE:  moving the data of public.t0
NOTICE:  dropping the old public.t0
NOTICE:  renaming the new table to public.t0
SELECT alter_distributed_table('t0', 'c3', shard_count:=100);
Loading citus tables...
No more tables to distribute, in rebalancing.
 citus test: 2025-02-09 21:43:09.00395
select * from citus_add_secondary_node('host.docker.internal', 5438, 'host.docker.internal', 5437);
Loading citus tables...
 citus test: 2025-02-09 21:43:09.00465
SELECT update_distributed_table_colocation('t3', colocate_with:='t3');
Loading citus tables...
 citus test: 2025-02-09 21:43:09.00532
SELECT update_distributed_table_colocation('t3', colocate_with:='t3');
 citus test: 2025-02-09 21:43:09.00545
SET statement_timeout = '240s';SELECT citus_drain_node('host.docker.internal', 5437);
Loading citus tables...
 citus test: 2025-02-09 21:43:09.00637
SELECT pg_dist_shard.shardid::text, *     FROM pg_dist_shard join citus_shards     on pg_dist_shard.shardid = citus_shards.shardid     where citus_table_type = 'distributed';
 citus test: 2025-02-09 21:43:09.00675
SELECT citus_move_shard_placement(102107, 'host.docker.internal', 5435, 'host.docker.internal', 5438);
ERROR:  Moving shards to a secondary (e.g., replica) node is not supported
Loading citus tables...
 citus test: 2025-02-09 21:43:09.00732
NOTICE:  creating a new table for public.t5
NOTICE:  moving the data of public.t5
NOTICE:  dropping the old public.t5
NOTICE:  renaming the new table to public.t5
SELECT undistribute_table('t5');
Loading citus tables...
 citus test: 2025-02-09 21:43:09.00988
NOTICE:  creating a new table for public.t0
NOTICE:  moving the data of public.t0
NOTICE:  dropping the old public.t0
NOTICE:  renaming the new table to public.t0
SELECT alter_distributed_table('t0', 'c2');
 citus test: 2025-02-09 21:43:10.00562
NOTICE:  Moving shard 102120 from host.docker.internal:5434 to host.docker.internal:5435 ...
--- the following 164 lines are omitted.
NOTICE:  Moving shard 102518 from host.docker.internal:5434 to host.docker.internal:5435 ...
SET statement_timeout = '240s';SELECT citus_drain_node('host.docker.internal', 5434);
Loading citus tables...
 citus test: 2025-02-09 21:46:54.00405
SELECT pg_dist_shard.shardid::text, *     FROM pg_dist_shard join citus_shards     on pg_dist_shard.shardid = citus_shards.shardid     where citus_table_type = 'distributed';
 citus test: 2025-02-09 21:46:54.00435
SELECT citus_move_shard_placement(102095, 'host.docker.internal', 5435, 'host.docker.internal', 5438);
ERROR:  Moving shards to a secondary (e.g., replica) node is not supported

 citus test: 2025-02-09 21:46:54.00437
SET statement_timeout = '240s';SELECT citus_drain_node('host.docker.internal', 5435);
ERROR:  Shard replication factor (1) cannot be greater than number of nodes with should_have_shards=true (0).

Loading citus tables...
 citus test: 2025-02-09 21:46:54.00533
SELECT truncate_local_data_after_distributing_table('t6');
Loading citus tables...
 citus test: 2025-02-09 21:46:54.00599
NOTICE:  creating a new table for public.t0
NOTICE:  moving the data of public.t0
NOTICE:  dropping the old public.t0
NOTICE:  renaming the new table to public.t0
SELECT undistribute_table('t0');
Loading citus tables...
 citus test: 2025-02-09 21:46:54.00942
NOTICE:  creating a new table for public.t7
SELECT alter_distributed_table('t7', 'c41');
ERROR:  replication_factor (1) exceeds number of worker nodes (0)
HINT:  Add more worker nodes or try again with a lower replication factor.

 citus test: 2025-02-09 21:46:55.00203
SELECT update_distributed_table_colocation('t3', colocate_with:='t7');
ERROR:  cannot colocate tables t7 and t3
DETAIL:  Distribution column types don't match for t7 and t3.

Loading citus tables...
 citus test: 2025-02-09 21:46:55.00261
SELECT alter_distributed_table('t3', 'pkey', colocate_with:='t3');
ERROR:  cannot colocate with t3 and change distribution column to pkey because data type of column pkey is different then the distribution column of the t3

 citus test: 2025-02-09 21:46:55.00263
select * from citus_remove_node('host.docker.internal', 5438);
 citus test: 2025-02-09 21:46:55.00292
select * from citus_add_secondary_node('host.docker.internal', 5438, 'host.docker.internal', 5435);
Loading citus tables...
 citus test: 2025-02-09 21:46:55.00361
SELECT alter_distributed_table('t5', 'pkey', shard_count:=40);
ERROR:  cannot alter table because the table is not distributed

 citus test: 2025-02-09 21:46:55.00363
select * from citus_add_node('host.docker.internal', 5436);
Loading citus tables...
 citus test: 2025-02-09 21:46:55.00523
SELECT update_distributed_table_colocation('t3', colocate_with:='t7');
ERROR:  cannot colocate tables t7 and t3
DETAIL:  Distribution column types don't match for t7 and t3.

 citus test: 2025-02-09 21:46:55.00524
update pg_dist_rebalance_strategy set improvement_threshold = 0.21 where name = 'by_disk_size';
 citus test: 2025-02-09 21:46:55.00528
NOTICE:  Scheduled 231 moves as job 1
DETAIL:  Rebalance scheduled as background job
HINT:  To monitor progress, run: SELECT * FROM citus_rebalance_status();
SELECT citus_rebalance_start(rebalance_strategy:='by_disk_size', shard_transfer_mode:='force_logical');
 citus test: 2025-02-09 21:46:55.00831
SET statement_timeout = '20s';SELECT citus_rebalance_wait();
ERROR:  canceling statement due to statement timeout
(Wrote by me) --- ^C or PQfinish() here;
citus reset
NOTICE:  Citus partially supports CREATE DATABASE for distributed databases
DETAIL:  Citus does not propagate CREATE DATABASE command to workers
HINT:  You can manually create a database and its extensions on workers.
CREATE EXTENSION IF NOT EXISTS citus;
citus reset
NOTICE:  Citus partially supports CREATE DATABASE for distributed databases
DETAIL:  Citus does not propagate CREATE DATABASE command to workers
HINT:  You can manually create a database and its extensions on workers.
CREATE EXTENSION IF NOT EXISTS citus;
citus reset
ERROR:  database "testdb" is used by an active logical replication slot
DETAIL:  There is 1 active slot.
retry drop database in host.docker.internal: 5435
citus reset
ERROR:  database "testdb" is used by an active logical replication slot
DETAIL:  There is 1 active slot.
retry drop database in host.docker.internal: 5435
citus reset
ERROR:  database "testdb" is used by an active logical replication slot
DETAIL:  There is 1 active slot.

@duerwuyi
Copy link
Author

duerwuyi commented Feb 10, 2025

the log of Master node:

2025-02-09 21:46:55   UTC [57156] ERROR:  cannot colocate tables t7 and t3
2025-02-09 21:46:55   UTC [57156] DETAIL:  Distribution column types don't match for t7 and t3.
2025-02-09 21:46:55   UTC [57156] STATEMENT:  SELECT update_distributed_table_colocation('t3', colocate_with:='t7');
2025-02-09 21:46:55   UTC [57156] ERROR:  cannot colocate with t3 and change distribution column to pkey because data type of column pkey is different then the distribution column of the t3
2025-02-09 21:46:55   UTC [57156] STATEMENT:  SELECT alter_distributed_table('t3', 'pkey', colocate_with:='t3');
2025-02-09 21:46:55   UTC [57156] ERROR:  cannot alter table because the table is not distributed
2025-02-09 21:46:55   UTC [57156] STATEMENT:  SELECT alter_distributed_table('t5', 'pkey', shard_count:=40);
2025-02-09 21:46:55   UTC [57156] ERROR:  cannot colocate tables t7 and t3
2025-02-09 21:46:55   UTC [57156] DETAIL:  Distribution column types don't match for t7 and t3.
2025-02-09 21:46:55   UTC [57156] STATEMENT:  SELECT update_distributed_table_colocation('t3', colocate_with:='t7');
2025-02-09 21:46:55   UTC [57165] LOG:  deferred drop of orphaned resource public.t0_102518 on host.docker.internal:5434 completed
2025-02-09 21:46:55   UTC [57165] STATEMENT:  CALL citus_cleanup_orphaned_resources()
2025-02-09 21:47:01   UTC [55574] LOG:  found scheduled background tasks, starting new background task queue monitor
2025-02-09 21:47:01   UTC [55574] CONTEXT:  Citus maintenance daemon for database 5023555 user 10
2025-02-09 21:47:01   UTC [57181] LOG:  task jobid/taskid started: 1/1
2025-02-09 21:47:01   UTC [57181] CONTEXT:  Citus Background Task Queue Monitor: testdb
2025-02-09 21:47:01   UTC [57187] LOG:  logical decoding found consistent point at 8/A6879488
2025-02-09 21:47:01   UTC [57187] DETAIL:  There are no running transactions.
2025-02-09 21:47:01   UTC [57187] STATEMENT:  CREATE_REPLICATION_SLOT citus_shard_move_slot_9_10_248 LOGICAL pgoutput EXPORT_SNAPSHOT;
2025-02-09 21:47:01   UTC [57187] LOG:  exported logical decoding snapshot: "00000076-00000297-1" with 0 transaction IDs
2025-02-09 21:47:01   UTC [57187] STATEMENT:  CREATE_REPLICATION_SLOT citus_shard_move_slot_9_10_248 LOGICAL pgoutput EXPORT_SNAPSHOT;
2025-02-09 21:47:01   UTC [57191] LOG:  starting logical decoding for slot "citus_shard_move_slot_9_10_248"
2025-02-09 21:47:01   UTC [57191] DETAIL:  Streaming transactions committing after 8/A68794C0, reading WAL from 8/A6879488.
2025-02-09 21:47:01   UTC [57191] STATEMENT:  START_REPLICATION SLOT "citus_shard_move_slot_9_10_248" LOGICAL 0/0 (proto_version '4', origin 'any', publication_names '"citus_shard_move_publication_9_10_248"', binary 'true')
2025-02-09 21:47:01   UTC [57191] LOG:  logical decoding found consistent point at 8/A6879488
2025-02-09 21:47:01   UTC [57191] DETAIL:  There are no running transactions.
2025-02-09 21:47:01   UTC [57191] STATEMENT:  START_REPLICATION SLOT "citus_shard_move_slot_9_10_248" LOGICAL 0/0 (proto_version '4', origin 'any', publication_names '"citus_shard_move_publication_9_10_248"', binary 'true')
2025-02-09 21:47:02   UTC [57183] LOG:  The LSN of the target subscriptions on node host.docker.internal:5436 have caught up with the source LSN 
2025-02-09 21:47:02   UTC [57183] STATEMENT:  SELECT pg_catalog.citus_copy_shard_placement(102008, 6, 9, transfer_mode := 'force_logical')
2025-02-09 21:47:02   UTC [57183] LOG:  The LSN of the target subscriptions on node host.docker.internal:5436 have caught up with the source LSN 
2025-02-09 21:47:02   UTC [57183] STATEMENT:  SELECT pg_catalog.citus_copy_shard_placement(102008, 6, 9, transfer_mode := 'force_logical')
2025-02-09 21:47:02   UTC [57183] LOG:  The LSN of the target subscriptions on node host.docker.internal:5436 have caught up with the source LSN 
2025-02-09 21:47:02   UTC [57183] STATEMENT:  SELECT pg_catalog.citus_copy_shard_placement(102008, 6, 9, transfer_mode := 'force_logical')
2025-02-09 21:47:02   UTC [57181] LOG:  task jobid/taskid succeeded: 1/1
2025-02-09 21:47:02   UTC [57181] CONTEXT:  Citus Background Task Queue Monitor: testdb
2025-02-09 21:47:02   UTC [57181] LOG:  task jobid/taskid started: 1/2
2025-02-09 21:47:02   UTC [57181] CONTEXT:  Citus Background Task Queue Monitor: testdb
2025-02-09 21:47:03   UTC [57192] LOG:  The LSN of the target subscriptions on node host.docker.internal:5436 have caught up with the source LSN 
2025-02-09 21:47:03   UTC [57192] CONTEXT:  Citus Background Task Queue Executor: testdb/postgres for (1/2)
2025-02-09 21:47:03   UTC [57192] LOG:  The LSN of the target subscriptions on node host.docker.internal:5436 have caught up with the source LSN 
2025-02-09 21:47:03   UTC [57192] CONTEXT:  Citus Background Task Queue Executor: testdb/postgres for (1/2)
2025-02-09 21:47:03   UTC [57192] LOG:  The LSN of the target subscriptions on node host.docker.internal:5436 have caught up with the source LSN 
2025-02-09 21:47:03   UTC [57192] CONTEXT:  Citus Background Task Queue Executor: testdb/postgres for (1/2)
2025-02-09 21:47:03   UTC [57181] LOG:  task jobid/taskid succeeded: 1/2
2025-02-09 21:47:03   UTC [57181] CONTEXT:  Citus Background Task Queue Monitor: testdb
2025-02-09 21:47:03   UTC [57181] LOG:  task jobid/taskid started: 1/3
--- the following 1034 lines are omitted, which is all about task job, and all succeeded.
2025-02-09 21:48:58   UTC [57181] LOG:  task jobid/taskid started: 1/89
2025-02-09 21:48:58   UTC [57181] CONTEXT:  Citus Background Task Queue Monitor: testdb
2025-02-09 21:48:59   UTC [57662] LOG:  deferred drop of orphaned resource public.t2_102219 on host.docker.internal:5435 completed
2025-02-09 21:48:59   UTC [57662] STATEMENT:  CALL citus_cleanup_orphaned_resources()
2025-02-09 21:49:00   UTC [57661] LOG:  The LSN of the target subscriptions on node host.docker.internal:5436 have caught up with the source LSN 
2025-02-09 21:49:00   UTC [57661] CONTEXT:  Citus Background Task Queue Executor: testdb/postgres for (1/89)
2025-02-09 21:49:00   UTC [57661] LOG:  The LSN of the target subscriptions on node host.docker.internal:5436 have caught up with the source LSN 
2025-02-09 21:49:00   UTC [57661] CONTEXT:  Citus Background Task Queue Executor: testdb/postgres for (1/89)
2025-02-09 21:49:00   UTC [57661] LOG:  The LSN of the target subscriptions on node host.docker.internal:5436 have caught up with the source LSN 
2025-02-09 21:49:00   UTC [57661] CONTEXT:  Citus Background Task Queue Executor: testdb/postgres for (1/89)
2025-02-09 21:49:00   UTC [57181] LOG:  task jobid/taskid succeeded: 1/89
2025-02-09 21:49:00   UTC [57181] CONTEXT:  Citus Background Task Queue Monitor: testdb
2025-02-09 21:49:00   UTC [57181] LOG:  task jobid/taskid started: 1/90
2025-02-09 21:49:00   UTC [57181] CONTEXT:  Citus Background Task Queue Monitor: testdb
2025-02-09 21:49:00   UTC [57666] LOG:  deferred drop of orphaned resource public.t2_102220 on host.docker.internal:5435 completed
2025-02-09 21:49:00   UTC [57666] STATEMENT:  CALL citus_cleanup_orphaned_resources()
2025-02-09 21:49:01   UTC [57665] LOG:  The LSN of the target subscriptions on node host.docker.internal:5436 have caught up with the source LSN 
2025-02-09 21:49:01   UTC [57665] CONTEXT:  Citus Background Task Queue Executor: testdb/postgres for (1/90)
2025-02-09 21:49:01   UTC [57665] LOG:  The LSN of the target subscriptions on node host.docker.internal:5436 have caught up with the source LSN 
2025-02-09 21:49:01   UTC [57665] CONTEXT:  Citus Background Task Queue Executor: testdb/postgres for (1/90)
2025-02-09 21:49:01   UTC [57665] LOG:  The LSN of the target subscriptions on node host.docker.internal:5436 have caught up with the source LSN 
2025-02-09 21:49:01   UTC [57665] CONTEXT:  Citus Background Task Queue Executor: testdb/postgres for (1/90)
2025-02-09 21:49:01   UTC [57181] LOG:  task jobid/taskid succeeded: 1/90
2025-02-09 21:49:01   UTC [57181] CONTEXT:  Citus Background Task Queue Monitor: testdb
2025-02-09 21:49:01   UTC [57181] LOG:  task jobid/taskid started: 1/91
2025-02-09 21:49:01   UTC [57181] CONTEXT:  Citus Background Task Queue Monitor: testdb
2025-02-09 21:49:01   UTC [57676] LOG:  deferred drop of orphaned resource public.t2_102221 on host.docker.internal:5435 completed
2025-02-09 21:49:01   UTC [57676] STATEMENT:  CALL citus_cleanup_orphaned_resources()
2025-02-09 21:49:02   UTC [55973] FATAL:  terminating connection due to administrator command
2025-02-09 21:49:02   UTC [55594] FATAL:  terminating connection due to administrator command
2025-02-09 21:49:02   UTC [55587] FATAL:  terminating connection due to administrator command
2025-02-09 21:49:02   UTC [57181] LOG:  handling termination signal
2025-02-09 21:49:02   UTC [57181] CONTEXT:  Citus Background Task Queue Monitor: testdb
2025-02-09 21:49:02   UTC [57675] FATAL:  terminating background worker "Citus Background Task Queue Executor: testdb/postgres for (1/91)" due to administrator command
2025-02-09 21:49:02   UTC [57675] CONTEXT:  Citus Background Task Queue Executor: testdb/postgres for (1/91)
2025-02-09 21:49:02   UTC [55589] FATAL:  terminating connection due to administrator command
2025-02-09 21:49:02   UTC [57156] FATAL:  terminating connection due to administrator command
2025-02-09 21:49:02   UTC [57156] STATEMENT:  SET statement_timeout = '240s';SELECT citus_rebalance_wait();
2025-02-09 21:49:02   UTC [57181] LOG:  task jobid/taskid failed: 1/91
2025-02-09 21:49:02   UTC [57181] CONTEXT:  Citus Background Task Queue Monitor: testdb
2025-02-09 21:49:02   UTC [55593] FATAL:  terminating connection due to administrator command
2025-02-09 21:49:02   UTC [55591] FATAL:  terminating connection due to administrator command
2025-02-09 21:49:02   UTC [1] LOG:  background worker "Citus Background Task Queue Executor: testdb/postgres for (1/91)" (PID 57675) exited with exit code 1
2025-02-09 21:49:02   UTC [57677] FATAL:  terminating connection due to administrator command
2025-02-09 21:49:03   UTC [57680] LOG:  PID 55589 in cancel request did not match any process
2025-02-09 21:49:03   UTC [57681] LOG:  PID 55587 in cancel request did not match any process
2025-02-09 21:49:03   UTC [57682] LOG:  PID 55591 in cancel request did not match any process
2025-02-09 21:49:03   UTC [27] LOG:  checkpoint starting: immediate force wait
2025-02-09 21:49:03   UTC [27] LOG:  checkpoint complete: wrote 3 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.007 s, sync=0.004 s, total=0.024 s; sync files=2, longest=0.003 s, average=0.002 s; distance=1336 kB, estimate=7817 kB; lsn=8/A69012C0, redo lsn=8/A6901268
2025-02-09 21:49:03   UTC [57684] LOG:  PID 55594 in cancel request did not match any process
2025-02-09 21:49:03   UTC [57685] LOG:  PID 55593 in cancel request did not match any process
2025-02-09 21:49:05   UTC [57687] LOG:  starting maintenance daemon on database 5024507 user 10
2025-02-09 21:49:05   UTC [57687] CONTEXT:  Citus maintenance daemon for database 5024507 user 10
2025-02-09 21:49:23   UTC [57683] LOG:  could not receive data from client: Connection reset by peer
2025-02-09 21:54:03   UTC [27] LOG:  checkpoint starting: time
2025-02-09 21:55:55   UTC [27] LOG:  checkpoint complete: wrote 1117 buffers (6.8%); 0 WAL file(s) added, 0 removed, 0 recycled; write=111.661 s, sync=0.075 s, total=111.757 s; sync files=400, longest=0.003 s, average=0.001 s; distance=6806 kB, estimate=7716 kB; lsn=8/A6FA6CC8, redo lsn=8/A6FA6C38

@duerwuyi
Copy link
Author

the log of worker node:

2025-02-09 21:47:02   UTC [70304] LOG:  logical decoding found consistent point at 9/C5A55C38
2025-02-09 21:47:02   UTC [70304] DETAIL:  There are no running transactions.
2025-02-09 21:47:02   UTC [70304] STATEMENT:  CREATE_REPLICATION_SLOT citus_shard_move_slot_9_10_249 LOGICAL pgoutput EXPORT_SNAPSHOT;
2025-02-09 21:47:02   UTC [70304] LOG:  exported logical decoding snapshot: "00000073-00001961-1" with 0 transaction IDs
2025-02-09 21:47:02   UTC [70304] STATEMENT:  CREATE_REPLICATION_SLOT citus_shard_move_slot_9_10_249 LOGICAL pgoutput EXPORT_SNAPSHOT;
2025-02-09 21:47:02   UTC [70306] LOG:  starting logical decoding for slot "citus_shard_move_slot_9_10_249"
2025-02-09 21:47:02   UTC [70306] DETAIL:  Streaming transactions committing after 9/C5A55C70, reading WAL from 9/C5A55C38.
2025-02-09 21:47:02   UTC [70306] STATEMENT:  START_REPLICATION SLOT "citus_shard_move_slot_9_10_249" LOGICAL 0/0 (proto_version '4', origin 'any', publication_names '"citus_shard_move_publication_9_10_249"', binary 'true')
2025-02-09 21:47:02   UTC [70306] LOG:  logical decoding found consistent point at 9/C5A55C38
2025-02-09 21:47:02   UTC [70306] DETAIL:  There are no running transactions.
2025-02-09 21:47:02   UTC [70306] STATEMENT:  START_REPLICATION SLOT "citus_shard_move_slot_9_10_249" LOGICAL 0/0 (proto_version '4', origin 'any', publication_names '"citus_shard_move_publication_9_10_249"', binary 'true')
2025-02-09 21:47:04   UTC [70318] LOG:  logical decoding found consistent point at 9/C5A6A288
--- the following 960 lines are omitted, which is all about replication slot , and all succeeded.
2025-02-09 21:49:00   UTC [71134] STATEMENT:  CREATE_REPLICATION_SLOT citus_shard_move_slot_9_10_337 LOGICAL pgoutput EXPORT_SNAPSHOT;
2025-02-09 21:49:00   UTC [71134] LOG:  exported logical decoding snapshot: "00000075-0000198D-1" with 0 transaction IDs
2025-02-09 21:49:00   UTC [71134] STATEMENT:  CREATE_REPLICATION_SLOT citus_shard_move_slot_9_10_337 LOGICAL pgoutput EXPORT_SNAPSHOT;
2025-02-09 21:49:00   UTC [71136] LOG:  starting logical decoding for slot "citus_shard_move_slot_9_10_337"
2025-02-09 21:49:00   UTC [71136] DETAIL:  Streaming transactions committing after 9/C5C50EE8, reading WAL from 9/C5C50EB0.
2025-02-09 21:49:00   UTC [71136] STATEMENT:  START_REPLICATION SLOT "citus_shard_move_slot_9_10_337" LOGICAL 0/0 (proto_version '4', origin 'any', publication_names '"citus_shard_move_publication_9_10_337"', binary 'true')
2025-02-09 21:49:00   UTC [71136] LOG:  logical decoding found consistent point at 9/C5C50EB0
2025-02-09 21:49:00   UTC [71136] DETAIL:  There are no running transactions.
2025-02-09 21:49:00   UTC [71136] STATEMENT:  START_REPLICATION SLOT "citus_shard_move_slot_9_10_337" LOGICAL 0/0 (proto_version '4', origin 'any', publication_names '"citus_shard_move_publication_9_10_337"', binary 'true')
2025-02-09 21:49:01   UTC [71141] LOG:  logical decoding found consistent point at 9/C5C53548
2025-02-09 21:49:01   UTC [71141] DETAIL:  There are no running transactions.
2025-02-09 21:49:01   UTC [71141] STATEMENT:  CREATE_REPLICATION_SLOT citus_shard_move_slot_9_10_338 LOGICAL pgoutput EXPORT_SNAPSHOT;
2025-02-09 21:49:01   UTC [71141] LOG:  exported logical decoding snapshot: "00000078-000019D6-1" with 0 transaction IDs
2025-02-09 21:49:01   UTC [71141] STATEMENT:  CREATE_REPLICATION_SLOT citus_shard_move_slot_9_10_338 LOGICAL pgoutput EXPORT_SNAPSHOT;
2025-02-09 21:49:01   UTC [71143] LOG:  starting logical decoding for slot "citus_shard_move_slot_9_10_338"
2025-02-09 21:49:01   UTC [71143] DETAIL:  Streaming transactions committing after 9/C5C53580, reading WAL from 9/C5C53548.
2025-02-09 21:49:01   UTC [71143] STATEMENT:  START_REPLICATION SLOT "citus_shard_move_slot_9_10_338" LOGICAL 0/0 (proto_version '4', origin 'any', publication_names '"citus_shard_move_publication_9_10_338"', binary 'true')
2025-02-09 21:49:01   UTC [71143] LOG:  logical decoding found consistent point at 9/C5C53548
2025-02-09 21:49:01   UTC [71143] DETAIL:  There are no running transactions.
2025-02-09 21:49:01   UTC [71143] STATEMENT:  START_REPLICATION SLOT "citus_shard_move_slot_9_10_338" LOGICAL 0/0 (proto_version '4', origin 'any', publication_names '"citus_shard_move_publication_9_10_338"', binary 'true')
2025-02-09 21:49:03   UTC [68165] WARNING:  terminating connection due to administrator command
2025-02-09 21:49:03   UTC [68165] CONTEXT:  while executing command on host.docker.internal:5433
2025-02-09 21:49:03     Citus maintenance daemon for database 6287007 user 10
2025-02-09 21:49:04   UTC [71152] ERROR:  database "testdb" is used by an active logical replication slot
2025-02-09 21:49:04   UTC [71152] DETAIL:  There is 1 active slot.
2025-02-09 21:49:04   UTC [71152] STATEMENT:  drop database if exists testdb with (force);
2025-02-09 21:49:10   UTC [71160] ERROR:  database "testdb" is used by an active logical replication slot
2025-02-09 21:49:10   UTC [71160] DETAIL:  There is 1 active slot.
2025-02-09 21:49:10   UTC [71160] STATEMENT:  drop database if exists testdb with (force);
2025-02-09 21:49:16   UTC [71175] ERROR:  database "testdb" is used by an active logical replication slot
2025-02-09 21:49:16   UTC [71175] DETAIL:  There is 1 active slot.
2025-02-09 21:49:16   UTC [71175] STATEMENT:  drop database if exists testdb with (force);
2025-02-09 21:49:22   UTC [71183] ERROR:  database "testdb" is used by an active logical replication slot
2025-02-09 21:49:22   UTC [71183] DETAIL:  There is 1 active slot.
2025-02-09 21:49:22   UTC [71183] STATEMENT:  drop database if exists testdb with (force);
2025-02-09 21:49:23   UTC [71183] LOG:  could not receive data from client: Connection reset by peer
2025-02-09 21:50:15   UTC [71279] LOG:  starting maintenance daemon on database 6287007 user 10
2025-02-09 21:50:15   UTC [71279] CONTEXT:  Citus maintenance daemon for database 6287007 user 10
2025-02-09 21:50:34   UTC [71320] ERROR:  database "testdb" is used by an active logical replication slot
2025-02-09 21:50:34   UTC [71320] DETAIL:  There is 1 active slot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants