standby node stuck in catchingup state when replication slot lost on primary node #1031

rhicks0614 · 2024-04-03T18:31:58Z

rhicks0614
Apr 3, 2024

hi,

I'm trying out pg_auto_failover version 2.1.2, using two nodes (postgresql-14).

I'm additionally setting max_slot_wal_keep_size, due to disk space limitations.

After stopping the standby node, the primary node went to "wait_primary" as expected.

Eventually the replication slot wal_status goes to "lost", due to max_slot_wal_keep_size being set:

postgres=# select * from pg_replication_slots;
        slot_name         | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | c
atalog_xmin | restart_lsn | confirmed_flush_lsn | wal_status | safe_wal_size | two_phase 
--------------------------+--------+-----------+--------+----------+-----------+--------+------------+------+--
------------+-------------+---------------------+------------+---------------+-----------
 pgautofailover_standby_2 |        | physical  |        |          | f         | f      |            |  749 |  
            |             |                     | lost       |               | f
(1 row)

Now after restarting the standby, I observe it is stuck in the "catchingup" state per "pg_autoctl show state":

                    Name |  Node |                                Host:Port |        TLI: LSN |   Connection |      Reported State |      Assigned State
-------------------------+-------+------------------------------------------+-----------------+--------------+---------------------+--------------------
appliance_apm00232407584 |     4 | fded:88e7:d92c:0:201:4427:b7b2:9093:5435 |   1: 1/3B521D68 |   read-write |        wait_primary |        wait_primary
appliance_apm00232407729 |     5 | fded:88e7:d92c:0:201:4493:23d1:6c9d:5435 |   1: 0/9E000000 |    read-only |          catchingup |          catchingup

and in the postgres logs, the standby is continually trying to start streaming again, even though the WAL segment has been removed (lost):

...
2024-04-03 17:46:54.633 UTC [5128] LOG:  started streaming WAL from primary at 0/9E000000 on timeline 1
2024-04-03 17:46:54.633 UTC [5128] FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 00000001000000000000009E has already been removed
2024-04-03 17:46:59.637 UTC [5135] LOG:  started streaming WAL from primary at 0/9E000000 on timeline 1
2024-04-03 17:46:59.637 UTC [5135] FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 00000001000000000000009E has already been removed
2024-04-03 17:47:04.641 UTC [5142] LOG:  started streaming WAL from primary at 0/9E000000 on timeline 1
2024-04-03 17:47:04.641 UTC [5142] FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 00000001000000000000009E has already been removed
...

Is there some way to trigger it to give up, and proceed to do pg_basebackup to recover?

thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

standby node stuck in catchingup state when replication slot lost on primary node #1031

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

standby node stuck in catchingup state when replication slot lost on primary node #1031

Uh oh!

rhicks0614 Apr 3, 2024

Replies: 0 comments

rhicks0614
Apr 3, 2024