Deadlock with transaction recovery is possible during Citus upgrades #7875

onurctirtir · 2025-02-04T07:52:44Z

This is especially possible for the upgrade paths where we are taking strong locks, as in citus--11.2-2--11.3-1.sql.

An example on how this upgrade path can cause a deadlock with transaction recovery:

RecoverTwoPhaseCommits() acquires RowExclusiveLock on pg_dist_transaction via RecoverWorkerTransactions() and it doesn't release the lock until the end of the transaction as close the table with "NoLock".
Upgrade script acquires AccessExclusiveLock on pg_dist_authinfo because of the following DDL:
"ALTER TABLE pg_catalog.pg_dist_authinfo REPLICA IDENTITY USING INDEX pg_dist_authinfo_identification_index;"
RecoverTwoPhaseCommits() continues to process the next worker via RecoverWorkerTransactions() and this implicitly tries to acquire AccessShareLock on pg_dist_authinfo via GetNodeConnection().
Upgrade script tries to acquire AccessExclusiveLock on pg_dist_transaction because of the following DDL:
"ALTER TABLE pg_catalog.pg_dist_transaction REPLICA IDENTITY USING INDEX pg_dist_transaction_unique_constraint;"

Now, while the process that is executing "ALTER EXTENSION citus UPDATE" is waiting to acquire AccessExclusiveLock on pg_dist_transaction while holding AccessExclusiveLock on pg_dist_authinfo; maintenance daemon is waiting to acquire AccessShareLock on pg_dist_authinfo while holding AccessExclusiveLock on pg_dist_authinfo.

As a result, the processes involved in the deadlock are cancelled by Postgres to resolve the deadlock.

Luckily, this doesn't leave the database in a bad state or such because Postgres implicitly executes the upgrade scripts within an implicit transaction block and a retry mostly helps.

But rather than retrying, until we properly fix this issue, disabling 2PC recovery during Citus upgrade seems like a more reliable workaround, as in;

ALTER SYSTEM SET citus.recover_2pc_interval TO -1;
SELECT pg_reload_conf();

ALTER EXTENSION citus UPDATE;

ALTER SYSTEM RESET citus.recover_2pc_interval;
SELECT pg_reload_conf();

onurctirtir · 2025-02-04T14:00:06Z

Note to future dev:

Once we come up with a fix for that, we should also revert b6b73e2 from the branches where it presents while backporting the fix.

As of today, only main & release-13.0 has b6b73e2.

…des (citusdata#7875) Currently, RecoverWorkerTransactions() creates a new connection for each worker node and then performs transaction recovery by reading and locking the pg_dist_transaction catalog table until the end of the transaction. When RecoverTwoPhaseCommits() calls RecoverWorkerTransactions() for each worker node, the lock acquisition order between pg_dist_authinfo and pg_dist_transaction can reverse on alternate iterations. This reversal can lead to a deadlock if any concurrent process requires locks on these catalog tables—a situation that has surfaced during the Citus upgrade workflow. To resolve this, we now pre-establish all worker node connections upfront. This change ensures that RecoverWorkerTransactions() operates with a single, consistent distributed catalog table connection, thereby always acquiring locks on pg_dist_authinfo and pg_dist_transaction in the correct order and preventing potential deadlocks during extension updates or similar operations.

onurctirtir added the bug label Feb 4, 2025

onurctirtir mentioned this issue Feb 4, 2025

Revert "Release RowExclusiveLock on pg_dist_transaction as soon as remote xacts are recovered" #7877

Merged

onurctirtir added the cherry-pick-13.0 label Feb 11, 2025

codeforall mentioned this issue Feb 24, 2025

Fix 7875 #7910

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deadlock with transaction recovery is possible during Citus upgrades #7875

Deadlock with transaction recovery is possible during Citus upgrades #7875

onurctirtir commented Feb 4, 2025 •

edited

Loading

onurctirtir commented Feb 4, 2025

Deadlock with transaction recovery is possible during Citus upgrades #7875

Deadlock with transaction recovery is possible during Citus upgrades #7875

Comments

onurctirtir commented Feb 4, 2025 • edited Loading

onurctirtir commented Feb 4, 2025

onurctirtir commented Feb 4, 2025 •

edited

Loading