-
Notifications
You must be signed in to change notification settings - Fork 695
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix 7875 #7910
base: release-13.0
Are you sure you want to change the base?
Fix 7875 #7910
Conversation
Codecov ReportAttention: Patch coverage is
❌ Your patch check has failed because the patch coverage (0.00%) is below the target coverage (75.00%). You can increase the patch coverage or adjust the target coverage.
Additional details and impacted files@@ Coverage Diff @@
## release-13.0 #7910 +/- ##
================================================
- Coverage 89.48% 1.22% -88.26%
================================================
Files 276 276
Lines 60063 59391 -672
Branches 7524 7412 -112
================================================
- Hits 53747 729 -53018
- Misses 4166 58581 +54415
+ Partials 2150 81 -2069 |
…ng Citus upgrade tests" This reverts commit b6b73e2.
…des (citusdata#7875) Currently, RecoverWorkerTransactions() creates a new connection for each worker node and then performs transaction recovery by reading and locking the pg_dist_transaction catalog table until the end of the transaction. When RecoverTwoPhaseCommits() calls RecoverWorkerTransactions() for each worker node, the lock acquisition order between pg_dist_authinfo and pg_dist_transaction can reverse on alternate iterations. This reversal can lead to a deadlock if any concurrent process requires locks on these catalog tables—a situation that has surfaced during the Citus upgrade workflow. To resolve this, we now pre-establish all worker node connections upfront. This change ensures that RecoverWorkerTransactions() operates with a single, consistent distributed catalog table connection, thereby always acquiring locks on pg_dist_authinfo and pg_dist_transaction in the correct order and preventing potential deadlocks during extension updates or similar operations.
Prepared transactions on the local node are initiated and managed by a remote worker in the cluster. As such, we now only call RecoverWorkerTransactions for remote nodes, ensuring that transaction recovery is handled by the appropriate node.
Fix Deadlock with transaction recovery is possible during Citus upgrades (#7875)
Currently, RecoverWorkerTransactions() creates a new connection for each worker node and then performs transaction recovery by reading and locking the pg_dist_transaction catalog table until the end of the transaction. When RecoverTwoPhaseCommits() calls RecoverWorkerTransactions() for each worker node, the lock acquisition order between pg_dist_authinfo and pg_dist_transaction can reverse on alternate iterations.This reversal can lead to a deadlock if any concurrent process requires locks on these catalog tables—a situation that has surfaced during the Citus upgrade workflow.To resolve this, we now pre-establish all worker node connections upfront.
This change ensures that RecoverWorkerTransactions() operates with a single, consistent distributed catalog table connection, thereby always acquiring locks on pg_dist_authinfo and pg_dist_transaction in the correct order and preventing potential deadlocks during extension updates or similar operations.
The PR also reverts the commit that disabled the relevant test cases,