Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid re-assigning the global pid for client backends and bg workers when the application_name changes #7791

Merged
merged 12 commits into from
Dec 23, 2024
Merged
Show file tree
Hide file tree
Changes from 10 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions src/backend/distributed/metadata/metadata_cache.c
Original file line number Diff line number Diff line change
Expand Up @@ -4545,6 +4545,16 @@ GetLocalNodeId(void)
}


/*
* CachedLocalNodeIdIsValid return true if the cached local node id is valid.
*/
bool
CachedLocalNodeIdIsValid(void)
{
return LocalNodeId != -1;
}


/*
* RegisterLocalGroupIdCacheCallbacks registers the callbacks required to
* maintain LocalGroupId at a consistent value. It's separate from
Expand Down
16 changes: 15 additions & 1 deletion src/backend/distributed/shared_library_init.c
Original file line number Diff line number Diff line change
Expand Up @@ -2899,6 +2899,18 @@
* So we set the FinishedStartupCitusBackend flag in StartupCitusBackend to
* indicate when this responsibility handoff has happened.
*
* On the other hand, even if now it's this hook's responsibility to update
* the global pid, we cannot do so if we might need to read from catalog
* but it's unsafe to do so. For Citus internal backends, this cannot be the
* case because in that case AssignGlobalPID() just extracts its global pid
* from the application_name and extracting doesn't require catalog access.
* But for external client backends, we either need to guarantee that we won't
* read from catalog tables or that it's safe to do so. The only case where
* AssignGlobalPID() could read from catalog tables is when the cached local
* node id is invalidated. So for this reason, for external client backends,
* we require that either the cached local node id is valid or that we are in
* a transaction block -because in that case it's safe to read from catalog.
*
* Another solution to the catalog table acccess problem would be to update
* global pid lazily, like we do for HideShards. But that's not possible
* for the global pid, since it is stored in shared memory instead of in a
Expand All @@ -2907,7 +2919,9 @@
* as reasonably possible, which is also why we extract global pids in the
* AuthHook already (extracting doesn't require catalog access).
*/
if (FinishedStartupCitusBackend)
if (FinishedStartupCitusBackend &&
(!IsExternalClientBackend() || CachedLocalNodeIdIsValid() ||
IsTransactionState()))

Check warning on line 2924 in src/backend/distributed/shared_library_init.c

View check run for this annotation

Codecov / codecov/patch

src/backend/distributed/shared_library_init.c#L2924

Added line #L2924 was not covered by tests
{
AssignGlobalPID(newval);
}
Expand Down
1 change: 1 addition & 0 deletions src/include/distributed/metadata_cache.h
Original file line number Diff line number Diff line change
Expand Up @@ -181,6 +181,7 @@ extern CitusTableCacheEntry * LookupCitusTableCacheEntry(Oid relationId);
extern DistObjectCacheEntry * LookupDistObjectCacheEntry(Oid classid, Oid objid, int32
objsubid);
extern int32 GetLocalGroupId(void);
extern bool CachedLocalNodeIdIsValid(void);
extern int32 GetLocalNodeId(void);
extern void CitusTableCacheFlushInvalidatedEntries(void);
extern Oid LookupShardRelationFromCatalog(int64 shardId, bool missing_ok);
Expand Down
29 changes: 29 additions & 0 deletions src/test/regress/expected/remove_coordinator.out
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,35 @@ SELECT master_remove_node('localhost', :master_port);

(1 row)

-- to silence -potentially flaky- "could not establish connection after" warnings in below test
SET client_min_messages TO ERROR;
-- to fail fast when the hostname is not resolvable, as it will be the case below
SET citus.node_connection_timeout to '1s';
BEGIN;
SET application_name TO 'new_app_name';
-- that should fail because of bad hostname & port
SELECT citus_add_node('200.200.200.200', 1, 200);
ERROR: connection to the remote node [email protected]:1 failed
-- Since above command failed, now Postgres will need to revert the
-- application_name change made in this transaction and this will
-- happen within abort-transaction callback, so we won't be in a
-- transaction block while Postgres does that.
--
-- And when the application_name changes, Citus tries to re-assign
-- the global pid and doing so for Citus internal backends doesn't
-- require being in a transaction block and is safe.
--
-- However, for the client external backends (like us here), Citus
-- doesn't try to re-assign the global pid if doing so requires catalog
-- access and we're outside of a transaction block. Note that in that
-- case the catalog access may only be needed to retrive the local node
-- id when the cached local node is invalidated like what just happened
-- here because of the failed citus_add_node() call made above.
--
-- So by failing here (rather than crashing), we ensure this behavior.
ROLLBACK;
RESET client_min_messages;
RESET citus.node_connection_timeout;
-- restore coordinator for the rest of the tests
SELECT citus_set_coordinator_host('localhost', :master_port);
citus_set_coordinator_host
Expand Down
34 changes: 34 additions & 0 deletions src/test/regress/sql/remove_coordinator.sql
Original file line number Diff line number Diff line change
@@ -1,5 +1,39 @@
-- removing coordinator from pg_dist_node should update pg_dist_colocation
SELECT master_remove_node('localhost', :master_port);

-- to silence -potentially flaky- "could not establish connection after" warnings in below test
SET client_min_messages TO ERROR;

-- to fail fast when the hostname is not resolvable, as it will be the case below
SET citus.node_connection_timeout to '1s';

BEGIN;
SET application_name TO 'new_app_name';

-- that should fail because of bad hostname & port
SELECT citus_add_node('200.200.200.200', 1, 200);

-- Since above command failed, now Postgres will need to revert the
-- application_name change made in this transaction and this will
-- happen within abort-transaction callback, so we won't be in a
-- transaction block while Postgres does that.
--
-- And when the application_name changes, Citus tries to re-assign
-- the global pid and doing so for Citus internal backends doesn't
-- require being in a transaction block and is safe.
--
-- However, for the client external backends (like us here), Citus
-- doesn't try to re-assign the global pid if doing so requires catalog
-- access and we're outside of a transaction block. Note that in that
-- case the catalog access may only be needed to retrive the local node
-- id when the cached local node is invalidated like what just happened
-- here because of the failed citus_add_node() call made above.
--
-- So by failing here (rather than crashing), we ensure this behavior.
ROLLBACK;

RESET client_min_messages;
RESET citus.node_connection_timeout;

-- restore coordinator for the rest of the tests
SELECT citus_set_coordinator_host('localhost', :master_port);
Loading