Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid re-assigning the global pid for client backends and bg workers when the application_name changes #7791

Merged
merged 12 commits into from
Dec 23, 2024
Merged
Show file tree
Hide file tree
Changes from 11 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 22 additions & 9 deletions src/backend/distributed/shared_library_init.c
Original file line number Diff line number Diff line change
Expand Up @@ -2890,14 +2890,27 @@ ApplicationNameAssignHook(const char *newval, void *extra)
DetermineCitusBackendType(newval);

/*
* AssignGlobalPID might read from catalog tables to get the the local
* nodeid. But ApplicationNameAssignHook might be called before catalog
* access is available to the backend (such as in early stages of
* authentication). We use StartupCitusBackend to initialize the global pid
* after catalogs are available. After that happens this hook becomes
* responsible to update the global pid on later application_name changes.
* So we set the FinishedStartupCitusBackend flag in StartupCitusBackend to
* indicate when this responsibility handoff has happened.
* We use StartupCitusBackend to initialize the global pid after catalogs
* are available. After that happens this hook becomes responsible to update
* the global pid on later application_name changes. So we set the
* FinishedStartupCitusBackend flag in StartupCitusBackend to indicate when
* this responsibility handoff has happened.
*
* Also note that when application_name changes, we don't actually need to
* try re-assigning the global pid for external client backends and
* background workers because application_name doesn't affect the global
* pid for such backends - note that !IsExternalClientBackend() check covers
* both types of backends. Plus,
* trying to re-assign the global pid for such backends would unnecessarily
* cause performing a catalog access when the cached local node id is
* invalidated. However, accessing to the catalog tables is dangerous in
* certain situations like when we're not in a transaction block. And for
* the other types of backends, i.e., the Citus internal backends, we need
* to re-assign the global pid when the application_name changes because for
* such backends we simply extract the global pid inherited from the
* originating backend from the application_name -that's specified by
* originating backend when openning that connection- and this doesn't require
* catalog access.
*
* Another solution to the catalog table acccess problem would be to update
* global pid lazily, like we do for HideShards. But that's not possible
Expand All @@ -2907,7 +2920,7 @@ ApplicationNameAssignHook(const char *newval, void *extra)
* as reasonably possible, which is also why we extract global pids in the
* AuthHook already (extracting doesn't require catalog access).
*/
if (FinishedStartupCitusBackend)
if (FinishedStartupCitusBackend && !IsExternalClientBackend())
{
AssignGlobalPID(newval);
}
Expand Down
4 changes: 4 additions & 0 deletions src/backend/distributed/test/run_from_same_connection.c
Original file line number Diff line number Diff line change
Expand Up @@ -190,6 +190,9 @@ run_commands_on_session_level_connection_to_node(PG_FUNCTION_ARGS)

/*
* override_backend_data_gpid is a wrapper around SetBackendDataGpid().
* Also sets distributedCommandOriginator to true since the only caller of
* this method calls this function actually wants this backend to
* be treated as a distributed command originator with the given global pid.
*/
Datum
override_backend_data_gpid(PG_FUNCTION_ARGS)
Expand All @@ -199,6 +202,7 @@ override_backend_data_gpid(PG_FUNCTION_ARGS)
uint64 gpid = PG_GETARG_INT64(0);

SetBackendDataGlobalPID(gpid);
SetBackendDataDistributedCommandOriginator(true);

PG_RETURN_VOID();
}
Expand Down
17 changes: 17 additions & 0 deletions src/backend/distributed/transaction/backend_data.c
Original file line number Diff line number Diff line change
Expand Up @@ -964,6 +964,23 @@
}


/*
* SetBackendDataDistributedCommandOriginator sets the distributedCommandOriginator
* field on MyBackendData.
*/
void
SetBackendDataDistributedCommandOriginator(bool distributedCommandOriginator)
{
if (!MyBackendData)
{
return;
}
SpinLockAcquire(&MyBackendData->mutex);

Check warning on line 978 in src/backend/distributed/transaction/backend_data.c

View check run for this annotation

Codecov / codecov/patch

src/backend/distributed/transaction/backend_data.c#L978

Added line #L978 was not covered by tests
MyBackendData->distributedCommandOriginator = distributedCommandOriginator;
SpinLockRelease(&MyBackendData->mutex);
}


/*
* GetGlobalPID returns the global process id of the current backend.
*/
Expand Down
1 change: 1 addition & 0 deletions src/include/distributed/backend_data.h
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,7 @@ extern void AssignGlobalPID(const char *applicationName);
extern uint64 GetGlobalPID(void);
extern void SetBackendDataDatabaseId(void);
extern void SetBackendDataGlobalPID(uint64 gpid);
extern void SetBackendDataDistributedCommandOriginator(bool distributedCommandOriginator);
extern uint64 ExtractGlobalPID(const char *applicationName);
extern int ExtractNodeIdFromGlobalPID(uint64 globalPID, bool missingOk);
extern int ExtractProcessIdFromGlobalPID(uint64 globalPID);
Expand Down
31 changes: 31 additions & 0 deletions src/test/regress/expected/remove_coordinator.out
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,37 @@ SELECT master_remove_node('localhost', :master_port);

(1 row)

-- to silence -potentially flaky- "could not establish connection after" warnings in below test
SET client_min_messages TO ERROR;
-- to fail fast when the hostname is not resolvable, as it will be the case below
SET citus.node_connection_timeout to '1s';
BEGIN;
SET application_name TO 'new_app_name';
-- that should fail because of bad hostname & port
SELECT citus_add_node('200.200.200.200', 1, 200);
ERROR: connection to the remote node [email protected]:1 failed
-- Since above command failed, now Postgres will need to revert the
-- application_name change made in this transaction and this will
-- happen within abort-transaction callback, so we won't be in a
-- transaction block while Postgres does that.
--
-- And when the application_name changes, Citus tries to re-assign
-- the global pid but it does so only for Citus internal backends,
-- and doing so for Citus internal backends doesn't require being
-- in a transaction block and is safe.
--
-- However, for the client external backends (like us here), Citus
-- doesn't re-assign the global pid because it's not needed and it's
-- not safe to do so outside of a transaction block. This is because,
-- it would require performing a catalog access to retrive the local
-- node id when the cached local node is invalidated like what just
-- happened here because of the failed citus_add_node() call made
-- above.
--
-- So by failing here (rather than crashing), we ensure this behavior.
ROLLBACK;
RESET client_min_messages;
RESET citus.node_connection_timeout;
-- restore coordinator for the rest of the tests
SELECT citus_set_coordinator_host('localhost', :master_port);
citus_set_coordinator_host
Expand Down
36 changes: 36 additions & 0 deletions src/test/regress/sql/remove_coordinator.sql
Original file line number Diff line number Diff line change
@@ -1,5 +1,41 @@
-- removing coordinator from pg_dist_node should update pg_dist_colocation
SELECT master_remove_node('localhost', :master_port);

-- to silence -potentially flaky- "could not establish connection after" warnings in below test
SET client_min_messages TO ERROR;

-- to fail fast when the hostname is not resolvable, as it will be the case below
SET citus.node_connection_timeout to '1s';

BEGIN;
SET application_name TO 'new_app_name';

-- that should fail because of bad hostname & port
SELECT citus_add_node('200.200.200.200', 1, 200);

-- Since above command failed, now Postgres will need to revert the
-- application_name change made in this transaction and this will
-- happen within abort-transaction callback, so we won't be in a
-- transaction block while Postgres does that.
--
-- And when the application_name changes, Citus tries to re-assign
-- the global pid but it does so only for Citus internal backends,
-- and doing so for Citus internal backends doesn't require being
-- in a transaction block and is safe.
--
-- However, for the client external backends (like us here), Citus
-- doesn't re-assign the global pid because it's not needed and it's
-- not safe to do so outside of a transaction block. This is because,
-- it would require performing a catalog access to retrive the local
-- node id when the cached local node is invalidated like what just
-- happened here because of the failed citus_add_node() call made
-- above.
--
-- So by failing here (rather than crashing), we ensure this behavior.
ROLLBACK;

RESET client_min_messages;
RESET citus.node_connection_timeout;

-- restore coordinator for the rest of the tests
SELECT citus_set_coordinator_host('localhost', :master_port);
Loading