You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Citus MX and (in general Citus) requires the nodes to be able to connect to each other. Citus now provides a simple UDF that provides the connectivity information across the whole cluster.
Possible improvements in data planes
We can check for connectivity issues regularly in our managed database offerings. This can help detect network issues before they are reported by the customer via a ticket.
What are the typical use cases?
Routine controls in Control planes
Database Operators working on troubleshooting in a cluster
Communication goals (e.g. detailed howto vs orientation)
Good locations for content in docs structure
A new section under Useful Diagnostic Queries
Cluster Management -> Security -> Connection management
We can have a note section here indicating that the users can use the new UDF to check for broken ssl setups, or other connection management configuration problems.
How does this work? (devs)
citus_check_cluster_node_health() returns setof records with names and ports for the source and target nodes, and the connectivity check result.
There are 3 possible result column values:
true : connection attempt from source to target succeeded.
false : connection attempt from source to target failed.
NULL : connection attempt from the current node to source node failed.
Example sql
This UDF can return many rows on large clusters. We suggest operators to run the following query that returns a single boolean first. If this one is true, all the checks were successful. Otherwise, the operator should look into the connectivity report on all the worker pairs.
SELECT bool_and(coalesce(result, false)) FROM citus_check_cluster_node_health();
bool_and
---------------------------------------------------------------------
f
(1 row)
For a quick summary, one can also use the following query to get the distinct counts of connectivity results in the cluster.
SELECT result, count(*) FROM citus_check_cluster_node_health() GROUP BY result ORDER BY 1;
result | count
---------------------------------------------------------------------
t | 2
| 2
(2 rows)
To check all the results for the connectivity checks, run the following:
SELECT * FROM citus_check_cluster_node_health() ORDER BY 1,2,3,4;
from_nodename | from_nodeport | to_nodename | to_nodeport | result
---------------------------------------------------------------------
localhost | 57637 | localhost | 57637 | t
localhost | 57637 | localhost | 57638 | t
localhost | 57638 | localhost | 57637 |
localhost | 57638 | localhost | 57638 |
Corner cases, gotchas
It is risky to use the following query to check for an aggregate summary: SELECT bool_and(result) FROM citus_check_cluster_node_health();
because bool_and aggregate returns true even if some of the rows were NULL. Using coalesce(result, false) is crucial here.
Are there relevant blog posts or outside documentation about the concept/feature?
Link to relevant commits and regression tests if applicable
#1038 includes docs for citus_check_cluster_node_health() itself, but we're still lacking content in useful diagnostic queries and the cluster management page like you suggsted.
Why are we implementing it? (sales eng)
Citus MX and (in general Citus) requires the nodes to be able to connect to each other. Citus now provides a simple UDF that provides the connectivity information across the whole cluster.
We can check for connectivity issues regularly in our managed database offerings. This can help detect network issues before they are reported by the customer via a ticket.
What are the typical use cases?
Communication goals (e.g. detailed howto vs orientation)
Good locations for content in docs structure
Useful Diagnostic Queries
We can have a note section here indicating that the users can use the new UDF to check for broken ssl setups, or other connection management configuration problems.
How does this work? (devs)
citus_check_cluster_node_health()
returns setof records with names and ports for the source and target nodes, and the connectivity check result.There are 3 possible result column values:
true
: connection attempt from source to target succeeded.false
: connection attempt from source to target failed.NULL
: connection attempt from the current node to source node failed.Example sql
This UDF can return many rows on large clusters. We suggest operators to run the following query that returns a single boolean first. If this one is true, all the checks were successful. Otherwise, the operator should look into the connectivity report on all the worker pairs.
For a quick summary, one can also use the following query to get the distinct counts of connectivity results in the cluster.
To check all the results for the connectivity checks, run the following:
Corner cases, gotchas
It is risky to use the following query to check for an aggregate summary:
SELECT bool_and(result) FROM citus_check_cluster_node_health();
because bool_and aggregate returns true even if some of the rows were NULL. Using
coalesce(result, false)
is crucial here.Are there relevant blog posts or outside documentation about the concept/feature?
Link to relevant commits and regression tests if applicable
citusdata/citus#5509
The text was updated successfully, but these errors were encountered: