-
Notifications
You must be signed in to change notification settings - Fork 25.2k
transport: log network reconnects with same peer process #128415
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
transport: log network reconnects with same peer process #128415
Conversation
ClusterConnectionManager now caches the previous ephemeralId (created on process-start) of peer nodes on disconnect in a connection history table. On reconnect, when a peer has the same ephemeralId as it did previously, this is logged to indicate a network failure. The connectionHistory is trimmed to the current set of peers by NodeConnectionsService.
Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination) |
I wasn't able to find a way to test the ClusterConnectionManager's connectionHistory table when integrated through the NodeConnectionsService. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good, just a few questions and minor comments.
server/src/main/java/org/elasticsearch/transport/ClusterConnectionManager.java
Outdated
Show resolved
Hide resolved
/** | ||
* Keep the connection history for the nodes listed | ||
*/ | ||
void retainConnectionHistory(List<DiscoveryNode> nodes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the javadoc I think we should mention that we discard history for nodes not in the list? If you know the Set API then it's suggested by the name retain
, but if you don't it might not be obvious.
@@ -120,6 +122,7 @@ public void connectToNodes(DiscoveryNodes discoveryNodes, Runnable onCompletion) | |||
runnables.add(connectionTarget.connect(null)); | |||
} | |||
} | |||
transportService.retainConnectionHistory(nodes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might be able to use DiscoveryNodes#getAllNodes()
rather than building up an auxiliary collection, that might be marginally more efficient? Set#retainAll
seems to take a Collection
, but we'd need to change the ConnectionManager#retainConnectionHistory
interface to accommodate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need a separate collection here at all? We could just pass discoveryNodes
around I think.
But also, really this is cleaning out the nodes about which we no longer care, so I think we should be doing this in disconnectFromNodesExcept
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nick raised an important point about the race between the connection history table and the close callback.
A connection's close callback will always put an entry in the history table. If this close is a consequence of a cluster state change and disconnect in NodeConnectionsService, then it will add a node history right after it's supposed to be cleaned out.
Cleaning out the node history table whenever we disconnect from some nodes or connect to some new nodes works fine, but it means the history table will always lag a version behind, in what it's holding onto.
I came up with a concurrency scheme that works for keeping the node history current in NodeConnectionsService, but it's more complicated.
public void onFailure(Exception e) { | ||
final NodeConnectionHistory hist = new NodeConnectionHistory(node.getEphemeralId(), e); | ||
nodeHistory.put(conn.getNode().getId(), hist); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to store the connection history even when conn.hasReferences() == false
? I'm not 100% familiar with this code, but I wonder if we might get the occasional ungraceful disconnect after we've released all our references?
I guess in that case we would eventually discard the entry via retainConnectionHistory
anyway.
Do we need to be careful with the timing of calls to retainConnectionHistory
versus the these close handlers firing? I guess any entries that are added after a purge would not survive subsequent purges.
node.descriptionWithoutAttributes(), | ||
e, | ||
ReferenceDocs.NETWORK_DISCONNECT_TROUBLESHOOTING | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like previously we would only have logged at debug level in this scenario? unless I'm reading it wrong. I'm not sure how interesting this case is (as we were disconnecting from the node anyway)?
assertTrue("recent disconnects should be listed", connectionManager.connectionHistorySize() == 2); | ||
|
||
connectionManager.retainConnectionHistory(Collections.emptyList()); | ||
assertTrue("connection history should be emptied", connectionManager.connectionHistorySize() == 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it would be better to expose a read-only copy of the map for testing this, that would allow us to assert that the correct IDs were present?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think ClusterConnectionManager
isn't quite the right place to do this - the job of this connection manager is to look after all node-to-node connections including ones used for discovery and remote cluster connections too. There are situations where we might close and re-establish these kinds of connection without either end restarting without that being a problem worthy of logging.
NodeConnectionsService
is the class that knows about connections to nodes in the cluster. I'd rather we implemented the logging about unexpected reconnects there. That does raise some difficulties about how to expose the exception that closed the connection, if such an exception exists. I did say that this bit would be tricky 😁 Nonetheless I'd rather we got the logging to happen in the right place first and then we can think about the plumbing needed to achieve this extra detail.
value = "org.elasticsearch.transport.ClusterConnectionManager:WARN", | ||
reason = "to ensure we log cluster manager disconnect events on WARN level" | ||
) | ||
public void testExceptionalDisconnectLoggingInClusterConnectionManager() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we put this into its own test suite? This suite is supposed to be about ESLoggingHandler
which is unrelated to the logging in ClusterConnectionManager
. I think this test should work fine in the :server
test suite, no need to hide it in the transport-netty4
module.
Also could you open a separate PR to move testConnectionLogging
and testExceptionalDisconnectLogging
out of this test suite - they're testing the logging in TcpTransport
which is similarly unrelated to ESLoggingHandler
. IIRC they were added here for historical reasons, but these days we use the Netty transport everywhere so these should work in :server
too.
@@ -120,6 +122,7 @@ public void connectToNodes(DiscoveryNodes discoveryNodes, Runnable onCompletion) | |||
runnables.add(connectionTarget.connect(null)); | |||
} | |||
} | |||
transportService.retainConnectionHistory(nodes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need a separate collection here at all? We could just pass discoveryNodes
around I think.
But also, really this is cleaning out the nodes about which we no longer care, so I think we should be doing this in disconnectFromNodesExcept
instead.
server/src/main/java/org/elasticsearch/transport/ClusterConnectionManager.java
Outdated
Show resolved
Hide resolved
NodeConnectionHistory hist = nodeHistory.remove(connNode.getId()); | ||
if (hist != null && hist.ephemeralId.equals(connNode.getEphemeralId())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we extract this to a separate method rather than adding to this already over-long and over-nested code directly?
Also I'd rather use nodeConnectionHistory
instead of hist
. Abbreviated variable names are a hindrance to readers, particularly if they don't have English as a first language, and there's no disadvantage to using the full type name here.
(nit: also it can be final
)
if (hist.disconnectCause != null) { | ||
logger.warn( | ||
() -> format( | ||
"transport connection reopened to node with same ephemeralId [%s], close exception:", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Users don't really know what ephemeralId
is so I think will find this message confusing. Could we say something like reopened transport connection to node [%s] which disconnected exceptionally [%s/%dms] ago but did not restart, so the disconnection is unexpected
? NB also tracking the disconnection duration here.
Similarly disconnected gracefully
in the other branch.
Also can we link ReferenceDocs.NETWORK_DISCONNECT_TROUBLESHOOTING
?
// that's a bug. | ||
} else { | ||
logger.debug("closing unused transport connection to [{}]", node); | ||
conn.addCloseListener(new ActionListener<Void>() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: reduce duplication a bit here:
conn.addCloseListener(new ActionListener<>() {
@Override
public void onResponse(Void ignored) {
addNewNodeConnectionHistory(null);
}
@Override
public void onFailure(Exception e) {
addNewNodeConnectionHistory(e);
}
private void addNewNodeConnectionHistory(@Nullable Exception e) {
nodeHistory.put(node.getId(), new NodeConnectionHistory(node.getEphemeralId(), e));
}
});
Also consider extracting this out to the top level to try and keep this method length/nesting depth from getting too much further out of hand.
Thanks for the feedback everyone. It looks like I can repurpose the |
- moved test out of ESLoggingHandlerIt into a separate ClusterConnectionManagerIntegTests file - moved connection history into NodeConnectionsService, and adopted a consistency scheme - rewrote re-connection log message to include duration - changed log level of local disconnect with exception to debug
logger.warn( | ||
""" | ||
transport connection to [{}] closed by remote with exception [{}]; \ | ||
if unexpected, see [{}] for troubleshooting guidance""", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this isn't guaranteed to be a WARN
worthy event - if the node shut down then we might get a Connection reset
or similar but that's not something that needs action, and we do log those exceptions elsewhere. On reflection I'd rather leave the logging in ClusterConnectionManager
alone in this PR and just look at the new logs from the NodeConnectionsService
.
import org.elasticsearch.test.junit.annotations.TestLogging; | ||
|
||
@ESIntegTestCase.ClusterScope(numDataNodes = 2, scope = ESIntegTestCase.Scope.TEST) | ||
public class ClusterConnectionManagerIntegTests extends ESIntegTestCase { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: ESIntegTestCase
tests should have names ending in IT
and be in the internalClusterTest
source set. But as mentioned in my previous comment we probably don't want to change this here.
@@ -347,4 +357,113 @@ public String toString() { | |||
} | |||
} | |||
} | |||
|
|||
private class ConnectionHistory { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I like the look of this. Maybe ConnectionHistory implements TransportConnectionListener
rather than having another layer of indirection?
Also this needs to be covered in NodeConnectionsServiceTests
.
* Each node in the cluster always has a nodeHistory entry that is either the dummy value or a connection history record. This | ||
* allows node disconnect callbacks to discard their entry if the disconnect occurred because of a change in cluster state. | ||
*/ | ||
private final NodeConnectionHistory dummy = new NodeConnectionHistory("", 0, null); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can be static
I think, it's a global constant. We tend to name global constants in SHOUTY_SNAKE_CASE
reflecting their meaning, so here I'd suggest CONNECTED
or CONNECTED_MARKER
or something like that. This way you get to say nodeConnectionHistory != CONNECTED_MARKER
below which makes it clearer to the reader what this predicate means.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: also looks like the javadoc is for the nodeHistory
field
"reopened transport connection to node [%s] " | ||
+ "which disconnected exceptionally [%dms] ago but did not " | ||
+ "restart, so the disconnection is unexpected; " | ||
+ "if unexpected, see [{}] for troubleshooting guidance", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for if unexpected
here, I think the point is that this situation is always unexpected.
+ "restart, so the disconnection is unexpected; " | ||
+ "if unexpected, see [{}] for troubleshooting guidance", | ||
node.descriptionWithoutAttributes(), | ||
nodeConnectionHistory.disconnectTime, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This'll show the absolute disconnect time in milliseconds (i.e. since 1970) whereas I think we want to see the duration between the disconnect and the current time.
Thanks for the feedback David -- this was definitely a light pass on everything other than the concurrency scheme, and I wanted to get notes on it before adding complete testing and getting everything else just right. In hindsight, I was probably better off not trying to address everything else at the same time instead of committing first-draft versions. |
|
||
void reserveConnectionHistoryForNodes(DiscoveryNodes nodes) { | ||
for (DiscoveryNode node : nodes) { | ||
nodeHistory.put(node.getId(), dummy); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might need to be putIfAbsent
so we don't over-write any actual current NodeConnectionHistory
entries right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure. My read was these two calls would come from cluster state changing to add or remove nodes from this table. Inclusion is controlled by these calls, which unconditionally add or remove entries. The close callback has to be careful to check if it has an entry that's valid: this protects against long-running callbacks inserting garbage into the table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The DiscoveryNodes
passed to connectToNodes
contains all the nodes in the cluster, including any existing ones, so if there's a node which already exists in the cluster, and is currently disconnected, then it will have an entry in nodeHistory
which isn't dummy
that this line will overwrite on any cluster state update. So yeah I think putIfAbsent
is what we want here.
}); | ||
} | ||
|
||
void reserveConnectionHistoryForNodes(DiscoveryNodes nodes) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I wonder if this should be called something like startTrackingConnectionHistory
(and the other method stop...
), the "reserving" language seems like an implementation detail leaking?
I do like the implementation though, nice approach to fixing the race.
|
||
void reserveConnectionHistoryForNodes(DiscoveryNodes nodes) { | ||
for (DiscoveryNode node : nodes) { | ||
nodeHistory.put(node.getId(), dummy); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The DiscoveryNodes
passed to connectToNodes
contains all the nodes in the cluster, including any existing ones, so if there's a node which already exists in the cluster, and is currently disconnected, then it will have an entry in nodeHistory
which isn't dummy
that this line will overwrite on any cluster state update. So yeah I think putIfAbsent
is what we want here.
NodeConnectionHistory nodeConnectionHistory = nodeHistory.get(node.getId()); | ||
if (nodeConnectionHistory != null) { | ||
nodeHistory.replace(node.getId(), nodeConnectionHistory, dummy); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks a little racy, although in practice I think it's fine because ClusterConnectionManager
protects against opening multiple connections to the same node concurrently. Still, if we did all this (including the logging) within a nodeHistory.compute(node.getId, ...)
then there'd obviously be no races.
void removeConnectionHistoryForNodes(Set<DiscoveryNode> nodes) { | ||
final int startSize = nodeHistory.size(); | ||
for (DiscoveryNode node : nodes) { | ||
nodeHistory.remove(node.getId()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's kind of an implicit invariant here that org.elasticsearch.cluster.NodeConnectionsService.ConnectionHistory#nodeHistory
and org.elasticsearch.cluster.NodeConnectionsService#targetsByNode
have the same keys. At the very least we should be able to assert this. I also wonder if we should be calling nodeHistory.retainAll()
to make it super-clear that we are keeping these keysets aligned.
But then that got me thinking, maybe we should be tracking the connection history of each target node in ConnectionTarget
rather than trying to maintain two parallel maps. Could that work?
} | ||
|
||
@Override | ||
public void onNodeDisconnected(DiscoveryNode node, Transport.Connection connection) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just spotted we're already executing this in a close-listener, but one that runs under ActionListener.running(...)
so it drops the exception. I think it'd be nicer to adjust this callback to take a @Nullable Exception e
parameter rather than having to add a second close listener just to pick up the exception as done here.
ClusterConnectionManager now caches the previous ephemeralId (created on process-start) of peer nodes on disconnect in a connection history table. On reconnect, when a peer has the same ephemeralId as it did previously, this is logged to indicate a network failure. The connectionHistory is trimmed to the current set of peers by NodeConnectionsService.