Leaked RPC authentication error in polykey agent #376

aryanjassal · 2025-03-24T00:41:19Z

Describe the bug

After running my agent for a while, it eventually crashed with the following log message.

Mar 24 00:38:30 matrix-dell-34xx-zenith polykey[4792]: ErrorNodeConnectionManagerRPCDenied: RPC call was denied due to being unauthenticated

This indicates that errors are leaking from their context and crashing the entire agent, which shouldn't happen.

To Reproduce

Run the agent for a while
Observe error

Expected behavior

The agent shouldn't crash. It should either be handled internally, or printed to stdout/stderr as a log message.

Screenshots

Platform

Device: Dell Precision 3480
OS: NixOS
Version: [e.g. 22]

Additional context

Notify maintainers

@tegefaulkes

The text was updated successfully, but these errors were encountered:

linear · 2025-03-24T00:41:23Z

ENG-560

tegefaulkes · 2025-03-24T00:46:10Z

ErrorNodeConnectionManagerRPCDenied is a new error from the network segregation changes. If you try to make a RPC call before authenticating then this will be thrown through the stream. this shouldn't happen since you shouldn't have access to the connection before it has authenticated.

I'm guessing its a race condition where both sides has authenticated but one hasn't updated it's state yet. This is very likely crashing the seed nodes right now. Though the seed nodes should auto restart when they crash like this.

@aryanjassal work with brynley to go over the seed node logs and see if this error is being thrown before crashing. Also work with her to work out why the seed nodes don't auto restart when they fail.

aryanjassal · 2025-03-24T02:30:46Z

Could these leaked errors be the cause of the agent shutting down and not restarting properly? The container logs might shed more light to my conjecture.

aryanjassal · 2025-03-25T02:47:04Z

Oops, didn't see the previous message before posting mine.

After investigating this with brynley, the seednodes were working yesterday, and attempting to find this one line from the logs might be a needle in a haystack kind of a problem, so I will go ahead and work on fixing this issue with the assumption that this is a cause of the seednode failure.

tegefaulkes · 2025-03-25T20:32:12Z

The seed nodes are not working. One of them is still down and I had to restart one of them on Monday.

aryanjassal · 2025-04-01T05:20:51Z

After a quick discussion with Brian, the error originates from the RPC middleware layer. At the time of authentication, two calls are made individually by both the agents. After authenticating from both sides, the restricted commands can be used. If a RPC call is made while being unauthenticated, then this error is thrown, which crashes the program.

We have a few options to deal with this.

Use a grace timer. We will give some time before RPC communication will be performed after authentication to allow the other side to also do the authentication.
Use a third step. Currently, there are two steps to network authentication. First, Node A makes a RPC call to Node B for authentication. This triggers the two steps of handling the forward and reverse stream. After handling the streams, the connection gets authenticated. We can add an additional step here to confirm both sides agree on the authentication being successful before allowing any other RPC communication.
Rework network authentication. Instead of making a unary call from each side, we can make a single duplex call allowing both sides to kind-of negotiate or handshake the authentication. This would ensure that after the RPC, both nodes have confirmed the authentication.

I have discarded the first option, and will investigate the viability of the second and third option. The third option is more elegant, but would take more work to implement. The second option is easier to implement as it leverages existing work, but is not as efficient as the third option.

After finalising a solution, I will start its implementation.

aryanjassal added the bug Something isn't working label Mar 24, 2025

aryanjassal assigned tegefaulkes Mar 24, 2025

aryanjassal assigned aryanjassal and unassigned tegefaulkes Apr 1, 2025

aryanjassal mentioned this issue Apr 1, 2025

Fixing leaked RPC authentication error MatrixAI/Polykey#883

Draft

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Leaked RPC authentication error in polykey agent #376

Leaked RPC authentication error in polykey agent #376

aryanjassal commented Mar 24, 2025

linear bot commented Mar 24, 2025

tegefaulkes commented Mar 24, 2025

aryanjassal commented Mar 24, 2025

aryanjassal commented Mar 25, 2025

tegefaulkes commented Mar 25, 2025

aryanjassal commented Apr 1, 2025

Leaked RPC authentication error in polykey agent #376

Leaked RPC authentication error in polykey agent #376

Comments

aryanjassal commented Mar 24, 2025

Describe the bug

To Reproduce

Expected behavior

Screenshots

Platform

Additional context

Notify maintainers

linear bot commented Mar 24, 2025

tegefaulkes commented Mar 24, 2025

aryanjassal commented Mar 24, 2025

aryanjassal commented Mar 25, 2025

tegefaulkes commented Mar 25, 2025

aryanjassal commented Apr 1, 2025