Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Leaked RPC authentication error in polykey agent #376

Open
aryanjassal opened this issue Mar 24, 2025 · 6 comments
Open

Leaked RPC authentication error in polykey agent #376

aryanjassal opened this issue Mar 24, 2025 · 6 comments
Assignees
Labels
bug Something isn't working

Comments

@aryanjassal
Copy link
Member

Describe the bug

After running my agent for a while, it eventually crashed with the following log message.

Mar 24 00:38:30 matrix-dell-34xx-zenith polykey[4792]: ErrorNodeConnectionManagerRPCDenied: RPC call was denied due to being unauthenticated

This indicates that errors are leaking from their context and crashing the entire agent, which shouldn't happen.

To Reproduce

  1. Run the agent for a while
  2. Observe error

Expected behavior

The agent shouldn't crash. It should either be handled internally, or printed to stdout/stderr as a log message.

Screenshots

Platform

  • Device: Dell Precision 3480
  • OS: NixOS
  • Version: [e.g. 22]

Additional context

Notify maintainers

@tegefaulkes

@aryanjassal aryanjassal added the bug Something isn't working label Mar 24, 2025
Copy link

linear bot commented Mar 24, 2025

ENG-560

@tegefaulkes
Copy link
Contributor

ErrorNodeConnectionManagerRPCDenied is a new error from the network segregation changes. If you try to make a RPC call before authenticating then this will be thrown through the stream. this shouldn't happen since you shouldn't have access to the connection before it has authenticated.

I'm guessing its a race condition where both sides has authenticated but one hasn't updated it's state yet. This is very likely crashing the seed nodes right now. Though the seed nodes should auto restart when they crash like this.

@aryanjassal work with brynley to go over the seed node logs and see if this error is being thrown before crashing. Also work with her to work out why the seed nodes don't auto restart when they fail.

@aryanjassal
Copy link
Member Author

Could these leaked errors be the cause of the agent shutting down and not restarting properly? The container logs might shed more light to my conjecture.

@aryanjassal
Copy link
Member Author

Oops, didn't see the previous message before posting mine.

After investigating this with brynley, the seednodes were working yesterday, and attempting to find this one line from the logs might be a needle in a haystack kind of a problem, so I will go ahead and work on fixing this issue with the assumption that this is a cause of the seednode failure.

@tegefaulkes
Copy link
Contributor

The seed nodes are not working. One of them is still down and I had to restart one of them on Monday.

Copy link
Member Author

After a quick discussion with Brian, the error originates from the RPC middleware layer. At the time of authentication, two calls are made individually by both the agents. After authenticating from both sides, the restricted commands can be used. If a RPC call is made while being unauthenticated, then this error is thrown, which crashes the program.

We have a few options to deal with this.

  1. Use a grace timer. We will give some time before RPC communication will be performed after authentication to allow the other side to also do the authentication.
  2. Use a third step. Currently, there are two steps to network authentication. First, Node A makes a RPC call to Node B for authentication. This triggers the two steps of handling the forward and reverse stream. After handling the streams, the connection gets authenticated. We can add an additional step here to confirm both sides agree on the authentication being successful before allowing any other RPC communication.
  3. Rework network authentication. Instead of making a unary call from each side, we can make a single duplex call allowing both sides to kind-of negotiate or handshake the authentication. This would ensure that after the RPC, both nodes have confirmed the authentication.

I have discarded the first option, and will investigate the viability of the second and third option. The third option is more elegant, but would take more work to implement. The second option is easier to implement as it leverages existing work, but is not as efficient as the third option.

After finalising a solution, I will start its implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Development

No branches or pull requests

2 participants