Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

khepri_db: function_clause in rabbit_federation_exchange_link_sup_sup on network disconnect #12274

Open
lukebakken opened this issue Sep 11, 2024 · 3 comments
Assignees
Labels
bug khepri Khepri as the metadata store

Comments

@lukebakken
Copy link
Collaborator

Describe the bug

Disconnecting the network to one node of a 3-node khepri-enabled cluster eventually results in a strange function_clause error:

rmq0-function_clause-stack.txt

The error also originates from the rabbit_federation_queue_link_sup_sup process as well. My test project enables the rabbitmq_federation plugin, but does not create any federation links.

Reproduction steps

  • Start cluster
    git clone [email protected]:lukebakken/docker-rabbitmq-cluster.git
    cd docker-rabbitmq-cluster
    git checkout khepri
    make DOCKER_FRESH=true clean up
    
  • Disconnect node rmq0
    docker network disconnect rabbitnet docker-rabbitmq-cluster-rmq0-1
    
  • Watch logs until function_clause error happens

Expected behavior

No error.

Additional context

This does not appear to affect the normal operation of PerfTest.

In addition, the following log lines appear:

rmq2-1       | 2024-09-11 00:29:10.084227+00:00 [error] <0.181.0>
rmq2-1       | 2024-09-11 00:29:10.084227+00:00 [error] <0.181.0> ** Cannot get connection id for node '[email protected]'
rmq2-1       | 2024-09-11 00:29:10.084227+00:00 [error] <0.181.0>
rmq1-1       | 2024-09-11 00:29:10.096091+00:00 [error] <0.181.0>
rmq1-1       | 2024-09-11 00:29:10.096091+00:00 [error] <0.181.0> ** Cannot get connection id for node '[email protected]'
rmq1-1       | 2024-09-11 00:29:10.096091+00:00 [error] <0.181.0>

These log lines originate in OTP itself:

lbakken@shostakovich ~/development/erlang/otp (master =)
$ git grep -i 'cannot get connection'
lib/kernel/src/net_kernel.erl:1051:            error_logger:error_msg("~n** Cannot get connection id for node ~w~n",
lib/kernel/src/net_kernel.erl:1156:                error_logger:error_msg("~n** Cannot get connection id for node ~w~n",
lib/kernel/src/net_kernel.erl:1545:                    error_logger:error_msg("~n** Cannot get connection id for node ~w~n",

What's odd is that the error messages originate from the node to which the error message refers 🤔

@the-mikedavis
Copy link
Member

The rabbit_db_msup module and its callers will need some updates to handle potential timeouts when interacting with Khepri like in #11785

The changes will probably be trickier for this module since the commands don't come from a user so it's not a simple matter of bubbling up and returning an error.

@mkuratczyk
Copy link
Contributor

I just hit that with rabbit_shovel_dyn_worker_sup_sup, which makes sense, since it's also a mirrored supervisor.

@the-mikedavis the-mikedavis added the khepri Khepri as the metadata store label Sep 12, 2024
@michaelklishin
Copy link
Member

@the-mikedavis no, we can still bubble up an error. Shovel will then log it and restart. With Shovels, these "failure loops" is how the errors are communicated since this is a non-interactive client by definition.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug khepri Khepri as the metadata store
Projects
None yet
Development

No branches or pull requests

4 participants