Skip to content

CASSGO-72 Connection trouble with Amazon Keyspaces #1873

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
yomipq opened this issue Mar 27, 2025 · 19 comments · May be fixed by #1886
Open

CASSGO-72 Connection trouble with Amazon Keyspaces #1873

yomipq opened this issue Mar 27, 2025 · 19 comments · May be fixed by #1886

Comments

@yomipq
Copy link

yomipq commented Mar 27, 2025

I have a trouble connecting to AWS Amazon Keyspaces (for Apache Cassandra). My program on EC2 can connect to Amazon Keyspaces without any issues for a while, but after a few days or weeks, it loses the connection and any query causes the error below.

gocql: no hosts available in the pool

Go version: 1.23.4
GoCQL version: 1.7.0

I built the program with gocql_debug enabled, and I got following logs.

2025/03/25 03:20:29 gocql: Session.handleNodeConnected: 172.16.1.14:9142
2025/03/25 03:20:29 gocql: conns of pool after stopped "172.16.1.14": 2
2025/03/25 03:20:29 gocql: Session.handleNodeConnected: 172.16.1.28:9142
2025/03/25 03:20:29 gocql: conns of pool after stopped "172.16.1.28": 2
2025/03/25 03:21:29 Session.ring:[172.16.1.14:UP][172.16.1.28:UP]

...

2025/03/26 15:11:11 gocql: unable to dial "[HostInfo hostname=\"\" connectAddress=\"127.0.0.1\" peer=\"<nil>\" rpc_address=\"127.0.0.1\" broadcast_address=\"127.0.0.1\" preferred_ip=\"<nil>\" connect_addr=\"127.0.0.1\" connect_addr_source=\"connect_address\" port=9142 data_centre=\"ap-northeast-1\" rack=\"ap-northeast-1\" host_id=\"be0f3a14-e107-3fee-a5e5-415c10539abd\" version=\"v3.11.2\" state=UP num_tokens=0]": dial tcp 127.0.0.1:9142: connect: connection refused
2025/03/26 15:11:11 gocql: filling stopped "127.0.0.1": dial tcp 127.0.0.1:9142: connect: connection refused
2025/03/26 15:11:11 gocql: conns of pool after stopped "127.0.0.1": 0
2025/03/26 15:11:11 gocql: Session.handleNodeDown: 127.0.0.1:9142
2025/03/26 15:11:11 gocql: unable to refresh ring: get existing host=[HostInfo hostname="" connectAddress="172.16.1.14" peer="172.16.1.14" rpc_address="172.16.1.14" broadcast_address="<nil>" preferred_ip="172.16.1.14" connect_addr="172.16.1.14" connect_addr_source="connect_address" port=9142 data_centre="ap-northeast-1" rack="ap-northeast-1" host_id="be0f3a14-e107-3fee-a5e5-415c10539abd" version="v3.11.2" state=UP num_tokens=1] from prevHosts: cannot find host
2025/03/26 15:11:29 Session.ring:[127.0.0.1:DOWN][172.16.1.28:UP]

...

2025/03/26 22:43:35 gocql: unable to dial "[HostInfo hostname=\"\" connectAddress=\"127.0.0.1\" peer=\"<nil>\" rpc_address=\"127.0.0.1\" broadcast_address=\"127.0.0.1\" preferred_ip=\"<nil>\" connect_addr=\"127.0.0.1\" connect_addr_source=\"connect_address\" port=9142 data_centre=\"ap-northeast-1\" rack=\"ap-northeast-1\" host_id=\"b666465e-cb85-3efa-b3ab-f6cf139e5a39\" version=\"v3.11.2\" state=UP num_tokens=0]": dial tcp 127.0.0.1:9142: connect: connection refused
2025/03/26 22:43:35 gocql: filling stopped "127.0.0.1": dial tcp 127.0.0.1:9142: connect: connection refused
2025/03/26 22:43:35 gocql: conns of pool after stopped "127.0.0.1": 0
2025/03/26 22:43:35 gocql: Session.handleNodeDown: 127.0.0.1:9142
2025/03/26 22:43:35 gocql: unable to refresh ring: get existing host=[HostInfo hostname="" connectAddress="172.16.1.28" peer="172.16.1.28" rpc_address="172.16.1.28" broadcast_address="<nil>" preferred_ip="172.16.1.28" connect_addr="172.16.1.28" connect_addr_source="connect_address" port=9142 data_centre="ap-northeast-1" rack="ap-northeast-1" host_id="b666465e-cb85-3efa-b3ab-f6cf139e5a39" version="v3.11.2" state=UP num_tokens=1] from prevHosts: cannot find host
2025/03/26 22:44:29 Session.ring:[127.0.0.1:DOWN][127.0.0.1:DOWN]

On startup, It has two hosts 172.16.1.14 and 172.16.1.28. After a while, the connection to 172.16.1.14 got lost with error cannot find host and try to reconnect to 127.0.0.1 instead of 172.16.1.14. After another while, the other connection also got lost with the same error and also try to reconnect to 127.0.0.1 instead of 172.16.1.28. As a result, all connections got lost.

So here are my questions:

First, in what situation the error cannot find host occur? Is this an expected error? I read the source code, but I couldn't understand it well.
Second, what makes it reconnect to 127.0.0.1 instead of original address? Is this an expected behavior?

If anyone has any idea, please let me know.

@joao-r-reis
Copy link
Contributor

joao-r-reis commented Mar 27, 2025

I've seen this issue on other projects, basically if you use aws keyspaces with peering then system.local will return 127.0.0.1 and system.peers will return all the correct IPs (including the local one I think?).

I'm not sure if there's any driver option you can explore to work around this issue, it's possible that this will require some code changes on gocql in order to support aws keyspaces with peering

@joao-r-reis
Copy link
Contributor

You can play around with IgnorePeerAddr and DisableInitialHostLookup to see if you can find a work around

@yomipq
Copy link
Author

yomipq commented Mar 28, 2025

Thank you for the information.

I will look into IgnorePeerAddr and DisableInitialHostLookup .

@yomipq
Copy link
Author

yomipq commented Mar 28, 2025

I checked the code and found that IgnorePeerAddr is defined in ClusterConfig struct but is not used anywhere. It seems that IgnorePeerAddr has no effect.

I will try DisableInitialHostLookup = true for now.

@yomipq
Copy link
Author

yomipq commented Mar 31, 2025

I'm trying DisableInitialHostLookup. Although there are some parts I'm not sure, it's working fine.

2025/03/28 08:28:50 gocql: Session.handleNodeConnected: 172.16.1.28:9142
2025/03/28 08:28:50 gocql: Session.handleNodeConnected: 172.16.1.14:9142
2025/03/28 08:28:50 gocql: conns of pool after stopped "172.16.1.14": 2
2025/03/28 08:28:50 gocql: conns of pool after stopped "172.16.1.28": 2
2025/03/28 08:29:50 Session.ring:[172.16.1.28:UP][172.16.1.14:UP][172.16.1.28:UP]

...

2025/03/29 20:19:33 gocql: unable to refresh ring: get existing host=[HostInfo hostname="" connectAddress="172.16.1.28" peer="172.16.1.28" rpc_address="172.16.1.28" broadcast_address="<nil>" preferred_ip="172.16.1.28" connect_addr="172.16.1.28" connect_addr_source="connect_address" port=9142 data_centre="ap-northeast-1" rack="ap-northeast-1" host_id="b666465e-cb85-3efa-b3ab-f6cf139e5a39" version="v3.11.2" state=UP num_tokens=1] from prevHosts: cannot find host
2025/03/29 20:19:33 gocql: Session.handleNodeConnected: 172.16.1.14:9142
2025/03/29 20:19:33 gocql: conns of pool after stopped "172.16.1.28": 0
2025/03/29 20:19:33 gocql: conns of pool after stopped "172.16.1.14": 2
2025/03/29 20:19:50 Session.ring:[172.16.1.14:UP][172.16.1.28:UP][172.16.1.14:UP][127.0.0.1:DOWN]

On startup, it had 3 connections, including 2 connections to 172.16.1.28, for some reason. After the error cannot find host on the connection to 172.16.1.28, it had 4 connections, including 2 connections to 172.16.1.14 and 1 connection to 127.0.0.1 that was down.

I'm not sure if this is a normal behavior or not, but it's working.

By the way, I came up with another work around. It might be a better solution to add a HostFilter in cluster config to ignore 127.0.0.1:

cluster.HostFilter = gocql.HostFilterFunc(func (h *gocql.HostInfo) bool {
        return h.ConnectAddress().String() != "127.0.0.1"
})

So far it's working fine after refreshRing, I will keep it for a while longer.

@jameshartig
Copy link
Contributor

First, in what situation the error cannot find host occur? Is this an expected error? I read the source code, but I couldn't understand it well.

I think that could occur if the same hostID appears twice in the peers table, or the local host is in the peers table.

Can you paste the output for SELECT * FROM system.local WHERE key='local' and SELECT * FROM system.peers?

@yomipq
Copy link
Author

yomipq commented Apr 1, 2025

I think that could occur if the same hostID appears twice in the peers table, or the local host is in the peers table.

Thank you for telling me that. You are right. I read the code again and now I understand what happened.

cqlsh> SELECT * FROM system.local WHERE key='local';

 key   | bootstrapped | broadcast_address | cluster_name     | cql_version | data_center    | gossip_generation | host_id                              | listen_address | native_protocol_version | partitioner                                 | rack           | release_version | rpc_address | schema_version                       | thrift_version | tokens | truncated_at
-------+--------------+-------------------+------------------+-------------+----------------+-------------------+--------------------------------------+----------------+-------------------------+---------------------------------------------+----------------+-----------------+-------------+--------------------------------------+----------------+--------+--------------
 local |    COMPLETED |         127.0.0.1 | Amazon Keyspaces |       3.4.4 | ap-northeast-1 |                42 | b666465e-cb85-3efa-b3ab-f6cf139e5a39 |      127.0.0.1 |                       4 | org.apache.cassandra.dht.Murmur3Partitioner | ap-northeast-1 |          3.11.2 |   127.0.0.1 | 05deae2d-6405-494d-a965-c0e5836bcb3c |         20.1.0 |   null |         null

(1 rows)
cqlsh> SELECT * FROM system.peers;

 peer        | data_center    | host_id                              | preferred_ip | rack           | release_version | rpc_address | schema_version                       | tokens
-------------+----------------+--------------------------------------+--------------+----------------+-----------------+-------------+--------------------------------------+-------------------------
 172.16.1.28 | ap-northeast-1 | b666465e-cb85-3efa-b3ab-f6cf139e5a39 |  172.16.1.28 | ap-northeast-1 |          3.11.2 | 172.16.1.28 | 05deae2d-6405-494d-a965-c0e5836bcb3c |                  {'-1'}
 172.16.1.14 | ap-northeast-1 | be0f3a14-e107-3fee-a5e5-415c10539abd |  172.16.1.14 | ap-northeast-1 |          3.11.2 | 172.16.1.14 | 05deae2d-6405-494d-a965-c0e5836bcb3c | {'9223372036854775806'}

(2 rows)

Both 127.0.0.1 and 172.16.1.28 have the same hostID.

@joao-r-reis
Copy link
Contributor

joao-r-reis commented Apr 1, 2025

Yeah AWS Keyspaces basically returns the full list of hosts (including the local host but with the correct IP address) in the system.peers table when peering is enabled which unfortunately is a behavior that is not consistent with "regular" C* so some drivers have an issue with it.

@jameshartig
Copy link
Contributor

Yeah AWS Keyspaces basically returns the full list of hosts (including the local host but with the correct IP address) in the system.peers table when peering is enabled which unfortunately is a behavior that is not consistent with "regular" C* so some drivers have an issue with it.

Do we need to handle this? We could separate the local from the peers and ignore a duplicate host_id when refreshing the ring.

@yomipq
Copy link
Author

yomipq commented Apr 2, 2025

I would really appreciate it if GoCQL handle this issue. If there is anything I can help, please let me know.

@joao-r-reis
Copy link
Contributor

Do we need to handle this? We could separate the local from the peers and ignore a duplicate host_id when refreshing the ring.

Yeah we could do this as long as there's no impact to users not using AWS keyspaces. It should target gocql 2.1.0 though since we're trying to wrap up 2.0

@yomipq
Copy link
Author

yomipq commented Apr 3, 2025

Thank you. I'm looking forward to the release.

If the release schedule is planed, could you please tell me approximately when 2.1.0 will be released?

@joao-r-reis
Copy link
Contributor

No ETA on 2.1.0 for now. Also no guarantee that this will go in 2.1.0, someone needs to volunteer to open a PR for this but let's start with creating a JIRA, I'll do this.

@joao-r-reis
Copy link
Contributor

@joao-r-reis joao-r-reis changed the title Connection trouble with Amazon Keyspaces CASSGO-72 Connection trouble with Amazon Keyspaces Apr 3, 2025
@yomipq
Copy link
Author

yomipq commented Apr 4, 2025

OK, so I will try to fix this issue. Do I need to do something in JIRA? Is it OK to just send a PR on GitHub?

At first, I'd like to confirm how to handle this. If I understand correctly, the same hostIDs and localhost address are handled properly on initializing connections (maybe NewSession() in session.go), but not on reconnection process (maybe refreshRing() in host_source.go). So now, I need to fix the reconnection process. Is that right?

@jameshartig
Copy link
Contributor

If I understand correctly, the same hostIDs and localhost address are handled properly on initializing connections (maybe NewSession() in session.go), but not on reconnection process (maybe refreshRing() in host_source.go). So now, I need to fix the reconnection process. Is that right?

I think inside func (r *ringDescriber) GetHosts() we shouldn't blindly just be doing append([]*HostInfo{localHost}, peerHosts...) and maybe check to see if localHost is already in peerHosts (leaving a comment linking to this issue and probably mentioning amazon keyspaces) and if it is then we shouldn't add it twice and prefer the one from peers.

@joao-r-reis you might have more experience with how other drivers handle this. Thoughts?

@joao-r-reis
Copy link
Contributor

@joao-r-reis you might have more experience with how other drivers handle this. Thoughts?

Most drivers use maps with ip addresses as keys but java driver 4.x moved to using host ids as the keys. Usually the behavior is to go through the new list of localhost + peer hosts one by one and add the host to the map if there is no entry or update the entry if it already exists. In the case of the java driver they have special handling for cases where there the same host id already exists under a different IP which happens in C* when a node changes its IP but this case will also fix the aws keyspaces issue I believe: https://github.com/apache/cassandra-java-driver/blob/529d56e1742dcd1df3ca55c00fd8e02c0e484c68/core/src/main/java/com/datastax/oss/driver/internal/core/metadata/AddNodeRefresh.java#L43-L66

GoCQL has both maps but it uses the map by host id for these checks.

I think inside func (r *ringDescriber) GetHosts() we shouldn't blindly just be doing append([]*HostInfo{localHost}, peerHosts...)

GetHosts() just returns all the hosts found in system tables without additional logic so I think it should be kept as is and changes should be made to func refreshRing(r *ringDescriber) error instead.

Reading through refreshRing function I'd think that the aws keyspaces case would not be an issue because it starts with the local host as it is iterating through the slice and then when it reaches the peer host with the same host id it should update the IPs... Maybe there's some bug here.

@joao-r-reis
Copy link
Contributor

joao-r-reis commented Apr 4, 2025

Oh I think the issue is that the control connection queries system.local and updates that host's entry in the map when it connects/reconnects... Other drivers (except java driver 4.x I think) just trigger a full ring refresh when the control connection connects/reconnects and I think we should do this on gocql as well. Basically we can move the refreshRing call from reconnect to setupConn I think.

The issue is that refreshRing triggers r.session.startPoolFill(h) and we only this to be triggered if if it's a reconnection (i.e. c.session.initialized() == true). The way this is handled on the C# driver for example is by having event handlers that are only set at the end of the session initialization process so the ring refresh that happens as part of the first control connection setup trigger events that do nothing. So first the control connection is initialized and a ring refresh is triggered (without pool fills), then event handlers for the ring refresh and topology events are set up and only then the pool filling is triggered.

@yomipq
Copy link
Author

yomipq commented Apr 5, 2025

Reading through refreshRing function I'd think that the aws keyspaces case would not be an issue because it starts with the local host as it is iterating through the slice and then when it reaches the peer host with the same host id it should update the IPs... Maybe there's some bug here.

I might have found the cause. In the for-loop of hosts in refreshRing https://github.com/apache/cassandra-gocql-driver/blob/v1.7.0/host_source.go#L727-L756 , it first processes the localhost, overwrites the IP address to 127.0.0.1 and removes the hostID from prevHosts . And then it processes the peer host with the same hostID, but the hostID no longer exists in prevHosts, resulting in the ErrCannotFindHost error.

I think it would fix the issue to move delete(prevHosts, h.HostID()) to after the for-loop. In other words, it first processes localhost, then the peer host, where it reverts IP address from local address to peer address, and finally delete the existing hostIDs from prevHosts after all hosts are processed.

What do you think about this?

@yomipq yomipq linked a pull request Apr 27, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants