CASSGO-72 Connection trouble with Amazon Keyspaces #1873

yomipq · 2025-03-27T07:28:46Z

I have a trouble connecting to AWS Amazon Keyspaces (for Apache Cassandra). My program on EC2 can connect to Amazon Keyspaces without any issues for a while, but after a few days or weeks, it loses the connection and any query causes the error below.

gocql: no hosts available in the pool

Go version: 1.23.4
GoCQL version: 1.7.0

I built the program with gocql_debug enabled, and I got following logs.

2025/03/25 03:20:29 gocql: Session.handleNodeConnected: 172.16.1.14:9142
2025/03/25 03:20:29 gocql: conns of pool after stopped "172.16.1.14": 2
2025/03/25 03:20:29 gocql: Session.handleNodeConnected: 172.16.1.28:9142
2025/03/25 03:20:29 gocql: conns of pool after stopped "172.16.1.28": 2
2025/03/25 03:21:29 Session.ring:[172.16.1.14:UP][172.16.1.28:UP]

...

2025/03/26 15:11:11 gocql: unable to dial "[HostInfo hostname=\"\" connectAddress=\"127.0.0.1\" peer=\"<nil>\" rpc_address=\"127.0.0.1\" broadcast_address=\"127.0.0.1\" preferred_ip=\"<nil>\" connect_addr=\"127.0.0.1\" connect_addr_source=\"connect_address\" port=9142 data_centre=\"ap-northeast-1\" rack=\"ap-northeast-1\" host_id=\"be0f3a14-e107-3fee-a5e5-415c10539abd\" version=\"v3.11.2\" state=UP num_tokens=0]": dial tcp 127.0.0.1:9142: connect: connection refused
2025/03/26 15:11:11 gocql: filling stopped "127.0.0.1": dial tcp 127.0.0.1:9142: connect: connection refused
2025/03/26 15:11:11 gocql: conns of pool after stopped "127.0.0.1": 0
2025/03/26 15:11:11 gocql: Session.handleNodeDown: 127.0.0.1:9142
2025/03/26 15:11:11 gocql: unable to refresh ring: get existing host=[HostInfo hostname="" connectAddress="172.16.1.14" peer="172.16.1.14" rpc_address="172.16.1.14" broadcast_address="<nil>" preferred_ip="172.16.1.14" connect_addr="172.16.1.14" connect_addr_source="connect_address" port=9142 data_centre="ap-northeast-1" rack="ap-northeast-1" host_id="be0f3a14-e107-3fee-a5e5-415c10539abd" version="v3.11.2" state=UP num_tokens=1] from prevHosts: cannot find host
2025/03/26 15:11:29 Session.ring:[127.0.0.1:DOWN][172.16.1.28:UP]

...

2025/03/26 22:43:35 gocql: unable to dial "[HostInfo hostname=\"\" connectAddress=\"127.0.0.1\" peer=\"<nil>\" rpc_address=\"127.0.0.1\" broadcast_address=\"127.0.0.1\" preferred_ip=\"<nil>\" connect_addr=\"127.0.0.1\" connect_addr_source=\"connect_address\" port=9142 data_centre=\"ap-northeast-1\" rack=\"ap-northeast-1\" host_id=\"b666465e-cb85-3efa-b3ab-f6cf139e5a39\" version=\"v3.11.2\" state=UP num_tokens=0]": dial tcp 127.0.0.1:9142: connect: connection refused
2025/03/26 22:43:35 gocql: filling stopped "127.0.0.1": dial tcp 127.0.0.1:9142: connect: connection refused
2025/03/26 22:43:35 gocql: conns of pool after stopped "127.0.0.1": 0
2025/03/26 22:43:35 gocql: Session.handleNodeDown: 127.0.0.1:9142
2025/03/26 22:43:35 gocql: unable to refresh ring: get existing host=[HostInfo hostname="" connectAddress="172.16.1.28" peer="172.16.1.28" rpc_address="172.16.1.28" broadcast_address="<nil>" preferred_ip="172.16.1.28" connect_addr="172.16.1.28" connect_addr_source="connect_address" port=9142 data_centre="ap-northeast-1" rack="ap-northeast-1" host_id="b666465e-cb85-3efa-b3ab-f6cf139e5a39" version="v3.11.2" state=UP num_tokens=1] from prevHosts: cannot find host
2025/03/26 22:44:29 Session.ring:[127.0.0.1:DOWN][127.0.0.1:DOWN]

On startup, It has two hosts 172.16.1.14 and 172.16.1.28. After a while, the connection to 172.16.1.14 got lost with error cannot find host and try to reconnect to 127.0.0.1 instead of 172.16.1.14. After another while, the other connection also got lost with the same error and also try to reconnect to 127.0.0.1 instead of 172.16.1.28. As a result, all connections got lost.

So here are my questions:

First, in what situation the error cannot find host occur? Is this an expected error? I read the source code, but I couldn't understand it well.
Second, what makes it reconnect to 127.0.0.1 instead of original address? Is this an expected behavior?

If anyone has any idea, please let me know.

The text was updated successfully, but these errors were encountered:

joao-r-reis · 2025-03-27T16:46:42Z

I've seen this issue on other projects, basically if you use aws keyspaces with peering then system.local will return 127.0.0.1 and system.peers will return all the correct IPs (including the local one I think?).

I'm not sure if there's any driver option you can explore to work around this issue, it's possible that this will require some code changes on gocql in order to support aws keyspaces with peering

joao-r-reis · 2025-03-27T16:48:37Z

You can play around with IgnorePeerAddr and DisableInitialHostLookup to see if you can find a work around

yomipq · 2025-03-28T00:44:12Z

Thank you for the information.

I will look into IgnorePeerAddr and DisableInitialHostLookup .

yomipq · 2025-03-28T01:43:40Z

I checked the code and found that IgnorePeerAddr is defined in ClusterConfig struct but is not used anywhere. It seems that IgnorePeerAddr has no effect.

I will try DisableInitialHostLookup = true for now.

yomipq · 2025-03-31T00:49:37Z

I'm trying DisableInitialHostLookup. Although there are some parts I'm not sure, it's working fine.

2025/03/28 08:28:50 gocql: Session.handleNodeConnected: 172.16.1.28:9142
2025/03/28 08:28:50 gocql: Session.handleNodeConnected: 172.16.1.14:9142
2025/03/28 08:28:50 gocql: conns of pool after stopped "172.16.1.14": 2
2025/03/28 08:28:50 gocql: conns of pool after stopped "172.16.1.28": 2
2025/03/28 08:29:50 Session.ring:[172.16.1.28:UP][172.16.1.14:UP][172.16.1.28:UP]

...

2025/03/29 20:19:33 gocql: unable to refresh ring: get existing host=[HostInfo hostname="" connectAddress="172.16.1.28" peer="172.16.1.28" rpc_address="172.16.1.28" broadcast_address="<nil>" preferred_ip="172.16.1.28" connect_addr="172.16.1.28" connect_addr_source="connect_address" port=9142 data_centre="ap-northeast-1" rack="ap-northeast-1" host_id="b666465e-cb85-3efa-b3ab-f6cf139e5a39" version="v3.11.2" state=UP num_tokens=1] from prevHosts: cannot find host
2025/03/29 20:19:33 gocql: Session.handleNodeConnected: 172.16.1.14:9142
2025/03/29 20:19:33 gocql: conns of pool after stopped "172.16.1.28": 0
2025/03/29 20:19:33 gocql: conns of pool after stopped "172.16.1.14": 2
2025/03/29 20:19:50 Session.ring:[172.16.1.14:UP][172.16.1.28:UP][172.16.1.14:UP][127.0.0.1:DOWN]

On startup, it had 3 connections, including 2 connections to 172.16.1.28, for some reason. After the error cannot find host on the connection to 172.16.1.28, it had 4 connections, including 2 connections to 172.16.1.14 and 1 connection to 127.0.0.1 that was down.

I'm not sure if this is a normal behavior or not, but it's working.

By the way, I came up with another work around. It might be a better solution to add a HostFilter in cluster config to ignore 127.0.0.1:

cluster.HostFilter = gocql.HostFilterFunc(func (h *gocql.HostInfo) bool {
        return h.ConnectAddress().String() != "127.0.0.1"
})

So far it's working fine after refreshRing, I will keep it for a while longer.

jameshartig · 2025-03-31T13:58:46Z

First, in what situation the error cannot find host occur? Is this an expected error? I read the source code, but I couldn't understand it well.

I think that could occur if the same hostID appears twice in the peers table, or the local host is in the peers table.

Can you paste the output for SELECT * FROM system.local WHERE key='local' and SELECT * FROM system.peers?

yomipq · 2025-04-01T00:25:32Z

I think that could occur if the same hostID appears twice in the peers table, or the local host is in the peers table.

Thank you for telling me that. You are right. I read the code again and now I understand what happened.

cqlsh> SELECT * FROM system.local WHERE key='local';

 key   | bootstrapped | broadcast_address | cluster_name     | cql_version | data_center    | gossip_generation | host_id                              | listen_address | native_protocol_version | partitioner                                 | rack           | release_version | rpc_address | schema_version                       | thrift_version | tokens | truncated_at
-------+--------------+-------------------+------------------+-------------+----------------+-------------------+--------------------------------------+----------------+-------------------------+---------------------------------------------+----------------+-----------------+-------------+--------------------------------------+----------------+--------+--------------
 local |    COMPLETED |         127.0.0.1 | Amazon Keyspaces |       3.4.4 | ap-northeast-1 |                42 | b666465e-cb85-3efa-b3ab-f6cf139e5a39 |      127.0.0.1 |                       4 | org.apache.cassandra.dht.Murmur3Partitioner | ap-northeast-1 |          3.11.2 |   127.0.0.1 | 05deae2d-6405-494d-a965-c0e5836bcb3c |         20.1.0 |   null |         null

(1 rows)
cqlsh> SELECT * FROM system.peers;

 peer        | data_center    | host_id                              | preferred_ip | rack           | release_version | rpc_address | schema_version                       | tokens
-------------+----------------+--------------------------------------+--------------+----------------+-----------------+-------------+--------------------------------------+-------------------------
 172.16.1.28 | ap-northeast-1 | b666465e-cb85-3efa-b3ab-f6cf139e5a39 |  172.16.1.28 | ap-northeast-1 |          3.11.2 | 172.16.1.28 | 05deae2d-6405-494d-a965-c0e5836bcb3c |                  {'-1'}
 172.16.1.14 | ap-northeast-1 | be0f3a14-e107-3fee-a5e5-415c10539abd |  172.16.1.14 | ap-northeast-1 |          3.11.2 | 172.16.1.14 | 05deae2d-6405-494d-a965-c0e5836bcb3c | {'9223372036854775806'}

(2 rows)

Both 127.0.0.1 and 172.16.1.28 have the same hostID.

joao-r-reis · 2025-04-01T14:17:12Z

Yeah AWS Keyspaces basically returns the full list of hosts (including the local host but with the correct IP address) in the system.peers table when peering is enabled which unfortunately is a behavior that is not consistent with "regular" C* so some drivers have an issue with it.

jameshartig · 2025-04-01T17:23:12Z

Yeah AWS Keyspaces basically returns the full list of hosts (including the local host but with the correct IP address) in the system.peers table when peering is enabled which unfortunately is a behavior that is not consistent with "regular" C* so some drivers have an issue with it.

Do we need to handle this? We could separate the local from the peers and ignore a duplicate host_id when refreshing the ring.

yomipq · 2025-04-02T00:20:28Z

I would really appreciate it if GoCQL handle this issue. If there is anything I can help, please let me know.

joao-r-reis · 2025-04-02T10:16:03Z

Do we need to handle this? We could separate the local from the peers and ignore a duplicate host_id when refreshing the ring.

Yeah we could do this as long as there's no impact to users not using AWS keyspaces. It should target gocql 2.1.0 though since we're trying to wrap up 2.0

yomipq · 2025-04-03T07:42:36Z

Thank you. I'm looking forward to the release.

If the release schedule is planed, could you please tell me approximately when 2.1.0 will be released?

joao-r-reis · 2025-04-03T10:40:34Z

No ETA on 2.1.0 for now. Also no guarantee that this will go in 2.1.0, someone needs to volunteer to open a PR for this but let's start with creating a JIRA, I'll do this.

joao-r-reis · 2025-04-03T10:42:18Z

https://issues.apache.org/jira/browse/CASSGO-72

yomipq · 2025-04-04T00:54:58Z

OK, so I will try to fix this issue. Do I need to do something in JIRA? Is it OK to just send a PR on GitHub?

At first, I'd like to confirm how to handle this. If I understand correctly, the same hostIDs and localhost address are handled properly on initializing connections (maybe NewSession() in session.go), but not on reconnection process (maybe refreshRing() in host_source.go). So now, I need to fix the reconnection process. Is that right?

jameshartig · 2025-04-04T02:28:55Z

If I understand correctly, the same hostIDs and localhost address are handled properly on initializing connections (maybe NewSession() in session.go), but not on reconnection process (maybe refreshRing() in host_source.go). So now, I need to fix the reconnection process. Is that right?

I think inside func (r *ringDescriber) GetHosts() we shouldn't blindly just be doing append([]*HostInfo{localHost}, peerHosts...) and maybe check to see if localHost is already in peerHosts (leaving a comment linking to this issue and probably mentioning amazon keyspaces) and if it is then we shouldn't add it twice and prefer the one from peers.

@joao-r-reis you might have more experience with how other drivers handle this. Thoughts?

joao-r-reis · 2025-04-04T11:46:12Z

@joao-r-reis you might have more experience with how other drivers handle this. Thoughts?

Most drivers use maps with ip addresses as keys but java driver 4.x moved to using host ids as the keys. Usually the behavior is to go through the new list of localhost + peer hosts one by one and add the host to the map if there is no entry or update the entry if it already exists. In the case of the java driver they have special handling for cases where there the same host id already exists under a different IP which happens in C* when a node changes its IP but this case will also fix the aws keyspaces issue I believe: https://github.com/apache/cassandra-java-driver/blob/529d56e1742dcd1df3ca55c00fd8e02c0e484c68/core/src/main/java/com/datastax/oss/driver/internal/core/metadata/AddNodeRefresh.java#L43-L66

GoCQL has both maps but it uses the map by host id for these checks.

I think inside func (r *ringDescriber) GetHosts() we shouldn't blindly just be doing append([]*HostInfo{localHost}, peerHosts...)

GetHosts() just returns all the hosts found in system tables without additional logic so I think it should be kept as is and changes should be made to func refreshRing(r *ringDescriber) error instead.

Reading through refreshRing function I'd think that the aws keyspaces case would not be an issue because it starts with the local host as it is iterating through the slice and then when it reaches the peer host with the same host id it should update the IPs... Maybe there's some bug here.

joao-r-reis · 2025-04-04T12:09:46Z

Oh I think the issue is that the control connection queries system.local and updates that host's entry in the map when it connects/reconnects... Other drivers (except java driver 4.x I think) just trigger a full ring refresh when the control connection connects/reconnects and I think we should do this on gocql as well. Basically we can move the refreshRing call from reconnect to setupConn I think.

The issue is that refreshRing triggers r.session.startPoolFill(h) and we only this to be triggered if if it's a reconnection (i.e. c.session.initialized() == true). The way this is handled on the C# driver for example is by having event handlers that are only set at the end of the session initialization process so the ring refresh that happens as part of the first control connection setup trigger events that do nothing. So first the control connection is initialized and a ring refresh is triggered (without pool fills), then event handlers for the ring refresh and topology events are set up and only then the pool filling is triggered.

yomipq · 2025-04-05T07:11:00Z

Reading through refreshRing function I'd think that the aws keyspaces case would not be an issue because it starts with the local host as it is iterating through the slice and then when it reaches the peer host with the same host id it should update the IPs... Maybe there's some bug here.

I might have found the cause. In the for-loop of hosts in refreshRing https://github.com/apache/cassandra-gocql-driver/blob/v1.7.0/host_source.go#L727-L756 , it first processes the localhost, overwrites the IP address to 127.0.0.1 and removes the hostID from prevHosts . And then it processes the peer host with the same hostID, but the hostID no longer exists in prevHosts, resulting in the ErrCannotFindHost error.

I think it would fix the issue to move delete(prevHosts, h.HostID()) to after the for-loop. In other words, it first processes localhost, then the peer host, where it reverts IP address from local address to peer address, and finally delete the existing hostIDs from prevHosts after all hosts are processed.

What do you think about this?

joao-r-reis changed the title ~~Connection trouble with Amazon Keyspaces~~ CASSGO-72 Connection trouble with Amazon Keyspaces Apr 3, 2025

yomipq linked a pull request Apr 27, 2025 that will close this issue

CASSGO-72 Fix connection issue in Amazon Keyspaces #1886

Open

CASSGO-72 Connection trouble with Amazon Keyspaces #1873

CASSGO-72 Connection trouble with Amazon Keyspaces #1873

Comments

yomipq commented Mar 27, 2025

joao-r-reis commented Mar 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joao-r-reis commented Mar 27, 2025

Uh oh!

yomipq commented Mar 28, 2025

Uh oh!

yomipq commented Mar 28, 2025

Uh oh!

yomipq commented Mar 31, 2025

Uh oh!

jameshartig commented Mar 31, 2025

Uh oh!

yomipq commented Apr 1, 2025

Uh oh!

joao-r-reis commented Apr 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jameshartig commented Apr 1, 2025

Uh oh!

yomipq commented Apr 2, 2025

Uh oh!

joao-r-reis commented Apr 2, 2025

Uh oh!

yomipq commented Apr 3, 2025

Uh oh!

joao-r-reis commented Apr 3, 2025

Uh oh!

joao-r-reis commented Apr 3, 2025

Uh oh!

yomipq commented Apr 4, 2025

Uh oh!

jameshartig commented Apr 4, 2025

Uh oh!

joao-r-reis commented Apr 4, 2025

Uh oh!

joao-r-reis commented Apr 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yomipq commented Apr 5, 2025

Uh oh!

joao-r-reis commented Mar 27, 2025 •

edited

Loading

joao-r-reis commented Apr 1, 2025 •

edited

Loading

joao-r-reis commented Apr 4, 2025 •

edited

Loading