Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

infinite ResolveNow() loop on net transients #8154

Open
lev-lb opened this issue Mar 7, 2025 · 19 comments
Open

infinite ResolveNow() loop on net transients #8154

lev-lb opened this issue Mar 7, 2025 · 19 comments
Assignees
Labels
Area: Resolvers/Balancers Includes LB policy & NR APIs, resolver/balancer/picker wrappers, LB policy impls and utilities. Status: Requires Reporter Clarification Type: Bug

Comments

@lev-lb
Copy link

lev-lb commented Mar 7, 2025

What version of gRPC are you using?

client: v1.70.0, server: v1.61.1, v1.63.2, v1.66.0.

What version of Go are you using (go version)?

go version go1.22.11 linux/amd64

What operating system (Linux, Windows, …) and version?

several Linux versions, including: kernel 4.15.0-112-generic on Ubuntu

What did you do?

gRPC client with a custom resolver (using pick_first LB), connecting to a multi-node server cluster. this codebase appeared to be working fine for about 6 years, up to and including gRPC v1.63.2 - or at least the issue below was never brought to our attention before. since upgrading to v1.70.0, the same code appears to occasionally enter infinite loop calling the custom resolver's ResolveNow() method. the issue is non-deterministic, but has been observed 4-5 times in the past month, AFAICT - always on network transients (impaired connectivity to the gRPC servers cluster).

What did you expect to see?

the problem appears some time after the initial conn was established, lost, re-established, etc. after a net hiccup, i'd expect the lower gRPC layers to keep trying to dial the addresses returned by ResolveNow() as per the backoff and connect timeout config - which it normally does, and has been doing for years. an example of this is at the top of the attached log.

What did you see instead?

occasionally (but rarely, despite forcibly injecting various network faults), the gRPC code enters an infinite loop of calling ResolveNow() at a rate of >15 thousand calls/second, pegging several CPU cores (>210% CPU on an otherwise near-idle machine). the process does not recover. the issue appears to manifest somewhere around here:

2025-03-07T12:48:28.072	INFO	[core] [Channel #1 SubChannel #2]Subchannel Connectivity change to IDLE
2025-03-07T12:48:28.072	INFO	resolver.ResolveNow()	{"lbgrpc": "1i855hu", "targets": "192.168.19.217:443,192.168.19.218:443,192.168.19.220:443,192.168.25.23:443,192.168.26.25:443,192.168.29.142:443,192.168.29.144:443,192.168.16.233:443,192.168.17.106:443,192.168.17.107:443,192.168.17.30:443,192.168.19.213:443,192.168.19.214:443,192.168.19.215:443", "calls-last-sec": 0}
2025-03-07T12:48:28.072	INFO	[transport] [client-transport 0xc000bc46c8] loopyWriter exiting with error: received GOAWAY with no active streams
2025-03-07T12:48:28.072	INFO	[pick-first-lb] [pick-first-lb 0xc000a37d70] Received SubConn state update: 0xc0002bb5e0, {ConnectivityState:IDLE ConnectionError:<nil> connectedAddress:{Addr: ServerName: Attributes:<nil> BalancerAttributes:<nil> Metadata:<nil>}}
2025-03-07T12:48:28.073	INFO	[core] blockingPicker: the picked transport is not ready, loop back to repick
2025-03-07T12:48:28.073	INFO	[core] [Channel #1]Resolver state updated: { "Addresses": [ { "Addr": "192.168.19.217:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.19.218:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.19.220:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.25.23:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.26.25:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.29.142:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.29.144:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.16.233:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.17.106:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.17.107:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.17.30:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.19.213:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.19.214:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.19.215:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null } ], "Endpoints": null, "ServiceConfig": null, "Attributes": null} ()
2025-03-07T12:48:28.073	INFO	[core] [Channel #1]Channel Connectivity change to IDLE
2025-03-07T12:48:28.073	INFO	[transport] [client-transport 0xc000bc46c8] Closing: connection error: desc = "error reading from server: read tcp 10.212.1.6:32844->192.168.19.215:443: use of closed network connection"
2025-03-07T12:48:28.073	INFO	[pick-first-lb] [pick-first-lb 0xc000a37d70] Received new config { "shuffleAddressList": false}, resolver state { "Addresses": [ { "Addr": "192.168.19.217:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.19.218:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.19.220:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.25.23:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.26.25:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.29.142:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.29.144:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.16.233:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.17.106:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.17.107:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.17.30:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.19.213:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.19.214:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.19.215:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null } ], "Endpoints": null, "ServiceConfig": null, "Attributes": null}
2025-03-07T12:48:28.073	INFO	[core] [Channel #1 SubChannel #2]addrConn: updateAddrs addrs (5 of 14): [{Addr: "192.168.19.217:443", ServerName: "", } {Addr: "192.168.19.218:443", ServerName: "", } {Addr: "192.168.19.220:443", ServerName: "", } {Addr: "192.168.25.23:443", ServerName: "", } {Addr: "192.168.26.25:443", ServerName: "", }]
2025-03-07T12:48:28.073	INFO	[core] [Channel #1 SubChannel #2]Subchannel Connectivity change to CONNECTING
2025-03-07T12:48:28.073	INFO	[core] [Channel #1 SubChannel #2]Subchannel picks a new address "192.168.19.215:443" to connect
2025-03-07T12:48:28.073	INFO	[pick-first-lb] [pick-first-lb 0xc000a37d70] Received SubConn state update: 0xc0002bb5e0, {ConnectivityState:CONNECTING ConnectionError:<nil> connectedAddress:{Addr: ServerName: Attributes:<nil> BalancerAttributes:<nil> Metadata:<nil>}}
2025-03-07T12:48:28.073	INFO	[core] [Channel #1]Channel Connectivity change to CONNECTING
2025-03-07T12:48:28.073	INFO	[core] [Channel #1 SubChannel #2]Subchannel picks a new address "192.168.19.217:443" to connect
2025-03-07T12:48:28.073	INFO	[core] Creating new client transport to "{Addr: \"192.168.19.215:443\", ServerName: \"lb-resolver\", }": connection error: desc = "transport: Error while dialing: dial tcp 192.168.19.215:443: operation was canceled"
2025-03-07T12:48:28.073	WARN	[core] [Channel #1 SubChannel #2]grpc: addrConn.createTransport failed to connect to {Addr: "192.168.19.215:443", ServerName: "lb-resolver", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 192.168.19.215:443: operation was canceled"
2025-03-07T12:48:28.073	INFO	resolver.ResolveNow()	{"lbgrpc": "1i855hu", "targets": "192.168.19.218:443,192.168.19.220:443,192.168.25.23:443,192.168.26.25:443,192.168.29.142:443,192.168.29.144:443,192.168.16.233:443,192.168.17.106:443,192.168.17.107:443,192.168.17.30:443,192.168.19.213:443,192.168.19.214:443,192.168.19.215:443,192.168.19.217:443", "calls-last-sec": 1}
2025-03-07T12:48:28.074	INFO	[core] [Channel #1]Resolver state updated: { "Addresses": [ { "Addr": "192.168.19.218:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.19.220:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.25.23:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.26.25:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.29.142:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.29.144:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.16.233:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.17.106:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.17.107:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.17.30:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.19.213:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.19.214:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.19.215:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.19.217:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null } ], "Endpoints": null, "ServiceConfig": null, "Attributes": null} ()
2025-03-07T12:48:28.074	INFO	[pick-first-lb] [pick-first-lb 0xc000a37d70] Received new config { "shuffleAddressList": false}, resolver state { "Addresses": [ { "Addr": "192.168.19.218:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.19.220:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.25.23:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.26.25:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.29.142:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.29.144:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.16.233:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.17.106:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.17.107:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.17.30:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.19.213:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.19.214:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.19.215:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.19.217:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null } ], "Endpoints": null, "ServiceConfig": null, "Attributes": null}
2025-03-07T12:48:28.074	INFO	[core] [Channel #1 SubChannel #2]addrConn: updateAddrs addrs (5 of 14): [{Addr: "192.168.19.218:443", ServerName: "", } {Addr: "192.168.19.220:443", ServerName: "", } {Addr: "192.168.25.23:443", ServerName: "", } {Addr: "192.168.26.25:443", ServerName: "", } {Addr: "192.168.29.142:443", ServerName: "", }]
2025-03-07T12:48:28.074	INFO	[core] Creating new client transport to "{Addr: \"192.168.19.217:443\", ServerName: \"lb-resolver\", }": connection error: desc = "transport: Error while dialing: dial tcp 192.168.19.217:443: operation was canceled"
2025-03-07T12:48:28.074	WARN	[core] [Channel #1 SubChannel #2]grpc: addrConn.createTransport failed to connect to {Addr: "192.168.19.217:443", ServerName: "lb-resolver", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 192.168.19.217:443: operation was canceled"
2025-03-07T12:48:28.074	INFO	resolver.ResolveNow()	{"lbgrpc": "1i855hu", "targets": "192.168.19.220:443,192.168.25.23:443,192.168.26.25:443,192.168.29.142:443,192.168.29.144:443,192.168.16.233:443,192.168.17.106:443,192.168.17.107:443,192.168.17.30:443,192.168.19.213:443,192.168.19.214:443,192.168.19.215:443,192.168.19.217:443,192.168.19.218:443", "calls-last-sec": 2}
2025-03-07T12:48:28.074	INFO	[core] [Channel #1 SubChannel #2]Subchannel picks a new address "192.168.19.218:443" to connect
2025-03-07T12:48:28.074	INFO	[core] [Channel #1]Resolver state updated: { "Addresses": [ { "Addr": "192.168.19.220:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.25.23:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.26.25:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.29.142:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.29.144:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.16.233:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.17.106:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.17.107:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.17.30:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.19.213:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.19.214:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.19.215:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.19.217:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.19.218:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null } ], "Endpoints": null, "ServiceConfig": null, "Attributes": null} ()
2025-03-07T12:48:28.074	INFO	[pick-first-lb] [pick-first-lb 0xc000a37d70] Received new config { "shuffleAddressList": false}, resolver state { "Addresses": [ { "Addr": "192.168.19.220:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.25.23:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.26.25:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.29.142:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.29.144:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.16.233:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.17.106:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.17.107:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.17.30:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.19.213:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.19.214:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.19.215:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.19.217:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.19.218:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null } ], "Endpoints": null, "ServiceConfig": null, "Attributes": null}
2025-03-07T12:48:28.074	INFO	[core] [Channel #1 SubChannel #2]addrConn: updateAddrs addrs (5 of 14): [{Addr: "192.168.19.220:443", ServerName: "", } {Addr: "192.168.25.23:443", ServerName: "", } {Addr: "192.168.26.25:443", ServerName: "", } {Addr: "192.168.29.142:443", ServerName: "", } {Addr: "192.168.29.144:443", ServerName: "", }]
2025-03-07T12:48:28.075	INFO	[core] Creating new client transport to "{Addr: \"192.168.19.218:443\", ServerName: \"lb-resolver\", }": connection error: desc = "transport: Error while dialing: dial tcp 192.168.19.218:443: operation was canceled"
2025-03-07T12:48:28.075	WARN	[core] [Channel #1 SubChannel #2]grpc: addrConn.createTransport failed to connect to {Addr: "192.168.19.218:443", ServerName: "lb-resolver", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 192.168.19.218:443: operation was canceled"
2025-03-07T12:48:28.075	INFO	resolver.ResolveNow()	{"lbgrpc": "1i855hu", "targets": "192.168.25.23:443,192.168.26.25:443,192.168.29.142:443,192.168.29.144:443,192.168.16.233:443,192.168.17.106:443,192.168.17.107:443,192.168.17.30:443,192.168.19.213:443,192.168.19.214:443,192.168.19.215:443,192.168.19.217:443,192.168.19.218:443,192.168.19.220:443", "calls-last-sec": 3}
2025-03-07T12:48:28.075	INFO	[core] [Channel #1]Resolver state updated: { "Addresses": [ { "Addr": "192.168.25.23:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.26.25:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.29.142:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.29.144:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.16.233:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.17.106:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.17.107:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.17.30:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.19.213:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.19.214:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.19.215:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.19.217:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.19.218:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.19.220:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null } ], "Endpoints": null, "ServiceConfig": null, "Attributes": null} ()
2025-03-07T12:48:28.075	INFO	[core] [Channel #1 SubChannel #2]Subchannel picks a new address "192.168.19.220:443" to connect
2025-03-07T12:48:28.075	INFO	[pick-first-lb] [pick-first-lb 0xc000a37d70] Received new config { "shuffleAddressList": false}, resolver state { "Addresses": [ { "Addr": "192.168.25.23:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.26.25:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.29.142:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.29.144:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.16.233:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.17.106:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.17.107:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.17.30:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.19.213:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.19.214:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.19.215:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.19.217:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.19.218:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.19.220:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null } ], "Endpoints": null, "ServiceConfig": null, "Attributes": null}
2025-03-07T12:48:28.075	INFO	[core] [Channel #1 SubChannel #2]addrConn: updateAddrs addrs (5 of 14): [{Addr: "192.168.25.23:443", ServerName: "", } {Addr: "192.168.26.25:443", ServerName: "", } {Addr: "192.168.29.142:443", ServerName: "", } {Addr: "192.168.29.144:443", ServerName: "", } {Addr: "192.168.16.233:443", ServerName: "", }]

see attached log, the relevant portion (reproduced above) starts at around PROBLEM MANIFESTS AROUND HERE text.

sample.2025-03-07.12-54-08.log.gz

the only unusual entry in the log that i see before this ResolveNow() infinite loop but don't normally find in the logs on any other net transients (which are handled gracefully) is this one (UPDATED): [core] blockingPicker: the picked transport is not ready, loop back to repick loopyWriter exiting with error: connection error: desc = "keepalive ping failed to receive ACK within timeout".

potentially relevant gRPC client options:

kal := keepalive.ClientParameters{
    Time:                10 * time.Second,
    Timeout:             10 * time.Second,
    PermitWithoutStream: true, 
}
dialBackoffConfig := backoff.Config{
    BaseDelay:  10 * time.Millisecond,
    Multiplier: 5,
    Jitter:     0.1,  
    MaxDelay:   3 * time.Second,
}
cp := grpc.ConnectParams{
    Backoff:           dialBackoffConfig,
    MinConnectTimeout: 1 * time.Second,
}
opts := []grpc.DialOption{
    grpc.WithBlock(),
    grpc.WithDisableRetry(),
    grpc.WithDefaultCallOptions(grpc.WaitForReady(true)),
    grpc.WithKeepaliveParams(kal),
    grpc.WithConnectParams(cp),
    grpc.WithResolvers(customResolver),
}
@purnesh42H purnesh42H self-assigned this Mar 10, 2025
@purnesh42H purnesh42H added the Area: Resolvers/Balancers Includes LB policy & NR APIs, resolver/balancer/picker wrappers, LB policy impls and utilities. label Mar 10, 2025
@purnesh42H
Copy link
Contributor

purnesh42H commented Mar 10, 2025

@lev-lb are you using grpc.Dial or grpc.NewClient? If you have switched to grpc.NewClient which was released in 1.64, WithBlock parameter will be ignored. See release notes https://github.com/grpc/grpc-go/releases/tag/v1.64.0

From the logs I am seeing "operation was canceled" which means context cancelation by the caller either due to short timeout or manual cancelation. This is likely the cause of backoff not getting chance to execute and connection attempt being immediately aborted. If the cancelation happens during dialing the connection, all addresses are exhausted and clientconn is repored with TRANSIENT_FAILURE which triggers the reconnection process.

Its hard to debug further with just the logs. Could you provide your code showing how you are creating the client and how the cancelation is happening?

Also, do you have any back off implemented in your custom name resolver?

@lev-lb
Copy link
Author

lev-lb commented Mar 11, 2025

hi, @purnesh42H , thanks for looking into it.

@lev-lb are you using grpc.Dial or grpc.NewClient? If you have switched to grpc.NewClient which was released in 1.64, WithBlock parameter will be ignored.

the code in question still uses grpc.Dial(), but, but that's neither here nor there, the infinite loop happens hours after the initial Dial(), long after the initial connection has been established, lost, re-established, etc.

From the logs I am seeing "operation was canceled" which means context cancelation by the caller either due to short timeout or manual cancelation. This is likely the cause of backoff not getting chance to execute and connection attempt being immediately aborted. If the cancelation happens during dialing the connection, all addresses are exhausted and clientconn is repored with TRANSIENT_FAILURE which triggers the reconnection process.

all that is fine and by design: all the app-level calls are capped at ~10sec, and when the cluster is unreachable - it's expected that they will time out, and that the CC will go into TRANSIENT_FAILURE (among other things). the fact that reconnection process takes place is expected. the fact that gRPC core randomly attempts to call ResolveNow() to do this at the rate of over 15 thousand times per second is not expected.

i see nothing in the docs of the grpc.Dial() (or NewClient(), for that matter) options, including backoff and timeouts to indicate that the gRPC core will ever do that, nor do i see any logic in gRPC core trying to do that, especially since it did so only several times in the past months of 24/7 uptime, rather than every time.

Its hard to debug further with just the logs. Could you provide your code showing how you are creating the client and how the cancelation is happening?

see above: the cancellations are just plain vanilla gRPC call context timeouts. the gRPC client is created once, at app startup (i'm oversimplifying, but it's something that happens very rarely, say once a month). a client is expected to experience many CC state transients: some caused by app cluster server outages, some by network partitions. but also: some by app cluster servers closing connection, which they do after 60 sec of idle conn.

it is the latter flow that i suspect at the moment. AFAICT, this infinite ResolveNow() loop appears to coincide with the upstream gRPC server closing the conn (GOAWAY with no error, also reflected in the loopyWriter transport trace i mentioned above) just at the same time as the app is issuing a request on the gRPC client. coincidentally, the two sides have 60sec timers, one for closing the idle conn, the other for re-fetching some data once a minute. due to variable response timing, these coincide very rarely.

Also, do you have any back off implemented in your custom name resolver?

no. AFAICT - that would make it impossible for me to control the reconnection process to the cluster. i'm specifying the backoff and timeout params to the gRPC client itself (see reproduced in full above). as you can see, we need that reconnection to be relatively tight, time-wise, with rapid failover between the app cluster servers, one or more of which are extremely likely to be up at any given time. i have no control over when the gRPC core will call into ResolveNow(), so if gRPC core will actually follow its backoff/timeout config (which it does the overwhelming majority of time) and will call into the resolver which will introduce its own additional blocking delay, sometimes i'll end up with no delay, sometimes with compound delay of both, sometimes with a random mix of both. i need a deterministic backoff there.

@lev-lb
Copy link
Author

lev-lb commented Mar 11, 2025

Could you provide your code showing how you are creating the client and how the cancelation is happening?

@purnesh42H : actually, a very old version of the code is available online, client creation is here, cancellation is just the ctx timeout on individual gRPC calls.

the app currently experiencing the issue is closed-source, but the resolver+LB+client design is the same for all practical purposes. the timings used in the app linked above are different, i have reproduced the real config used now in the ticket above.

@lev-lb
Copy link
Author

lev-lb commented Mar 11, 2025

i had another repro today. so far, i can confirm that:

  • this ResolveNow() loop appears to be consistently happening when the client happens to submit a new gRPC request just as the server is closing the connection on its side due to idle timeout.
    • it appears to be always preceded by some loopyWriter exiting with error trace, the most recent ones i'm seeing are received GOAWAY with no active streams. i might have been mistaken about the keepalive ping failed to receive ACK within timeout trace i initially pasted above in the text (the log snippet and the attached log were still correct, however). i'm guessing that just reflects the above point.
  • not every loopyWriter exiting with error: received GOAWAY with no active streams trace leads to the death loop: in most cases gRPC core appears to recover nicely, as can be seen in the original log attached above.
  • however, so far, every instance of the [core] blockingPicker: the picked transport is not ready, loop back to repick i've seen - was immediately followed by the death ResolveNow() loop, and every death loop that i have the logs for - immediately follows that trace.
  • incidentally, only the client version was upgraded to v1.70.0. the server gRPC versions this issue has been observed with so far are probably v1.61.1, v1.63.2 and v1.66.0. no tests against v1.70.0 servers were carried out (nor are possible at present).

HTH.

@purnesh42H
Copy link
Contributor

purnesh42H commented Mar 13, 2025

@lev-lb i am still trying to repro. Just to check, are you able to repro the same behavior when using dns resolver?

@lev-lb
Copy link
Author

lev-lb commented Mar 13, 2025

@purnesh42H : probably not, for a couple of reasons. for starters, i can't use the DNS resolver for the app in question, which is why it has a custom resolver to begin with.

but also, you probably wouldn't be able to spot this bug with DNS resolver, certainly not in its default config, as (AFAICT from the code) DNS resolver has its own internal backoff mechanism. this makes deterministic backoff/retries impossible, of course, as now you have 2 backoff mechanisms playing off one another. and that, in turn, makes repro more difficult - assuming my theory about the problem occurring when the client app is initiating a call just as server is closing the connection is correct, that is...

@purnesh42H
Copy link
Contributor

purnesh42H commented Mar 24, 2025

I finally found some time to look into this further.

loopyWriter exiting with error: received GOAWAY with no active streams

so this just means the connection to the server is being closed which is expected because you mentioned server is closing

[core] blockingPicker: the picked transport is not ready, loop back to repick

this is just a side effect of connection being closed to address in the subchannel so the picker wrapper is going to wait for new picker update.

@lev-lb
Copy link
Author

lev-lb commented Mar 24, 2025

@purnesh42H : thanks for looking into it again.

[core] blockingPicker: the picked transport is not ready, loop back to repick

this is just a side effect of subchannel being closed so the picker wrapper is going to wait for new picker update.

makes sense. but, as i mentioned above:

  • not every loopyWriter exiting with error: received GOAWAY with no active streams trace leads to the death loop: in most cases gRPC core appears to recover nicely, as can be seen in the original log attached above.
  • however, so far, every instance of the [core] blockingPicker: the picked transport is not ready, loop back to repick i've seen - was immediately followed by the death ResolveNow() loop, and every death loop that i have the logs for - immediately follows that trace.

and this observation stands, even though i keep having these issues almost once daily now, always under the same specific circumstances.

so, perhaps the flow that gets kickstarted by that picker waiting is the one that's causing it - or that includes that picker waiting (maybe it's just part of the problematic flow)...

@purnesh42H
Copy link
Contributor

purnesh42H commented Mar 24, 2025

Coming back to ResolveNow loop. One thing which I can see in the logs you posted above which might explain the reason of re-triggering ResolveNow after connection is closed. addrConn: updateAddrs addrs (5 of 14):.......

This means that new addresses are received by the resolver and because you have that rotation logic in your custom resolver, the list might be considered different (even though all the addresses are same) from previous one and that is starting the transport reset go routine because the channel is in CONNECTING state http://github.com/grpc/grpc-go/blob/a51009d1d7074ee1efcd323578064cbe44ef87e5/clientconn.go#L959-L1009

Now, this resetTransportAndUnlock() which makes the re-resolution request after failing to connect to any of the addresses from resolver list. Now, after that the backOff timer does come but because your resolver produces a new list immediately with a different order, I am wondering the whole process restarts immediately. In a happy case, where atleast to one addresses connection is made, this won't be problem because re-resolution is not needed. However, if the server is closed re-resolution will happen.

I discussed with other maintainers and one thing you can try is new pick first policy by setting GRPC_EXPERIMENTAL_ENABLE_NEW_PICK_FIRST env var to true if that can be suited better for your custom resolver. Though, it might be good if you can have some back off in your resolver as well.

Also, could you confirm if you are using proxy?

@lev-lb
Copy link
Author

lev-lb commented Mar 24, 2025

well, the list of addresses is usually (but not always) "the same" but in a different order since i need the server selections to be "sticky". i.e. i need the opposite of load balancing on a per call basis from the same client - the load balancing is happening across clients, not across calls of the same client. a given client should keep talking to the same server for as long as possible, with as many and as long a pauses between gRPC calls as the app requires. which is why i went with pick_first. still, if the cluster geometry changes - addresses might come or go.

  • does GRPC_EXPERIMENTAL_ENABLE_NEW_PICK_FIRST policy support such a use case?
  • how is it different in terms of the observable behaviour from the original pick_first? is this documented somewhere? both the commit message and the PR description are very sparse on the details of the behavioural differences...
  • just how "experimental" is this?

i'm trying GRPC_EXPERIMENTAL_ENABLE_NEW_PICK_FIRST now. may i assume that if i see pick-first-leaf-lb in the gRPC log - it's using the right thing?

Also, could you confirm if you are using proxy?

nope, not using any proxies anywhere (presumably a mux on the server side doesn't qualify as a proxy?).

Though, it might be good if you can have some back off in your resolver as well.

if i'll do that, AFAICT - i lose control over the backoff timings. my use cases are "very soft RT", if my client (which runs as part of a different service) can't give an answer to its own client within the bounds of the timeout - i'm in trouble, so within that very limited budget of time i need to make sure to knock on several different cluster servers for sure. 2 layers of backoff mean it gets random. and submitting requests concurrently to more than one server is a no-no for a number of reasons.

besides, even if i had a backoff, it seems there's still a bug in some gRPC channel state machine somewhere out there, as, AFAICT, the channel never recovers, regardless of the subsequent state of the network or servers...

@purnesh42H
Copy link
Contributor

https://github.com/grpc/grpc-go/releases/tag/v1.69.0 introduced the new pick first which has more details. The new pick first should still continue to use the same address once chosen until it receive a failure so that doesn't change for your client. The main difference is that in new pick first has one address per subchannel so it will have retry with backoff for each subchannel consequently for each address. The old pickfirst had all the addresses in one subchannel which result in retrying the entire list of address and then backoff. Since, with new pick first we won't be comparing the new list of addresses from old list which relies on ordering, you might not see infinite ResolveNow() loop for your custom resolver.

@purnesh42H
Copy link
Contributor

i'm trying GRPC_EXPERIMENTAL_ENABLE_NEW_PICK_FIRST now. may i assume that if i see pick-first-leaf-lb in the gRPC log - it's using the right thing?

yes, correct

just how "experimental" is this?

we most likely will stabalise it in next 1-2 releases unless decided otherwise.

@lev-lb
Copy link
Author

lev-lb commented Mar 25, 2025

here's another head scratcher from production with the original pick_first in the meantime...

with client app adjusted to issue the next gRPC call(s) at 55 sec intervals (5 sec before the server will close the idle conn, so as to not collide), this still happens, but in a surprising fashion. a pair of gRPC calls enter the client code (4 msec apart), appear to get stuck for 5 seconds until the server closes the channel (60 sec idle limit), then the gRPC client goes into the ResolveNow() loop, on schedule. huh?!? 🤔

no gRPC-level logs on this run, unfortunately - see 2 messages downthread for that.

UPD: and no, the fact that the app happens to issue a couple of racing calls to the gRPC client does not appear to be germane to the issue, many previous cases happened with just one outstanding gRPC call.

@lev-lb
Copy link
Author

lev-lb commented Mar 25, 2025

@purnesh42H : BTW, can you clarify this bit, please?

Though, it might be good if you can have some back off in your resolver as well.

based on what i see in the logs, it would appear that the ResolveNow() loop is actually infinite, and does not depend on the network conditions at all. the client transport appears to not even try to connect to the targets, it just immediately returns with transport: Error while dialing: dial tcp <whatever>: operation was canceled, but keeps doing this internally nonetheless.

how can a backoff in the resolver help overcome that?

@lev-lb
Copy link
Author

lev-lb commented Mar 25, 2025

with client app adjusted to issue the next gRPC call(s) at 55 sec intervals (5 sec before the server will close the idle conn, so as to not collide), this still happens, but in a surprising fashion. a pair of gRPC calls enter the client code (4 msec apart), appear to get stuck for 5 seconds until the server closes the channel (60 sec idle limit), then the gRPC client goes into the ResolveNow() loop, on schedule. huh?!? 🤔

i dug up the logs of a repro with this 5 sec delay and gRPC core logging, original pick_first. i'm using a unary client interceptor to trace entry/exit from the gRPC client. between the trace upon invocation of a gRPC method from that interceptor and the time gRPC core goes into the resolver loop there's literally nothing, no traces from gRPC core whatsoever:

2025-03-25T23:04:28.090    DEBUG   gRPC-client-interceptor-about-to-call-invoker()   {"lbgrpc": "08sk5j0", "grpc.start_time": "2025-03-25T23:04:28.0903+02:00", "grpc.request.deadline": "2025-03-25T23:04:38.0902+02:00"}
    <----- 5 seconds of NO log entries from gRPC!
2025-03-25T23:04:33.075    INFO    [core] [Channel #7 SubChannel #8]Subchannel Connectivity change to IDLE
2025-03-25T23:04:33.075    INFO    resolver.ResolveNow()    {"lbgrpc": "08sk5j0", "targets": "192.168.24.194:443,192.168.24.195:443,192.168.24.193:443"}
2025-03-25T23:04:33.076    INFO    [core] blockingPicker: the picked transport is not ready, loop back to repick
2025-03-25T23:04:33.076    INFO    [core] blockingPicker: the picked transport is not ready, loop back to repick
2025-03-25T23:04:33.076    INFO    [transport] [client-transport 0xc000792fc8] loopyWriter exiting with error: received GOAWAY with no active streams
2025-03-25T23:04:33.076    INFO    [core] [Channel #7]Resolver state updated: { "Addresses": [ { "Addr": "192.168.24.194:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.24.195:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.24.193:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null } ], "Endpoints": null, "ServiceConfig": null, "Attributes": null } ()
2025-03-25T23:04:33.076    INFO    [pick-first-lb] [pick-first-lb 0xc000b348a0] Received SubConn state update: 0xc000b15540, {ConnectivityState:IDLE ConnectionError:<nil> connectedAddress:{Addr: ServerName: Attributes:<nil> BalancerAttributes:<nil> Metadata:<nil>}}
2025-03-25T23:04:33.076    INFO    [core] [Channel #7]Channel Connectivity change to IDLE
2025-03-25T23:04:33.076    INFO    [pick-first-lb] [pick-first-lb 0xc000b348a0] Received new config { "shuffleAddressList": false }, resolver state { "Addresses": [ { "Addr": "192.168.24.194:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.24.195:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.24.193:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null } ], "Endpoints": null, "ServiceConfig": null, "Attributes": null }
2025-03-25T23:04:33.076    INFO    [core] [Channel #7 SubChannel #8]addrConn: updateAddrs addrs (3 of 3): [{Addr: "192.168.24.194:443", ServerName: "", } {Addr: "192.168.24.195:443", ServerName: "", } {Addr: "192.168.24.193:443", ServerName: "", }]
2025-03-25T23:04:33.076    INFO    [core] [Channel #7 SubChannel #8]Subchannel Connectivity change to CONNECTING
2025-03-25T23:04:33.076    INFO    [transport] [client-transport 0xc000792fc8] Closing: connection error: desc = "error reading from server: read tcp 10.212.1.15:39530->192.168.24.193:443: use of closed network connection"
2025-03-25T23:04:33.076    INFO    [core] connect called on addrConn in non-idle state (CONNECTING); ignoring.
2025-03-25T23:04:33.076    INFO    [pick-first-lb] [pick-first-lb 0xc000b348a0] Received SubConn state update: 0xc000b15540, {ConnectivityState:CONNECTING ConnectionError:<nil> connectedAddress:{Addr: ServerName: Attributes:<nil> BalancerAttributes:<nil> Metadata:<nil>}}
2025-03-25T23:04:33.076    INFO    [core] [Channel #7]Channel Connectivity change to CONNECTING
2025-03-25T23:04:33.076    INFO    [core] [Channel #7 SubChannel #8]Subchannel picks a new address "192.168.24.194:443" to connect
2025-03-25T23:04:33.076    INFO    [core] [Channel #7 SubChannel #8]Subchannel picks a new address "192.168.24.193:443" to connect
2025-03-25T23:04:33.076    INFO    [core] Creating new client transport to "{Addr: \"192.168.24.193:443\", ServerName: \"lb-resolver\", }": connection error: desc = "transport: Error while dialing: dial tcp 192.168.24.193:443: operation was canceled"
2025-03-25T23:04:33.076    WARN    [core] [Channel #7 SubChannel #8]grpc: addrConn.createTransport failed to connect to {Addr: "192.168.24.193:443", ServerName: "lb-resolver", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 192.168.24.193:443: operation was canceled"
2025-03-25T23:04:33.076    INFO    resolver.ResolveNow()    {"lbgrpc": "08sk5j0", "targets": "192.168.24.195:443,192.168.24.193:443,192.168.24.194:443"}
    <...and so on, with ResolveNow()...>

off the top of my head, 5 sec does not ring a bell - IIRC, none of the client/server timeouts are around 5 sec, not even after backoff.

@purnesh42H
Copy link
Contributor

purnesh42H commented Mar 27, 2025

@lev-lb did you get to try the new pick first? Let us know if this issue go away with that. Thanks.

Will go through the new logs that you have provided when I have some time.

@dfawley
Copy link
Member

dfawley commented Mar 27, 2025

we most likely will stabalise it in next 1-2 releases unless decided otherwise.

The option itself is what's experimental, really. It will become the default on the next release:

// NewPickFirstEnabled is set if the new pickfirst leaf policy is to be used
// instead of the exiting pickfirst implementation. This can be disabled by
// setting the environment variable "GRPC_EXPERIMENTAL_ENABLE_NEW_PICK_FIRST"
// to "false".
NewPickFirstEnabled = boolFromEnv("GRPC_EXPERIMENTAL_ENABLE_NEW_PICK_FIRST", true)

Copy link

github-actions bot commented Apr 2, 2025

This issue is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.

@github-actions github-actions bot added the stale label Apr 2, 2025
@lev-lb
Copy link
Author

lev-lb commented Apr 2, 2025

sorry for the delay. at a quick glance, AFAIC - the "new" pick_first appears to totally disregard its backoff config when it enters the same race condition. the below is one of the multiple similar episodes i've seen with GRPC_EXPERIMENTAL_ENABLE_NEW_PICK_FIRST=true.

the resolver used in the log below injects a delay of 100msec if it detects that core gRPC went into the death loop. since-last-ms is the time elapsed (in msec) since ResolveNow() was last invoked by gRPC core, so, e.g. 0.281 is 281 microseconds. and i definitely see that gRPC connection hammering the dead servers at even higher rate than that. that's with the same backoff config as listed at the top of this thread - which is a far cry from those numbers.

2025-03-27T10:53:09.282    INFO    resolver.ResolveNow()    {"lbgrpc": "0stexsw", "targets": "192.168.24.193:443,192.168.24.194:443,192.168.24.195:443", "since-last-ms": 38495.582}
2025-03-27T10:53:18.721    INFO    resolver.ResolveNow()    {"lbgrpc": "0stexsw", "targets": "192.168.24.194:443,192.168.24.195:443,192.168.24.193:443", "since-last-ms": 9438.413}
2025-03-27T10:53:18.823    WARN    resolver.ResolveNow()    {"lbgrpc": "0stexsw", "targets": "192.168.24.195:443,192.168.24.193:443,192.168.24.194:443", "since-last-ms": 0.281}
2025-03-27T10:53:18.924    WARN    resolver.ResolveNow()    {"lbgrpc": "0stexsw", "targets": "192.168.24.193:443,192.168.24.194:443,192.168.24.195:443", "since-last-ms": 0.4}
2025-03-27T10:53:19.026    WARN    resolver.ResolveNow()    {"lbgrpc": "0stexsw", "targets": "192.168.24.194:443,192.168.24.195:443,192.168.24.193:443", "since-last-ms": 0.437}
2025-03-27T10:53:19.129    WARN    resolver.ResolveNow()    {"lbgrpc": "0stexsw", "targets": "192.168.24.195:443,192.168.24.193:443,192.168.24.194:443", "since-last-ms": 0.456}
2025-03-27T10:53:19.231    WARN    resolver.ResolveNow()    {"lbgrpc": "0stexsw", "targets": "192.168.24.193:443,192.168.24.194:443,192.168.24.195:443", "since-last-ms": 0.418}
2025-03-27T10:53:19.333    WARN    resolver.ResolveNow()    {"lbgrpc": "0stexsw", "targets": "192.168.24.194:443,192.168.24.195:443,192.168.24.193:443", "since-last-ms": 0.375}
2025-03-27T10:53:19.434    WARN    resolver.ResolveNow()    {"lbgrpc": "0stexsw", "targets": "192.168.24.195:443,192.168.24.193:443,192.168.24.194:443", "since-last-ms": 0.427}
2025-03-27T10:53:19.536    WARN    resolver.ResolveNow()    {"lbgrpc": "0stexsw", "targets": "192.168.24.193:443,192.168.24.194:443,192.168.24.195:443", "since-last-ms": 0.203}
2025-03-27T10:53:19.638    WARN    resolver.ResolveNow()    {"lbgrpc": "0stexsw", "targets": "192.168.24.194:443,192.168.24.195:443,192.168.24.193:443", "since-last-ms": 0.345}
2025-03-27T10:53:19.740    WARN    resolver.ResolveNow()    {"lbgrpc": "0stexsw", "targets": "192.168.24.195:443,192.168.24.193:443,192.168.24.194:443", "since-last-ms": 0.384}
2025-03-27T10:53:19.842    WARN    resolver.ResolveNow()    {"lbgrpc": "0stexsw", "targets": "192.168.24.193:443,192.168.24.194:443,192.168.24.195:443", "since-last-ms": 0.506}
2025-03-27T10:53:20.198    INFO    resolver.ResolveNow()    {"lbgrpc": "0stexsw", "targets": "192.168.24.194:443,192.168.24.195:443,192.168.24.193:443", "since-last-ms": 356.777}
2025-03-27T10:53:20.343    INFO    resolver.ResolveNow()    {"lbgrpc": "0stexsw", "targets": "192.168.24.195:443,192.168.24.193:443,192.168.24.194:443", "since-last-ms": 144.428}

for the record, these connection storms are triggered by one or two gRPC calls total, nothing in the app code on the client invokes gRPC methods at sustained sub-ms rate, ever.

so - no, AFAIC, the new resolver suffers from a similar problem of occasionally (but rarely) completely disregarding its backoff config. the only reason it's not killing the client in the log above is app-side sleep.

i do have the full logs (inc. gRPC core) for a couple of such incidents, inc. the one above, but it'll take time to process them before i can publish them externally. do tell if that's necessary.

@github-actions github-actions bot removed the stale label Apr 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: Resolvers/Balancers Includes LB policy & NR APIs, resolver/balancer/picker wrappers, LB policy impls and utilities. Status: Requires Reporter Clarification Type: Bug
Projects
None yet
Development

No branches or pull requests

3 participants