-
Notifications
You must be signed in to change notification settings - Fork 4.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
infinite ResolveNow() loop on net transients #8154
Comments
@lev-lb are you using From the logs I am seeing "operation was canceled" which means context cancelation by the caller either due to short timeout or manual cancelation. This is likely the cause of backoff not getting chance to execute and connection attempt being immediately aborted. If the cancelation happens during dialing the connection, all addresses are exhausted and clientconn is repored with Its hard to debug further with just the logs. Could you provide your code showing how you are creating the client and how the cancelation is happening? Also, do you have any back off implemented in your custom name resolver? |
hi, @purnesh42H , thanks for looking into it.
the code in question still uses
all that is fine and by design: all the app-level calls are capped at ~10sec, and when the cluster is unreachable - it's expected that they will time out, and that the CC will go into i see nothing in the docs of the
see above: the cancellations are just plain vanilla gRPC call context timeouts. the gRPC client is created once, at app startup (i'm oversimplifying, but it's something that happens very rarely, say once a month). a client is expected to experience many CC state transients: some caused by app cluster server outages, some by network partitions. but also: some by app cluster servers closing connection, which they do after 60 sec of idle conn. it is the latter flow that i suspect at the moment. AFAICT, this infinite
no. AFAICT - that would make it impossible for me to control the reconnection process to the cluster. i'm specifying the backoff and timeout params to the gRPC client itself (see reproduced in full above). as you can see, we need that reconnection to be relatively tight, time-wise, with rapid failover between the app cluster servers, one or more of which are extremely likely to be up at any given time. i have no control over when the gRPC core will call into |
@purnesh42H : actually, a very old version of the code is available online, client creation is here, cancellation is just the the app currently experiencing the issue is closed-source, but the resolver+LB+client design is the same for all practical purposes. the timings used in the app linked above are different, i have reproduced the real config used now in the ticket above. |
i had another repro today. so far, i can confirm that:
HTH. |
@lev-lb i am still trying to repro. Just to check, are you able to repro the same behavior when using dns resolver? |
@purnesh42H : probably not, for a couple of reasons. for starters, i can't use the DNS resolver for the app in question, which is why it has a custom resolver to begin with. but also, you probably wouldn't be able to spot this bug with DNS resolver, certainly not in its default config, as (AFAICT from the code) DNS resolver has its own internal backoff mechanism. this makes deterministic backoff/retries impossible, of course, as now you have 2 backoff mechanisms playing off one another. and that, in turn, makes repro more difficult - assuming my theory about the problem occurring when the client app is initiating a call just as server is closing the connection is correct, that is... |
I finally found some time to look into this further.
so this just means the connection to the server is being closed which is expected because you mentioned server is closing
this is just a side effect of connection being closed to address in the subchannel so the picker wrapper is going to wait for new picker update. |
@purnesh42H : thanks for looking into it again.
makes sense. but, as i mentioned above:
and this observation stands, even though i keep having these issues almost once daily now, always under the same specific circumstances. so, perhaps the flow that gets kickstarted by that picker waiting is the one that's causing it - or that includes that picker waiting (maybe it's just part of the problematic flow)... |
Coming back to This means that new addresses are received by the resolver and because you have that rotation logic in your custom resolver, the list might be considered different (even though all the addresses are same) from previous one and that is starting the transport reset go routine because the channel is in CONNECTING state http://github.com/grpc/grpc-go/blob/a51009d1d7074ee1efcd323578064cbe44ef87e5/clientconn.go#L959-L1009 Now, this I discussed with other maintainers and one thing you can try is new pick first policy by setting Also, could you confirm if you are using proxy? |
well, the list of addresses is usually (but not always) "the same" but in a different order since i need the server selections to be "sticky". i.e. i need the opposite of load balancing on a per call basis from the same client - the load balancing is happening across clients, not across calls of the same client. a given client should keep talking to the same server for as long as possible, with as many and as long a pauses between gRPC calls as the app requires. which is why i went with
i'm trying
nope, not using any proxies anywhere (presumably a mux on the server side doesn't qualify as a proxy?).
if i'll do that, AFAICT - i lose control over the backoff timings. my use cases are "very soft RT", if my client (which runs as part of a different service) can't give an answer to its own client within the bounds of the timeout - i'm in trouble, so within that very limited budget of time i need to make sure to knock on several different cluster servers for sure. 2 layers of backoff mean it gets random. and submitting requests concurrently to more than one server is a no-no for a number of reasons. besides, even if i had a backoff, it seems there's still a bug in some gRPC channel state machine somewhere out there, as, AFAICT, the channel never recovers, regardless of the subsequent state of the network or servers... |
https://github.com/grpc/grpc-go/releases/tag/v1.69.0 introduced the new pick first which has more details. The new pick first should still continue to use the same address once chosen until it receive a failure so that doesn't change for your client. The main difference is that in new pick first has one address per subchannel so it will have retry with backoff for each subchannel consequently for each address. The old pickfirst had all the addresses in one subchannel which result in retrying the entire list of address and then backoff. Since, with new pick first we won't be comparing the new list of addresses from old list which relies on ordering, you might not see infinite ResolveNow() loop for your custom resolver. |
yes, correct
we most likely will stabalise it in next 1-2 releases unless decided otherwise. |
here's another head scratcher from production with the original with client app adjusted to issue the next gRPC call(s) at 55 sec intervals (5 sec before the server will close the idle conn, so as to not collide), this still happens, but in a surprising fashion. a pair of gRPC calls enter the client code (4 msec apart), appear to get stuck for 5 seconds until the server closes the channel (60 sec idle limit), then the gRPC client goes into the no gRPC-level logs on this run, unfortunately - see 2 messages downthread for that. UPD: and no, the fact that the app happens to issue a couple of racing calls to the gRPC client does not appear to be germane to the issue, many previous cases happened with just one outstanding gRPC call. |
@purnesh42H : BTW, can you clarify this bit, please?
based on what i see in the logs, it would appear that the how can a backoff in the resolver help overcome that? |
i dug up the logs of a repro with this 5 sec delay and gRPC core logging, original
off the top of my head, 5 sec does not ring a bell - IIRC, none of the client/server timeouts are around 5 sec, not even after backoff. |
@lev-lb did you get to try the new pick first? Let us know if this issue go away with that. Thanks. Will go through the new logs that you have provided when I have some time. |
The option itself is what's experimental, really. It will become the default on the next release: grpc-go/internal/envconfig/envconfig.go Lines 53 to 57 in 6819ed7
|
This issue is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed. |
sorry for the delay. at a quick glance, AFAIC - the "new" the resolver used in the log below injects a delay of 100msec if it detects that core gRPC went into the death loop.
for the record, these connection storms are triggered by one or two gRPC calls total, nothing in the app code on the client invokes gRPC methods at sustained sub-ms rate, ever. so - no, AFAIC, the new resolver suffers from a similar problem of occasionally (but rarely) completely disregarding its backoff config. the only reason it's not killing the client in the log above is app-side sleep. i do have the full logs (inc. gRPC core) for a couple of such incidents, inc. the one above, but it'll take time to process them before i can publish them externally. do tell if that's necessary. |
What version of gRPC are you using?
client: v1.70.0, server: v1.61.1, v1.63.2, v1.66.0.
What version of Go are you using (
go version
)?go version go1.22.11 linux/amd64
What operating system (Linux, Windows, …) and version?
several Linux versions, including: kernel 4.15.0-112-generic on Ubuntu
What did you do?
gRPC client with a custom resolver (using
pick_first
LB), connecting to a multi-node server cluster. this codebase appeared to be working fine for about 6 years, up to and including gRPC v1.63.2 - or at least the issue below was never brought to our attention before. since upgrading to v1.70.0, the same code appears to occasionally enter infinite loop calling the custom resolver'sResolveNow()
method. the issue is non-deterministic, but has been observed 4-5 times in the past month, AFAICT - always on network transients (impaired connectivity to the gRPC servers cluster).What did you expect to see?
the problem appears some time after the initial conn was established, lost, re-established, etc. after a net hiccup, i'd expect the lower gRPC layers to keep trying to dial the addresses returned by
ResolveNow()
as per the backoff and connect timeout config - which it normally does, and has been doing for years. an example of this is at the top of the attached log.What did you see instead?
occasionally (but rarely, despite forcibly injecting various network faults), the gRPC code enters an infinite loop of calling
ResolveNow()
at a rate of >15 thousand calls/second, pegging several CPU cores (>210% CPU on an otherwise near-idle machine). the process does not recover. the issue appears to manifest somewhere around here:see attached log, the relevant portion (reproduced above) starts at around
PROBLEM MANIFESTS AROUND HERE
text.sample.2025-03-07.12-54-08.log.gz
the only unusual entry in the log that i see before this
ResolveNow()
infinite loop but don't normally find in the logs on any other net transients (which are handled gracefully) is this one (UPDATED):[core] blockingPicker: the picked transport is not ready, loop back to repick
loopyWriter exiting with error: connection error: desc = "keepalive ping failed to receive ACK within timeout".potentially relevant gRPC client options:
The text was updated successfully, but these errors were encountered: