-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Xds fallback #11254
Xds fallback #11254
Conversation
96cecb2
to
72d9ac9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is good overall. I finally start to understand the spirit of it.
709026a
to
0c1eef8
Compare
af33b25
to
d8f7094
Compare
Fallback is ready for review again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sending what I have, mostly minor, I haven't looked deep enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sending what I have.
This comment was marked as resolved.
This comment was marked as resolved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Things done in commit 003348b
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sending what I have.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sending what I have.
f186005
to
f304264
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sending what I have.
subscriber.controlPlaneClient.adjustResourceSubscription(type); | ||
|
||
CpcWithFallbackState cpcToUse = manageControlPlaneClient(subscriber); | ||
if (cpcToUse.cpc != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If cpc is null
here because all cpcs are failing, then the subscriber won't be notified with an error. Although, is that preexisting? If it is preexisting, we can resolve that as a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the retry timer goes off and the CPC tries to reconnect, if it fails, then handleStreamClosed will call onError (the same as previously done). It seems like we shouldn't continue sending errors for every backoff, but that should probably be addressed in a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I had considered that, but retries can take 2 minutes. It is known to be failing at this very moment, so we shouldn't delay an error.
I think it is proper to update the error each backoff; the error can change over time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't care about an extra call to onError
, then it trivially becomes adding a call to subscriber.onError
in the catch (IOException)
of manageControlPlaneClient
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why manageControlPlaneClient()
, which I've already said is broken to pass a subscriber, when we could just call onError()
directly here? Nothing actually knows the error more than "no working cpcs." I was most interested in having an actionable error message, but having a call to onError() here would be fine except...
... we'd have to be careful about it to avoid #11672 . Let's not worry about it in this PR, because it is pre-existing, broken in multiple ways, and fallback is enough trouble by itself.
subscriber.controlPlaneClient.adjustResourceSubscription(type); | ||
|
||
CpcWithFallbackState cpcToUse = manageControlPlaneClient(subscriber); | ||
if (cpcToUse.cpc != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I had considered that, but retries can take 2 minutes. It is known to be failing at this very moment, so we shouldn't delay an error.
I think it is proper to update the error each backoff; the error can change over time.
The `FailingXdsTransport` seems like the cleanest approach, though I think
we need a `FailingXdsStreamingCall` to go with it.
…On Wed, Nov 27, 2024 at 2:43 PM Eric Anderson ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In xds/src/main/java/io/grpc/xds/client/XdsClientImpl.java
<#11254 (comment)>:
> + cleanUpResourceTimers(cpcClosed);
+
+ if (status.isOk()) {
+ return; // Not considered an error
+ }
+
+ Collection<String> authoritiesForClosedCpc = getActiveAuthorities(cpcClosed);
+ for (Map<String, ResourceSubscriber<? extends ResourceUpdate>> subscriberMap :
+ resourceSubscribers.values()) {
+ for (ResourceSubscriber<? extends ResourceUpdate> subscriber : subscriberMap.values()) {
+ if (subscriber.hasResult() || !authoritiesForClosedCpc.contains(subscriber.authority)) {
+ continue;
+ }
+
+ // try to fallback to lower priority control plane client
+ if (shouldTryFallback && manageControlPlaneClient(subscriber).didFallback) {
In getOrCreateControlPlaneClient(String) we need to catch
IllegalArgumentException from getOrCreateControlPlaneClient(ServerInfo),
because we wouldn't want failure to create one transport to prevent us from
trying other servers.
But handling that error is tricky. If we log and keep going then
addCpcToAuthority() will add null to the activateCpClients authority
list. We'd need to put protections everywhere it is used. I think we'd be
better off making FailingXdsTransport (I know, I know. Hide your surprise).
The main issue I see is we probably shouldn't call eventHandler until
StreamingCall.start() returns, which requires a separate thread. If we
move the IllegalStateException handling into GrpcXdsTransport then we could
use GrpcUtil.SHARED_CHANNEL_EXECUTOR.
Ideas?
—
Reply to this email directly, view it on GitHub
<#11254 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AZQMCXTJJTCVPTBJHCVVC2L2CZDKJAVCNFSM6AAAAABIXKY5WKVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDINRWGA2TAMJYGQ>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Yeah, it implies a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree we don't need an executor in the FailingXdsTransport because the synchronization context will prevent reentrancy.
I'm still not all that comfortable with manageControlPlaneClient()
being passed a Subscriber
. But it looks like it will behave okay, even if it is misleading, so this can go in without that changed.
f13ebe9
to
3ee25db
Compare
Implementation of https://github.com/grpc/proposal/blob/master/A71-xds-fallback.md in Java