-
-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
reverseproxy: Health as a ratio of successful requests #5398
base: master
Are you sure you want to change the base?
Conversation
497bfd0
to
31cfebc
Compare
Thanks for working on this. My main concern after a first pass through the code (which I know is still a draft) is that it requires 1 goroutine per successful request. This is similar to failure counting, but if we assume the majority of requests to be successful, then this will use significantly more resources than counting failures. What I've typically leaned on instead is to have a ring buffer or a rolling/online algorithm depending on what the needs are. So for example, when I need to compute the standard deviation over a sliding window, I've used Welford's Online Algorithm which means I don't have to iterate the whole data set each time a new point is discovered. I wonder if we could use something similar for this calculation, where we instead maybe keep the last N data points (if needed) or heck, maybe just a moving average? https://en.wikipedia.org/wiki/Moving_average |
31cfebc
to
6d01018
Compare
Yeah, I'm definitely concerned about goroutine spam with this approach. How would a rolling average type situation "forget" about old requests though? If there's no mechanism to remove old counts/entries then if you fall below the ratio, there's no way for the upstream to be healthy again. |
This is a great question. One way is by configuring a window size, e.g. the last N requests. One way to do that is to have an array of size N and you just iterate it, round-robin style; but this requires iterating the entire array at each request to compute an average. An on-line average computation shouldn't require that extra pass through the data IIRC. I need to find a little time to double-check / look this up, but I can probably recommend something along those lines. |
To clarify: I think in order for this to be more comfortable for me to merge, we'd want to replace the goroutine-based approach for counting with some sort of other structure that will be less complicated and less likely to use a lot of memory. |
I have no idea how you're thinking that can be done. This is the only idea I can come up with. How else are we meant to decrement the counter after the timeout? |
Closes #4949
This makes it possible to keep track of successful requests over time for each upstream. Same mechanism as failure duration, i.e. a the count is increased right after the response is written from the proxy, and a goroutine is spawned to decrement the counter after a certain delay.
This also adds a minimum successful requests ratio, so if there's too many failures compared to successes, the upstream is marked unhealthy. This is paired with a minimum success count (defaulting to 5 as a reasonable lower bound) so that the ratio only applies after there's at least N successes in memory, otherwise a single failure at the start might take out the server.
Example config:
Essentially, the above means that within the past 10s, there must have been less than 10 fails or more than 90% success (unless there's less than 5 successes), otherwise the upstream is marked unhealthy. (For testing, I used
unhealthy_status 500
as a reliable way to increment fails intentionally).This is obviously more useful with multiple upstreams, but the above example uses just one to more quickly and reliably show the effect of the config.