Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reverseproxy: Health as a ratio of successful requests #5398

Draft
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

francislavoie
Copy link
Member

@francislavoie francislavoie commented Feb 26, 2023

Closes #4949

This makes it possible to keep track of successful requests over time for each upstream. Same mechanism as failure duration, i.e. a the count is increased right after the response is written from the proxy, and a goroutine is spawned to decrement the counter after a certain delay.

This also adds a minimum successful requests ratio, so if there's too many failures compared to successes, the upstream is marked unhealthy. This is paired with a minimum success count (defaulting to 5 as a reasonable lower bound) so that the ratio only applies after there's at least N successes in memory, otherwise a single failure at the start might take out the server.

Example config:

reverse_proxy :8002 {
	success_duration 10s
	fail_duration 10s
	max_fails 10
	unhealthy_status 500
	min_success_ratio 9/10
	min_successes 5
}

Essentially, the above means that within the past 10s, there must have been less than 10 fails or more than 90% success (unless there's less than 5 successes), otherwise the upstream is marked unhealthy. (For testing, I used unhealthy_status 500 as a reliable way to increment fails intentionally).

This is obviously more useful with multiple upstreams, but the above example uses just one to more quickly and reliably show the effect of the config.

@francislavoie francislavoie added the feature ⚙️ New feature or request label Feb 26, 2023
@francislavoie francislavoie added this to the v2.7.0 milestone Feb 26, 2023
@francislavoie francislavoie force-pushed the proxy-success-ratio branch 2 times, most recently from 497bfd0 to 31cfebc Compare April 15, 2023 14:52
@mholt
Copy link
Member

mholt commented Apr 15, 2023

Thanks for working on this. My main concern after a first pass through the code (which I know is still a draft) is that it requires 1 goroutine per successful request. This is similar to failure counting, but if we assume the majority of requests to be successful, then this will use significantly more resources than counting failures.

What I've typically leaned on instead is to have a ring buffer or a rolling/online algorithm depending on what the needs are. So for example, when I need to compute the standard deviation over a sliding window, I've used Welford's Online Algorithm which means I don't have to iterate the whole data set each time a new point is discovered.

I wonder if we could use something similar for this calculation, where we instead maybe keep the last N data points (if needed) or heck, maybe just a moving average? https://en.wikipedia.org/wiki/Moving_average

@francislavoie
Copy link
Member Author

Yeah, I'm definitely concerned about goroutine spam with this approach.

How would a rolling average type situation "forget" about old requests though? If there's no mechanism to remove old counts/entries then if you fall below the ratio, there's no way for the upstream to be healthy again.

@mholt
Copy link
Member

mholt commented Apr 19, 2023

How would a rolling average type situation "forget" about old requests though?

This is a great question. One way is by configuring a window size, e.g. the last N requests. One way to do that is to have an array of size N and you just iterate it, round-robin style; but this requires iterating the entire array at each request to compute an average. An on-line average computation shouldn't require that extra pass through the data IIRC.

I need to find a little time to double-check / look this up, but I can probably recommend something along those lines.

@francislavoie francislavoie modified the milestones: v2.7.0, 2.x Apr 19, 2023
@mholt
Copy link
Member

mholt commented Sep 26, 2024

To clarify: I think in order for this to be more comfortable for me to merge, we'd want to replace the goroutine-based approach for counting with some sort of other structure that will be less complicated and less likely to use a lot of memory.

@francislavoie
Copy link
Member Author

francislavoie commented Sep 26, 2024

I have no idea how you're thinking that can be done. This is the only idea I can come up with. How else are we meant to decrement the counter after the timeout?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature ⚙️ New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

dynamic passive health check
2 participants