reverseproxy: Health as a ratio of successful requests #5398

francislavoie · 2023-02-26T07:41:14Z

This makes it possible to keep track of successful requests over time for each upstream. Same mechanism as failure duration, i.e. a the count is increased right after the response is written from the proxy, and a goroutine is spawned to decrement the counter after a certain delay.

This also adds a minimum successful requests ratio, so if there's too many failures compared to successes, the upstream is marked unhealthy. This is paired with a minimum success count (defaulting to 5 as a reasonable lower bound) so that the ratio only applies after there's at least N successes in memory, otherwise a single failure at the start might take out the server.

Example config:

reverse_proxy :8002 {
	success_duration 10s
	fail_duration 10s
	max_fails 10
	unhealthy_status 500
	min_success_ratio 9/10
	min_successes 5
}

Essentially, the above means that within the past 10s, there must have been less than 10 fails or more than 90% success (unless there's less than 5 successes), otherwise the upstream is marked unhealthy. (For testing, I used unhealthy_status 500 as a reliable way to increment fails intentionally).

This is obviously more useful with multiple upstreams, but the above example uses just one to more quickly and reliably show the effect of the config.

mholt · 2023-04-15T14:53:39Z

Thanks for working on this. My main concern after a first pass through the code (which I know is still a draft) is that it requires 1 goroutine per successful request. This is similar to failure counting, but if we assume the majority of requests to be successful, then this will use significantly more resources than counting failures.

What I've typically leaned on instead is to have a ring buffer or a rolling/online algorithm depending on what the needs are. So for example, when I need to compute the standard deviation over a sliding window, I've used Welford's Online Algorithm which means I don't have to iterate the whole data set each time a new point is discovered.

I wonder if we could use something similar for this calculation, where we instead maybe keep the last N data points (if needed) or heck, maybe just a moving average? https://en.wikipedia.org/wiki/Moving_average

francislavoie · 2023-04-15T16:00:46Z

Yeah, I'm definitely concerned about goroutine spam with this approach.

How would a rolling average type situation "forget" about old requests though? If there's no mechanism to remove old counts/entries then if you fall below the ratio, there's no way for the upstream to be healthy again.

mholt · 2023-04-19T19:56:52Z

How would a rolling average type situation "forget" about old requests though?

This is a great question. One way is by configuring a window size, e.g. the last N requests. One way to do that is to have an array of size N and you just iterate it, round-robin style; but this requires iterating the entire array at each request to compute an average. An on-line average computation shouldn't require that extra pass through the data IIRC.

I need to find a little time to double-check / look this up, but I can probably recommend something along those lines.

mholt · 2024-09-26T21:48:03Z

To clarify: I think in order for this to be more comfortable for me to merge, we'd want to replace the goroutine-based approach for counting with some sort of other structure that will be less complicated and less likely to use a lot of memory.

francislavoie · 2024-09-26T22:38:00Z

I have no idea how you're thinking that can be done. This is the only idea I can come up with. How else are we meant to decrement the counter after the timeout?

francislavoie added the feature ⚙️ New feature or request label Feb 26, 2023

francislavoie added this to the v2.7.0 milestone Feb 26, 2023

francislavoie mentioned this pull request Feb 26, 2023

dynamic passive health check #4949

Open

francislavoie force-pushed the proxy-success-ratio branch 2 times, most recently from 497bfd0 to 31cfebc Compare April 15, 2023 14:52

francislavoie added 5 commits April 15, 2023 11:34

Add success_duration

cf69cd7

Add caddyhttp.Ratio type

c4b934f

Add min_success_ratio WIP

c8b8c3a

Add min_successes

2c61b50

Implement success ratio in health checks

6d01018

francislavoie force-pushed the proxy-success-ratio branch from 31cfebc to 6d01018 Compare April 15, 2023 15:34

francislavoie modified the milestones: v2.7.0, 2.x Apr 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reverseproxy: Health as a ratio of successful requests #5398

reverseproxy: Health as a ratio of successful requests #5398

francislavoie commented Feb 26, 2023 •

edited

Loading

mholt commented Apr 15, 2023

francislavoie commented Apr 15, 2023

mholt commented Apr 19, 2023 •

edited

Loading

mholt commented Sep 26, 2024

francislavoie commented Sep 26, 2024 •

edited

Loading

reverseproxy: Health as a ratio of successful requests #5398

Are you sure you want to change the base?

reverseproxy: Health as a ratio of successful requests #5398

Conversation

francislavoie commented Feb 26, 2023 • edited Loading

mholt commented Apr 15, 2023

francislavoie commented Apr 15, 2023

mholt commented Apr 19, 2023 • edited Loading

mholt commented Sep 26, 2024

francislavoie commented Sep 26, 2024 • edited Loading

francislavoie commented Feb 26, 2023 •

edited

Loading

mholt commented Apr 19, 2023 •

edited

Loading

francislavoie commented Sep 26, 2024 •

edited

Loading