-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rate-limiting on bandwidth #336
Comments
I have thought about this being a potential issue but never been able to see that it is. You will need to start a lot of Pods with new images considering that image pulling is done from multiple nodes during Pod startup. Have you observed the network traffic as an issue in your deployments? If that is the case I would want to try to replicate these issues. |
I created a cluster, and all images are on a single master node. Then, when I added three or more new nodes simultaneously, the new nodes only have spegel images. This greatly affects the network of the master node. |
Just to be clear, that means we are talking about HTTP rate limits, right? A socket-level rate limit would still hog all the TCP connections until most pods on all fresh nodes are done, whereas an HTTP 429 would have the clients retry and at least spegel would over time start to put requests to the other nodes that managed to actually get layers from the master node? |
@bittrance if we were to implement rate limiting it would be HTTP rate limits. Returning a 429 response would result in a fallback to the original registry. If the original registry is not accessible the mirror would be attempted again during the next image pull go around after the back off. @wenchao5211 what type of VMs are you using for your controlplane? It seems like they are very slim if serving a couple of megabytes of data from disk is greatly affecting its networking. Or are the image layers that you are pulling very large? It would be great to see some metrics from this control plane node to understand what is going on. |
I have both physical and virtual appliances, both with gigabit transfers, and the mirrors on our first master node are very large indeed, adding up to over 10GB, what metrics do you need? What kind of metrics are you looking for? Are they machine network and compute metrics or Pro Monitor data? |
This may in that case make sense if the individual layers are very large. I am not really sure right now what metrics are related. There are a lot of factors at play here, from disk IO to networking. I probably need some time to think about this. I have had a look at other registries to see if they have any similar features. Zot seems to only support HTTP rate limiting on individual requests. Harbor has support for throttling IO during image replication. It would be interesting to see if there are any other projects doing similar things. If there is a way to go it would be to limit the io copy speed when writing blobs- |
I think this is fine to implement after having a look at the configuration options in Dragonfly. https://d7y.io/docs/reference/configuration/dfdaemon @bittrance do you have any opinion about this? |
I think that the best tool to address this problem is kernel traffic control which would de-emphasize spegel traffic. However, there are various scenarios where you don't have control over kernel-level config. Plus, the bottleneck may be upstream (think edge nodes). Thus, implementing a spegel node max bandwidth consumption may indeed be a good idea. This should address the "large image" node net if saturation. The docs should prolly mention something about setting the limit so there is bandwidth left for gossipping. I'm more skeptical about a per-peer/client bandwidth limit since I think in most cases it will be hard to estimate the number of concurrent clients. Crazy idea: perhaps a spegel node could be configured to 429 requests above some limit only if it sees that the layer is available from other peers? |
Describe the problem to be solved
i use k3s , I want to limit the bandwidth of the spegel to a range, if there is no upper limit, it may affect the cluster network and lead to the cluster health
Proposed solution to the problem
No response
The text was updated successfully, but these errors were encountered: