Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rate-limiting on bandwidth #336

Closed
wenchao5211 opened this issue Feb 1, 2024 · 9 comments · Fixed by #365
Closed

rate-limiting on bandwidth #336

wenchao5211 opened this issue Feb 1, 2024 · 9 comments · Fixed by #365
Labels
enhancement New feature or request

Comments

@wenchao5211
Copy link

Describe the problem to be solved

i use k3s , I want to limit the bandwidth of the spegel to a range, if there is no upper limit, it may affect the cluster network and lead to the cluster health

Proposed solution to the problem

No response

@wenchao5211 wenchao5211 added the enhancement New feature or request label Feb 1, 2024
@phillebaba
Copy link
Member

I have thought about this being a potential issue but never been able to see that it is. You will need to start a lot of Pods with new images considering that image pulling is done from multiple nodes during Pod startup. Have you observed the network traffic as an issue in your deployments? If that is the case I would want to try to replicate these issues.

@wenchao5211
Copy link
Author

I created a cluster, and all images are on a single master node. Then, when I added three or more new nodes simultaneously, the new nodes only have spegel images. This greatly affects the network of the master node.

@bittrance
Copy link
Contributor

Just to be clear, that means we are talking about HTTP rate limits, right? A socket-level rate limit would still hog all the TCP connections until most pods on all fresh nodes are done, whereas an HTTP 429 would have the clients retry and at least spegel would over time start to put requests to the other nodes that managed to actually get layers from the master node?

@phillebaba
Copy link
Member

@bittrance if we were to implement rate limiting it would be HTTP rate limits. Returning a 429 response would result in a fallback to the original registry. If the original registry is not accessible the mirror would be attempted again during the next image pull go around after the back off.

@wenchao5211 what type of VMs are you using for your controlplane? It seems like they are very slim if serving a couple of megabytes of data from disk is greatly affecting its networking. Or are the image layers that you are pulling very large? It would be great to see some metrics from this control plane node to understand what is going on.

@wenchao5211
Copy link
Author

I have both physical and virtual appliances, both with gigabit transfers, and the mirrors on our first master node are very large indeed, adding up to over 10GB, what metrics do you need? What kind of metrics are you looking for? Are they machine network and compute metrics or Pro Monitor data?

@phillebaba
Copy link
Member

This may in that case make sense if the individual layers are very large. I am not really sure right now what metrics are related. There are a lot of factors at play here, from disk IO to networking. I probably need some time to think about this.

I have had a look at other registries to see if they have any similar features.

Zot seems to only support HTTP rate limiting on individual requests.

project-zot/zot#380

Harbor has support for throttling IO during image replication.

goharbor/harbor#13194

It would be interesting to see if there are any other projects doing similar things. If there is a way to go it would be to limit the io copy speed when writing blobs-

@wenchao5211
Copy link
Author

ref: dragonflyoss/Dragonfly#1427

@phillebaba
Copy link
Member

I think this is fine to implement after having a look at the configuration options in Dragonfly.

https://d7y.io/docs/reference/configuration/dfdaemon

@bittrance do you have any opinion about this?

@bittrance
Copy link
Contributor

I think that the best tool to address this problem is kernel traffic control which would de-emphasize spegel traffic. However, there are various scenarios where you don't have control over kernel-level config. Plus, the bottleneck may be upstream (think edge nodes).

Thus, implementing a spegel node max bandwidth consumption may indeed be a good idea. This should address the "large image" node net if saturation. The docs should prolly mention something about setting the limit so there is bandwidth left for gossipping.

I'm more skeptical about a per-peer/client bandwidth limit since I think in most cases it will be hard to estimate the number of concurrent clients. Crazy idea: perhaps a spegel node could be configured to 429 requests above some limit only if it sees that the layer is available from other peers?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants