Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add model_load_time metric #2311

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

Edwinhr716
Copy link
Contributor

What does this PR do?

Adds metric that measures the time spent in downloading the model, loading into GPU memory, and time it takes for the server to be ready to receive a request.

Because the router is the component that emits the metrics, but the launcher is the one who downloads the model and thus is where the download time is tracked, it is necessary to pass the duration from the launcher to the router. I considered passing it as a CLI argument, but to minimize the number of changes required opted to use an environment variable to do it instead. Open to suggestions as to how to better pass the value to the router.

To make it easier to work with Rust's Instant type I opted to measure two different values and then add them together: the time it takes to download to the model to launching the router; and the time it takes from launching the router to the router being ready to receive requests.

This PR is part of the metrics standardization effort.

Fixes # (issue)

Before submitting

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@Narsil
Copy link
Collaborator

Narsil commented Aug 9, 2024

I'm still confused what's the interest to measure this one huge metric that encompasses so many different things:

  • Model downloads, from where
  • Potential model conversion
  • Actual model load time which can be coming from disk or CPU RAM.

As a model runner, getting ready times can be interesting, but without more context it's quite useless, no ?
What are users supposed to do with that metric ? It's also never being modified during the server's lifetime, so it's not really probing the system to do monitoring no ?
In our monitoring systems, we only care about the logs showing insights as to what's happening, which contain every step, why they are occurring and how long each step takes.

I have nothing against the PR itself, but adding code without clear reasons for clear benefits is always a bit strange.

@Edwinhr716
Copy link
Contributor Author

What are users supposed to do with that metric ?

because it is essentially the startup latency of the model, it is useful in determining the pod autoscaling threshold and frequency. So if model-load-time is above say 40 seconds, the user might not want to decrease the number of pods too often as it takes long to create a new pod once demand rises again.

cc @achandrasekar

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants