-
-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: LLM initialization time increases significantly with larger tensor parallel size and Ray #10283
Comments
someone correct me if im wrong but the way the workers are initialized are done sequentially on the main process. which can be seen in the function I linked below vllm/vllm/executor/ray_gpu_executor.py Line 109 in bbd3e86
ray add additional overhead because you have to send the whole worker configs through Ray which is a slower process |
Thank you for your answer! However, I still have some concerns about the initialization overhead: For a 7B model:
The overhead seems disproportionately large considering:
Is this level of overhead expected? It seems excessive for a 7B model, especially since:
Could there be potential optimization opportunities to reduce these initialization costs? |
I don't find that overhead too strange, and there definitely is room for optimizations (parallelizing the process) but engine startup time is not really an important metric that people worry about. (model reloading would probably be the solution more people are interested in that is currently not implemented?) is there a reason you're looking for faster initialization? |
Great thanks for you relpy! we want to improve the startup speed, IMHO, 34s is also too long to wait, especially when we are developing new features and what to run some tests to verify it. |
Your current environment
vllm 0.5.2
The output of `python collect_env.py`
Model Input Dumps
just test the vllm init time
🐛 Describe the bug
Issue Description
We observed significant and unexpected increases in VLLM initialization time when scaling tensor parallelism (TP), especially with Ray enabled.
Observed Behavior
Expected Behavior
Initialization time should remain relatively constant or have minimal increase when scaling tensor parallelism and use ray.
Environment
Additional Context
The initialization time increase appears disproportionate to the tensor parallel size, suggesting a potential bottleneck in the initialization process, particularly when Ray is involved.
Reproducible Steps
vllm start time
vllm ray start time
The text was updated successfully, but these errors were encountered: