-
-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Roadmap] vLLM Roadmap Q3 2024 #5805
Comments
Does vLLM need the multi-model support similar like what FastChat does or something else? |
#2809 hello,how about this? |
Hi, the issues were mentioned in #5036 and should be taken into account. |
Will vLLM use Triton more to optimize operators' performance in future, or will it consider using the torch.compile mechanism more? And are there any plans for this? |
Hi! Is there or will there be support for the OpenAI Batch API ? |
I am doing for Whisper, my fork at https://github.com/mesolitica/vllm-whisper, the frontend later should compatible with OpenAI API plus able to stream output tokens, few hiccups, still trying to figure out based on T5 branch,
|
Able to load and infer, https://github.com/mesolitica/vllm-whisper/blob/main/examples/whisper_example.py, but the output is still trash, might be bugs related to weights or the attention, still debugging |
Do you have plans to support Ascend 910B in the future? |
Please consider prioritizing dynamic / just-in-time 8-bit quantization like EETQ which don't require offline quantization step. Previous mention in issues: #3261 (comment) |
deepseek-v2 and deepseek-coder-v2 are supported now. but awq or gptq version are not supported so these model are still not usable due to their huge 236B. also MLA(Multihead Latent Attention) of there model is not supported yet. |
Support for DoLa would be great! |
|
Please consider supporting transformer-based value models such as in the vllm fork https://github.com/MARIO-Math-Reasoning/vllm and the huggingface implementation https://huggingface.co/docs/trl/models#trl.AutoModelForCausalLMWithValueHead. The only thing that changes is adding a head to the end of the model to predict a value instead of logits. This would be a powerful addition to support very fast generation search and open up the possibility of more effective methods such as MCTS compared to traditional prompt based approaches such as self-consistency, CoT, ToT, etc. |
Thank you for your nice contribution! I wonder whether it is possible for you to fork a branch from vllm instead of creating new one so that anyone can see what changes in new contribution? |
yes thanks @robertgshaw2-neuralmagic, was trying it in recent days and it does look promising. happy to hear you believe it's more accurate than EETQ. I can confirm that Llama-70B-Instruct got almost same MMLU score with Would be great if it could load and quant the layers iteratively, as now if the 16bit model can't fit in the GPU, we have to quant it offline first. But the fact there is an option to do "dynamic" quant without calibration data is great. thanks for this |
It should be more accurate and much much faster - so I think we will not prioritizing adding Iterative quantization is on my list, ideally this week. |
vLLM currently has partial support for this (#4794). |
This requires a completely new instance of vLLM, It would be nice if we could just call an existing API with a batch request like you do with the OpenAI Batch API. |
Exactly my thoughts. I could help with the build. I already have a nano-library that does interface with OpenAI directly at ashim-mahara/odbg. The primary problem I have identified is with tracking the request origins in-case of dynamic batching by VLLM. The first one is easier if batches are executed sequentially but they would still need to be saved on the disk somewhere for retrieval later. |
@w013nad (or others), please feel free to open an RFC for this to discuss the ideal API. The main challenge is around file storage I believe. |
Hopefully, the function_call and tool_choice features will be implemented faster and will additionally support models like Qwen2 |
Hi all, CPU Optimizations to support GGUF models !!My thoughts are, Adding CPU optimizations to the vLLM makes it more robust.
If anyone already looking into this please let me know, I want to work on this part, I'm open to help/contribute to this Thanks |
ollama already support tool use in from version 0.3.0 |
Any chance that you guys can implement Dry Repetition Penalty? I sorely miss it from backends like Oobabooga or Kobold. |
We want to see more improvement on compiler since this is the major gap between vLLM and TRT-LLM (with meylin compiler) support. B.t.w, what's your opinion with SGLang (they extensively use torch.compile to optimize the ML workload) and their released benchmark? @simon-mo |
@akhilreddy0703 #5191 has just been merged, providing support for GGUF models. |
Hi, I would like to contribute to the Reward model API, do you have any suggestions or ideas in mind for this feature? |
A good start point might be some API similar to this https://github.com/OpenRLHF/OpenRLHF/pull/391/files |
Up for this, support multiple models or models at different version had good use case in the era of synthetic data. But I would suggest expose this feature in Engine level. My current recipe is using LangChain to abstract a layer on top of Ray, Ray is in charge of distributed model loading and inference. |
Is there a way to pass in custom decoding config in offline inference mode for different prompts i.e. use outlines to generate custom json output per prompt? It seems that currently, it is only possible to pass in a single decoding config to use for all prompts so would be great to have this feature! |
For offline inference mode will it be more efficient to organize data and create engine backend for each type of the prompts ? I am more interested in online decision of the decoding config for different type of coming inputs. Instead of using a chain of inference , one to make such judgement one to do inference, it is worthy of trying to do it before prefill or with a few round of generations. |
Though you can accelerate generation of reward/critic from limited hands experiences with our MegatronPPOTrainerEngine, Reward model is exclusive to alignment of LLM, which is out of the scope of vLLM. The challenge is huge memory required both for host cpu and its co-processor. The memory pressure comes from the fact that shards of optimizers of actor (finetuned GPT head), critic model (initialized with reward model parameters) co-exist with the shards of model parameters (no DDP copies on other gpu parallel groups). And in the last stage of pipeline of model, we need a full copy of an actor and a reward, which achieves the peak memory usage of whole PPO training PP stages. It is very complex situation; you cannot simply tackle this by hosting the frozen model outside of training gpus. vLLM does provide serving mode and you can make use of it. So my suggestion is, keep the relevant alignment features solely in the relevant repositories. |
The trouble in my use case is that each prompt requires a slightly different schema for the json depending on input to the prompt. Would be great if this could be treated similar to online inference in that sense. |
Hi what happened to "ARM aarch-64 support for AWS Graviton based instances and GH200" from the Q2 2024 roadmap? #3861 |
Hi, I wanted to contribute to Here is what I have got: #5683. What kind of LLM Class can be a good starting point for this? |
Hey can this be looked at please. I'm not able to run any mixture of experts models on L4 gpus (EC2 G6) instances due to the Triton issue mentioned |
More and more speech model is using a LLM to predict non-text tokens. Like ChatTTS or FishTTS, they are all using a llama to predict speech tokens. |
I think this is the difference in implementation at different granularities. |
Hi, it would be really great to have DRY implemented in vLLM, DRY has been a game changer for all the small models, since they tend to repeat much more. It's a really effective sampling method. It would be really useful to have it here as well |
Do we have plans to support #5540? We are having a production level use case and would really appreciate if someone can look into it for Q4 onwards. |
Any chance to support Ascend NPU as vLLM backend in 2024 Q4 roadmap?
|
Update:
This document includes the features in vLLM's roadmap for Q3 2024. Please feel free to discuss and contribute, as this roadmap is shaped by the vLLM community.
Themes.
As before, we categorized our roadmap into 6 broad themes:
Broad Model Support
Help wanted:
Hardware Support
Performance Optimizations
Production Features
Help wanted
OSS Community
Help wanted
Extensible Architecture
If any of the item you wanted is not on the roadmap, your suggestion and contribution is still welcomed! Please feel free to comment in this thread, open feature request, or create an RFC.
The text was updated successfully, but these errors were encountered: