You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
To provide more control over the model inputs, we currently define two methods for multi-modal models in vLLM:
The input processor is called inside LLMEngine to extend the prompt with placeholder tokens which are reserved for vLLM features such as KV cache and chunked prefill.
The input mapper is called inside ModelRunner to transform multi-modal inputs (e.g. PIL images) into tensor inputs, usually via the modality-specific processor (e.g. AutoImageProcessor) from HuggingFace.
Issues with the current design
The input processor accepts the output of HF AutoTokenizer, a list of token IDs, instead of the text prompt. Since HF AutoProcessor doesn’t accept token IDs, we have to write custom code to edit the list of token IDs based on the multi-modal inputs. For some models (such as Phi-3-vision), this means re-implementing code from their HF AutoProcessor, complicating the process of porting the model to vLLM.
The input mapper, being inside ModelRunner, lies on the critical path of vLLM’s model execution. Even when the input mapper is fast, the tail TTFT and TPOT suffers because of this. As the input mapper takes up more time, our overall throughput decreases proportionally which can be avoided if we move it outside of the critical path. Nevertheless, we can do little if the AutoProcessor inside input mapper is very slow, like in #9238. Hope that huggingface/transformers#33810 can help with that!
This abstraction results in redundant processing for models (such as Qwen2-VL and Molmo) with HF AutoProcessor that already performs most of the work for calculating the number of placeholder tokens.
Proposed Change
Unified multi-modal processor
We plan to merge our input processor and input mapper into a unified multi-modal processor and call it inside the LLMEngine (and thus benefit from #8779), taking the role of the existing tokenizer. After this change, each input type will be processed as follows:
Text-only prompt: Pass to vLLM tokenizer (wraps HF AutoTokenizer) [Unchanged]
List of token IDs: Skip vLLM tokenizer [Unchanged]
Text prompt with multi-modal input: Pass to vLLM multi-modal processor (wraps HF AutoProcessor) [NEW]
List of token IDs with multi-modal input: [DEPRECATED, see below]
This multi-modal processor will first call HF AutoProcessor, and then modify the processed token IDs by inserting placeholder tokens. (These processed token IDs are not to be confused with the deprecated “list of token IDs with multi-modal input", in which “list of token IDs" represents the tokenized text before processing with multi-modal input.) The number of placeholder tokens to assign can be determined by the existing feature size calculations for each model.
Deprecate token IDs with multi-modal input
To be compatible with OpenAI’s (legacy) Completions API, we currently support passing token IDs directly to both LLM class and OpenAI-compatible server. However, Completions API doesn’t support multi-modal inputs, so we will deprecate passing token IDs alongside multi-modal inputs to simplify model implementation (see Issue 1 above). Please tell us if you have a use case for this and don’t want to see it removed!
[2/N] Convert LLaVA-1.5, Phi-3-Vision, Qwen2-VL and Ultravox to multi-modal processor as POC and add tests
[3/N] Deprecate the old code for input processor/mapper so external developers have time to convert
[4/N] Convert the rest of the built-in vLLM models to multi-modal processor
[5/N] Remove the old code for input processor/mapper
The majority of our code will be called inside the existing InputPreprocessor which is separated from the vLLM engine, making it easy to integrate with #8779.
You can define additional modalities in MultiModalProcessingMetadata to handle your custom multi-modal plugins. If the names of those modalities are not valid keyword arguments to HF AutoProcessor, you can override the default multi-modal processor (similar to how you currently need to define _default_input_mapper for multi-modal plugins).
Some users currently use multi-modal plugins to directly pass custom model inputs (#6260). We can implement an alternative process_multimodal to help them migrate to the new processing framework.
No batched preprocessing for now
Currently, preprocessing is performed per prompt in vLLM. While we can call HF tokenizer and modality-specific processor on batched inputs separately, calling the wrapping HF AutoProcessor with both list of texts and list of multi-modal data results in the processed multi-modal data (e.g. image) being assigned to every text in the list, rather than the more intuitive zip-like behavior (e.g. the ith image only assigned to the ith text). To support batched preprocessing, we would have to write custom code for each model to combine the outputs of HF tokenizer and modality-specific processors. Given that this can significantly complicate model implementation (see Issue 1 above), we will not consider batched preprocessing at this stage, even with this change.
The text was updated successfully, but these errors were encountered:
This is great. In the EngineCore/AsyncLLM refactor (#9826), we introduced the concept of a Processor. I think that this code should sit inside there.
You initiative here will fit very well with the EngineCore/AsyncLLM refactor --- since the Processor runs in process 0, while the EngineCore runs in process 1. This means that we can overlap the input processing with the model execution (which is not currently true since the input processing currently runs in ModelRunner, which is part of Engine core.
One other note. The Processor in the PR linked currently runs inside process 0. However, we made the APIs such that we can adjust the Processor to run N background processes if needed. So, if you can work within this class, we can have a nice separation of concerns, which will enable us to offload more things to background processes as need.
Motivation
Background
To provide more control over the model inputs, we currently define two methods for multi-modal models in vLLM:
LLMEngine
to extend the prompt with placeholder tokens which are reserved for vLLM features such as KV cache and chunked prefill.ModelRunner
to transform multi-modal inputs (e.g.PIL
images) into tensor inputs, usually via the modality-specific processor (e.g.AutoImageProcessor
) from HuggingFace.Issues with the current design
AutoTokenizer
, a list of token IDs, instead of the text prompt. Since HFAutoProcessor
doesn’t accept token IDs, we have to write custom code to edit the list of token IDs based on the multi-modal inputs. For some models (such as Phi-3-vision), this means re-implementing code from their HFAutoProcessor
, complicating the process of porting the model to vLLM.ModelRunner
, lies on the critical path of vLLM’s model execution. Even when the input mapper is fast, the tail TTFT and TPOT suffers because of this. As the input mapper takes up more time, our overall throughput decreases proportionally which can be avoided if we move it outside of the critical path. Nevertheless, we can do little if theAutoProcessor
inside input mapper is very slow, like in #9238. Hope that huggingface/transformers#33810 can help with that!AutoProcessor
that already performs most of the work for calculating the number of placeholder tokens.Proposed Change
Unified multi-modal processor
We plan to merge our input processor and input mapper into a unified multi-modal processor and call it inside the
LLMEngine
(and thus benefit from #8779), taking the role of the existing tokenizer. After this change, each input type will be processed as follows:AutoTokenizer
) [Unchanged]AutoProcessor
) [NEW]List of token IDs with multi-modal input:[DEPRECATED, see below]This multi-modal processor will first call HF
AutoProcessor
, and then modify the processed token IDs by inserting placeholder tokens. (These processed token IDs are not to be confused with the deprecated “list of token IDs with multi-modal input", in which “list of token IDs" represents the tokenized text before processing with multi-modal input.) The number of placeholder tokens to assign can be determined by the existing feature size calculations for each model.Deprecate token IDs with multi-modal input
To be compatible with OpenAI’s (legacy) Completions API, we currently support passing token IDs directly to both
LLM
class and OpenAI-compatible server. However, Completions API doesn’t support multi-modal inputs, so we will deprecate passing token IDs alongside multi-modal inputs to simplify model implementation (see Issue 1 above). Please tell us if you have a use case for this and don’t want to see it removed!Feedback Period
Feel free to comment as the effort progresses!
Timeline
MultiModalInputs
toMultiModalKwargs
#10040The majority of our code will be called inside the existing
InputPreprocessor
which is separated from the vLLM engine, making it easy to integrate with #8779.CC List
@ywang96 @Isotr0py @WoosukKwon @robertgshaw2-neuralmagic
Any Other Things
Multi-modal plugins remain supported
You can define additional modalities in
MultiModalProcessingMetadata
to handle your custom multi-modal plugins. If the names of those modalities are not valid keyword arguments to HFAutoProcessor
, you can override the default multi-modal processor (similar to how you currently need to define_default_input_mapper
for multi-modal plugins).Some users currently use multi-modal plugins to directly pass custom model inputs (#6260). We can implement an alternative process_multimodal to help them migrate to the new processing framework.
No batched preprocessing for now
Currently, preprocessing is performed per prompt in vLLM. While we can call HF tokenizer and modality-specific processor on batched inputs separately, calling the wrapping HF
AutoProcessor
with both list of texts and list of multi-modal data results in the processed multi-modal data (e.g. image) being assigned to every text in the list, rather than the more intuitivezip
-like behavior (e.g. thei
th image only assigned to thei
th text). To support batched preprocessing, we would have to write custom code for each model to combine the outputs of HF tokenizer and modality-specific processors. Given that this can significantly complicate model implementation (see Issue 1 above), we will not consider batched preprocessing at this stage, even with this change.The text was updated successfully, but these errors were encountered: