[RFC]: Merge input processor and input mapper for multi-modal models

## Motivation

### Background

To provide more control over the model inputs, we currently define two methods for multi-modal models in vLLM:

- The **input processor** is called inside `LLMEngine` to extend the prompt with placeholder tokens which are reserved for vLLM features such as KV cache and chunked prefill.
- The **input mapper** is called inside `ModelRunner` to transform multi-modal inputs (e.g. `PIL` images) into tensor inputs, usually via the modality-specific processor (e.g. `AutoImageProcessor`) from HuggingFace.

### Issues with the current design

1. The input processor accepts the output of HF `AutoTokenizer`, a list of token IDs, instead of the text prompt. Since HF `AutoProcessor` doesn’t accept token IDs, we have to write custom code to edit the list of token IDs based on the multi-modal inputs. For some models (such as Phi-3-vision), this means re-implementing code from their HF `AutoProcessor`, complicating the process of porting the model to vLLM.
2. The input mapper, being inside `ModelRunner`, lies on the critical path of vLLM’s model execution. Even when the input mapper is fast, the tail TTFT and TPOT suffers because of this. As the input mapper takes up more time, our overall throughput decreases proportionally which can be avoided if we move it outside of the critical path. Nevertheless, we can do little if the `AutoProcessor` inside input mapper is very slow, like in [#9238](https://github.com/vllm-project/vllm/issues/9238). Hope that [huggingface/transformers#33810](https://github.com/huggingface/transformers/issues/33810) can help with that!
3. This abstraction results in redundant processing for models (such as Qwen2-VL and Molmo) with HF `AutoProcessor` that already performs most of the work for calculating the number of placeholder tokens.

## Proposed Change

### Unified multi-modal processor

We plan to merge our input processor and input mapper into a unified multi-modal processor (`BaseMultiModalProcessor`) that wraps HF `AutoProcessor`, and call it inside the `LLMEngine` (and thus benefit from #8779), taking the role of the existing tokenizer. After this change, each input type will be processed as follows:

- Text-only prompt: Pass to vLLM tokenizer (wraps HF `AutoTokenizer`) [Unchanged]
- List of token IDs: Skip vLLM tokenizer [Unchanged]
- Text prompt with multi-modal input: Pass to vLLM multi-modal processor [NEW]
- List of token IDs with multi-modal input: ~[Deprecated]~ Pass to vLLM multi-modal processor [NEW]

### Automatic prompt replacement

`BaseMultiModalProcessor._get_prompt_replacements` specifies HF's logic of replacing input placeholder tokens (e.g. `<image>` for a single image) with feature placeholder tokens (e.g. `<image><image>...<image>`, the number of which equals to the feature size). Given this specification, we can automatically detect whether HF has replaced the input placeholder tokens by checking whether the feature placeholder tokens exist in the prompt.

`BaseMultiModalProcessor._apply_prompt_replacements` provides model-agnostic code for automatically replacing input placeholder tokens with feature placeholder tokens. This is only called if we find that HF hasn't done so yet.

This enables the multi-modal processor to accept text/token prompts and process them separately from the multi-modal data. The detailed logic is shown in `BaseMultiModalProcessor._apply_hf_processor_main`.

### Processor caching

#11396 caches each item in the multi-modal output of HF processor and links them back to items in the input data.

When new data is passed in, we first check which items are in the cache, and which ones are missing. The missing items are passed into the HF processor in a single batch and cached, before being merged with the existing items in the cache.

Note that the text/token prompt must be processed separately from the multi-modal data because HF processors expect the input placeholders in the text to correspond to each multi-modal data item, but we only want to process the items that are missing. We can handle this elegantly using automatic prompt replacement (see above).

### ~~Deprecate token IDs with multi-modal input~~

~~To be compatible with OpenAI’s (legacy) Completions API, we currently support passing token IDs directly to both `LLM` class and OpenAI-compatible server. However, Completions API doesn’t support multi-modal inputs, so we will deprecate passing token IDs alongside multi-modal inputs to simplify model implementation (see Issue 1 above). **Please tell us if you have a use case for this and don’t want to see it removed!**~~

## Feedback Period

Feel free to comment as the effort progresses!

### Timeline

- [x] #10040
- [x] #10044
- [x] #10485
- [x] [3/N] Develop and refine POC
  - [x] #10676
  - [x] #10711 
  - [x] #10977
  - [x] #11199
  - [x] #11198
  - [x] #11258
  - [x] #11303
  - [x] #11396
  - [x] #11620
  - [x] #11661
  - [x] #11669
  - [x] #11674
  - [x] #11746
  - [x] #11812
  - [x] #11900
  - [x] #12244
  - [x] #12269
  - [x] #11427
  - [x] #13215
  - [x] #13380
  - [x] #13516
  - [x] #13964
  - [x] #14038
  - [x] #15712
  - [x] #16408
  - [x] #16416
- [x] [4/N] Deprecate the old code for input processor/mapper so external developers have time to convert
  - [x] Update documentation on how to implement multi-modal models
    - [x] #11925
    - [x] #13331
    - [x] #14278
    - [ ] #15405
    - [x] #16915
  - [x] Activate deprecation logic
    - [x] #13979
- [x] [5/N] Convert the rest of the built-in vLLM models to multi-modal processor
  - [x] #11632 (Aria, BLIP-2, Chameleon, Fuyu)
  - [x] #11682
  - [x] #11717
  - [x] #12504
  - [x] #12069
  - [x] #12553
  - [x] #12660
  - [x] #12449
  - [x] #12966
  - [x] #13278
  - [x] #14015
  - [x] #12211
  - [x] #15477
- [x] [6/N] Remove the old code for input processor/mapper
  - [x] #14864
  - [x] #15673
  - [x] #15686

The majority of our code will be called inside the existing `InputPreprocessor` which is separated from the vLLM engine, making it easy to integrate with #8779.

## CC List

@ywang96 @Isotr0py @WoosukKwon @robertgshaw2-neuralmagic 

## Any Other Things

### ~~Multi-modal plugins remain supported~~ Migrating multi-modal plugins

You can define additional input modalities (`ModalityDataItems`) and parse them in subclasses of `MultiModalDataParser` on a per-model basis. Afterwards, override `BaseMultiModalProcessor._get_data_parser` to construct your newly-defined parser.

Some users currently use multi-modal plugins to directly pass custom model inputs ([#6260](https://github.com/vllm-project/vllm/pull/6260)). Those inputs can be excluded from HF processing by returning them in `ModalityDataItems.get_passthrough_data` instead of `ModalityDataItems.get_processor_data`.

### ~~No batched preprocessing for now~~

~~Currently, preprocessing is performed per prompt in vLLM. While we can call HF tokenizer and modality-specific processor on batched inputs separately, calling the wrapping HF `AutoProcessor` with both list of texts and list of multi-modal data results in the processed multi-modal data (e.g. image) being assigned to every text in the list, rather than the more intuitive `zip`-like behavior (e.g. the `i`th image only assigned to the `i`th text). To support batched preprocessing, we would have to write custom code for each model to combine the outputs of HF tokenizer and modality-specific processors. Given that this can significantly complicate model implementation (see Issue 1 above), we will not consider batched preprocessing at this stage, even with this change.~~


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[RFC]: Merge input processor and input mapper for multi-modal models #10114

Motivation

Background

Issues with the current design

Proposed Change

Unified multi-modal processor

Automatic prompt replacement

Processor caching

Deprecate token IDs with multi-modal input

Feedback Period

Timeline

CC List

Any Other Things

Multi-modal plugins remain supported Migrating multi-modal plugins

No batched preprocessing for now

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[RFC]: Merge input processor and input mapper for multi-modal models #10114

Description

Motivation

Background

Issues with the current design

Proposed Change

Unified multi-modal processor

Automatic prompt replacement

Processor caching

Deprecate token IDs with multi-modal input

Feedback Period

Timeline

CC List

Any Other Things

Multi-modal plugins remain supported Migrating multi-modal plugins

No batched preprocessing for now

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions