Conversation
748039b to
29fc130
Compare
There was a problem hiding this comment.
Pull request overview
This PR refactors multimodal media I/O and model preprocessing to add native video input support (notably for Qwen3-VL / Qwen3.5 / InternS1 Pro), while also unifying handling of images and time-series through shared “media IO” abstractions.
Changes:
- Introduces
MediaIOabstractions + URL/data/file loading helpers, and adds video/time-series/image encode/decode utilities. - Updates server-side OpenAI-style multimodal parsing to emit unified
{type, data, ...params}items and threadsmedia_io_kwargsthrough the OpenAI request path. - Refactors many VL model preprocessors to consume
collect_multimodal_items()and adds video token handling paths for Qwen3/InternS1Pro families; updates PyTorch multimodal data container to include modality.
Reviewed changes
Copilot reviewed 69 out of 69 changed files in this pull request and generated 18 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_lmdeploy/test_vl/test_vl_encode.py | Adds encode/decode tests for image/video/time-series and invalid inputs. |
| tests/test_lmdeploy/test_vl/test_qwen3vl_processor.py | Updates test message schema to new {type,data} format; import path update. |
| requirements/runtime_cuda.txt | Adds OpenCV headless runtime dependency for video loading. |
| lmdeploy/vl/utils.py | Replaces ad-hoc loaders with MediaIO-based load_* and encode_*_base64 APIs. |
| lmdeploy/vl/time_series_utils.py | Removes legacy time-series utilities (migrated into MediaIO). |
| lmdeploy/vl/model/yi.py | Switches preprocessing to unified multimodal collection API. |
| lmdeploy/vl/model/xcomposer2.py | Switches preprocessing to unified multimodal collection API. |
| lmdeploy/vl/model/qwen3_5.py | Pulls video token attributes from HF processor; prepares for video inputs. |
| lmdeploy/vl/model/qwen3.py | Adds video preprocessing and PyTorch packing path for Qwen3-VL. |
| lmdeploy/vl/model/qwen2.py | Switches preprocessing to unified multimodal collection API. |
| lmdeploy/vl/model/qwen.py | Switches preprocessing to unified multimodal collection API. |
| lmdeploy/vl/model/phi3_vision.py | Switches preprocessing to unified multimodal collection API. |
| lmdeploy/vl/model/mllama.py | Switches preprocessing to unified multimodal collection API. |
| lmdeploy/vl/model/llava_next.py | Switches preprocessing to unified multimodal collection API. |
| lmdeploy/vl/model/llava_hf.py | Switches preprocessing to unified multimodal collection API. |
| lmdeploy/vl/model/llava.py | Switches preprocessing to unified multimodal collection API. |
| lmdeploy/vl/model/llama4.py | Switches preprocessing to unified multimodal collection API. |
| lmdeploy/vl/model/internvl3_hf.py | Switches preprocessing to unified multimodal collection API. |
| lmdeploy/vl/model/internvl.py | Switches preprocessing to unified multimodal collection API. |
| lmdeploy/vl/model/interns1_pro.py | Adds unified multimodal preprocessing for image/video/time-series, plus packing paths. |
| lmdeploy/vl/model/glm4_1v.py | Switches preprocessing to unified multimodal collection API. |
| lmdeploy/vl/model/gemma3_vl.py | Switches preprocessing to unified multimodal collection API. |
| lmdeploy/vl/model/deepseek_vl2.py | Switches preprocessing to unified multimodal collection API. |
| lmdeploy/vl/model/deepseek.py | Switches preprocessing to unified multimodal collection API. |
| lmdeploy/vl/model/cogvlm.py | Switches preprocessing to unified multimodal collection API. |
| lmdeploy/vl/model/base.py | Replaces collect_images/collect_time_series with collect_multimodal_items(). |
| lmdeploy/vl/media/base.py | Introduces abstract MediaIO interface. |
| lmdeploy/vl/media/connection.py | Adds load_from_url() supporting http/data/file/path. |
| lmdeploy/vl/media/image.py | Adds ImageMediaIO for image load/encode. |
| lmdeploy/vl/media/time_series.py | Adds TimeSeriesMediaIO for time-series load/encode. |
| lmdeploy/vl/media/video_loader.py | Adds multi-backend video decoding loaders (OpenCV/Decord/TorchCodec/TorchVision). |
| lmdeploy/vl/media/video.py | Adds VideoMediaIO for video load/encode (JPEG-frame and raw). |
| lmdeploy/vl/constants.py | Adds Modality enum to standardize modality tags. |
| lmdeploy/vl/init.py | Re-exports load_* and encode_*_base64 APIs. |
| lmdeploy/serve/processors/multimodal.py | Refactors OpenAI multimodal parsing; adds media_io_kwargs support. |
| lmdeploy/serve/openai/protocol.py | Adds media_io_kwargs to OpenAI request schema. |
| lmdeploy/serve/openai/api_server.py | Threads media_io_kwargs through chat/generate endpoints. |
| lmdeploy/serve/core/async_engine.py | Threads media_io_kwargs into engine generation flow. |
| lmdeploy/pytorch/multimodal/image_type.py | Removes legacy ImageData wrapper type. |
| lmdeploy/pytorch/multimodal/data_type.py | Renames/reshapes multimodal container to MultiModalData + adds modality. |
| lmdeploy/pytorch/multimodal/init.py | Updates exports for new multimodal data container. |
| lmdeploy/pytorch/models/utils/multimodal.py | Removes unused multimodal mixin helper. |
| lmdeploy/pytorch/models/qwen3_vl_moe.py | Docstring cleanup while aligning with other forward signatures. |
| lmdeploy/pytorch/models/qwen3_vl.py | Updates multimodal plumbing and adds Qwen3VLInputProcessor for image/video. |
| lmdeploy/pytorch/models/qwen3_next.py | Minor cleanup (removes TODO comment). |
| lmdeploy/pytorch/models/qwen3_moe.py | Type-annotation updates / minor cleanup. |
| lmdeploy/pytorch/models/qwen3_5_moe.py | Switches to new Qwen3VL input processor for multimodal input. |
| lmdeploy/pytorch/models/qwen3_5.py | Switches to new Qwen3VL input processor and adds modality-based token selection. |
| lmdeploy/pytorch/models/qwen2_vl.py | Replaces MultiModalTensor with MultiModalData. |
| lmdeploy/pytorch/models/qwen2_5_vl.py | Replaces MultiModalTensor with MultiModalData. |
| lmdeploy/pytorch/models/phi3_v.py | Replaces MultiModalTensor with MultiModalData. |
| lmdeploy/pytorch/models/llava.py | Replaces MultiModalTensor with MultiModalData. |
| lmdeploy/pytorch/models/llama4.py | Replaces MultiModalTensor with MultiModalData. |
| lmdeploy/pytorch/models/internvl3_hf.py | Replaces MultiModalTensor with MultiModalData. |
| lmdeploy/pytorch/models/internvl.py | Replaces MultiModalTensor with MultiModalData. |
| lmdeploy/pytorch/models/interns1_pro.py | Updates InternS1Pro multimodal handling to new mm_data container. |
| lmdeploy/pytorch/models/glm4_1v.py | Replaces MultiModalTensor with MultiModalData. |
| lmdeploy/pytorch/models/gemma3_vl.py | Replaces MultiModalTensor with MultiModalData. |
| lmdeploy/pytorch/models/deepseek_vl2.py | Replaces MultiModalTensor with MultiModalData. |
| lmdeploy/pytorch/models/cogvlm.py | Replaces MultiModalTensor with MultiModalData. |
| lmdeploy/pytorch/models/chatglm2.py | Replaces MultiModalTensor with MultiModalData. |
| lmdeploy/pytorch/model_inputs.py | Updates step context types to use MultiModalData. |
| docs/zh_cn/multi_modal/qwen2_5_vl.md | Updates docs import path for encode_image_base64. |
| docs/zh_cn/multi_modal/minicpmv.md | Updates docs import path for encode_image_base64. |
| docs/zh_cn/multi_modal/internvl.md | Updates docs import path for encode_image_base64. |
| docs/en/multi_modal/qwen2_5_vl.md | Updates docs import path for encode_image_base64. |
| docs/en/multi_modal/minicpmv.md | Updates docs import path for encode_image_base64. |
| docs/en/multi_modal/internvl.md | Updates docs import path for encode_image_base64. |
| autotest/tools/pipeline/mllm_case.py | Updates imports to use re-exported encode_image_base64. |
Comments suppressed due to low confidence (3)
lmdeploy/vl/model/interns1_pro.py:296
to_pytorch_aux_ts()validateslen(segs) == len(preps)+1, butprepscontains all modalities frompreprocess(). If the request includes time series plus any other modality (e.g., image), this will assert/crash. Filterprepsto only time-series entries (or validate only time-series inputs are allowed for this model).
lmdeploy/vl/media/video_loader.py:282- This
load_bytes()also creates aNamedTemporaryFile(delete=False, ...)and never removes it. If torchvision is the chosen backend, repeated video loads can leak temp files on disk. Use atry/finallycleanup or a context manager that deletes the file after decoding.
lmdeploy/vl/media/video_loader.py:282 TorchVisionVideoLoader.load_bytes()callsself.load_file(...)but does not return its(video, metadata)result, so callers will getNoneand crash. Return the value fromload_file()here.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # collect all preprocessing result from messages | ||
| preps = [x['content'] for x in messages if x['role'] == 'preprocess'] | ||
| assert len(preps) == 1 | ||
| preps = preps[0] | ||
|
|
||
| # split prompt into segments and validate data | ||
| segs = prompt.split(self.vision_start_token + self.video_token + self.vision_end_token) | ||
| assert len(segs) == len(preps) + 1, (f'the number of {self.video_token} is not equal ' | ||
| f'to input videos, {len(segs) - 1} vs {len(preps)}') |
There was a problem hiding this comment.
to_pytorch_aux_video() uses the full preps list (which includes images/time_series too) when validating the number of video placeholders and computing offsets, so mixed-modality inputs will assert/crash. Filter preps to only video entries (and likewise ensure the prompt splitting matches only the video placeholders).
| mm_inputs = [input_mm.get('mm_data', []) for input_mm in context.input_multimodals] | ||
| # flatten batch | ||
| mm_inputs = [item for sublist in mm_inputs for item in sublist] | ||
|
|
||
| if len(mm_inputs) > 0: | ||
| modality = mm_inputs[0].modality | ||
| pixel_values = torch.cat([inp.data for inp in mm_inputs]) | ||
|
|
||
| image_token_id = mm_inputs[0].meta.get('image_token_id') | ||
| video_token_id = mm_inputs[0].meta.get('video_token_id') | ||
| mm_token_id = image_token_id if modality == Modality.IMAGE else video_token_id | ||
| image_mask = (input_ids == mm_token_id) |
There was a problem hiding this comment.
This assumes all multimodal inputs are the same modality (modality = mm_inputs[0].modality) and derives a single placeholder token id from that. Mixed image+video inputs would be mishandled (masking and grid concatenation won’t align). Add validation that all modalities match or extend this to handle multiple modalities explicitly.
| # collect all preprocessing result from messages | ||
| preps = [x['content'] for x in messages if x['role'] == 'preprocess'] | ||
| assert len(preps) == 1 | ||
| preps = preps[0] | ||
|
|
||
| # split prompt into segments and validate data | ||
| segs = prompt.split(self.vision_start_token + self.video_token + self.vision_end_token) | ||
| assert len(segs) == len(preps) + 1, (f'the number of {self.video_token} is not equal ' | ||
| f'to input videos, {len(segs) - 1} vs {len(preps)}') |
There was a problem hiding this comment.
to_pytorch_aux_video() assumes every preprocessing entry corresponds to a video (len(segs) == len(preps)+1), but preps currently contains all modalities produced by preprocess() (including images). This will assert/crash for mixed image+video inputs. Filter preps to only video items (e.g., by modality) or validate that only videos are present before calling this path.
| meta=dict(ts_token_id=ts_token_id, ts_lens=ts_lens, ts_sr=ts_sr)) | ||
| else: | ||
| modality = input_mm.get('modality') | ||
| if modality == Modality.IMAGE: |
There was a problem hiding this comment.
Extract code in branch into a function would make the code more readable, for example:
if modality == IMAGE:
mm_data = self.make_image_mm(...)
elif modality == VIDEO:
mm_data = self.make_video_mm(...)
...
Objective
Test
HTTP url sample
File url sample
Data url sample
Media IO kwargs
In certain cases, users may want to control video sampling parameters via
num_framesorfpsto reduce input size.Here is an example to pass per-request
media_io_kwargsfor video sampling.media io kwargs sample
TODO