Skip to content

Support video inputs#4360

Open
CUHKSZzxy wants to merge 24 commits intoInternLM:mainfrom
CUHKSZzxy:support-video-inputs
Open

Support video inputs#4360
CUHKSZzxy wants to merge 24 commits intoInternLM:mainfrom
CUHKSZzxy:support-video-inputs

Conversation

@CUHKSZzxy
Copy link
Collaborator

@CUHKSZzxy CUHKSZzxy commented Feb 13, 2026

Objective

  1. Refactor for better multi-modal support.
  • Renaming and extending the image to multimodal.
  • Abstract media IO.
  1. Support native video inputs for Qwen3 VL, Qwen3.5, InternS1 Pro series.
  • Update related proprocess and implement special video mrope calculations.
  • Support different video loader (default to OpenCV, aligns with vLLM).

Test

  1. HTTP url
HTTP url sample
curl http://0.0.0.0:23333/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-VL-8B-Instruct",
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Describe this video."
                },
                {
                    "type": "video_url",
                    "video_url": {
                        "url": "https://raw.githubusercontent.com/CUHKSZzxy/Online-Data/main/clip_3_removed.mp4"
                    }
                }
            ]
        }
    ],
    "max_tokens": 100
  }'
  1. File url (file path)
File url sample
curl http://0.0.0.0:23333/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-VL-8B-Instruct",
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Describe this video."
                },
                {
                    "type": "video_url",
                    "video_url": {
                        "url": "file://your_path_to_video.mp4"
                    }
                }
            ]
        }
    ],
    "max_tokens": 100
  }'
  1. Data url (base64)
Data url sample
import pybase64
from openai import OpenAI

openai_api_key = 'EMPTY'
openai_api_base = 'http://0.0.0.0:23333/v1'

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

# Use video url in the payload
model_path = 'Qwen/Qwen3-VL-8B-Instruct'
video_path = '/nvme1/zhouxinyu/lmdeploy_fp8/clip_3_removed.mp4'

with open(video_path, "rb") as f:
    video_b64 = pybase64.b64encode(f.read()).decode('utf-8')

chat_completion_from_url = client.chat.completions.create(
    messages=[{
        'role':
        'user',
        'content': [
            {
                'type': 'text',
                'text': "Describe this video.",
            },
            {
                'type': 'video_url',
                'video_url': {
                    'url': f'data:video/mp4;base64,{video_b64}'
                },
            },
        ],
    }],
    model=model_path,
    max_completion_tokens=100,
)

result = chat_completion_from_url.choices[0].message.content
print('Chat completion output:\n', result)

Media IO kwargs

In certain cases, users may want to control video sampling parameters via num_frames or fps to reduce input size.
Here is an example to pass per-request media_io_kwargs for video sampling.

media io kwargs sample
curl http://0.0.0.0:23333/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-VL-8B-Instruct",
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Describe this video."
                },
                {
                    "type": "video_url",
                    "video_url": {
                        "url": "https://raw.githubusercontent.com/CUHKSZzxy/Online-Data/main/clip_3_removed.mp4"
                    }
                }
            ]
        }
    ],
    "max_tokens": 100,
    "media_io_kwargs":{
        "video": {
            "num_frames": 10,
            "fps": 2
        }
    }
  }'

TODO

  • Better media io abstractions
  • Support video inputs for Qwen3 VL, Qwen3.5, InternS1 Pro
  • Runtime media-io-kwargs

@CUHKSZzxy CUHKSZzxy force-pushed the support-video-inputs branch from 748039b to 29fc130 Compare February 27, 2026 13:33
@CUHKSZzxy CUHKSZzxy marked this pull request as ready for review March 4, 2026 07:27
Copilot AI review requested due to automatic review settings March 4, 2026 07:27
@CUHKSZzxy CUHKSZzxy changed the title [WIP] Support video inputs Support video inputs Mar 4, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors multimodal media I/O and model preprocessing to add native video input support (notably for Qwen3-VL / Qwen3.5 / InternS1 Pro), while also unifying handling of images and time-series through shared “media IO” abstractions.

Changes:

  • Introduces MediaIO abstractions + URL/data/file loading helpers, and adds video/time-series/image encode/decode utilities.
  • Updates server-side OpenAI-style multimodal parsing to emit unified {type, data, ...params} items and threads media_io_kwargs through the OpenAI request path.
  • Refactors many VL model preprocessors to consume collect_multimodal_items() and adds video token handling paths for Qwen3/InternS1Pro families; updates PyTorch multimodal data container to include modality.

Reviewed changes

Copilot reviewed 69 out of 69 changed files in this pull request and generated 18 comments.

Show a summary per file
File Description
tests/test_lmdeploy/test_vl/test_vl_encode.py Adds encode/decode tests for image/video/time-series and invalid inputs.
tests/test_lmdeploy/test_vl/test_qwen3vl_processor.py Updates test message schema to new {type,data} format; import path update.
requirements/runtime_cuda.txt Adds OpenCV headless runtime dependency for video loading.
lmdeploy/vl/utils.py Replaces ad-hoc loaders with MediaIO-based load_* and encode_*_base64 APIs.
lmdeploy/vl/time_series_utils.py Removes legacy time-series utilities (migrated into MediaIO).
lmdeploy/vl/model/yi.py Switches preprocessing to unified multimodal collection API.
lmdeploy/vl/model/xcomposer2.py Switches preprocessing to unified multimodal collection API.
lmdeploy/vl/model/qwen3_5.py Pulls video token attributes from HF processor; prepares for video inputs.
lmdeploy/vl/model/qwen3.py Adds video preprocessing and PyTorch packing path for Qwen3-VL.
lmdeploy/vl/model/qwen2.py Switches preprocessing to unified multimodal collection API.
lmdeploy/vl/model/qwen.py Switches preprocessing to unified multimodal collection API.
lmdeploy/vl/model/phi3_vision.py Switches preprocessing to unified multimodal collection API.
lmdeploy/vl/model/mllama.py Switches preprocessing to unified multimodal collection API.
lmdeploy/vl/model/llava_next.py Switches preprocessing to unified multimodal collection API.
lmdeploy/vl/model/llava_hf.py Switches preprocessing to unified multimodal collection API.
lmdeploy/vl/model/llava.py Switches preprocessing to unified multimodal collection API.
lmdeploy/vl/model/llama4.py Switches preprocessing to unified multimodal collection API.
lmdeploy/vl/model/internvl3_hf.py Switches preprocessing to unified multimodal collection API.
lmdeploy/vl/model/internvl.py Switches preprocessing to unified multimodal collection API.
lmdeploy/vl/model/interns1_pro.py Adds unified multimodal preprocessing for image/video/time-series, plus packing paths.
lmdeploy/vl/model/glm4_1v.py Switches preprocessing to unified multimodal collection API.
lmdeploy/vl/model/gemma3_vl.py Switches preprocessing to unified multimodal collection API.
lmdeploy/vl/model/deepseek_vl2.py Switches preprocessing to unified multimodal collection API.
lmdeploy/vl/model/deepseek.py Switches preprocessing to unified multimodal collection API.
lmdeploy/vl/model/cogvlm.py Switches preprocessing to unified multimodal collection API.
lmdeploy/vl/model/base.py Replaces collect_images/collect_time_series with collect_multimodal_items().
lmdeploy/vl/media/base.py Introduces abstract MediaIO interface.
lmdeploy/vl/media/connection.py Adds load_from_url() supporting http/data/file/path.
lmdeploy/vl/media/image.py Adds ImageMediaIO for image load/encode.
lmdeploy/vl/media/time_series.py Adds TimeSeriesMediaIO for time-series load/encode.
lmdeploy/vl/media/video_loader.py Adds multi-backend video decoding loaders (OpenCV/Decord/TorchCodec/TorchVision).
lmdeploy/vl/media/video.py Adds VideoMediaIO for video load/encode (JPEG-frame and raw).
lmdeploy/vl/constants.py Adds Modality enum to standardize modality tags.
lmdeploy/vl/init.py Re-exports load_* and encode_*_base64 APIs.
lmdeploy/serve/processors/multimodal.py Refactors OpenAI multimodal parsing; adds media_io_kwargs support.
lmdeploy/serve/openai/protocol.py Adds media_io_kwargs to OpenAI request schema.
lmdeploy/serve/openai/api_server.py Threads media_io_kwargs through chat/generate endpoints.
lmdeploy/serve/core/async_engine.py Threads media_io_kwargs into engine generation flow.
lmdeploy/pytorch/multimodal/image_type.py Removes legacy ImageData wrapper type.
lmdeploy/pytorch/multimodal/data_type.py Renames/reshapes multimodal container to MultiModalData + adds modality.
lmdeploy/pytorch/multimodal/init.py Updates exports for new multimodal data container.
lmdeploy/pytorch/models/utils/multimodal.py Removes unused multimodal mixin helper.
lmdeploy/pytorch/models/qwen3_vl_moe.py Docstring cleanup while aligning with other forward signatures.
lmdeploy/pytorch/models/qwen3_vl.py Updates multimodal plumbing and adds Qwen3VLInputProcessor for image/video.
lmdeploy/pytorch/models/qwen3_next.py Minor cleanup (removes TODO comment).
lmdeploy/pytorch/models/qwen3_moe.py Type-annotation updates / minor cleanup.
lmdeploy/pytorch/models/qwen3_5_moe.py Switches to new Qwen3VL input processor for multimodal input.
lmdeploy/pytorch/models/qwen3_5.py Switches to new Qwen3VL input processor and adds modality-based token selection.
lmdeploy/pytorch/models/qwen2_vl.py Replaces MultiModalTensor with MultiModalData.
lmdeploy/pytorch/models/qwen2_5_vl.py Replaces MultiModalTensor with MultiModalData.
lmdeploy/pytorch/models/phi3_v.py Replaces MultiModalTensor with MultiModalData.
lmdeploy/pytorch/models/llava.py Replaces MultiModalTensor with MultiModalData.
lmdeploy/pytorch/models/llama4.py Replaces MultiModalTensor with MultiModalData.
lmdeploy/pytorch/models/internvl3_hf.py Replaces MultiModalTensor with MultiModalData.
lmdeploy/pytorch/models/internvl.py Replaces MultiModalTensor with MultiModalData.
lmdeploy/pytorch/models/interns1_pro.py Updates InternS1Pro multimodal handling to new mm_data container.
lmdeploy/pytorch/models/glm4_1v.py Replaces MultiModalTensor with MultiModalData.
lmdeploy/pytorch/models/gemma3_vl.py Replaces MultiModalTensor with MultiModalData.
lmdeploy/pytorch/models/deepseek_vl2.py Replaces MultiModalTensor with MultiModalData.
lmdeploy/pytorch/models/cogvlm.py Replaces MultiModalTensor with MultiModalData.
lmdeploy/pytorch/models/chatglm2.py Replaces MultiModalTensor with MultiModalData.
lmdeploy/pytorch/model_inputs.py Updates step context types to use MultiModalData.
docs/zh_cn/multi_modal/qwen2_5_vl.md Updates docs import path for encode_image_base64.
docs/zh_cn/multi_modal/minicpmv.md Updates docs import path for encode_image_base64.
docs/zh_cn/multi_modal/internvl.md Updates docs import path for encode_image_base64.
docs/en/multi_modal/qwen2_5_vl.md Updates docs import path for encode_image_base64.
docs/en/multi_modal/minicpmv.md Updates docs import path for encode_image_base64.
docs/en/multi_modal/internvl.md Updates docs import path for encode_image_base64.
autotest/tools/pipeline/mllm_case.py Updates imports to use re-exported encode_image_base64.
Comments suppressed due to low confidence (3)

lmdeploy/vl/model/interns1_pro.py:296

  • to_pytorch_aux_ts() validates len(segs) == len(preps)+1, but preps contains all modalities from preprocess(). If the request includes time series plus any other modality (e.g., image), this will assert/crash. Filter preps to only time-series entries (or validate only time-series inputs are allowed for this model).
    lmdeploy/vl/media/video_loader.py:282
  • This load_bytes() also creates a NamedTemporaryFile(delete=False, ...) and never removes it. If torchvision is the chosen backend, repeated video loads can leak temp files on disk. Use a try/finally cleanup or a context manager that deletes the file after decoding.
    lmdeploy/vl/media/video_loader.py:282
  • TorchVisionVideoLoader.load_bytes() calls self.load_file(...) but does not return its (video, metadata) result, so callers will get None and crash. Return the value from load_file() here.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +244 to +252
# collect all preprocessing result from messages
preps = [x['content'] for x in messages if x['role'] == 'preprocess']
assert len(preps) == 1
preps = preps[0]

# split prompt into segments and validate data
segs = prompt.split(self.vision_start_token + self.video_token + self.vision_end_token)
assert len(segs) == len(preps) + 1, (f'the number of {self.video_token} is not equal '
f'to input videos, {len(segs) - 1} vs {len(preps)}')
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to_pytorch_aux_video() uses the full preps list (which includes images/time_series too) when validating the number of video placeholders and computing offsets, so mixed-modality inputs will assert/crash. Filter preps to only video entries (and likewise ensure the prompt splitting matches only the video placeholders).

Copilot uses AI. Check for mistakes.
Comment on lines +1058 to +1069
mm_inputs = [input_mm.get('mm_data', []) for input_mm in context.input_multimodals]
# flatten batch
mm_inputs = [item for sublist in mm_inputs for item in sublist]

if len(mm_inputs) > 0:
modality = mm_inputs[0].modality
pixel_values = torch.cat([inp.data for inp in mm_inputs])

image_token_id = mm_inputs[0].meta.get('image_token_id')
video_token_id = mm_inputs[0].meta.get('video_token_id')
mm_token_id = image_token_id if modality == Modality.IMAGE else video_token_id
image_mask = (input_ids == mm_token_id)
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assumes all multimodal inputs are the same modality (modality = mm_inputs[0].modality) and derives a single placeholder token id from that. Mixed image+video inputs would be mishandled (masking and grid concatenation won’t align). Add validation that all modalities match or extend this to handle multiple modalities explicitly.

Copilot uses AI. Check for mistakes.
Comment on lines +174 to +182
# collect all preprocessing result from messages
preps = [x['content'] for x in messages if x['role'] == 'preprocess']
assert len(preps) == 1
preps = preps[0]

# split prompt into segments and validate data
segs = prompt.split(self.vision_start_token + self.video_token + self.vision_end_token)
assert len(segs) == len(preps) + 1, (f'the number of {self.video_token} is not equal '
f'to input videos, {len(segs) - 1} vs {len(preps)}')
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to_pytorch_aux_video() assumes every preprocessing entry corresponds to a video (len(segs) == len(preps)+1), but preps currently contains all modalities produced by preprocess() (including images). This will assert/crash for mixed image+video inputs. Filter preps to only video items (e.g., by modality) or validate that only videos are present before calling this path.

Copilot uses AI. Check for mistakes.
@CUHKSZzxy CUHKSZzxy requested review from grimoire and lvhan028 March 5, 2026 04:13
meta=dict(ts_token_id=ts_token_id, ts_lens=ts_lens, ts_sr=ts_sr))
else:
modality = input_mm.get('modality')
if modality == Modality.IMAGE:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extract code in branch into a function would make the code more readable, for example:

if modality == IMAGE:
    mm_data = self.make_image_mm(...)
elif modality == VIDEO:
    mm_data = self.make_video_mm(...)
...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants