You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
from PIL import Image
import requests
import numpy as np
import av
from huggingface_hub import hf_hub_download
from transformers import VideoLlavaProcessor, VideoLlavaForConditionalGeneration
def read_video_pyav(container, indices):
'''
Decode the video with PyAV decoder.
Args:
container (av.container.input.InputContainer): PyAV container.
indices (List[int]): List of frame indices to decode.
Returns:
np.ndarray: np array of decoded frames of shape (num_frames, height, width, 3).
'''
frames = []
container.seek(0)
start_index = indices[0]
end_index = indices[-1]
for i, frame in enumerate(container.decode(video=0)):
if i > end_index:
break
if i >= start_index and i in indices:
frames.append(frame)
return np.stack([x.to_ndarray(format="rgb24") for x in frames])
model = VideoLlavaForConditionalGeneration.from_pretrained("LanguageBind/Video-LLaVA-7B-hf")
processor = VideoLlavaProcessor.from_pretrained("LanguageBind/Video-LLaVA-7B-hf")
prompt = "USER: <video>Why is this video funny? ASSISTANT:"
video_path = hf_hub_download(repo_id="raushan-testing-hf/videos-test", filename="sample_demo_1.mp4", repo_type="dataset")
container = av.open(video_path)
# sample uniformly 8 frames from the video
total_frames = container.streams.video[0].frames
indices = np.arange(0, total_frames, total_frames / 8).astype(int)
clip = read_video_pyav(container, indices)
inputs = processor(text=prompt, videos=clip, return_tensors="pt")
# Generate
generate_ids = model.generate(**inputs, max_length=80)
print(processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0])
# Generate from images and videos mix
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
prompt = [
"USER: <image> How many cats are there in the image? ASSISTANT:",
"USER: <video>Why is this video funny? ASSISTANT:"
]
inputs = processor(text=prompt, images=image, videos=clip, padding=True, return_tensors="pt")
# Generate
generate_ids = model.generate(**inputs, max_length=50)
print(processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True))
Trace back:
Expanding inputs for image tokens in Video-LLaVa should be done in processing. Please add `patch_size` and `vision_feature_select_strategy` to the model's processing config or set directly with `processor.patch_size = {{patch_size}}` and processor.vision_feature_select_strategy = {{vision_feature_select_strategy}}`. Using processors without these attributes in the config is deprecated and will throw an error in v4.44.
Expanding inputs for image tokens in Video-LLaVa should be done in processing. Please add `patch_size` and `vision_feature_select_strategy` to the model's processing config or set directly with `processor.patch_size = {{patch_size}}` and processor.vision_feature_select_strategy = {{vision_feature_select_strategy}}`. Using processors without these attributes in the config is deprecated and will throw an error in v4.47.
USER: Why is this video funny? ASSISTANT: The and? and??????????? [? [ and, [ [ [ [ [ [ [ [ [ [, [, [ and, [ and, and, and, and, and, and, and, and, and, and, and, and, [
Traceback (most recent call last):
File "/home/jiqing/test_llava.py", line 58, in <module>
generate_ids = model.generate(**inputs, max_length=50)
File "/home/jiqing/miniforge3/envs/torch_2_6/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/home/jiqing/transformers/src/transformers/generation/utils.py", line 2231, in generate
result = self._sample(
File "/home/jiqing/transformers/src/transformers/generation/utils.py", line 3222, in _sample
outputs = self(**model_inputs, return_dict=True)
File "/home/jiqing/miniforge3/envs/torch_2_6/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/jiqing/miniforge3/envs/torch_2_6/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/home/jiqing/transformers/src/transformers/models/video_llava/modeling_video_llava.py", line 663, in forward
outputs = self.language_model(
File "/home/jiqing/miniforge3/envs/torch_2_6/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/jiqing/miniforge3/envs/torch_2_6/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/home/jiqing/transformers/src/transformers/models/llama/modeling_llama.py", line 1204, in forward
outputs = self.model(
File "/home/jiqing/miniforge3/envs/torch_2_6/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/jiqing/miniforge3/envs/torch_2_6/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/home/jiqing/transformers/src/transformers/models/llama/modeling_llama.py", line 955, in forward
layer_outputs = decoder_layer(
File "/home/jiqing/miniforge3/envs/torch_2_6/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/jiqing/miniforge3/envs/torch_2_6/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/home/jiqing/transformers/src/transformers/models/llama/modeling_llama.py", line 685, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/home/jiqing/miniforge3/envs/torch_2_6/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/jiqing/miniforge3/envs/torch_2_6/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/home/jiqing/transformers/src/transformers/models/llama/modeling_llama.py", line 611, in forward
attn_output = torch.nn.functional.scaled_dot_product_attention(
RuntimeError: The size of tensor a (2332) must match the size of tensor b (22) at non-singleton dimension 3
The causal mask shape: [2, 1, 1, 22]
Expected behavior
The transformers==4.45.2 can output the correct generated texts:
Expanding inputs for image tokens in Video-LLaVa should be done in processing. Please add `patch_size` and `vision_feature_select_strategy` to the model's processing config or set directly with `processor.patch_size = {{patch_size}}` and processor.vision_feature_select_strategy = {{vision_feature_select_strategy}}`. Using processors without these attributes in the config is deprecated and will throw an error in v4.44.
Expanding inputs for image tokens in Video-LLaVa should be done in processing. Please add `patch_size` and `vision_feature_select_strategy` to the model's processing config or set directly with `processor.patch_size = {{patch_size}}` and processor.vision_feature_select_strategy = {{vision_feature_select_strategy}}`. Using processors without these attributes in the config is deprecated and will throw an error in v4.47.
Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)
USER: Why is this video funny? ASSISTANT: The video is funny because it shows a baby sitting on a bed and playing with a Wii remote, which is an unusual sight.Ъ because babies are not typically known for playing video games. The baby's actions with the remote control create a humorous and unexpected scene, making it entertain
['USER: How many cats are there in the image? ASSISTANT: There are two cats in the image. (or three, depending on the interpretation of the image).', 'USER: Why is this video funny? ASSISTANT: The video is funny because it shows a baby sitting on a bed and playing with a Wii remote... The baby is holding the']
The causal mask shape [2, 1, 1, 2332]
The text was updated successfully, but these errors were encountered:
jiqing-feng
changed the title
Llava model's generation error due to causal mask shape mismatch
Video-Llava model's generation error due to causal mask shape mismatch
Nov 12, 2024
Thanks for providing those details and the traceback.
It seems like the core issue is related to the tensor sizes not matching during the attention mechanism in the model. Here are some steps to potentially resolve this:
Update Model's Processing Config: Ensure that patch_size and vision_feature_select_strategy are set in the processor's config.
Debugging Shapes: Check the shapes of the inputs and the masks before passing them to the model to make sure they match.
print(inputs.shape)
Compare Configurations: Ensure that your configurations in the newer version of transformers align with those used in version 4.45.2. Sometimes, even minor changes in default settings can lead to such issues.
Review Model Changes: Look into the changes made in the modeling_llama.py and modeling_video_llava.py files between these versions to spot differences in the implementation of forward calls or handling of inputs.
Community Help: If none of the above steps work, it might be useful to raise this issue on the official GitHub repository or the Hugging Face forums. Including the traceback and detailed description you’ve provided here will be very helpful for anyone trying to assist.
I hope these steps help you narrow down and resolve the issue! If you need more specific advice or further assistance, don't hesitate to ask.
System Info
The regression happens after transformers==4.45.2.
Who can help?
@ArthurZucker
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
The code is from LanguageBind/Video-LLaVA-7B-hf
It's also the official codes in modeling_video_llava
python
Trace back:
The causal mask shape: [2, 1, 1, 22]
Expected behavior
The transformers==4.45.2 can output the correct generated texts:
The causal mask shape [2, 1, 1, 2332]
The text was updated successfully, but these errors were encountered: