Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Video-Llava model's generation error due to causal mask shape mismatch #34696

Open
2 of 4 tasks
jiqing-feng opened this issue Nov 12, 2024 · 1 comment
Open
2 of 4 tasks
Labels

Comments

@jiqing-feng
Copy link
Contributor

jiqing-feng commented Nov 12, 2024

System Info

The regression happens after transformers==4.45.2.

- `transformers` version: 4.47.0.dev0
- Platform: Linux-6.6.0-gnr.bkc.6.6.9.3.15.x86_64-x86_64-with-glibc2.34
- Python version: 3.10.15
- Huggingface_hub version: 0.26.1
- Safetensors version: 0.4.5
- Accelerate version: 1.1.0
- Accelerate config:    not found
- PyTorch version (GPU?): 2.6.0.dev20241014+cpu (False)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?: <fill in>

Who can help?

@ArthurZucker

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

The code is from LanguageBind/Video-LLaVA-7B-hf
It's also the official codes in modeling_video_llava

python

from PIL import Image
import requests
import numpy as np
import av
from huggingface_hub import hf_hub_download
from transformers import VideoLlavaProcessor, VideoLlavaForConditionalGeneration

def read_video_pyav(container, indices):
    '''
    Decode the video with PyAV decoder.

    Args:
        container (av.container.input.InputContainer): PyAV container.
        indices (List[int]): List of frame indices to decode.

    Returns:
        np.ndarray: np array of decoded frames of shape (num_frames, height, width, 3).
    '''
    frames = []
    container.seek(0)
    start_index = indices[0]
    end_index = indices[-1]
    for i, frame in enumerate(container.decode(video=0)):
        if i > end_index:
            break
        if i >= start_index and i in indices:
            frames.append(frame)
    return np.stack([x.to_ndarray(format="rgb24") for x in frames])

model = VideoLlavaForConditionalGeneration.from_pretrained("LanguageBind/Video-LLaVA-7B-hf")
processor = VideoLlavaProcessor.from_pretrained("LanguageBind/Video-LLaVA-7B-hf")

prompt = "USER: <video>Why is this video funny? ASSISTANT:"
video_path = hf_hub_download(repo_id="raushan-testing-hf/videos-test", filename="sample_demo_1.mp4", repo_type="dataset")
container = av.open(video_path)

# sample uniformly 8 frames from the video
total_frames = container.streams.video[0].frames
indices = np.arange(0, total_frames, total_frames / 8).astype(int)
clip = read_video_pyav(container, indices)

inputs = processor(text=prompt, videos=clip, return_tensors="pt")

# Generate
generate_ids = model.generate(**inputs, max_length=80)
print(processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0])

# Generate from images and videos mix
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
prompt = [
    "USER: <image> How many cats are there in the image? ASSISTANT:",
    "USER: <video>Why is this video funny? ASSISTANT:"
]
inputs = processor(text=prompt, images=image, videos=clip, padding=True, return_tensors="pt")

# Generate
generate_ids = model.generate(**inputs, max_length=50)
print(processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True))

Trace back:

Expanding inputs for image tokens in Video-LLaVa should be done in processing. Please add `patch_size` and `vision_feature_select_strategy` to the model's processing config or set directly with `processor.patch_size = {{patch_size}}` and processor.vision_feature_select_strategy = {{vision_feature_select_strategy}}`. Using processors without these attributes in the config is deprecated and will throw an error in v4.44.
Expanding inputs for image tokens in Video-LLaVa should be done in processing. Please add `patch_size` and `vision_feature_select_strategy` to the model's processing config or set directly with `processor.patch_size = {{patch_size}}` and processor.vision_feature_select_strategy = {{vision_feature_select_strategy}}`. Using processors without these attributes in the config is deprecated and will throw an error in v4.47.
USER: Why is this video funny? ASSISTANT: The and? and??????????? [? [ and, [ [ [ [ [ [ [ [ [ [, [, [ and, [ and, and, and, and, and, and, and, and, and, and, and, and, [
Traceback (most recent call last):
  File "/home/jiqing/test_llava.py", line 58, in <module>
    generate_ids = model.generate(**inputs, max_length=50)
  File "/home/jiqing/miniforge3/envs/torch_2_6/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/home/jiqing/transformers/src/transformers/generation/utils.py", line 2231, in generate
    result = self._sample(
  File "/home/jiqing/transformers/src/transformers/generation/utils.py", line 3222, in _sample
    outputs = self(**model_inputs, return_dict=True)
  File "/home/jiqing/miniforge3/envs/torch_2_6/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/jiqing/miniforge3/envs/torch_2_6/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/jiqing/transformers/src/transformers/models/video_llava/modeling_video_llava.py", line 663, in forward
    outputs = self.language_model(
  File "/home/jiqing/miniforge3/envs/torch_2_6/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/jiqing/miniforge3/envs/torch_2_6/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/jiqing/transformers/src/transformers/models/llama/modeling_llama.py", line 1204, in forward
    outputs = self.model(
  File "/home/jiqing/miniforge3/envs/torch_2_6/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/jiqing/miniforge3/envs/torch_2_6/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/jiqing/transformers/src/transformers/models/llama/modeling_llama.py", line 955, in forward
    layer_outputs = decoder_layer(
  File "/home/jiqing/miniforge3/envs/torch_2_6/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/jiqing/miniforge3/envs/torch_2_6/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/jiqing/transformers/src/transformers/models/llama/modeling_llama.py", line 685, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/home/jiqing/miniforge3/envs/torch_2_6/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/jiqing/miniforge3/envs/torch_2_6/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/jiqing/transformers/src/transformers/models/llama/modeling_llama.py", line 611, in forward
    attn_output = torch.nn.functional.scaled_dot_product_attention(
RuntimeError: The size of tensor a (2332) must match the size of tensor b (22) at non-singleton dimension 3

The causal mask shape: [2, 1, 1, 22]

Expected behavior

The transformers==4.45.2 can output the correct generated texts:

Expanding inputs for image tokens in Video-LLaVa should be done in processing. Please add `patch_size` and `vision_feature_select_strategy` to the model's processing config or set directly with `processor.patch_size = {{patch_size}}` and processor.vision_feature_select_strategy = {{vision_feature_select_strategy}}`. Using processors without these attributes in the config is deprecated and will throw an error in v4.44.
Expanding inputs for image tokens in Video-LLaVa should be done in processing. Please add `patch_size` and `vision_feature_select_strategy` to the model's processing config or set directly with `processor.patch_size = {{patch_size}}` and processor.vision_feature_select_strategy = {{vision_feature_select_strategy}}`. Using processors without these attributes in the config is deprecated and will throw an error in v4.47.
Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)
USER: Why is this video funny? ASSISTANT: The video is funny because it shows a baby sitting on a bed and playing with a Wii remote, which is an unusual sight.Ъ because babies are not typically known for playing video games. The baby's actions with the remote control create a humorous and unexpected scene, making it entertain
['USER:  How many cats are there in the image? ASSISTANT: There are two cats in the image. (or three, depending on the interpretation of the image).', 'USER: Why is this video funny? ASSISTANT: The video is funny because it shows a baby sitting on a bed and playing with a Wii remote... The baby is holding the']

The causal mask shape [2, 1, 1, 2332]

@jiqing-feng jiqing-feng changed the title Llava model's generation error due to causal mask shape mismatch Video-Llava model's generation error due to causal mask shape mismatch Nov 12, 2024
@cw235
Copy link

cw235 commented Nov 12, 2024

Thanks for providing those details and the traceback.

It seems like the core issue is related to the tensor sizes not matching during the attention mechanism in the model. Here are some steps to potentially resolve this:

  1. Update Model's Processing Config: Ensure that patch_size and vision_feature_select_strategy are set in the processor's config.

    processor.patch_size = <appropriate_patch_size>
    processor.vision_feature_select_strategy = <appropriate_strategy>
  2. Debugging Shapes: Check the shapes of the inputs and the masks before passing them to the model to make sure they match.

    print(inputs.shape)
  3. Compare Configurations: Ensure that your configurations in the newer version of transformers align with those used in version 4.45.2. Sometimes, even minor changes in default settings can lead to such issues.

  4. Review Model Changes: Look into the changes made in the modeling_llama.py and modeling_video_llava.py files between these versions to spot differences in the implementation of forward calls or handling of inputs.

  5. Community Help: If none of the above steps work, it might be useful to raise this issue on the official GitHub repository or the Hugging Face forums. Including the traceback and detailed description you’ve provided here will be very helpful for anyone trying to assist.

I hope these steps help you narrow down and resolve the issue! If you need more specific advice or further assistance, don't hesitate to ask.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants