Video-Llava model's generation error due to causal mask shape mismatch

### System Info

The regression happens after transformers==4.45.2.

```
- `transformers` version: 4.47.0.dev0
- Platform: Linux-6.6.0-gnr.bkc.6.6.9.3.15.x86_64-x86_64-with-glibc2.34
- Python version: 3.10.15
- Huggingface_hub version: 0.26.1
- Safetensors version: 0.4.5
- Accelerate version: 1.1.0
- Accelerate config:    not found
- PyTorch version (GPU?): 2.6.0.dev20241014+cpu (False)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?: <fill in>
```

### Who can help?

@ArthurZucker 

### Information

- [X] The official example scripts
- [ ] My own modified scripts

### Tasks

- [X] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

The code is from [LanguageBind/Video-LLaVA-7B-hf](https://huggingface.co/LanguageBind/Video-LLaVA-7B-hf)
It's also the official codes in [modeling_video_llava](https://github.com/huggingface/transformers/blob/main/src/transformers/models/video_llava/modeling_video_llava.py#L454-L516)

python
```
from PIL import Image
import requests
import numpy as np
import av
from huggingface_hub import hf_hub_download
from transformers import VideoLlavaProcessor, VideoLlavaForConditionalGeneration

def read_video_pyav(container, indices):
    '''
    Decode the video with PyAV decoder.

    Args:
        container (av.container.input.InputContainer): PyAV container.
        indices (List[int]): List of frame indices to decode.

    Returns:
        np.ndarray: np array of decoded frames of shape (num_frames, height, width, 3).
    '''
    frames = []
    container.seek(0)
    start_index = indices[0]
    end_index = indices[-1]
    for i, frame in enumerate(container.decode(video=0)):
        if i > end_index:
            break
        if i >= start_index and i in indices:
            frames.append(frame)
    return np.stack([x.to_ndarray(format="rgb24") for x in frames])

model = VideoLlavaForConditionalGeneration.from_pretrained("LanguageBind/Video-LLaVA-7B-hf")
processor = VideoLlavaProcessor.from_pretrained("LanguageBind/Video-LLaVA-7B-hf")

prompt = "USER: <video>Why is this video funny? ASSISTANT:"
video_path = hf_hub_download(repo_id="raushan-testing-hf/videos-test", filename="sample_demo_1.mp4", repo_type="dataset")
container = av.open(video_path)

# sample uniformly 8 frames from the video
total_frames = container.streams.video[0].frames
indices = np.arange(0, total_frames, total_frames / 8).astype(int)
clip = read_video_pyav(container, indices)

inputs = processor(text=prompt, videos=clip, return_tensors="pt")

# Generate
generate_ids = model.generate(**inputs, max_length=80)
print(processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0])

# Generate from images and videos mix
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
prompt = [
    "USER: <image> How many cats are there in the image? ASSISTANT:",
    "USER: <video>Why is this video funny? ASSISTANT:"
]
inputs = processor(text=prompt, images=image, videos=clip, padding=True, return_tensors="pt")

# Generate
generate_ids = model.generate(**inputs, max_length=50)
print(processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True))
```

Trace back:
```
Expanding inputs for image tokens in Video-LLaVa should be done in processing. Please add `patch_size` and `vision_feature_select_strategy` to the model's processing config or set directly with `processor.patch_size = {{patch_size}}` and processor.vision_feature_select_strategy = {{vision_feature_select_strategy}}`. Using processors without these attributes in the config is deprecated and will throw an error in v4.44.
Expanding inputs for image tokens in Video-LLaVa should be done in processing. Please add `patch_size` and `vision_feature_select_strategy` to the model's processing config or set directly with `processor.patch_size = {{patch_size}}` and processor.vision_feature_select_strategy = {{vision_feature_select_strategy}}`. Using processors without these attributes in the config is deprecated and will throw an error in v4.47.
USER: Why is this video funny? ASSISTANT: The and? and??????????? [? [ and, [ [ [ [ [ [ [ [ [ [, [, [ and, [ and, and, and, and, and, and, and, and, and, and, and, and, [
Traceback (most recent call last):
  File "/home/jiqing/test_llava.py", line 58, in <module>
    generate_ids = model.generate(**inputs, max_length=50)
  File "/home/jiqing/miniforge3/envs/torch_2_6/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/home/jiqing/transformers/src/transformers/generation/utils.py", line 2231, in generate
    result = self._sample(
  File "/home/jiqing/transformers/src/transformers/generation/utils.py", line 3222, in _sample
    outputs = self(**model_inputs, return_dict=True)
  File "/home/jiqing/miniforge3/envs/torch_2_6/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/jiqing/miniforge3/envs/torch_2_6/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/jiqing/transformers/src/transformers/models/video_llava/modeling_video_llava.py", line 663, in forward
    outputs = self.language_model(
  File "/home/jiqing/miniforge3/envs/torch_2_6/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/jiqing/miniforge3/envs/torch_2_6/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/jiqing/transformers/src/transformers/models/llama/modeling_llama.py", line 1204, in forward
    outputs = self.model(
  File "/home/jiqing/miniforge3/envs/torch_2_6/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/jiqing/miniforge3/envs/torch_2_6/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/jiqing/transformers/src/transformers/models/llama/modeling_llama.py", line 955, in forward
    layer_outputs = decoder_layer(
  File "/home/jiqing/miniforge3/envs/torch_2_6/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/jiqing/miniforge3/envs/torch_2_6/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/jiqing/transformers/src/transformers/models/llama/modeling_llama.py", line 685, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/home/jiqing/miniforge3/envs/torch_2_6/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/jiqing/miniforge3/envs/torch_2_6/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/jiqing/transformers/src/transformers/models/llama/modeling_llama.py", line 611, in forward
    attn_output = torch.nn.functional.scaled_dot_product_attention(
RuntimeError: The size of tensor a (2332) must match the size of tensor b (22) at non-singleton dimension 3
```

The causal mask shape: [2, 1, 1, 22]

### Expected behavior

The transformers==4.45.2 can output the correct generated texts:

```
Expanding inputs for image tokens in Video-LLaVa should be done in processing. Please add `patch_size` and `vision_feature_select_strategy` to the model's processing config or set directly with `processor.patch_size = {{patch_size}}` and processor.vision_feature_select_strategy = {{vision_feature_select_strategy}}`. Using processors without these attributes in the config is deprecated and will throw an error in v4.44.
Expanding inputs for image tokens in Video-LLaVa should be done in processing. Please add `patch_size` and `vision_feature_select_strategy` to the model's processing config or set directly with `processor.patch_size = {{patch_size}}` and processor.vision_feature_select_strategy = {{vision_feature_select_strategy}}`. Using processors without these attributes in the config is deprecated and will throw an error in v4.47.
Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)
USER: Why is this video funny? ASSISTANT: The video is funny because it shows a baby sitting on a bed and playing with a Wii remote, which is an unusual sight.Ъ because babies are not typically known for playing video games. The baby's actions with the remote control create a humorous and unexpected scene, making it entertain
['USER:  How many cats are there in the image? ASSISTANT: There are two cats in the image. (or three, depending on the interpretation of the image).', 'USER: Why is this video funny? ASSISTANT: The video is funny because it shows a baby sitting on a bed and playing with a Wii remote... The baby is holding the']
```

The causal mask shape [2, 1, 1, 2332]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Video-Llava model's generation error due to causal mask shape mismatch #34696

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Video-Llava model's generation error due to causal mask shape mismatch #34696

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions