You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
zero3 qwen2-vl training hangs when with mixed multimodal dataset.
When different GPUs have different modalities of mini-batch, multimodal related variables have different shapes among GPUs.
For example, video related tensor video_grid_thw have values on GPU0, but is None on GPU1.
The training hangs when dealing with this variable.
The hanging DOES NOT occur when using zero-2.
Is it caused by variable comunication between GPUs in zero-3?
What's the right way to train mixed modality data with zero-3?
dataset: mixure of pure-text, image-text
model: qwen2-vl
training on: 8xA100
stage3 config:
Describe the bug
zero3 qwen2-vl training hangs when with mixed multimodal dataset.
When different GPUs have different modalities of mini-batch, multimodal related variables have different shapes among GPUs.
For example, video related tensor
video_grid_thw
have values on GPU0, but isNone
on GPU1.The training hangs when dealing with this variable.
The hanging DOES NOT occur when using zero-2.
Is it caused by variable comunication between GPUs in zero-3?
What's the right way to train mixed modality data with zero-3?
dataset: mixure of pure-text, image-text
model: qwen2-vl
training on: 8xA100
stage3 config:
The text was updated successfully, but these errors were encountered: