Open
Description
Padding free seems to be an excellent idea, but how do you prevent the information leakage? Since the batch is flatten, the later item in the batch may attend to the information from previous items, which I don't think it is supposed to in any case, even in pretraining step.
I have not thoroughly test the feature, but I did a finetune on Qwen2.5VL and found out that padding-free will negatively affect the performance, causing the model to hallucinate information that is not even in the image.