Question on 2D pixel shuffle in InternVL-2.5 #962

franciszchen · 2025-03-21T18:37:48Z

Thanks for sharing this great project. Here I have a question on the 2D pixel shuffle. The vit_embeds has the shape of [N, L, C], and it is first reshaped to [N, h, w, C], and then is performed with pixel shuffle in two dimensions by reshaping, permute and contiguous for [N, h/2, w/2, C4]. But finally the vit_embeds is reshaped into [N, L/4, C4] for further usage. Why not directly perform the pixel shuffle on vit_embeds with [N, L, C] into [N, L/4, C*4]？If we modify the inference code with this pixel shuffle, will this change has significant influence on the performance?

InternVL/internvl_chat/internvl/model/internvl_chat/modeling_internvl_chat.py

Line 287 in 34a8100

vit_embeds = self.pixel_shuffle(vit_embeds, scale_factor=self.downsample_ratio)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question on 2D pixel shuffle in InternVL-2.5 #962

Question on 2D pixel shuffle in InternVL-2.5 #962

franciszchen commented Mar 21, 2025

Question on 2D pixel shuffle in InternVL-2.5 #962

Question on 2D pixel shuffle in InternVL-2.5 #962

Comments

franciszchen commented Mar 21, 2025