Transformer input batch_first confusion #1038

exoticism4869 · 2025-02-24T03:03:51Z

exoticism4869
Feb 24, 2025

I'm using clip and open_clip at the same time. When it comes to the transformer input shape, the parameter 'batch_first' is default false for clip while true for open_clip. This is not a problem cause I can handle the input shape
Here the comment in line357 may confuses beginners like me:

Answered by rwightman

Feb 24, 2025

@exoticism4869 it's the perspective that's important and the comment is correct... 'batch first' in this case is from the perspective of the attention implementation, whether it's expecting inputs that are batch first or sequence first. Originally nn.MHA was sequence first, and the original CLIP models were. But eventually default in many other cases became batch first with things like F.sdpa, etc and nn.MHA added the argument because people wanted to be able to use it with batch first as well.

The outer model module is always batch first, so the inputs to the Transformer are NLD. So the logic in your code snippet, if batch_first=False was passed to the attention impl in the resblocks the…

View full answer

rwightman · 2025-02-24T05:07:36Z

rwightman
Feb 24, 2025
Maintainer

@exoticism4869 it's the perspective that's important and the comment is correct... 'batch first' in this case is from the perspective of the attention implementation, whether it's expecting inputs that are batch first or sequence first. Originally nn.MHA was sequence first, and the original CLIP models were. But eventually default in many other cases became batch first with things like F.sdpa, etc and nn.MHA added the argument because people wanted to be able to use it with batch first as well.

The outer model module is always batch first, so the inputs to the Transformer are NLD. So the logic in your code snippet, if batch_first=False was passed to the attention impl in the resblocks then x needs to be transposed from NLD -> LND and back again after.

1 reply

exoticism4869 Feb 24, 2025
Author

Thanks, I understand. Batch_first is relative to the attention implementation, which makes it easier to explain.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transformer input batch_first confusion #1038

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Transformer input batch_first confusion #1038

exoticism4869 Feb 24, 2025

Replies: 1 comment · 1 reply

rwightman Feb 24, 2025 Maintainer

exoticism4869 Feb 24, 2025 Author

exoticism4869
Feb 24, 2025

Replies: 1 comment 1 reply

rwightman
Feb 24, 2025
Maintainer

exoticism4869 Feb 24, 2025
Author