Skip to content

Transformer input batch_first confusion #1038

Answered by rwightman
exoticism4869 asked this question in Q&A
Discussion options

You must be logged in to vote

@exoticism4869 it's the perspective that's important and the comment is correct... 'batch first' in this case is from the perspective of the attention implementation, whether it's expecting inputs that are batch first or sequence first. Originally nn.MHA was sequence first, and the original CLIP models were. But eventually default in many other cases became batch first with things like F.sdpa, etc and nn.MHA added the argument because people wanted to be able to use it with batch first as well.

The outer model module is always batch first, so the inputs to the Transformer are NLD. So the logic in your code snippet, if batch_first=False was passed to the attention impl in the resblocks the…

Replies: 1 comment 1 reply

Comment options

You must be logged in to vote
1 reply
@exoticism4869
Comment options

Answer selected by exoticism4869
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants
Converted from issue

This discussion was converted from issue #1037 on February 24, 2025 05:10.