Transformer input batch_first confusion #1038
-
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
@exoticism4869 it's the perspective that's important and the comment is correct... 'batch first' in this case is from the perspective of the attention implementation, whether it's expecting inputs that are batch first or sequence first. Originally nn.MHA was sequence first, and the original CLIP models were. But eventually default in many other cases became batch first with things like F.sdpa, etc and nn.MHA added the argument because people wanted to be able to use it with batch first as well. The outer model module is always batch first, so the inputs to the Transformer are NLD. So the logic in your code snippet, if |
Beta Was this translation helpful? Give feedback.
@exoticism4869 it's the perspective that's important and the comment is correct... 'batch first' in this case is from the perspective of the attention implementation, whether it's expecting inputs that are batch first or sequence first. Originally nn.MHA was sequence first, and the original CLIP models were. But eventually default in many other cases became batch first with things like F.sdpa, etc and nn.MHA added the argument because people wanted to be able to use it with batch first as well.
The outer model module is always batch first, so the inputs to the Transformer are NLD. So the logic in your code snippet, if
batch_first=False
was passed to the attention impl in the resblocks the…