We're encountering a runtime error in AttentionPairBiasMask.backward() during training:
RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.
The error occurs at line 358 in attention_pair_bias_torch.py:
grad_out_mask = grad_out_mask.view(B, ctx.num_heads, -1)
All input tensors to cuet.attention_pair_bias() in the forward pass are verified contiguous, and the forward pass completes successfully every time. The error occurs during backpropagation when gradient tensors become non-contiguous.
Environment
- Hardware: H100 GPUs
- cuEquivariance version: 0.6.1
- PyTorch version: 2.7
- CUDA version: 12.x
We've implemented a monkey patch that makes gradient tensors contiguous before calling the original backward function, which resolves the issue. However:
- Should
AttentionPairBiasMask.backward() handle non-contiguous gradient tensors? (A simple fix would be changing view() to reshape())
- Or is it expected that all gradient inputs will be contiguous? If so, what's the recommended approach for ensuring this?