Skip to content

AttentionPairBiasMask.backward() fails with non-contiguous gradient tensors despite contiguous forward inputs #177

@gcroci2

Description

@gcroci2

We're encountering a runtime error in AttentionPairBiasMask.backward() during training:

RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.

The error occurs at line 358 in attention_pair_bias_torch.py:

grad_out_mask = grad_out_mask.view(B, ctx.num_heads, -1)

All input tensors to cuet.attention_pair_bias() in the forward pass are verified contiguous, and the forward pass completes successfully every time. The error occurs during backpropagation when gradient tensors become non-contiguous.

Environment

  • Hardware: H100 GPUs
  • cuEquivariance version: 0.6.1
  • PyTorch version: 2.7
  • CUDA version: 12.x

We've implemented a monkey patch that makes gradient tensors contiguous before calling the original backward function, which resolves the issue. However:

  • Should AttentionPairBiasMask.backward() handle non-contiguous gradient tensors? (A simple fix would be changing view() to reshape())
  • Or is it expected that all gradient inputs will be contiguous? If so, what's the recommended approach for ensuring this?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions