You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While training with TorchRec ShardedEmbeddingCollection under a distributed env, some rank may get zero input for the last global batch. In such a case, there could be errors from fbgemm. See issue that I created.
But even this bug is bypassed (skipped), another new bug occurs ( when constructing the output KJT) . Not sure if it's a known issue. Do you have any plan to fix this? This is important for model evaluation where the last batch cannot be dropped. Thanks!
[rank0]: ~~~~~~~~~^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/dist-packages/torchrec/distributed/types.py", line 504, in impl
[rank0]: LazyAwaitable._wait_async(lhs), LazyAwaitable._wait_async(rhs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/dist-packages/torchrec/distributed/types.py", line 394, in _wait_async
[rank0]: obj._result = obj.wait()
[rank0]: ^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/dist-packages/torchrec/distributed/types.py", line 319, in wait
[rank0]: ret: W = self._wait_impl()
[rank0]: ^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/dist-packages/torchrec/distributed/embedding.py", line 446, in _wait_impl
[rank0]: construct_jagged_tensors(
[rank0]: File "/usr/local/lib/python3.12/dist-packages/torchrec/modules/utils.py", line 247, in construct_jagged_tensors
[rank0]: lengths = features.lengths().view(-1, features.stride())
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
It should not be illegal to specify -1 when viewing a zero-sized tensor.
The text was updated successfully, but these errors were encountered:
Hi team,
While training with TorchRec ShardedEmbeddingCollection under a distributed env, some rank may get zero input for the last global batch. In such a case, there could be errors from fbgemm. See issue that I created.
But even this bug is bypassed (skipped), another new bug occurs ( when constructing the output KJT) . Not sure if it's a known issue. Do you have any plan to
fix
this? This is important for model evaluation where the last batch cannot be dropped. Thanks!It should not be illegal to specify -1 when viewing a zero-sized tensor.
The text was updated successfully, but these errors were encountered: