Support empty batch for EmbeddingCollection #2773

JacoCheung · 2025-02-28T09:24:22Z

Hi team,

While training with TorchRec ShardedEmbeddingCollection under a distributed env, some rank may get zero input for the last global batch. In such a case, there could be errors from fbgemm. See issue that I created.

But even this bug is bypassed (skipped), another new bug occurs ( when constructing the output KJT) . Not sure if it's a known issue. Do you have any plan to fix this? This is important for model evaluation where the last batch cannot be dropped. Thanks!


[rank0]:           ~~~~~~~~~^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/torchrec/distributed/types.py", line 504, in impl
[rank0]:     LazyAwaitable._wait_async(lhs), LazyAwaitable._wait_async(rhs)
[rank0]:     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/torchrec/distributed/types.py", line 394, in _wait_async
[rank0]:     obj._result = obj.wait()
[rank0]:                   ^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/torchrec/distributed/types.py", line 319, in wait
[rank0]:     ret: W = self._wait_impl()
[rank0]:              ^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/torchrec/distributed/embedding.py", line 446, in _wait_impl
[rank0]:     construct_jagged_tensors(
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/torchrec/modules/utils.py", line 247, in construct_jagged_tensors
[rank0]:     lengths = features.lengths().view(-1, features.stride())
[rank0]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

It should not be illegal to specify -1 when viewing a zero-sized tensor.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support empty batch for EmbeddingCollection #2773

Support empty batch for EmbeddingCollection #2773

JacoCheung commented Feb 28, 2025 •

edited

Loading

Support empty batch for EmbeddingCollection #2773

Support empty batch for EmbeddingCollection #2773

Comments

JacoCheung commented Feb 28, 2025 • edited Loading

JacoCheung commented Feb 28, 2025 •

edited

Loading