Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support empty batch for EmbeddingCollection #2773

Open
JacoCheung opened this issue Feb 28, 2025 · 0 comments
Open

Support empty batch for EmbeddingCollection #2773

JacoCheung opened this issue Feb 28, 2025 · 0 comments

Comments

@JacoCheung
Copy link

JacoCheung commented Feb 28, 2025

Hi team,

While training with TorchRec ShardedEmbeddingCollection under a distributed env, some rank may get zero input for the last global batch. In such a case, there could be errors from fbgemm. See issue that I created.

But even this bug is bypassed (skipped), another new bug occurs ( when constructing the output KJT) . Not sure if it's a known issue. Do you have any plan to fix this? This is important for model evaluation where the last batch cannot be dropped. Thanks!


[rank0]:           ~~~~~~~~~^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/torchrec/distributed/types.py", line 504, in impl
[rank0]:     LazyAwaitable._wait_async(lhs), LazyAwaitable._wait_async(rhs)
[rank0]:     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/torchrec/distributed/types.py", line 394, in _wait_async
[rank0]:     obj._result = obj.wait()
[rank0]:                   ^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/torchrec/distributed/types.py", line 319, in wait
[rank0]:     ret: W = self._wait_impl()
[rank0]:              ^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/torchrec/distributed/embedding.py", line 446, in _wait_impl
[rank0]:     construct_jagged_tensors(
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/torchrec/modules/utils.py", line 247, in construct_jagged_tensors
[rank0]:     lengths = features.lengths().view(-1, features.stride())
[rank0]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

It should not be illegal to specify -1 when viewing a zero-sized tensor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant