Add Amsgrad#137
Conversation
Merge updates from DLRM
Parallelizing the pre-processing of the dataset. (facebookresearch#117)
Fix typo in dynamic axis names (facebookresearch#134)
|
Please see above PR. Can be tested with:
Requires at least ~24GB, either on a single GPU or distributed across multiple. Should achieve something like |
Adding Amsgrad improves numerical performance. This naive implementation requires the usage of dense gradients, which is not efficient.
This PR also includes a heuristic for better distribution of the embedding layers among the devices when using parallel training.