Skip to content

[ft] Skip extra quorum when using semi-sync training #1221

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 27, 2025

Conversation

H-Huang
Copy link
Member

@H-Huang H-Huang commented May 23, 2025

Missed that we were using the ftOptimizer when doing fault tolerant communication for HSDP, we should skip this when we have semi-sync training enabled and only do quorum when the replica groups sync.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 23, 2025
@H-Huang H-Huang force-pushed the diloco branch 2 times, most recently from 93f426d to 6fab72e Compare May 24, 2025 02:26
@H-Huang H-Huang changed the title [ft] dont do HSDP for semi_sync [ft] Skip extra quorum when using semi-sync training May 24, 2025
@@ -192,7 +192,8 @@ def __init__(
}
self.cache_state_dict: dict[str, Any] = {}
self._ft_optimizer = ft.Optimizer(ft_manager, self)
self._call_from_ft: bool = False
# Originally this is False, True means we just call the step() as normally
self._call_from_ft: bool = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why changes this to True? step() will manually update _call_from_ft() to ensure that the call path is correctly routed through ft.Optimizer.step() then OptimizersContainer.step(). If we set to False, it will only go though OptimizersContainer.step() not ft.Optimizer.step().

Copy link
Member Author

@H-Huang H-Huang May 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I had hardcoded this originally to get it working for the semi-sync training path. I updated this to be set by an argument in the constructor.

We only want to go through OptimizersContainer.step() not ft.Optimizer.step() when doing localsgd/diloco

@H-Huang H-Huang force-pushed the diloco branch 2 times, most recently from d8acd57 to abdadd9 Compare May 27, 2025 18:10
@H-Huang H-Huang marked this pull request as ready for review May 27, 2025 18:47
Copy link
Contributor

@fegin fegin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I think the fault tolerance logic in train.py is becoming larger enough to be moved inside ft.py. We can do the refactor after semi-sync training is more stable.

@H-Huang H-Huang merged commit 84df885 into pytorch:main May 27, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants