[ft] Skip extra quorum when using semi-sync training #1221

H-Huang · 2025-05-23T23:08:46Z

Missed that we were using the ftOptimizer when doing fault tolerant communication for HSDP, we should skip this when we have semi-sync training enabled and only do quorum when the replica groups sync.

fegin · 2025-05-27T17:21:35Z

torchtitan/components/optimizer.py

@@ -192,7 +192,8 @@ def __init__(
        }
        self.cache_state_dict: dict[str, Any] = {}
        self._ft_optimizer = ft.Optimizer(ft_manager, self)
-        self._call_from_ft: bool = False
+        # Originally this is False, True means we just call the step() as normally
+        self._call_from_ft: bool = True


Why changes this to True? step() will manually update _call_from_ft() to ensure that the call path is correctly routed through ft.Optimizer.step() then OptimizersContainer.step(). If we set to False, it will only go though OptimizersContainer.step() not ft.Optimizer.step().

Yeah I had hardcoded this originally to get it working for the semi-sync training path. I updated this to be set by an argument in the constructor.

We only want to go through OptimizersContainer.step() not ft.Optimizer.step() when doing localsgd/diloco

fegin

LGTM, I think the fault tolerance logic in train.py is becoming larger enough to be moved inside ft.py. We can do the refactor after semi-sync training is more stable.

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 23, 2025

H-Huang force-pushed the diloco branch 2 times, most recently from 93f426d to 6fab72e Compare May 24, 2025 02:26

H-Huang changed the title ~~[ft] dont do HSDP for semi_sync~~ [ft] Skip extra quorum when using semi-sync training May 24, 2025

H-Huang force-pushed the diloco branch from 6fab72e to 92ce539 Compare May 24, 2025 02:28

fegin reviewed May 27, 2025

View reviewed changes

H-Huang force-pushed the diloco branch 2 times, most recently from d8acd57 to abdadd9 Compare May 27, 2025 18:10

[ft] dont do HSDP for semi_sync

04cbbae

H-Huang force-pushed the diloco branch from abdadd9 to 04cbbae Compare May 27, 2025 18:47

H-Huang marked this pull request as ready for review May 27, 2025 18:47

fegin approved these changes May 27, 2025

View reviewed changes

H-Huang merged commit 84df885 into pytorch:main May 27, 2025
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ft] Skip extra quorum when using semi-sync training #1221

[ft] Skip extra quorum when using semi-sync training #1221

Uh oh!

H-Huang commented May 23, 2025 •

edited

Loading

Uh oh!

fegin May 27, 2025

Uh oh!

H-Huang May 27, 2025 •

edited

Loading

Uh oh!

fegin left a comment

Uh oh!

Uh oh!

Uh oh!

[ft] Skip extra quorum when using semi-sync training #1221

[ft] Skip extra quorum when using semi-sync training #1221

Uh oh!

Conversation

H-Huang commented May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fegin May 27, 2025

Choose a reason for hiding this comment

Uh oh!

H-Huang May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fegin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

H-Huang commented May 23, 2025 •

edited

Loading

H-Huang May 27, 2025 •

edited

Loading