[cp][flex_attention] integration test trial #1160

XilunWu · 2025-05-01T21:15:36Z

Stack from ghstack (oldest at bottom):

-> [cp][flex_attention] integration test trial #1160

[ghstack-poisoned]

ghstack-source-id: 7e12a16 Pull-Request-resolved: #1160

fegin · 2025-05-08T06:09:39Z

torchtitan/experiments/llama4/model/args.py

@@ -74,12 +74,13 @@ def update_from_config(self, job_config: JobConfig, tokenizer: Tokenizer) -> Non
                "FlexAttention is not compatible with selective AC yet. "
                "See https://github.com/pytorch/pytorch/issues/147879"
            )
-
+        """


You can just remove this block.

fegin · 2025-05-08T06:15:37Z

torchtitan/train.py

+            mask_mod = FlexAttention._get_causal_mask_mod()
+            batch_dimension = 1
+            seq_len = inputs.shape[1]
+            block_mask = FlexAttention.compiled_create_block_mask(


We should either let flex attention provide this compiled_create_block_mask to minimize the dependency on users' code when parallelizing CP. cc., @drisspg

meaning that Flex provides the compiled partial with no mask_mod args?

For CP + flex_attention, this PR generates 3 compiled BlockMask object for each mask_mod in:

QKV sharding -- this requires the existence of compiled BlockMask on global batch input and mask_mod to load balance.

actual training. The first FlexAttention module in model will create a compiled BlockMask from the sharded batch input and mask_mod. Note that applying this mask_mod to the sharded batch input is meaningless. Therefore this BlockMask will not be used the actual CP flex_attention computation.

actual training. When forward flex_attention is called over the sharded batch input for the first time in the current step, a BlockMask will be created from the sharded batch input and a remapped mask_mod which corresponds to the local region in the attention score (the Q_LEN by KV_LEN rectangle).

(1) introduces a dependency in user code in order to adopt CP flex_attention. (2) is how we define the mask_mod in torchtitan and can be modified. Ideally (1) and (2) can be merged so that there's no redundancy as well as user code modification in order to use CP.

fegin · 2025-05-08T06:17:10Z

torchtitan/train.py

+        if self.model_args.use_flex_attn:
+            from torchtitan.models.attention import FlexAttention
+
+            mask_mod = FlexAttention._get_causal_mask_mod()


I think mask_mod should be the input of context_parallel() and we can directly call compiled_create_block_mask. See the comment below.

[ghstack-poisoned]

ghstack-source-id: b0b6434 Pull-Request-resolved: #1160

[ghstack-poisoned]

ghstack-source-id: 9d38242 Pull-Request-resolved: #1160

ghstack-source-id: 9d38242 Pull-Request-resolved: #1160 [ghstack-poisoned]

ghstack-source-id: 9d38242 Pull-Request-resolved: #1160 ghstack-source-id: 53d803f Pull Request resolved: #1228

Update

867acad

[ghstack-poisoned]

XilunWu added a commit that referenced this pull request May 1, 2025

[cp][flex_attention] integration test trial

596d8ac

ghstack-source-id: 7e12a16 Pull-Request-resolved: #1160

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 1, 2025

XilunWu requested a review from fegin May 1, 2025 21:22

XilunWu added the module: context parallel label May 1, 2025

XilunWu marked this pull request as draft May 1, 2025 21:22

ebsmothers mentioned this pull request May 2, 2025

[wip] context parallelism pytorch/torchtune#2668

Draft

8 tasks

fegin reviewed May 8, 2025

View reviewed changes

Update

3c9bb5f

[ghstack-poisoned]

XilunWu added a commit that referenced this pull request May 12, 2025

[cp][flex_attention] integration test trial

a61afbd

ghstack-source-id: b0b6434 Pull-Request-resolved: #1160

XilunWu mentioned this pull request May 13, 2025

[do NOT land] llama-3 8B w/ flex_attention #1181

Closed

Update

d0cbb47

[ghstack-poisoned]

XilunWu added a commit that referenced this pull request May 21, 2025

[cp][flex_attention] integration test trial

3cdc62a

ghstack-source-id: 9d38242 Pull-Request-resolved: #1160

XilunWu added a commit that referenced this pull request May 27, 2025

[cp][flex_attention] integration test trial

68e76cf

ghstack-source-id: 9d38242 Pull-Request-resolved: #1160 [ghstack-poisoned]

XilunWu mentioned this pull request May 27, 2025

[cp][flex_attention] integration test trial #1228

Draft

XilunWu added a commit that referenced this pull request May 27, 2025

[cp][flex_attention] integration test trial

30bd26c

ghstack-source-id: 9d38242 Pull-Request-resolved: #1160 ghstack-source-id: 53d803f Pull Request resolved: #1228

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[cp][flex_attention] integration test trial #1160

[cp][flex_attention] integration test trial #1160

Uh oh!

XilunWu commented May 1, 2025 •

edited

Loading

Uh oh!

fegin May 8, 2025

Uh oh!

fegin May 8, 2025

Uh oh!

drisspg May 12, 2025

Uh oh!

XilunWu May 13, 2025

Uh oh!

fegin May 8, 2025

Uh oh!

Uh oh!

[cp][flex_attention] integration test trial #1160

Are you sure you want to change the base?

[cp][flex_attention] integration test trial #1160

Uh oh!

Conversation

XilunWu commented May 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fegin May 8, 2025

Choose a reason for hiding this comment

Uh oh!

fegin May 8, 2025

Choose a reason for hiding this comment

Uh oh!

drisspg May 12, 2025

Choose a reason for hiding this comment

Uh oh!

XilunWu May 13, 2025

Choose a reason for hiding this comment

Uh oh!

fegin May 8, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

XilunWu commented May 1, 2025 •

edited

Loading