Llama3 pre-context parallel dataloader changes #1400

pstjohn · 2026-01-02T17:43:54Z

Makes a number of updates in preparation for llama3 context-parallel training. It's still not currently working, need to further update the model to handle the cu_seq_lens_q_padded kwargs and would like to add a single-GPU CP test that uses BSHD inputs to at least exercise this code in CI.

This PR:

Only materializes the dataloader on the cp_rank=0, and returns None on other ranks.
Uses the scatter operation in the dataloader to synchronize StopIteration exceptions
Adds tests for the CP dataloader on 1 and 2-gpu machines
moves llama3 to use DLCM data as the sanity dataset, turns off some genome collation options by default. This is larger than the dummy sequences currently used in training, and will make sure we can fill a few batches in CP testing. We may want to revert this eventually once we're done bringing up llama3; since it does trigger the tokenizer download during testing.
removes lazy tokenization from llama3, this wont work. See https://nvidia.slack.com/archives/C074Z808N05/p1767818883160949
starts adding CP files for llama3

Closes BIO-11

copy-pr-bot · 2026-01-02T17:43:57Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Signed-off-by: Peter St. John <[email protected]>

pstjohn force-pushed the pstjohn/llama-context-parallel branch from 65f1073 to 8d28094 Compare January 7, 2026 15:28

pstjohn added 14 commits January 9, 2026 04:28

only materialize dataloader on cp_rank 0

9a53426

Signed-off-by: Peter St. John <[email protected]>

cp work

8c60a6e

Signed-off-by: Peter St. John <[email protected]>

adding cp dataloader and tests to llama3

21a895c

Signed-off-by: Peter St. John <[email protected]>

update fsdp2_cp with grad accum

55f0c07

Signed-off-by: Peter St. John <[email protected]>

updates from computelab

02d8de7

Signed-off-by: Peter St. John <[email protected]>

move llama3 native to dlcm sanity data

8297d84

Signed-off-by: Peter St. John <[email protected]>

fix collator to handle empty datasets

4f27f23

Signed-off-by: Peter St. John <[email protected]>

turn off genomic collator by default

f05d2ba

Signed-off-by: Peter St. John <[email protected]>

remove breakponts

e9a08bb

Signed-off-by: Peter St. John <[email protected]>

fix multi-gpu dataloader test

3afb215

Signed-off-by: Peter St. John <[email protected]>

fix tests

17d2015

Signed-off-by: Peter St. John <[email protected]>

add single gpu cp test

651b9b1

Signed-off-by: Peter St. John <[email protected]>

still not completely working

bd93c06

Signed-off-by: Peter St. John <[email protected]>

xfail CP tests

b06126d

Signed-off-by: Peter St. John <[email protected]>

pstjohn force-pushed the pstjohn/llama-context-parallel branch from a9f6d9e to b06126d Compare January 9, 2026 12:52

pstjohn changed the title ~~Pstjohn/llama context parallel~~ Llama3 pre-context parallel dataloader changes Jan 9, 2026

pstjohn marked this pull request as ready for review January 9, 2026 13:01

pstjohn requested review from cspades, dorotat-nv, jomitchellnv, jstjohn, jwilber, savitha-eng and trvachov as code owners January 9, 2026 13:01

pstjohn added 2 commits January 9, 2026 06:04

fix collator debug call with no torch distributed

3bd8b57

Signed-off-by: Peter St. John <[email protected]>

revert modeling changes in data pr

9241408

Signed-off-by: Peter St. John <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Llama3 pre-context parallel dataloader changes #1400

Llama3 pre-context parallel dataloader changes #1400

Uh oh!

pstjohn commented Jan 2, 2026 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Jan 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Llama3 pre-context parallel dataloader changes #1400

Are you sure you want to change the base?

Llama3 pre-context parallel dataloader changes #1400

Uh oh!

Conversation

pstjohn commented Jan 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

copy-pr-bot bot commented Jan 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pstjohn commented Jan 2, 2026 •

edited

Loading