Skip to content

Conversation

@pstjohn
Copy link
Collaborator

@pstjohn pstjohn commented Jan 2, 2026

Makes a number of updates in preparation for llama3 context-parallel training. It's still not currently working, need to further update the model to handle the cu_seq_lens_q_padded kwargs and would like to add a single-GPU CP test that uses BSHD inputs to at least exercise this code in CI.

This PR:

  • Only materializes the dataloader on the cp_rank=0, and returns None on other ranks.
  • Uses the scatter operation in the dataloader to synchronize StopIteration exceptions
  • Adds tests for the CP dataloader on 1 and 2-gpu machines
  • moves llama3 to use DLCM data as the sanity dataset, turns off some genome collation options by default. This is larger than the dummy sequences currently used in training, and will make sure we can fill a few batches in CP testing. We may want to revert this eventually once we're done bringing up llama3; since it does trigger the tokenizer download during testing.
  • removes lazy tokenization from llama3, this wont work. See https://nvidia.slack.com/archives/C074Z808N05/p1767818883160949
  • starts adding CP files for llama3

Closes BIO-11

@copy-pr-bot
Copy link

copy-pr-bot bot commented Jan 2, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@pstjohn pstjohn force-pushed the pstjohn/llama-context-parallel branch from 65f1073 to 8d28094 Compare January 7, 2026 15:28
pstjohn added 14 commits January 9, 2026 04:28
Signed-off-by: Peter St. John <[email protected]>
Signed-off-by: Peter St. John <[email protected]>
Signed-off-by: Peter St. John <[email protected]>
Signed-off-by: Peter St. John <[email protected]>
Signed-off-by: Peter St. John <[email protected]>
Signed-off-by: Peter St. John <[email protected]>
Signed-off-by: Peter St. John <[email protected]>
Signed-off-by: Peter St. John <[email protected]>
Signed-off-by: Peter St. John <[email protected]>
@pstjohn pstjohn force-pushed the pstjohn/llama-context-parallel branch from a9f6d9e to b06126d Compare January 9, 2026 12:52
@pstjohn pstjohn changed the title Pstjohn/llama context parallel Llama3 pre-context parallel dataloader changes Jan 9, 2026
@pstjohn pstjohn marked this pull request as ready for review January 9, 2026 13:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant