[train][FullyAsync] Implement customizable weight sync frequency#1293
Open
tamoghnokandar wants to merge 7 commits intoNovaSky-AI:mainfrom
Open
[train][FullyAsync] Implement customizable weight sync frequency#1293tamoghnokandar wants to merge 7 commits intoNovaSky-AI:mainfrom
tamoghnokandar wants to merge 7 commits intoNovaSky-AI:mainfrom
Conversation
Contributor
There was a problem hiding this comment.
Code Review
This pull request successfully implements customizable weight sync frequency in fully-async training by introducing a distinction between an outer train_batch_size and an inner policy_mini_batch_size. The changes are consistently applied across the trainer logic, configuration files, and documentation. The addition of new tests ensures the new outer-batch semantics and resume logic are working as expected. My only feedback is a minor correction in the documentation.
Comment on lines
302
to
+304
| - AReal: https://arxiv.org/abs/2505.24298v3 | ||
| - PipelineRL: https://arxiv.org/abs/2509.19128v2 | ||
| - ScaleRL: https://arxiv.org/abs/2510.13786 No newline at end of file | ||
| - ScaleRL: https://arxiv.org/abs/2510.13786 |
Contributor
There was a problem hiding this comment.
…kyRL into add_weight_sync
Contributor
Author
|
@CharlieFRuan This PR is ready for review. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR changes fully-async training to treat
trainer.train_batch_sizeas the outer async training step, while keepingtrainer.policy_mini_batch_sizeas the inner optimizer minibatch size. Each fully-async step now waits for a full outer batch of prompt-groups, runs the existing inner minibatch loop over that batch, performs a single weight sync, and then advancesglobal_step. Fixes issue #1205.What Changed
Fully-Async trainer semantics
train_batch_sizeis now the unit for:policy_mini_batch_sizeremains the inner slicing unit used by the existing optimizer path.num_policy_minibatches_per_outer_stepto make the outer-step to inner-minibatch relationship explicit.train_batch_size >= policy_mini_batch_sizetrain_batch_size % policy_mini_batch_size == 0Training loop behavior
train_batch_sizecompleted generation groups before building a training batch.train_batch_sizegroups_run_training(...)train_batch_sizeUIDs consumedglobal_steponceStaleness and capacity
Fully-async staleness capacity now uses
train_batch_sizeas the consumer quantum.Worker-count bounds are now enforced as:
train_batch_size <= num_parallel_generation_workers <= train_batch_size * (max_staleness_steps + 1)This aligns
max_staleness_stepswith outer fully-async training steps rather than inner optimizer minibatches.Tests
Added/updated tests in tests/train/test_fully_async_trainer.py to cover:
train_batch_sizetrain_batch_sizeImpact
With this change, fully-async training behaves as intended for configs such as:
train_batch_size = 1024policy_mini_batch_size = 256In that setup, one fully-async outer step consumes 1024 prompt-groups, runs 4 inner policy minibatches, performs one weight sync, and increments
global_steponce.Reward Curve
Eval Curve
Epochs
I ran for around 9 epochs before running out of GPU credits. PR description written by Codex.