Skip to content

[train][FullyAsync] Implement customizable weight sync frequency#1293

Open
tamoghnokandar wants to merge 7 commits intoNovaSky-AI:mainfrom
tamoghnokandar:add_weight_sync
Open

[train][FullyAsync] Implement customizable weight sync frequency#1293
tamoghnokandar wants to merge 7 commits intoNovaSky-AI:mainfrom
tamoghnokandar:add_weight_sync

Conversation

@tamoghnokandar
Copy link
Contributor

@tamoghnokandar tamoghnokandar commented Mar 8, 2026

Summary

This PR changes fully-async training to treat trainer.train_batch_size as the outer async training step, while keeping trainer.policy_mini_batch_size as the inner optimizer minibatch size. Each fully-async step now waits for a full outer batch of prompt-groups, runs the existing inner minibatch loop over that batch, performs a single weight sync, and then advances global_step. Fixes issue #1205.

What Changed

Fully-Async trainer semantics

  • Introduced explicit outer-batch semantics in skyrl/train/fully_async_trainer.py.
  • train_batch_size is now the unit for:
    • outer-step aggregation
    • steps-per-epoch calculation
    • consumed UID accounting
    • staleness/capacity tracking
    • worker-count validation
  • policy_mini_batch_size remains the inner slicing unit used by the existing optimizer path.
  • Added num_policy_minibatches_per_outer_step to make the outer-step to inner-minibatch relationship explicit.
  • Relaxed validation from equality to divisibility:
    • train_batch_size >= policy_mini_batch_size
    • train_batch_size % policy_mini_batch_size == 0

Training loop behavior

  • The fully-async loop now waits until it has train_batch_size completed generation groups before building a training batch.
  • A single outer step now:
    1. collects train_batch_size groups
    2. converts them into one training batch
    3. runs _run_training(...)
    4. marks train_batch_size UIDs consumed
    5. performs one pause/sync/resume cycle
    6. increments global_step once
  • Per-epoch accounting and resume assertions were updated to use outer-batch units instead of policy minibatch units.

Staleness and capacity

  • Fully-async staleness capacity now uses train_batch_size as the consumer quantum.

  • Worker-count bounds are now enforced as:

    train_batch_size <= num_parallel_generation_workers <= train_batch_size * (max_staleness_steps + 1)

  • This aligns max_staleness_steps with outer fully-async training steps rather than inner optimizer minibatches.

Tests

Added/updated tests in tests/train/test_fully_async_trainer.py to cover:

  • steps-per-epoch and effective dataloader length using train_batch_size
  • outer-step staleness capacity using train_batch_size
  • aggregation of a full outer batch before running inner minibatch optimizer steps
  • resume-time consumed UID accounting using completed outer steps

Impact

With this change, fully-async training behaves as intended for configs such as:

  • train_batch_size = 1024
  • policy_mini_batch_size = 256

In that setup, one fully-async outer step consumes 1024 prompt-groups, runs 4 inner policy minibatches, performs one weight sync, and increments global_step once.

Reward Curve

reward_curve

Eval Curve

eval_curve

Epochs

epoch_curve

I ran for around 9 epochs before running out of GPU credits. PR description written by Codex.


Open with Devin

Copy link
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 5 additional findings.

Open in Devin Review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request successfully implements customizable weight sync frequency in fully-async training by introducing a distinction between an outer train_batch_size and an inner policy_mini_batch_size. The changes are consistently applied across the trainer logic, configuration files, and documentation. The addition of new tests ensures the new outer-batch semantics and resume logic are working as expected. My only feedback is a minor correction in the documentation.

Comment on lines 302 to +304
- AReal: https://arxiv.org/abs/2505.24298v3
- PipelineRL: https://arxiv.org/abs/2509.19128v2
- ScaleRL: https://arxiv.org/abs/2510.13786 No newline at end of file
- ScaleRL: https://arxiv.org/abs/2510.13786
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The arXiv links in the references section appear to be placeholders pointing to future dates (e.g., 2505.24298v3 for May 2025). This could be misleading for readers. Please replace these with the correct references or remove them if they are not yet available.

@tamoghnokandar
Copy link
Contributor Author

@CharlieFRuan This PR is ready for review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant