Skip to content

Remove batch-level padding from tokenize_sft_batch#582

Merged
Kovbo merged 4 commits intomainfrom
remove-sft-batch-padding
Feb 26, 2026
Merged

Remove batch-level padding from tokenize_sft_batch#582
Kovbo merged 4 commits intomainfrom
remove-sft-batch-padding

Conversation

@Kovbo
Copy link
Collaborator

@Kovbo Kovbo commented Feb 25, 2026

Summary

  • Remove unnecessary padding from tokenize_sft_batch: each trajectory tensor now keeps its natural sequence length instead of being padded to the longest in the batch
  • Add padding-strip in unsloth training loop (service.py): trims any trailing padding before .to(device) for robustness

Context

tokenize_sft_batch was padding all trajectories to the longest sequence in the batch. Since every consumer (unsloth, megatron) processes trajectories individually with gradient accumulation, this padding was pure waste — inflating CPU tensors and forcing extra GPU compute on padding tokens. The megatron trainer already stripped padding on its own.

Now padding is eliminated at the source. The unsloth training loop also defensively strips padding before moving to GPU, so it works correctly whether tensors arrive padded or not.

tokenize_sft_batch was padding all trajectories to the longest sequence
in the batch, but every consumer (unsloth, megatron) processes them
individually. This wasted CPU memory and GPU compute on padding tokens.

Now each trajectory tensor keeps its natural length. The unsloth
training loop strips any residual padding before .to(device) for
robustness.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@Kovbo Kovbo marked this pull request as draft February 25, 2026 20:03
Kovbo and others added 2 commits February 25, 2026 20:05
Match the serverless-training microbatch approach: process trajectories
in configurable microbatch groups with padding trimmed to the longest
in each group. Changing microbatch_size is a one-line change.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@Kovbo Kovbo marked this pull request as ready for review February 25, 2026 20:25
@Kovbo Kovbo requested a review from angkywilliam February 25, 2026 20:50
# Process trajectories in microbatches, trimming padding to the
# longest sequence in each microbatch to avoid wasted GPU compute.
microbatch_size = 1
for i in range(0, len(batch.trajectory_tensors), microbatch_size):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if we need this flow.
We already remove the extra padding logic in the processing flow above.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, no need to have it. Removed!

Padding is now removed at the source in tokenize_sft_batch, so the
training loop doesn't need microbatch trimming logic.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@Kovbo Kovbo merged commit 5207db0 into main Feb 26, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants