GRPO/GKD on top of Megatron, orchestrated by Ray. The student/actor is trained with Megatron, generates completions with vLLM, and — for GKD — is distilled with a teacher model.
# via the helper scripts
CUDA_VISIBLE_DEVICES=0,1,2,3 bash examples/ray/gkd/run.sh
# or directly
megatron rlhf --use_ray true --config examples/ray/gkd/rollout_colocate_teacher_colocate.yamlThe YAML is split into a top-level section (shared args) and per-role groups
(train, rollout, and optionally teacher). Each group's gpus: field sets how
many GPUs that role uses; CUDA_VISIBLE_DEVICES must expose at least the total
number of GPUs the chosen placement needs (see below).
The gkd/ folder ships three ready-to-run configs. The file name encodes the two
independent choices — rollout placement and teacher mode:
| file | rollout | teacher |
|---|---|---|
rollout_colocate_teacher_colocate.yaml |
colocate (shares train GPUs) | colocated (shares train GPUs) |
rollout_separate_teacher_colocate.yaml |
separate (own GPUs) | colocated (shares train GPUs) |
rollout_colocate_teacher_standalone.yaml |
colocate (shares train GPUs) | standalone vLLM replicas (own GPUs) |
This is controlled by colocate_groups plus each role's gpus.
| Placement | colocate_groups |
GPUs needed | When to use |
|---|---|---|---|
| colocate | [[train, rollout]] |
train.gpus — all roles in the group must set the same gpus (one shared set) |
default; fewer GPUs, train and rollout time-share the same devices |
| separate | omit | train.gpus + rollout.gpus (disjoint sets) |
more GPUs, rollout overlaps with training |
-
colocate — train and rollout live on the same devices and take turns. Set
offload_model/offload_optimizer(+offload_teacher_modelfor GKD) andsleep_level: 1so the idle role releases GPU memory to the active one. Example:train.gpus=4,rollout.gpus=4,colocate_groups: [[train, rollout]]→ 4 GPUs total, with TP2 giving DP2. -
separate — train and rollout occupy disjoint GPU sets; weights are pushed to the rollout engine every step.
Pick exactly one. gkd_logits_topk: K selects top-k distillation;
omit it for full-vocab distillation.
| Mode | How to configure | top-k | full-vocab | Status |
|---|---|---|---|---|
Colocated teacher_model |
set top-level teacher_model: (+ offload_teacher_model: true) |
✅ | ✅ | supported |
| Standalone teacher replicas | add a teacher: group with gpus, model, and vllm_engine_kwargs.max_logprobs |
✅ | ❌ | supported |
2a. Colocated teacher (rollout_colocate_teacher_colocate.yaml, rollout_separate_teacher_colocate.yaml)
The teacher shares the train GPUs and is offloaded to CPU between teacher forwards. It is the only mode that supports full-vocab distillation, and it works with both colocate and separate rollout placements.
teacher_model: Qwen/Qwen3.5-4B
offload_teacher_model: true
gkd_logits_topk: 64 # omit for full-vocabThe teacher runs as its own set of Ray-managed vLLM replicas on separate GPUs and returns prompt top-k logprobs; the driver fetches them per step.
gkd_logits_topk: 64 # REQUIRED — replicas are top-k only
# do NOT set top-level teacher_model here (that would also load a colocated teacher)
teacher:
gpus: 4
model: Qwen/Qwen3.5-4B # the teacher checkpoint these replicas serve
vllm_engine_kwargs: {"max_logprobs": 64} # MUST be >= gkd_logits_topkmax_logprobsmust be>= gkd_logits_topk, or vLLM rejects theprompt_logprobsrequest.- GPUs needed = colocated train+rollout set +
teacher.gpus.
- top-k (
gkd_logits_topk: K): the teacher exposes only the top-K logprobs per position. Much lower memory, works for every teacher mode. - full-vocab (omit
gkd_logits_topk): distill the full vocabulary distribution. Colocated teacher only, and memory-heavy (caches per-rank vocab-sharded teacher logits). If you OOM: switch to top-k, lowermicro_batch_size
OPSD lets the teacher see a different (privileged) prompt than the student while
scoring the same on-policy response — e.g. the teacher sees the problem + a
reference solution. A dataset preprocessor (loaded via external_plugins) emits a
per-row teacher_prompt; the loss aligns the shared response tokens by mask.
external_plugins: examples/train/rlhf/opsd/opsd_plugin.py # registers teacher_prompt
teacher_model: Qwen/Qwen3.5-4B
gkd_logits_topk: 64- Supported in Ray with top-k (
gkd_logits_topk) for both a colocated teacher and standalone teacher replicas (teacher.gpus > 0). - No extra flag is needed: OPSD activates automatically when rows carry a non-empty
teacher_prompt; otherwise training falls back to plain GKD.
- Sequence length: the encoder budget is
max_length + max_completion_length(prompt is capped atmax_length, the on-policy completion adds up tomax_completion_length). Sizevllm_max_model_lenaccordingly. padding_free: truepacks a micro-batch into one sequence; pair withsequence_parallel: truewhentensor_model_parallel_size > 1.- Parallelism / DP: data parallel size =
gpus / (TP * PP * CP). e.g. 4 GPUs withtensor_model_parallel_size: 2→ DP2. - Memory release (colocate):
offload_model,offload_optimizer,offload_teacher_model, andsleep_level: 1are what let colocated roles fit. - GRPO specifics: rewards via
reward_funcs+external_plugins; sampling vianum_generations/steps_per_generation; noteacher_*settings.