Name	Name	Last commit message	Last commit date
parent directory ..
gkd	gkd
grpo	grpo
README.md	README.md

Ray-based Megatron RLHF examples (GKD & GRPO)

GRPO/GKD on top of Megatron, orchestrated by Ray. The student/actor is trained with Megatron, generates completions with vLLM, and — for GKD — is distilled with a teacher model.

How to run

# via the helper scripts
CUDA_VISIBLE_DEVICES=0,1,2,3 bash examples/ray/gkd/run.sh

# or directly
megatron rlhf --use_ray true --config examples/ray/gkd/rollout_colocate_teacher_colocate.yaml

The YAML is split into a top-level section (shared args) and per-role groups (train, rollout, and optionally teacher). Each group's gpus: field sets how many GPUs that role uses; CUDA_VISIBLE_DEVICES must expose at least the total number of GPUs the chosen placement needs (see below).

The gkd/ folder ships three ready-to-run configs. The file name encodes the two independent choices — rollout placement and teacher mode:

file	rollout	teacher
`rollout_colocate_teacher_colocate.yaml`	colocate (shares train GPUs)	colocated (shares train GPUs)
`rollout_separate_teacher_colocate.yaml`	separate (own GPUs)	colocated (shares train GPUs)
`rollout_colocate_teacher_standalone.yaml`	colocate (shares train GPUs)	standalone vLLM replicas (own GPUs)

1. GPU placement: colocate vs separate

This is controlled by colocate_groups plus each role's gpus.

Placement	`colocate_groups`	GPUs needed	When to use
colocate	`[[train, rollout]]`	`train.gpus` — all roles in the group must set the same `gpus` (one shared set)	default; fewer GPUs, train and rollout time-share the same devices
separate	omit	`train.gpus + rollout.gpus` (disjoint sets)	more GPUs, rollout overlaps with training

colocate — train and rollout live on the same devices and take turns. Set offload_model/offload_optimizer (+ offload_teacher_model for GKD) and sleep_level: 1 so the idle role releases GPU memory to the active one. Example: train.gpus=4, rollout.gpus=4, colocate_groups: [[train, rollout]] → 4 GPUs total, with TP2 giving DP2.
separate — train and rollout occupy disjoint GPU sets; weights are pushed to the rollout engine every step.

2. Teacher modes (GKD only)

Pick exactly one. gkd_logits_topk: K selects top-k distillation; omit it for full-vocab distillation.

Mode	How to configure	top-k	full-vocab	Status
Colocated `teacher_model`	set top-level `teacher_model:` (+ `offload_teacher_model: true`)	✅	✅	supported
Standalone teacher replicas	add a `teacher:` group with `gpus`, `model`, and `vllm_engine_kwargs.max_logprobs`	✅	❌	supported

2a. Colocated teacher (`rollout_colocate_teacher_colocate.yaml`, `rollout_separate_teacher_colocate.yaml`)

The teacher shares the train GPUs and is offloaded to CPU between teacher forwards. It is the only mode that supports full-vocab distillation, and it works with both colocate and separate rollout placements.

teacher_model: Qwen/Qwen3.5-4B
offload_teacher_model: true
gkd_logits_topk: 64      # omit for full-vocab

2b. Standalone teacher replicas (`rollout_colocate_teacher_standalone.yaml`)

The teacher runs as its own set of Ray-managed vLLM replicas on separate GPUs and returns prompt top-k logprobs; the driver fetches them per step.

gkd_logits_topk: 64                 # REQUIRED — replicas are top-k only
# do NOT set top-level teacher_model here (that would also load a colocated teacher)
teacher:
  gpus: 4
  model: Qwen/Qwen3.5-4B            # the teacher checkpoint these replicas serve
  vllm_engine_kwargs: {"max_logprobs": 64}   # MUST be >= gkd_logits_topk

max_logprobs must be >= gkd_logits_topk, or vLLM rejects the prompt_logprobs request.
GPUs needed = colocated train+rollout set + teacher.gpus.

3. top-k vs full-vocab distillation

top-k (gkd_logits_topk: K): the teacher exposes only the top-K logprobs per position. Much lower memory, works for every teacher mode.
full-vocab (omit gkd_logits_topk): distill the full vocabulary distribution. Colocated teacher only, and memory-heavy (caches per-rank vocab-sharded teacher logits). If you OOM: switch to top-k, lower micro_batch_size

4. OPSD (On-Policy / privileged Distillation)

OPSD lets the teacher see a different (privileged) prompt than the student while scoring the same on-policy response — e.g. the teacher sees the problem + a reference solution. A dataset preprocessor (loaded via external_plugins) emits a per-row teacher_prompt; the loss aligns the shared response tokens by mask.

external_plugins: examples/train/rlhf/opsd/opsd_plugin.py   # registers teacher_prompt
teacher_model: Qwen/Qwen3.5-4B
gkd_logits_topk: 64

Supported in Ray with top-k (gkd_logits_topk) for both a colocated teacher and standalone teacher replicas (teacher.gpus > 0).
No extra flag is needed: OPSD activates automatically when rows carry a non-empty teacher_prompt; otherwise training falls back to plain GKD.

5. Things to know (common knobs & pitfalls)

Sequence length: the encoder budget is max_length + max_completion_length (prompt is capped at max_length, the on-policy completion adds up to max_completion_length). Size vllm_max_model_len accordingly.
padding_free: true packs a micro-batch into one sequence; pair with sequence_parallel: true when tensor_model_parallel_size > 1.
Parallelism / DP: data parallel size = gpus / (TP * PP * CP). e.g. 4 GPUs with tensor_model_parallel_size: 2 → DP2.
Memory release (colocate): offload_model, offload_optimizer, offload_teacher_model, and sleep_level: 1 are what let colocated roles fit.
GRPO specifics: rewards via reward_funcs + external_plugins; sampling via num_generations / steps_per_generation; no teacher_* settings.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Ray-based Megatron RLHF examples (GKD & GRPO)

How to run

1. GPU placement: colocate vs separate

2. Teacher modes (GKD only)

2a. Colocated teacher (`rollout_colocate_teacher_colocate.yaml`, `rollout_separate_teacher_colocate.yaml`)

2b. Standalone teacher replicas (`rollout_colocate_teacher_standalone.yaml`)

3. top-k vs full-vocab distillation

4. OPSD (On-Policy / privileged Distillation)

5. Things to know (common knobs & pitfalls)

FilesExpand file tree

ray

Directory actions

More options

Directory actions

More options

Latest commit

History

ray

Folders and files

parent directory

README.md

Ray-based Megatron RLHF examples (GKD & GRPO)

How to run

1. GPU placement: colocate vs separate

2. Teacher modes (GKD only)

2a. Colocated teacher (rollout_colocate_teacher_colocate.yaml, rollout_separate_teacher_colocate.yaml)

2b. Standalone teacher replicas (rollout_colocate_teacher_standalone.yaml)

3. top-k vs full-vocab distillation

4. OPSD (On-Policy / privileged Distillation)

5. Things to know (common knobs & pitfalls)

2a. Colocated teacher (`rollout_colocate_teacher_colocate.yaml`, `rollout_separate_teacher_colocate.yaml`)

2b. Standalone teacher replicas (`rollout_colocate_teacher_standalone.yaml`)