Make per-turn max_tokens handling (vs. truncation at the end) more explicit and better for multi-turn

Currently we do a truncation according to max generation length after the agent loop finishes. Ideally the agent loop should stop generating with `stop_reason="length"` naturally, so the truncation shouldn't have much effect if we handle the `max_tokens` in each turn for the agent loop.

We should make it explicit that the max generation length is not per-turn's max_tokens, but the total generation length of the trajectory. 

This should also relate to the engine config of `max_model_len`.

https://github.com/NovaSky-AI/SkyRL/blob/31fb6b86bc61d8939d640aea4475c622b565a366/skyrl-train/examples/terminal_bench/generator/terminal_bench_generator.py#L148-L156

https://github.com/NovaSky-AI/SkyRL/blob/31fb6b86bc61d8939d640aea4475c622b565a366/skyrl-train/skyrl_train/generators/skyrl_gym_generator.py#L280-L290

https://github.com/NovaSky-AI/SkyRL/blob/31fb6b86bc61d8939d640aea4475c622b565a366/skyrl-train/examples/mini_swe_agent/mini_swe_generator.py#L198-L208

	# Determine stop reason
	max_response_tokens = (
	self.generator_cfg.sampling_params.max_generate_length
	+ self.generator_cfg.max_input_length
	- initial_prompt_length
	)
	stop_reason = "complete" # Default for trial completion
	if len(response_ids) > max_response_tokens:
	stop_reason = "length"

	# need to truncate loss mask correctly for responses that go to max length
	if self.max_turns > 1:
	# max total resp length = max tokens (max length of final turn generation) + max_input_length (max input for any generation turn) - len(original prompt)
	max_response_tokens = max_tokens + max_input_length - initial_prompt_length
	else:
	max_response_tokens = max_tokens

	if len(response_ids) > max_response_tokens:
	stop_reason = "length"
	response_ids = response_ids[:max_response_tokens]
	loss_mask = loss_mask[:max_response_tokens]

	# Calculate maximum response tokens allowed
	max_response_tokens = max_tokens + max_input_length - initial_prompt_length

	# Determine stop reason
	stop_reason = "complete" # Default for trial completion
	if len(response_ids) > max_response_tokens:
	stop_reason = "length"

	# Truncate to maximum allowed length
	response_ids = response_ids[:max_response_tokens]
	loss_mask = loss_mask[:max_response_tokens]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make per-turn max_tokens handling (vs. truncation at the end) more explicit and better for multi-turn #406

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Make per-turn max_tokens handling (vs. truncation at the end) more explicit and better for multi-turn #406

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions