Open
Conversation
Greptile SummaryThis PR implements proper separation of termination and truncation signals to enable bootstrapped value estimation in RL training. The changes allow the policy to distinguish between true episode endings (reaching goals, collisions) and artificial timeouts (episode length limits), which can improve training by using bootstrapped values on truncations. Key Changes
Issues Found
Confidence Score: 3/5
Important Files Changed
Flowchartflowchart TD
A[c_step starts] --> B[Increment timestep<br/>Reset terminals and truncations arrays]
B --> C[Move agents and compute collision states]
C --> D{Collision detected?}
D -->|Vehicle/Offroad collision<br/>with STOP/REMOVE behavior| E[Set terminal flag]
D -->|No collision| F[Continue]
E --> F
F --> G{Goal reached?}
G -->|GOAL_RESPAWN mode| H[Set terminal flag<br/>and respawn_agent]
G -->|GOAL_STOP mode| I[Set terminal flag<br/>and stop agent]
G -->|GOAL_GENERATE_NEW| J[sample_new_goal<br/>Set terminal flag]
G -->|Not reached| K[Check episode termination]
H --> K
I --> K
J --> K
K --> L{timestep+1 >= episode_length<br/>OR all agents respawned?}
L -->|Yes| M[Set ALL truncation flags<br/>add_log and c_reset]
L -->|No| N[compute_observations<br/>Return to Python]
N --> O[PufferRL: Bootstrap on truncations<br/>Add gamma times V from previous step]
Last reviewed commit: 615d812 |
| break; | ||
| } | ||
| } | ||
| int reached_time_limit = (env->timestep + 1) >= env->episode_length; |
There was a problem hiding this comment.
off-by-one error: should be env->timestep >= env->episode_length (not +1). original code used ==, but this causes truncation one step early
Suggested change
| int reached_time_limit = (env->timestep + 1) >= env->episode_length; | |
| int reached_time_limit = env->timestep >= env->episode_length; |
615d812 to
c4a0e02
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Reimplementation of already merged work .
Bifurcates between truncations and termination to let RL policy use bootstrapped value in case of truncation, potentially aiding training.
Small modification -