Bug Report: RuntimeError: normal expects all elements of std >= 0.0 during PPO update on quadruped robot
Description
Training crashes during PPO update with RuntimeError: normal expects all elements of std >= 0.0 in the policy's distribution sampling. Despite enabling nan_guard, no NaN dump is generated (no /tmp/mjlab/nan_dumps directory created), suggesting the invalid values are negative or zero std rather than NaN/Inf.
Robot & Training Configuration
| Parameter |
Value |
| Robot Type |
Quadruped (wheeled-legged, 四轮足) |
| Max Velocity |
lin 2 m/s , ang 3rad/s |
| Terrain |
Flat ground (平地) |
| Algorithm |
PPO |
Error Trace
Traceback (most recent call last):
File "/home/me/unitree_rl/mjlab/src/mjlab/scripts/train.py", line 256, in <module>
main()
File "/home/me/unitree_rl/mjlab/src/mjlab/scripts/train.py", line 252, in main
launch_training(task_id=chosen_task, args=args)
File "/home/me/unitree_rl/mjlab/src/mjlab/scripts/train.py", line 203, in launch_training
run_train(task_id, args, log_dir)
File "/home/me/unitree_rl/mjlab/src/mjlab/scripts/train.py", line 173, in run_train
runner.learn(
File "/home/me/unitree_rl/rsl_rl/rsl_rl/runners/on_policy_runner.py", line 108, in learn
loss_dict = self.alg.update()
^^^^^^^^^^^^^^^^^
File "/home/me/unitree_rl/rsl_rl/rsl_rl/algorithms/ppo.py", line 256, in update
self.actor(
File ".../torch/nn/modules/module.py", line 1787, in _call_impl
return forward_call(*args, **kwargs)
File "/home/me/unitree_rl/rsl_rl/rsl_rl/models/mlp_model.py", line 106, in forward
return self.distribution.sample()
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/me/unitree_rl/rsl_rl/rsl_rl/modules/distribution.py", line 180, in sample
return self._distribution.sample() # type: ignore
File ".../torch/distributions/normal.py", line 81, in sample
return torch.normal(self.loc.expand(shape), self.scale.expand(shape))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: normal expects all elements of std >= 0.0
Key Observations
- Crash location:
Normal.sample() receives std <= 0 (likely from policy log_std exp or softplus)
- nan_guard ineffective: No NaN dump produced → values are not NaN/Inf, but negative or zero std
- Potential trigger:
action_clip is set to None, removing bounds protection that might prevent extreme policy outputs
Environment
| Item |
Version |
| OS |
Ubuntu 20.04 |
| Python |
3.11.14 |
| PyTorch |
(conda env: rl_mjlab) |
| Library |
mjlab + rsl_rl |
Steps to Reproduce
- Configure quadruped robot with max velocity 2 m/s on flat terrain
- Set
action_clip: None (disable action clipping)
- Enable
nan_guard in config
- Start training with PPO
- Wait for crash during
runner.learn() → alg.update()
Actual Behavior
- Training crashes with
std <= 0
nan_guard silent (no dump directory created)
- Crash occurs mid-training (not at initialization, random crash)
Suggested Investigation
-
Root cause: Why does log_std collapse to -inf or very negative values?
- Missing
action_clip and obs_clip allows unbounded actions → unstable gradients?
-
Defensive programming: Clamp std to min=1e-6 before sampling as safeguard
Bug Report:
RuntimeError: normal expects all elements of std >= 0.0during PPO update on quadruped robotDescription
Training crashes during PPO update with
RuntimeError: normal expects all elements of std >= 0.0in the policy's distribution sampling. Despite enablingnan_guard, no NaN dump is generated (no/tmp/mjlab/nan_dumpsdirectory created), suggesting the invalid values are negative or zero std rather than NaN/Inf.Robot & Training Configuration
Error Trace
Key Observations
Normal.sample()receivesstd <= 0(likely from policylog_stdexp or softplus)action_clipis set toNone, removing bounds protection that might prevent extreme policy outputsEnvironment
Steps to Reproduce
action_clip: None(disable action clipping)nan_guardin configrunner.learn()→alg.update()Actual Behavior
std <= 0nan_guardsilent (no dump directory created)Suggested Investigation
Root cause: Why does
log_stdcollapse to-infor very negative values?action_clipandobs_clipallows unbounded actions → unstable gradients?Defensive programming: Clamp std to
min=1e-6before sampling as safeguard