Skip to content

Bounded Action Space #81

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

AntoineRichard
Copy link

Hi there!

This PR adds support for bounded action spaces directly into the agent.
The main difference with clipping, is that this ensures actions are sampled within a fixed range and rewards on actions will not be computed on clipped actions.

To accomodate this, two options are provided:

  1. The "SAC style", where a gaussian based policy is bounded to the [-1, 1] range with a tanh on the mean and a tanh on the sampled actions. This is accounted for in the calculation of the action log dist. (Appendix C here: https://arxiv.org/pdf/1801.01290) Or one could look at: https://github.com/DLR-RM/stable-baselines3/blob/ea913a848242b2fca3cbcac255097e1d144207df/stable_baselines3/common/distributions.py#L207 ?
  2. A "beta policy", where rather than sampling on a probability distribution that's unbounded (the normal distribution for instance), we sample actions using a bounded probability distribution (the beta distribution). Original paper here: https://proceedings.mlr.press/v70/chou17a/chou17a.pdf . This is then rescaled to whatever is needed.

To allow for the smooth calculation of the KL distance between two beta distribution, I had to slightly rework the transition to store the distribution parameters rather than just the std and the mean. Hence in the case of the normal distribution, I save mean + std_dev, while for the beta distribution alpha and beta.

Then instead of manually computing the KL distance, I let torch do the heavy lifting.

Configuration wise it could look like this:

Beta

    policy = RslRlPpoActorCriticBetaCfg(
        init_noise_std=1.0,
        actor_hidden_dims=[32, 32],
        critic_hidden_dims=[32, 32],
        activation="elu",
        clip_actions=True, # Note this is useless since it clips all the time regardless.
        clip_actions_range=[-1.0, 1.0],
    )

Normal

    policy = RslRlPpoActorCriticCfg(
        init_noise_std=1.0,
        actor_hidden_dims=[32, 32],
        critic_hidden_dims=[32, 32],
        activation="elu",
        clip_actions=True, # Default to False
        clip_actions_range=[-1.0, 1.0],
    )

I know, this changes significantly the way PPO updates are done, and it's a BREAKING CHANGE, so no I totally understand if the beta policy doesn't make it to main repo! Though having a reliable action clipping mechanism would be nice :).

LMK if you want me to change anything, I'd be happy to!

Best,

Antoine

@AntoineRichard AntoineRichard changed the title Added action clipping SAC style, and created a BetaPolicy which has a… Bounded Action Space Apr 1, 2025
@ClemensSchwarke
Copy link
Collaborator

Hi Antoine,
thanks for your PR! Distributions are definitely on our list for a future release. We will review your code once we are at that point.
Thanks :)

@Mayankm96
Copy link
Member

If you'd like to help add this feature, I think it would make sense to have a general distribution class through which beta or gaussian essentially becomes different options users can configure. Eventually we want to add support for categorical distribution and other types as well.

@jingyang-huang
Copy link

Hello, have you tested this modification, I try to train it with my env but it fails

@AntoineRichard
Copy link
Author

AntoineRichard commented May 15, 2025

Hey! Yes we use it to train all our robots at the University of Luxembourg! Is it crashing? Or just not training?

@jingyang-huang
Copy link

jingyang-huang commented May 15, 2025

Hey! Yes we use it to train all our robots at the University of Luxembourg! Is it crashing? Or just not training?

Hey, Antoine, glad to hear from you so fast! I merged your PR into my branch and use the Normal one to train my agent. I keep the env and rewards all the same and just change the policy. However, my training fails and the mean reward did not increase, which is quite abnormal(Previous one works well). I was wondering if we need to adjust some hyper-parameters to adpat to the new policy?
The specific training log is shown below, in which the orange one is the newly modified ppo.
image

Looking forward to your reply!

@AntoineRichard
Copy link
Author

And that's the Beta or the Squashed Gaussian?

@jingyang-huang
Copy link

jingyang-huang commented May 15, 2025

And that's the Beta or the Squashed Gaussian?

It`s the Squashed Gaussian. Besides, I have also tried Beta, but sadly none of them work for my case😭

@AntoineRichard
Copy link
Author

Any chances it's just not outputting values in the range that make sense for you? I would recommend looking at the std_dev values

@jingyang-huang
Copy link

Any chances it's just not outputting values in the range that make sense for you? I would recommend looking at the std_dev values

Okay, thank you for your advice Antoine, I would try to debug and find the problem

@jingyang-huang
Copy link

jingyang-huang commented May 17, 2025

Any chances it's just not outputting values in the range that make sense for you? I would recommend looking at the std_dev values

Hi, Antoine, I was wondering how is the performance of Beta and Squashed on your task. I increased num_steps_per_env to enable the training for them, but it turns out that they are not as good as the original one on my task, and could not converge to the optimal. How about you, do they perform better?

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants