Skip to content

Commit cb93fb0

Browse files
Complete PR #771 (Tune preference comparison example hyperparameters) (#782)
* Tune preference comparison example hyperparameters The preference comparison example previously did not show significant learning. It usually ended with a reward < -1000, which can be considered "failed" in the Pendulum environment. This commit updates the parameters to avoid this. It could be argued that hyperparameter optimization for the examples is bad, since it gives a skewed impression of the library. I think as long as we acknowledge that the parameters were optimized this is okay though, and it is much nicer if we have a working example as a starting point. I have tuned the hyperparameters with a mix of syne_tune [1] and manual tuning. Since the training can have very high variance, I repeated each training run multiple (up to 100) times and used multi-fidelity optimization (PASHA and ASHA) to find a good configuration. I set the objective to the 90% upper-confidence-bound of the mean final-evaluation reward over all the training runs. Unfortunately the optimization process was a bit messy since I was just getting started with syne_tune, so it is difficult to provide a full script to cleanly reproduce the results. I used something akin to this configuration space: ```py import syne_tune.config_space as cs config_space = { "reward_epochs": cs.randint(1, 20), "ppo_clip_range": cs.uniform(0.0, 0.3), "ppo_ent_coef": cs.uniform(0.0, 0.01), "ppo_gae_lambda": cs.uniform(0.9, 0.99), "ppo_n_epochs": cs.randint(5, 25), "discount_factor": cs.uniform(0.9, 1.0), "use_sde": cs.choice(["true", "false"]), "sde_sample_freq": cs.randint(1, 5), "ppo_lr": cs.loguniform(1e-4, 5e-3), "exploration_frac": cs.uniform(0, 0.1), "num_iterations": cs.randint(5, 100), "initial_comparison_frac": cs.uniform(0.05, 0.25), "initial_epoch_multiplier": cs.randint(1, 4), "query_schedule": cs.choice(["constant", "hyperbolic", "inverse_quadratic"]), "total_timesteps": 50_000, "total_comparisons": 200, "max_evals": 100, } ``` and the configuration I selected in the end is this one ```py { "reward_epochs": 10, "ppo_clip_range": 0.1, "ppo_ent_coef": 0.01, "ppo_gae_lambda": 0.90, "ppo_n_epochs": 15, "discount_factor": 0.97, "use_sde": "false", "sde_sample_freq": 1, "ppo_lr": 2e-3, "exploration_frac": 0.05, "num_iterations": 60, "initial_comparison_frac": 0.10, "initial_epoch_multiplier": 4, "query_schedule": "hyperbolic", } ``` Here are the (rounded) evaluation results of the 100 runs of the configuration: ``` [ -155, -100, -132, -150, -164, -110, -195, -194, -168, -148, -177, -113, -176, -205, -106, -169, -123, -104, -151, -169, -157, -184, -130, -151, -108, -111, -202, -142, -198, -138, -178, -104, -174, -149, -113, -107, -122, -198, -428, -221, -217, -141, -192, -158, -139, -219, -230, -209, -141, -173, -118, -176, -108, -290, -810, -182, -159, -178, -247, -205, -165, -672, -250, -138, -166, -282, -133, -147, -111, -145, -148, -116, -436, -140, -190, -137, -194, -177, -193, -1043, -243, -183, -156, -183, -184, -186, -141, -144, -194, -112, -178, -146, -140, -130, -143, -618, -402, -236, -171, -163] ``` Mean (before rounding): 196.49 Fraction of runs <-800: 2/100 Fraction of runs >-200: 79/100 This is far from perfect. I didn't include all parameters in the optimization. The 50,000 steps and 200 queries are likely overkill. Still, it significantly improves the example that users see first. I only changed the example on the main documentation page, not the notebooks. Those are already out of sync with the main example, so I am not sure how best to proceed with them. [1] https://github.com/awslabs/syne-tune * Add changes to notebook * Change number notation in cell. * clear outputs from notebook * remove empty code cell * fix variable name in preference_comparison * Run black * remove whitespace --------- Co-authored-by: Timo Kaufmann <[email protected]>
1 parent 5c85ebf commit cb93fb0

File tree

2 files changed

+55
-31
lines changed

2 files changed

+55
-31
lines changed

docs/algorithms/preference_comparisons.rst

Lines changed: 19 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,8 @@ to predict the preference comparison.
1818
Example
1919
=======
2020

21-
Detailed example notebook: :doc:`../tutorials/5_train_preference_comparisons`
21+
You can copy this example to train `PPO <https://stable-baselines3.readthedocs.io/en/master/modules/ppo.html>`_ on `Pendulum <https://www.gymlibrary.dev/environments/classic_control/pendulum/>`_ using a reward model trained on 200 synthetic preference comparisons.
22+
For a more detailed example, refer to :doc:`../tutorials/5_train_preference_comparisons`.
2223

2324
.. testcode::
2425
:skipif: skip_doctests
@@ -50,7 +51,7 @@ Detailed example notebook: :doc:`../tutorials/5_train_preference_comparisons`
5051
reward_trainer = preference_comparisons.BasicRewardTrainer(
5152
preference_model=preference_model,
5253
loss=preference_comparisons.CrossEntropyRewardLoss(),
53-
epochs=3,
54+
epochs=10,
5455
rng=rng,
5556
)
5657

@@ -62,29 +63,39 @@ Detailed example notebook: :doc:`../tutorials/5_train_preference_comparisons`
6263
),
6364
env=venv,
6465
n_steps=2048 // venv.num_envs,
66+
clip_range=0.1,
67+
ent_coef=0.01,
68+
gae_lambda=0.95,
69+
n_epochs=10,
70+
gamma=0.97,
71+
learning_rate=2e-3,
6572
)
6673

6774
trajectory_generator = preference_comparisons.AgentTrainer(
6875
algorithm=agent,
6976
reward_fn=reward_net,
7077
venv=venv,
71-
exploration_frac=0.0,
78+
exploration_frac=0.05,
7279
rng=rng,
7380
)
7481

7582
pref_comparisons = preference_comparisons.PreferenceComparisons(
7683
trajectory_generator,
7784
reward_net,
78-
num_iterations=5,
85+
num_iterations=5, # Set to 60 for better performance
7986
fragmenter=fragmenter,
8087
preference_gatherer=gatherer,
8188
reward_trainer=reward_trainer,
82-
initial_epoch_multiplier=1,
89+
initial_epoch_multiplier=4,
90+
initial_comparison_frac=0.1,
91+
query_schedule="hyperbolic",
8392
)
84-
pref_comparisons.train(total_timesteps=5_000, total_comparisons=200)
93+
pref_comparisons.train(total_timesteps=50_000, total_comparisons=200)
8594

86-
reward, _ = evaluate_policy(agent.policy, venv, 10)
87-
print("Reward:", reward)
95+
n_eval_episodes = 10
96+
reward_mean, reward_std = evaluate_policy(agent.policy, venv, n_eval_episodes)
97+
reward_stderr = reward_std/np.sqrt(n_eval_episodes)
98+
print(f"Reward: {reward_mean:.0f} +/- {reward_stderr:.0f}")
8899

89100
.. testoutput::
90101
:hide:

docs/tutorials/5_train_preference_comparisons.ipynb

Lines changed: 36 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -23,13 +23,11 @@
2323
"metadata": {},
2424
"outputs": [],
2525
"source": [
26-
"import random\n",
2726
"from imitation.algorithms import preference_comparisons\n",
2827
"from imitation.rewards.reward_nets import BasicRewardNet\n",
2928
"from imitation.util.networks import RunningNorm\n",
3029
"from imitation.util.util import make_vec_env\n",
3130
"from imitation.policies.base import FeedForward32Policy, NormalizeFeaturesExtractor\n",
32-
"import gym\n",
3331
"from stable_baselines3 import PPO\n",
3432
"import numpy as np\n",
3533
"\n",
@@ -54,6 +52,12 @@
5452
" rng=rng,\n",
5553
")\n",
5654
"\n",
55+
"\n",
56+
"# Several hyperparameters (reward_epochs, ppo_clip_range, ppo_ent_coef,\n",
57+
"# ppo_gae_lambda, ppo_n_epochs, discount_factor, use_sde, sde_sample_freq,\n",
58+
"# ppo_lr, exploration_frac, num_iterations, initial_comparison_frac,\n",
59+
"# initial_epoch_multiplier, query_schedule) used in this example have been\n",
60+
"# approximately fine-tuned to reach a reasonable level of performance.\n",
5761
"agent = PPO(\n",
5862
" policy=FeedForward32Policy,\n",
5963
" policy_kwargs=dict(\n",
@@ -64,31 +68,35 @@
6468
" seed=0,\n",
6569
" n_steps=2048 // venv.num_envs,\n",
6670
" batch_size=64,\n",
67-
" ent_coef=0.0,\n",
68-
" learning_rate=0.0003,\n",
71+
" ent_coef=0.01,\n",
72+
" learning_rate=2e-3,\n",
73+
" clip_range=0.1,\n",
74+
" gae_lambda=0.95,\n",
75+
" gamma=0.97,\n",
6976
" n_epochs=10,\n",
7077
")\n",
7178
"\n",
7279
"trajectory_generator = preference_comparisons.AgentTrainer(\n",
7380
" algorithm=agent,\n",
7481
" reward_fn=reward_net,\n",
7582
" venv=venv,\n",
76-
" exploration_frac=0.0,\n",
83+
" exploration_frac=0.05,\n",
7784
" rng=rng,\n",
7885
")\n",
7986
"\n",
8087
"pref_comparisons = preference_comparisons.PreferenceComparisons(\n",
8188
" trajectory_generator,\n",
8289
" reward_net,\n",
83-
" num_iterations=5,\n",
90+
" num_iterations=5, # Set to 60 for better performance\n",
8491
" fragmenter=fragmenter,\n",
8592
" preference_gatherer=gatherer,\n",
8693
" reward_trainer=reward_trainer,\n",
8794
" fragment_length=100,\n",
8895
" transition_oversampling=1,\n",
8996
" initial_comparison_frac=0.1,\n",
9097
" allow_variable_horizon=False,\n",
91-
" initial_epoch_multiplier=1,\n",
98+
" initial_epoch_multiplier=4,\n",
99+
" query_schedule=\"hyperbolic\",\n",
92100
")"
93101
]
94102
},
@@ -106,8 +114,8 @@
106114
"outputs": [],
107115
"source": [
108116
"pref_comparisons.train(\n",
109-
" total_timesteps=5_000, # For good performance this should be 1_000_000\n",
110-
" total_comparisons=200, # For good performance this should be 5_000\n",
117+
" total_timesteps=5_000,\n",
118+
" total_comparisons=200,\n",
111119
")"
112120
]
113121
},
@@ -126,15 +134,14 @@
126134
"source": [
127135
"from imitation.rewards.reward_wrapper import RewardVecEnvWrapper\n",
128136
"\n",
129-
"\n",
130137
"learned_reward_venv = RewardVecEnvWrapper(venv, reward_net.predict_processed)"
131138
]
132139
},
133140
{
134141
"cell_type": "markdown",
135142
"metadata": {},
136143
"source": [
137-
"Now we can train an agent, that only sees those learned reward."
144+
"Next, we train an agent that sees only the shaped, learned reward."
138145
]
139146
},
140147
{
@@ -143,20 +150,24 @@
143150
"metadata": {},
144151
"outputs": [],
145152
"source": [
146-
"from stable_baselines3 import PPO\n",
147-
"from stable_baselines3.ppo import MlpPolicy\n",
148-
"\n",
149153
"learner = PPO(\n",
150-
" policy=MlpPolicy,\n",
151-
" env=learned_reward_venv,\n",
152154
" seed=0,\n",
155+
" policy=FeedForward32Policy,\n",
156+
" policy_kwargs=dict(\n",
157+
" features_extractor_class=NormalizeFeaturesExtractor,\n",
158+
" features_extractor_kwargs=dict(normalize_class=RunningNorm),\n",
159+
" ),\n",
160+
" env=learned_reward_venv,\n",
153161
" batch_size=64,\n",
154-
" ent_coef=0.0,\n",
155-
" learning_rate=0.0003,\n",
162+
" ent_coef=0.01,\n",
156163
" n_epochs=10,\n",
157-
" n_steps=64,\n",
164+
" n_steps=2048 // learned_reward_venv.num_envs,\n",
165+
" clip_range=0.1,\n",
166+
" gae_lambda=0.95,\n",
167+
" gamma=0.97,\n",
168+
" learning_rate=2e-3,\n",
158169
")\n",
159-
"learner.learn(1000) # Note: set to 100000 to train a proficient expert"
170+
"learner.learn(1_000) # Note: set to 100_000 to train a proficient expert"
160171
]
161172
},
162173
{
@@ -174,8 +185,10 @@
174185
"source": [
175186
"from stable_baselines3.common.evaluation import evaluate_policy\n",
176187
"\n",
177-
"reward, _ = evaluate_policy(learner.policy, venv, 10)\n",
178-
"print(reward)"
188+
"n_eval_episodes = 10\n",
189+
"reward_mean, reward_std = evaluate_policy(learner.policy, venv, n_eval_episodes)\n",
190+
"reward_stderr = reward_std / np.sqrt(n_eval_episodes)\n",
191+
"print(f\"Reward: {reward_mean:.0f} +/- {reward_stderr:.0f}\")"
179192
]
180193
}
181194
],
@@ -198,7 +211,7 @@
198211
"name": "python",
199212
"nbconvert_exporter": "python",
200213
"pygments_lexer": "ipython3",
201-
"version": "3.8.13"
214+
"version": "3.9.16"
202215
}
203216
},
204217
"nbformat": 4,

0 commit comments

Comments
 (0)