Complete PR #771 (Tune preference comparison example hyperparameters) (#782)

lukasberglund · timokau · web-flow · commit cb93fb058398 · 2023-09-12T13:14:37.000-07:00
* Tune preference comparison example hyperparameters The preference comparison example previously did not show significant learning. It usually ended with a reward < -1000, which can be considered "failed" in the Pendulum environment. This commit updates the parameters to avoid this. It could be argued that hyperparameter optimization for the examples is bad, since it gives a skewed impression of the library. I think as long as we acknowledge that the parameters were optimized this is okay though, and it is much nicer if we have a working example as a starting point. I have tuned the hyperparameters with a mix of syne_tune [1] and manual tuning. Since the training can have very high variance, I repeated each training run multiple (up to 100) times and used multi-fidelity optimization (PASHA and ASHA) to find a good configuration. I set the objective to the 90% upper-confidence-bound of the mean final-evaluation reward over all the training runs. Unfortunately the optimization process was a bit messy since I was just getting started with syne_tune, so it is difficult to provide a full script to cleanly reproduce the results. I used something akin to this configuration space: ```py import syne_tune.config_space as cs config_space = { "reward_epochs": cs.randint(1, 20), "ppo_clip_range": cs.uniform(0.0, 0.3), "ppo_ent_coef": cs.uniform(0.0, 0.01), "ppo_gae_lambda": cs.uniform(0.9, 0.99), "ppo_n_epochs": cs.randint(5, 25), "discount_factor": cs.uniform(0.9, 1.0), "use_sde": cs.choice(["true", "false"]), "sde_sample_freq": cs.randint(1, 5), "ppo_lr": cs.loguniform(1e-4, 5e-3), "exploration_frac": cs.uniform(0, 0.1), "num_iterations": cs.randint(5, 100), "initial_comparison_frac": cs.uniform(0.05, 0.25), "initial_epoch_multiplier": cs.randint(1, 4), "query_schedule": cs.choice(["constant", "hyperbolic", "inverse_quadratic"]), "total_timesteps": 50_000, "total_comparisons": 200, "max_evals": 100, } ``` and the configuration I selected in the end is this one ```py { "reward_epochs": 10, "ppo_clip_range": 0.1, "ppo_ent_coef": 0.01, "ppo_gae_lambda": 0.90, "ppo_n_epochs": 15, "discount_factor": 0.97, "use_sde": "false", "sde_sample_freq": 1, "ppo_lr": 2e-3, "exploration_frac": 0.05, "num_iterations": 60, "initial_comparison_frac": 0.10, "initial_epoch_multiplier": 4, "query_schedule": "hyperbolic", } ``` Here are the (rounded) evaluation results of the 100 runs of the configuration: ``` [ -155, -100, -132, -150, -164, -110, -195, -194, -168, -148, -177, -113, -176, -205, -106, -169, -123, -104, -151, -169, -157, -184, -130, -151, -108, -111, -202, -142, -198, -138, -178, -104, -174, -149, -113, -107, -122, -198, -428, -221, -217, -141, -192, -158, -139, -219, -230, -209, -141, -173, -118, -176, -108, -290, -810, -182, -159, -178, -247, -205, -165, -672, -250, -138, -166, -282, -133, -147, -111, -145, -148, -116, -436, -140, -190, -137, -194, -177, -193, -1043, -243, -183, -156, -183, -184, -186, -141, -144, -194, -112, -178, -146, -140, -130, -143, -618, -402, -236, -171, -163] ``` Mean (before rounding): 196.49 Fraction of runs <-800: 2/100 Fraction of runs >-200: 79/100 This is far from perfect. I didn't include all parameters in the optimization. The 50,000 steps and 200 queries are likely overkill. Still, it significantly improves the example that users see first. I only changed the example on the main documentation page, not the notebooks. Those are already out of sync with the main example, so I am not sure how best to proceed with them. [1] https://github.com/awslabs/syne-tune * Add changes to notebook * Change number notation in cell. * clear outputs from notebook * remove empty code cell * fix variable name in preference_comparison * Run black * remove whitespace --------- Co-authored-by: Timo Kaufmann <timokau@zoho.com>
diff --git a/docs/algorithms/preference_comparisons.rst b/docs/algorithms/preference_comparisons.rst
@@ -18,7 +18,8 @@ to predict the preference comparison.
 Example
 =======
 
-Detailed example notebook: :doc:`../tutorials/5_train_preference_comparisons`
+You can copy this example to train `PPO <https://stable-baselines3.readthedocs.io/en/master/modules/ppo.html>`_ on `Pendulum <https://www.gymlibrary.dev/environments/classic_control/pendulum/>`_ using a reward model trained on 200 synthetic preference comparisons.
+For a more detailed example, refer to :doc:`../tutorials/5_train_preference_comparisons`.
 
 .. testcode::
     :skipif: skip_doctests
@@ -50,7 +51,7 @@ Detailed example notebook: :doc:`../tutorials/5_train_preference_comparisons`
     reward_trainer = preference_comparisons.BasicRewardTrainer(
         preference_model=preference_model,
         loss=preference_comparisons.CrossEntropyRewardLoss(),
-        epochs=3,
+        epochs=10,
         rng=rng,
     )
 
@@ -62,29 +63,39 @@ Detailed example notebook: :doc:`../tutorials/5_train_preference_comparisons`
         ),
         env=venv,
         n_steps=2048 // venv.num_envs,
+        clip_range=0.1,
+        ent_coef=0.01,
+        gae_lambda=0.95,
+        n_epochs=10,
+        gamma=0.97,
+        learning_rate=2e-3,
     )
 
     trajectory_generator = preference_comparisons.AgentTrainer(
         algorithm=agent,
         reward_fn=reward_net,
         venv=venv,
-        exploration_frac=0.0,
+        exploration_frac=0.05,
         rng=rng,
     )
 
     pref_comparisons = preference_comparisons.PreferenceComparisons(
         trajectory_generator,
         reward_net,
-        num_iterations=5,
+        num_iterations=5, # Set to 60 for better performance
         fragmenter=fragmenter,
         preference_gatherer=gatherer,
         reward_trainer=reward_trainer,
-        initial_epoch_multiplier=1,
+        initial_epoch_multiplier=4,
+        initial_comparison_frac=0.1,
+        query_schedule="hyperbolic",
     )
-    pref_comparisons.train(total_timesteps=5_000, total_comparisons=200)
+    pref_comparisons.train(total_timesteps=50_000, total_comparisons=200)
 
-    reward, _ = evaluate_policy(agent.policy, venv, 10)
-    print("Reward:", reward)
+    n_eval_episodes = 10
+    reward_mean, reward_std = evaluate_policy(agent.policy, venv, n_eval_episodes)
+    reward_stderr = reward_std/np.sqrt(n_eval_episodes)
+    print(f"Reward: {reward_mean:.0f} +/- {reward_stderr:.0f}")
 
 .. testoutput::
     :hide:
diff --git a/docs/tutorials/5_train_preference_comparisons.ipynb b/docs/tutorials/5_train_preference_comparisons.ipynb
@@ -23,13 +23,11 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "import random\n",
     "from imitation.algorithms import preference_comparisons\n",
     "from imitation.rewards.reward_nets import BasicRewardNet\n",
     "from imitation.util.networks import RunningNorm\n",
     "from imitation.util.util import make_vec_env\n",
     "from imitation.policies.base import FeedForward32Policy, NormalizeFeaturesExtractor\n",
-    "import gym\n",
     "from stable_baselines3 import PPO\n",
     "import numpy as np\n",
     "\n",
@@ -54,6 +52,12 @@
     "    rng=rng,\n",
     ")\n",
     "\n",
+    "\n",
+    "# Several hyperparameters (reward_epochs, ppo_clip_range, ppo_ent_coef,\n",
+    "# ppo_gae_lambda, ppo_n_epochs, discount_factor, use_sde, sde_sample_freq,\n",
+    "# ppo_lr, exploration_frac, num_iterations, initial_comparison_frac,\n",
+    "# initial_epoch_multiplier, query_schedule) used in this example have been\n",
+    "# approximately fine-tuned to reach a reasonable level of performance.\n",
     "agent = PPO(\n",
     "    policy=FeedForward32Policy,\n",
     "    policy_kwargs=dict(\n",
@@ -64,31 +68,35 @@
     "    seed=0,\n",
     "    n_steps=2048 // venv.num_envs,\n",
     "    batch_size=64,\n",
-    "    ent_coef=0.0,\n",
-    "    learning_rate=0.0003,\n",
+    "    ent_coef=0.01,\n",
+    "    learning_rate=2e-3,\n",
+    "    clip_range=0.1,\n",
+    "    gae_lambda=0.95,\n",
+    "    gamma=0.97,\n",
     "    n_epochs=10,\n",
     ")\n",
     "\n",
     "trajectory_generator = preference_comparisons.AgentTrainer(\n",
     "    algorithm=agent,\n",
     "    reward_fn=reward_net,\n",
     "    venv=venv,\n",
-    "    exploration_frac=0.0,\n",
+    "    exploration_frac=0.05,\n",
     "    rng=rng,\n",
     ")\n",
     "\n",
     "pref_comparisons = preference_comparisons.PreferenceComparisons(\n",
     "    trajectory_generator,\n",
     "    reward_net,\n",
-    "    num_iterations=5,\n",
+    "    num_iterations=5,  # Set to 60 for better performance\n",
     "    fragmenter=fragmenter,\n",
     "    preference_gatherer=gatherer,\n",
     "    reward_trainer=reward_trainer,\n",
     "    fragment_length=100,\n",
     "    transition_oversampling=1,\n",
     "    initial_comparison_frac=0.1,\n",
     "    allow_variable_horizon=False,\n",
-    "    initial_epoch_multiplier=1,\n",
+    "    initial_epoch_multiplier=4,\n",
+    "    query_schedule=\"hyperbolic\",\n",
     ")"
    ]
   },
@@ -106,8 +114,8 @@
    "outputs": [],
    "source": [
     "pref_comparisons.train(\n",
-    "    total_timesteps=5_000,  # For good performance this should be 1_000_000\n",
-    "    total_comparisons=200,  # For good performance this should be 5_000\n",
+    "    total_timesteps=5_000,\n",
+    "    total_comparisons=200,\n",
     ")"
    ]
   },
@@ -126,15 +134,14 @@
    "source": [
     "from imitation.rewards.reward_wrapper import RewardVecEnvWrapper\n",
     "\n",
-    "\n",
     "learned_reward_venv = RewardVecEnvWrapper(venv, reward_net.predict_processed)"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Now we can train an agent, that only sees those learned reward."
+    "Next, we train an agent that sees only the shaped, learned reward."
    ]
   },
   {
@@ -143,20 +150,24 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "from stable_baselines3 import PPO\n",
-    "from stable_baselines3.ppo import MlpPolicy\n",
-    "\n",
     "learner = PPO(\n",
-    "    policy=MlpPolicy,\n",
-    "    env=learned_reward_venv,\n",
     "    seed=0,\n",
+    "    policy=FeedForward32Policy,\n",
+    "    policy_kwargs=dict(\n",
+    "        features_extractor_class=NormalizeFeaturesExtractor,\n",
+    "        features_extractor_kwargs=dict(normalize_class=RunningNorm),\n",
+    "    ),\n",
+    "    env=learned_reward_venv,\n",
     "    batch_size=64,\n",
-    "    ent_coef=0.0,\n",
-    "    learning_rate=0.0003,\n",
+    "    ent_coef=0.01,\n",
     "    n_epochs=10,\n",
-    "    n_steps=64,\n",
+    "    n_steps=2048 // learned_reward_venv.num_envs,\n",
+    "    clip_range=0.1,\n",
+    "    gae_lambda=0.95,\n",
+    "    gamma=0.97,\n",
+    "    learning_rate=2e-3,\n",
     ")\n",
-    "learner.learn(1000)  # Note: set to 100000 to train a proficient expert"
+    "learner.learn(1_000)  # Note: set to 100_000 to train a proficient expert"
    ]
   },
   {
@@ -174,8 +185,10 @@
    "source": [
     "from stable_baselines3.common.evaluation import evaluate_policy\n",
     "\n",
-    "reward, _ = evaluate_policy(learner.policy, venv, 10)\n",
-    "print(reward)"
+    "n_eval_episodes = 10\n",
+    "reward_mean, reward_std = evaluate_policy(learner.policy, venv, n_eval_episodes)\n",
+    "reward_stderr = reward_std / np.sqrt(n_eval_episodes)\n",
+    "print(f\"Reward: {reward_mean:.0f} +/- {reward_stderr:.0f}\")"
    ]
   }
  ],
@@ -198,7 +211,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.8.13"
+   "version": "3.9.16"
   }
  },
  "nbformat": 4,