diff --git a/lab3/HC ROMS.zip b/lab3/HC ROMS.zip new file mode 100644 index 00000000..d1aa7fd0 Binary files /dev/null and b/lab3/HC ROMS.zip differ diff --git a/lab3/RL.ipynb b/lab3/RL.ipynb index 05abf42d..6079fbc2 100644 --- a/lab3/RL.ipynb +++ b/lab3/RL.ipynb @@ -611,6 +611,15 @@ "id": "lbYHLr66i15n" }, "source": [ + "from google.colab import files\n", + "# The following command will open an upload window. \n", + "# To load the pong ROMs, please upload both HC_ROMS.zip and ROMS.zip to this window.\n", + "uploaded = files.upload()\n", + "\n", + "!pip install atari_py\n", + "\n", + "!python -m atari_py.import_roms .\n", + "\n", "def create_pong_env(): \n", " return gym.make(\"Pong-v0\", frameskip=5)\n", "env = create_pong_env()\n", @@ -805,8 +814,8 @@ "id": "YBLVfdpv7ajG" }, "source": [ - "Let's also consider the fact that, unlike CartPole, the Pong environment has an additional element of uncertainty -- regardless of what action the agent takes, we don't know how the opponent will play. That is, the environment is changing over time, based on *both* the actions we take and the actions of the opponent, which result in motion of the ball and motion of the paddles.\r\n", - "\r\n", + "Let's also consider the fact that, unlike CartPole, the Pong environment has an additional element of uncertainty -- regardless of what action the agent takes, we don't know how the opponent will play. That is, the environment is changing over time, based on *both* the actions we take and the actions of the opponent, which result in motion of the ball and motion of the paddles.\n", + "\n", "Therefore, to capture the dynamics, we also consider how the environment changes by looking at the difference between a previous observation (image frame) and the current observation (image frame). We've implemented a helper function, `pong_change`, that pre-processes two frames, calculates the change between the two, and then re-normalizes the values. Let's inspect this to visualize how the environment can change:" ] }, @@ -816,15 +825,15 @@ "id": "ItWrUwM87ZBw" }, "source": [ - "next_observation, _,_,_ = env.step(np.random.choice(n_actions))\r\n", - "diff = mdl.lab3.pong_change(observation, next_observation)\r\n", - "\r\n", - "f, ax = plt.subplots(1, 3, figsize=(15,15))\r\n", - "for a in ax:\r\n", - " a.grid(False)\r\n", - " a.axis(\"off\")\r\n", - "ax[0].imshow(observation); ax[0].set_title('Previous Frame');\r\n", - "ax[1].imshow(next_observation); ax[1].set_title('Current Frame');\r\n", + "next_observation, _,_,_ = env.step(np.random.choice(n_actions))\n", + "diff = mdl.lab3.pong_change(observation, next_observation)\n", + "\n", + "f, ax = plt.subplots(1, 3, figsize=(15,15))\n", + "for a in ax:\n", + " a.grid(False)\n", + " a.axis(\"off\")\n", + "ax[0].imshow(observation); ax[0].set_title('Previous Frame');\n", + "ax[1].imshow(next_observation); ax[1].set_title('Current Frame');\n", "ax[2].imshow(np.squeeze(diff)); ax[2].set_title('Difference (Model Input)');" ], "execution_count": null, @@ -845,14 +854,14 @@ "id": "YiJLu9SEAJu6" }, "source": [ - "### Rollout function\r\n", - "\r\n", - "We're now set up to define our key action algorithm for the game of Pong, which will ultimately be used to train our Pong agent. This function can be thought of as a \"rollout\", where the agent will 1) make an observation of the environment, 2) select an action based on its state in the environment, 3) execute a policy based on that action, resulting in some reward and a change to the environment, and 4) finally add memory of that action-reward to its `Memory` buffer. We will define this algorithm in the `collect_rollout` function below, and use it soon within a training block.\r\n", - "\r\n", - "Earlier you visually inspected the raw environment frames, the pre-processed frames, and the difference between previous and current frames. As you may have gathered, in a dynamic game like Pong, it can actually be helpful to consider the difference between two consecutive observations. This gives us information about the movement between frames -- how the game is changing. We will do this using the `pong_change` function we explored above (which also pre-processes frames for us).\r\n", - "\r\n", - "We will use differences between frames as the input on which actions will be selected. These observation changes will be forward propagated through our Pong agent, the CNN network model, which will then predict the next action to take based on this observation. The raw reward will be computed. The observation, action, and reward will be recorded into memory. This will loop until a particular game ends -- the rollout is completed.\r\n", - "\r\n", + "### Rollout function\n", + "\n", + "We're now set up to define our key action algorithm for the game of Pong, which will ultimately be used to train our Pong agent. This function can be thought of as a \"rollout\", where the agent will 1) make an observation of the environment, 2) select an action based on its state in the environment, 3) execute a policy based on that action, resulting in some reward and a change to the environment, and 4) finally add memory of that action-reward to its `Memory` buffer. We will define this algorithm in the `collect_rollout` function below, and use it soon within a training block.\n", + "\n", + "Earlier you visually inspected the raw environment frames, the pre-processed frames, and the difference between previous and current frames. As you may have gathered, in a dynamic game like Pong, it can actually be helpful to consider the difference between two consecutive observations. This gives us information about the movement between frames -- how the game is changing. We will do this using the `pong_change` function we explored above (which also pre-processes frames for us).\n", + "\n", + "We will use differences between frames as the input on which actions will be selected. These observation changes will be forward propagated through our Pong agent, the CNN network model, which will then predict the next action to take based on this observation. The raw reward will be computed. The observation, action, and reward will be recorded into memory. This will loop until a particular game ends -- the rollout is completed.\n", + "\n", "For now, we will define `collect_rollout` such that a batch of observations (i.e., from a batch of agent-environment worlds) can be processed serially (i.e., one at a time, in sequence). We will later utilize a parallelized version of this function that will parallelize batch processing to help speed up training! Let's get to it." ] }, @@ -935,17 +944,17 @@ "id": "msNBRcULHbrd" }, "source": [ - "### Rollout with untrained Pong model ###\r\n", - "\r\n", - "# Model\r\n", - "test_model = create_pong_model()\r\n", - "\r\n", - "# Rollout with single batch\r\n", - "single_batch_size = 1\r\n", - "memories = collect_rollout(single_batch_size, env, test_model, choose_action)\r\n", - "rollout_video = mdl.lab3.save_video_of_memory(memories[0], \"Pong-Random-Agent.mp4\")\r\n", - "\r\n", - "# Play back video of memories\r\n", + "### Rollout with untrained Pong model ###\n", + "\n", + "# Model\n", + "test_model = create_pong_model()\n", + "\n", + "# Rollout with single batch\n", + "single_batch_size = 1\n", + "memories = collect_rollout(single_batch_size, env, test_model, choose_action)\n", + "rollout_video = mdl.lab3.save_video_of_memory(memories[0], \"Pong-Random-Agent.mp4\")\n", + "\n", + "# Play back video of memories\n", "mdl.lab3.play_video(rollout_video)" ], "execution_count": null, @@ -979,27 +988,27 @@ "id": "FaEHTMRVMRXP" }, "source": [ - "### Hyperparameters and setup for training ###\r\n", - "# Rerun this cell if you want to re-initialize the training process\r\n", - "# (i.e., create new model, reset loss, etc)\r\n", - "\r\n", - "# Hyperparameters\r\n", - "learning_rate = 1e-3\r\n", - "MAX_ITERS = 1000 # increase the maximum to train longer\r\n", - "batch_size = 5 # number of batches to run\r\n", - "\r\n", - "# Model, optimizer\r\n", - "pong_model = create_pong_model()\r\n", - "optimizer = tf.keras.optimizers.Adam(learning_rate)\r\n", - "iteration = 0 # counter for training steps\r\n", - "\r\n", - "# Plotting\r\n", - "smoothed_reward = mdl.util.LossHistory(smoothing_factor=0.9)\r\n", - "smoothed_reward.append(0) # start the reward at zero for baseline comparison\r\n", - "plotter = mdl.util.PeriodicPlotter(sec=15, xlabel='Iterations', ylabel='Win Percentage (%)')\r\n", - "\r\n", - "# Batches and environment\r\n", - "# To parallelize batches, we need to make multiple copies of the environment.\r\n", + "### Hyperparameters and setup for training ###\n", + "# Rerun this cell if you want to re-initialize the training process\n", + "# (i.e., create new model, reset loss, etc)\n", + "\n", + "# Hyperparameters\n", + "learning_rate = 1e-3\n", + "MAX_ITERS = 1000 # increase the maximum to train longer\n", + "batch_size = 5 # number of batches to run\n", + "\n", + "# Model, optimizer\n", + "pong_model = create_pong_model()\n", + "optimizer = tf.keras.optimizers.Adam(learning_rate)\n", + "iteration = 0 # counter for training steps\n", + "\n", + "# Plotting\n", + "smoothed_reward = mdl.util.LossHistory(smoothing_factor=0.9)\n", + "smoothed_reward.append(0) # start the reward at zero for baseline comparison\n", + "plotter = mdl.util.PeriodicPlotter(sec=15, xlabel='Iterations', ylabel='Win Percentage (%)')\n", + "\n", + "# Batches and environment\n", + "# To parallelize batches, we need to make multiple copies of the environment.\n", "envs = [create_pong_env() for _ in range(batch_size)] # For parallelization" ], "execution_count": null, diff --git a/lab3/ROMS.zip b/lab3/ROMS.zip new file mode 100644 index 00000000..1017439c Binary files /dev/null and b/lab3/ROMS.zip differ