You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
refactor a2c, acer, acktr, ppo2, deepq, and trpo_mpi (#490)
* exported rl-algs
* more stuff from rl-algs
* run slow tests
* re-exported rl_algs
* re-exported rl_algs - fixed problems with serialization test and test_cartpole
* replaced atari_arg_parser with common_arg_parser
* run.py can run algos from both baselines and rl_algs
* added approximate humanoid reward with ppo2 into the README for reference
* dummy commit to RUN BENCHMARKS
* dummy commit to RUN BENCHMARKS
* dummy commit to RUN BENCHMARKS
* dummy commit to RUN BENCHMARKS
* very dummy commit to RUN BENCHMARKS
* serialize variables as a dict, not as a list
* running_mean_std uses tensorflow variables
* fixed import in vec_normalize
* dummy commit to RUN BENCHMARKS
* dummy commit to RUN BENCHMARKS
* flake8 complaints
* save all variables to make sure we save the vec_normalize normalization
* benchmarks on ppo2 only RUN BENCHMARKS
* make_atari_env compatible with mpi
* run ppo_mpi benchmarks only RUN BENCHMARKS
* hardcode names of retro environments
* add defaults
* changed default ppo2 lr schedule to linear RUN BENCHMARKS
* non-tf normalization benchmark RUN BENCHMARKS
* use ncpu=1 for mujoco sessions - gives a bit of a performance speedup
* reverted running_mean_std to user property decorators for mean, var, count
* reverted VecNormalize to use RunningMeanStd (no tf)
* reverted VecNormalize to use RunningMeanStd (no tf)
* profiling wip
* use VecNormalize with regular RunningMeanStd
* added acer runner (missing import)
* flake8 complaints
* added a note in README about TfRunningMeanStd and serialization of VecNormalize
* dummy commit to RUN BENCHMARKS
* merged benchmarks branch
will set entropy coeffient to 0.1, and construct fully connected network with 3 layers with 32 hidden units in each, and create a separate network for value function estimation (so that its parameters are not shared with the policy network, but the structure is the same)
90
+
91
+
See docstrings in [common/models.py](common/models.py) for description of network parameters for each type of model, and
92
+
docstring for [baselines/ppo2/ppo2.py/learn()](ppo2/ppo2.py) fir the description of the ppo2 hyperparamters.
93
+
94
+
### Example 2. DQN on Atari
95
+
DQN with Atari is at this point a classics of benchmarks. To run the baselines implementation of DQN on Atari Pong:
The algorithms serialization API is not properly unified yet; however, there is a simple method to save / restore trained models.
102
+
`--save_path` and `--load_path` command-line option loads the tensorflow state from a given path before training, and saves it after the training, respectively.
103
+
Let's imagine you'd like to train ppo2 on Atari Pong, save the model and then later visualize what has it learnt.
This should get to the mean reward per episode about 5k. To load and visualize the model, we'll do the following - load the model, train it for 0 steps, and then visualize:
*NOTE:* At the moment Mujoco training uses VecNormalize wrapper for the environment which is not being saved correctly; so loading the models trained on Mujoco will not work well if the environment is recreated. If necessary, you can work around that by replacing RunningMeanStd by TfRunningMeanStd in [baselines/common/vec_env/vec_normalize.py](baselines/common/vec_env/vec_normalize.py#L12). This way, mean and std of environment normalizing wrapper will be saved in tensorflow variables and included in the model file; however, training is slower that way - hence not including it by default
113
+
114
+
115
+
116
+
117
+
118
+
65
119
## Subpackages
66
120
67
121
-[A2C](baselines/a2c)
@@ -85,3 +139,4 @@ To cite this repository in publications:
Main entrypoint for A2C algorithm. Train a policy with given network architecture on a given environment using a2c algorithm.
102
+
103
+
Parameters:
104
+
-----------
105
+
106
+
network: policy network architecture. Either string (mlp, lstm, lnlstm, cnn_lstm, cnn, cnn_small, conv_only - see baselines.common/models.py for full list)
107
+
specifying the standard network architecture, or a function that takes tensorflow tensor as input and returns
108
+
tuple (output_tensor, extra_feed) where output tensor is the last network layer output, extra_feed is None for feed-forward
109
+
neural nets, and extra_feed is a dictionary describing how to feed state into the network for recurrent neural nets.
110
+
See baselines.common/policies.py/lstm for more details on using recurrent nets in policies
111
+
112
+
113
+
env: RL environment. Should implement interface similar to VecEnv (baselines.common/vec_env) or be wrapped with DummyVecEnv (baselines.common/vec_env/dummy_vec_env.py)
114
+
115
+
116
+
seed: seed to make random number sequence in the alorightm reproducible. By default is None which means seed from system noise generator (not reproducible)
117
+
118
+
nsteps: int, number of steps of the vectorized environment per update (i.e. batch size is nsteps * nenv where
119
+
nenv is number of environment copies simulated in parallel)
120
+
121
+
total_timesteps: int, total number of timesteps to train on (default: 80M)
122
+
123
+
vf_coef: float, coefficient in front of value function loss in the total loss function (default: 0.5)
124
+
125
+
ent_coef: float, coeffictiant in front of the policy entropy in the total loss function (default: 0.01)
126
+
127
+
max_gradient_norm: float, gradient is clipped to have global L2 norm no more than this value (default: 0.5)
128
+
129
+
lr: float, learning rate for RMSProp (current implementation has RMSProp hardcoded in) (default: 7e-4)
130
+
131
+
lrschedule: schedule of learning rate. Can be 'linear', 'constant', or a function [0..1] -> [0..1] that takes fraction of the training progress as input and
132
+
returns fraction of the learning rate (specified as lr) as output
133
+
134
+
epsilon: float, RMSProp epsilon (stabilizes square root computation in denominator of RMSProp update) (default: 1e-5)
log_interval: int, specifies how frequently the logs are printed out (default: 100)
141
+
142
+
**network_kwargs: keyword arguments to the policy / network builder. See baselines.common/policies.py/build_policy and arguments to a particular type of network
143
+
For instance, 'mlp' network architecture has arguments num_hidden and num_layers.
0 commit comments