## Main road - [x] Create toy environments - [x] Create new toy environments (@timorl's [repo](https://github.com/timorl/safety-gridworlds-gym)) - [x] Clean up toy environments for use with Gym API - [x] Add toy environments as a dependency (#38) - [x] Debug toy environments (david-lindner/safe-grid-gym#15) - [x] Refactor for use with Gym API (#32) - [x] Modify ai_safety_gridworlds_gym to fit our needs (@david-lindner's [fork](https://github.com/david-lindner/gym_ai_safety_gridworlds)) - [x] Improve dependency management #31 - [x] Switch all code referencing envs to use Gym env - [x] Improved tooling for hyperparameter tuning (e.g. Ray) - [x] Estimate compute costs and finalize logistics - First guess for an upper bound: 1 agent x 4 environments x 3 experiments = 12 sets of hyperparameters to tune x ~30 training runs = 360 runs x 2 hours - [ ] Do experiments **Start with experiments January 11** - [ ] Check if hparams tuned on Solver generalize to Cheater (vice versa too, but less important/rigorous) - ~Investigate corrupt versions of harder environments~ - ~Maybe bigger / more realistic boat race~ - ~Maybe a modified Atari env~ - ~Maybe a modified MuJoCo env~ - ~Maybe modified BipedalWalker env~ **Finish experiments February 15** **Deadline February 22** ## Environments: - [x] TomatoWateringCRMDP - [x] TransitionBoatRaceCRMDP - [x] Toy environments - [x] corrupt corners (satisfies our assumptions for guaranteed learnability) - [x] corrupt path to goal (does not satisfy assumptions for guaranteed learnability) ## Experiments per env - Baseline (learns corrupt reward) - Cheater (learns with access to true reward) - Solver (learns intended behavior from corrupt reward) ## Optional - [ ] Generalize PPO #17 - [ ] Improve test coverage #29
Main road
Investigate corrupt versions of harder environmentsMaybe bigger / more realistic boat raceMaybe a modified Atari envMaybe a modified MuJoCo envMaybe modified BipedalWalker envFinish experiments February 15
Deadline February 22
Environments:
Experiments per env
Optional