Maze problem with Reinforcement Learning

The environment can be represented as:

Results

After 50 episodes, the number of movements get converged to the optimal. The reward also goes to 1.