Skip to content

listing 3.5 epsilon is not greedy #44

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Ahmed-Mahmod-Salem opened this issue Mar 16, 2025 · 0 comments
Open

listing 3.5 epsilon is not greedy #44

Ahmed-Mahmod-Salem opened this issue Mar 16, 2025 · 0 comments

Comments

@Ahmed-Mahmod-Salem
Copy link

as for the code provided for listing 3.5, implementing experience replay, the value of epsilon is never updated, making the agent always choose random actions (if you reset the epsilon to 1.0 before running)

`from collections import deque
epochs = 5000
losses = []
mem_size = 1000 #A
batch_size = 200 #B
replay = deque(maxlen=mem_size) #C
max_moves = 50 #D
h = 0
for i in range(epochs):
game = Gridworld(size=4, mode='random')
state1_ = game.board.render_np().reshape(1,64) + np.random.rand(1,64)/100.0
state1 = torch.from_numpy(state1_).float()
status = 1
mov = 0
while(status == 1):
mov += 1
qval = model(state1) #E
qval_ = qval.data.numpy()
if (random.random() < epsilon): #F
action_ = np.random.randint(0,4)
else:
action_ = np.argmax(qval_)

    action = action_set[action_]
    game.makeMove(action)
    state2_ = game.board.render_np().reshape(1,64) + np.random.rand(1,64)/100.0
    state2 = torch.from_numpy(state2_).float()
    reward = game.reward()
    done = True if reward > 0 else False
    exp =  (state1, action_, reward, state2, done) #G
    replay.append(exp) #H
    state1 = state2
    
    if len(replay) > batch_size: #I
        minibatch = random.sample(replay, batch_size) #J
        state1_batch = torch.cat([s1 for (s1,a,r,s2,d) in minibatch]) #K
        action_batch = torch.Tensor([a for (s1,a,r,s2,d) in minibatch])
        reward_batch = torch.Tensor([r for (s1,a,r,s2,d) in minibatch])
        state2_batch = torch.cat([s2 for (s1,a,r,s2,d) in minibatch])
        done_batch = torch.Tensor([d for (s1,a,r,s2,d) in minibatch])
        
        Q1 = model(state1_batch) #L
        with torch.no_grad():
            Q2 = model(state2_batch) #M
        
        Y = reward_batch + gamma * ((1 - done_batch) * torch.max(Q2,dim=1)[0]) #N
        X = Q1.gather(dim=1,index=action_batch.long().unsqueeze(dim=1)).squeeze()
        loss = loss_fn(X, Y.detach())
        print(i, loss.item())
        clear_output(wait=True)
        optimizer.zero_grad()
        loss.backward()
        losses.append(loss.item())
        optimizer.step()

    if reward != -1 or mov > max_moves: #O
        status = 0
        mov = 0

losses = np.array(losses)

#A Set the total size of the experience replay memory
#B Set the minibatch size
#C Create the memory replay as a deque list
#D Maximum number of moves before game is over
#E Compute Q-values from input state in order to select action
#F Select action using epsilon-greedy strategy
#G Create experience of state, reward, action and next state as a tuple
#H Add experience to experience replay list
#I If replay list is at least as long as minibatch size, begin minibatch training
#J Randomly sample a subset of the replay list
#K Separate out the components of each experience into separate minibatch tensors
#L Re-compute Q-values for minibatch of states to get gradients
#M Compute Q-values for minibatch of next states but don't compute gradients
#N Compute the target Q-values we want the DQN to learn
#O If game is over, reset status and mov number
`

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant