listing 3.5 epsilon is not greedy #44

Ahmed-Mahmod-Salem · 2025-03-16T05:24:18Z

as for the code provided for listing 3.5, implementing experience replay, the value of epsilon is never updated, making the agent always choose random actions (if you reset the epsilon to 1.0 before running)

`from collections import deque
epochs = 5000
losses = []
mem_size = 1000 #A
batch_size = 200 #B
replay = deque(maxlen=mem_size) #C
max_moves = 50 #D
h = 0
for i in range(epochs):
game = Gridworld(size=4, mode='random')
state1_ = game.board.render_np().reshape(1,64) + np.random.rand(1,64)/100.0
state1 = torch.from_numpy(state1_).float()
status = 1
mov = 0
while(status == 1):
mov += 1
qval = model(state1) #E
qval_ = qval.data.numpy()
if (random.random() < epsilon): #F
action_ = np.random.randint(0,4)
else:
action_ = np.argmax(qval_)

    action = action_set[action_]
    game.makeMove(action)
    state2_ = game.board.render_np().reshape(1,64) + np.random.rand(1,64)/100.0
    state2 = torch.from_numpy(state2_).float()
    reward = game.reward()
    done = True if reward > 0 else False
    exp =  (state1, action_, reward, state2, done) #G
    replay.append(exp) #H
    state1 = state2
    
    if len(replay) > batch_size: #I
        minibatch = random.sample(replay, batch_size) #J
        state1_batch = torch.cat([s1 for (s1,a,r,s2,d) in minibatch]) #K
        action_batch = torch.Tensor([a for (s1,a,r,s2,d) in minibatch])
        reward_batch = torch.Tensor([r for (s1,a,r,s2,d) in minibatch])
        state2_batch = torch.cat([s2 for (s1,a,r,s2,d) in minibatch])
        done_batch = torch.Tensor([d for (s1,a,r,s2,d) in minibatch])
        
        Q1 = model(state1_batch) #L
        with torch.no_grad():
            Q2 = model(state2_batch) #M
        
        Y = reward_batch + gamma * ((1 - done_batch) * torch.max(Q2,dim=1)[0]) #N
        X = Q1.gather(dim=1,index=action_batch.long().unsqueeze(dim=1)).squeeze()
        loss = loss_fn(X, Y.detach())
        print(i, loss.item())
        clear_output(wait=True)
        optimizer.zero_grad()
        loss.backward()
        losses.append(loss.item())
        optimizer.step()

    if reward != -1 or mov > max_moves: #O
        status = 0
        mov = 0

losses = np.array(losses)

#A Set the total size of the experience replay memory
#B Set the minibatch size
#C Create the memory replay as a deque list
#D Maximum number of moves before game is over
#E Compute Q-values from input state in order to select action
#F Select action using epsilon-greedy strategy
#G Create experience of state, reward, action and next state as a tuple
#H Add experience to experience replay list
#I If replay list is at least as long as minibatch size, begin minibatch training
#J Randomly sample a subset of the replay list
#K Separate out the components of each experience into separate minibatch tensors
#L Re-compute Q-values for minibatch of states to get gradients
#M Compute Q-values for minibatch of next states but don't compute gradients
#N Compute the target Q-values we want the DQN to learn
#O If game is over, reset status and mov number
`

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

listing 3.5 epsilon is not greedy #44

listing 3.5 epsilon is not greedy #44

Ahmed-Mahmod-Salem commented Mar 16, 2025

listing 3.5 epsilon is not greedy #44

listing 3.5 epsilon is not greedy #44

Comments

Ahmed-Mahmod-Salem commented Mar 16, 2025