Some questions about the bootstrapped λ-returns in Dremerv3

Hello! Thank you for your amazing work and for sharing this implementation 🙏

I have a small question regarding the Critic Learning section in the DreamerV3 paper.

In this section, the λ-return is computed using the following formula:

![Image](https://github.com/user-attachments/assets/28eb2bba-23b5-424b-b496-993adfbc5fee)

However, in the DreamerV2 paper, the λ-return is computed differently, as shown here:

![Image](https://github.com/user-attachments/assets/f9c56d25-f989-4175-9f2c-ad2dcb3285e9)

My question is about the use of the value term $v_t$ in formula (5) of DreamerV3. In DreamerV2, it is $v_{t+1}$ that appears. I’m wondering why DreamerV3 uses $v_t$ instead?

From my understanding, the reward $r_t$ corresponds to time step $t$, so using $v_t$ in the same step might double-count the same information. Could you clarify the motivation or reasoning behind this change?

Looking forward to your response, and thank you again for this great work!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some questions about the bootstrapped λ-returns in Dremerv3 #181

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Some questions about the bootstrapped λ-returns in Dremerv3 #181

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions