Top-K Off-Policy Correction for a REINFORCE Recommender System Minmin Chen, Alex Beutel, Paul Covington, Sagar Jain, Francois Belletti, Ed Chi https://arxiv.org/pdf/1812.02353.pdf
This agent is similar to the base REINFORCE agent in TF-agents with a few modifcations:
- off-policy correction: an additional scaling factor to the standard REINFORCE gradient based on the ratio between the (target) policy we are training and the behaviour/collection policy. Basically a ratio between action probabilities
- Top-K off-policy correction: since the agent proposes k actions at a time, the correction factor has to be further adjusted e.g., with log(action_prob)
- estimating the behavior policy: the behaviour policy is predicted/approximated by another softmax head in the network. This is used for computing the importance ratio
1. ScaNN layer in action space (policy)
- when ScaNN is used, logits will not be computed for all actions.
- So in this case, either do not emit logits in the policy, or emit a tuple
<logits, canidate_actions>
, and update metrics such asWeightedReturn
to use this structure. - Currently it is not possible to detect whether ScaNN is used inside the policy, only in the agent
2. Confirm using emit_logits_as_info
and ScaNN (policy)
- This should not be used in conjunction with ScaNN (provided through
get_candidate_actions_fn
) since in this case the logits will not be computed for all actions.
3. agent trajectories
- Does agent assume a trajectory is a single episode, such that
trajectory.step_type
andtrajectory.discount
are essentially ignored?
4. optimize train step:
@common.function(autograph=False)
def _train_step_fn(data):
# trajectory, weights = data
def replicated_train_step(experience):
return tf_agent.train(experience).loss
per_replica_losses = distribution_strategy.run(
replicated_train_step,
args=(data,)
)
# return agent.train(experience=trajectories).loss
return distribution_strategy.reduce(
tf.distribute.ReduceOp.MEAN,
per_replica_losses, # loss,
axis=None
)
Create experiment comparing:
[1] rnn
- off_policy_correction_exponent = None
- use_supervised_loss_for_main_policy = True
[2] REINFORCE
- off_policy_correction_exponent = None
- use_supervised_loss_for_main_policy = False
[3] topk REINFORCE
- off_policy_correction_exponent = ~16
- use_supervised_loss_for_main_policy = False
For each user
, we consider a sequence of user historical interactions with the RecSys, recording the actions taken by the recommender (e.g., items recommended), as well as user feedback (e.g.,ratings
)
Given such a sequence, we predict the next action
to take, i.e., items to recommend, so that user satisfaction metrics, e.g., indicated by ratings
improve
MPD definiton
We translate this setup into a Markov Decision Process (MDP)
{S
, A
, P
, R
, p0
, y
}
S
: a continuous state space describing the user statesA
: a discrete action space, containing items available for recommendationP
: S × A × S → R is the state transition probabilityR
: S × A → R is the reward function, where 𝑟(𝑠, 𝑎) is the immediate reward obtained by performing action 𝑎 at user states
p0
is the initial state distributiony
is the discount factor for future rewards
A Trajectory
represents a sequence of aligned time steps
It captures:
observation
andstep_type
from current time step with the computedaction
andpolicy_info
Discount
,reward
andnext_step_type
come from the next time step.
We allow experience
to contain trajectories of different lengths in the time dimension, but these have to be padded with dummy values to have a constant size of T
in the time dimension
- Both
trajectory.reward
andweights
have to be 0 for these dummy values experience
can be provided in other formats such asTransition
's if they can be converted into Trajectories.
A TimeStep
contains the data emitted by an environment at each step of interaction. They include:
- a
step_type
, - an
observation
(e.g., NumPy array, dict, or list of arrays), - and an associated
reward
anddiscount
sequential ordering
- first
TimeStep
in a sequence equalsStepType.FIRST
- final
TimeStep
in a sequence equalsStepType.LAST
- All other
TimeStep
s in a sequence equalStepType.MID
A discounting factor is introduced for:
- Reducing variance
- Prescribing the effective time horizon we optimize over
Top-𝐾 Off-Policy Correction for a REINFORCE Recommender System
section 4.2 Estimating the behavior policy 𝛽
- Despite a substantial sharing of parameters between the two policy heads 𝜋𝜃 and 𝛽𝜃′ , there are two noticeable differences between them:
- (1) While the main policy 𝜋𝜃 is effectively trained using a weighted softmax to take into account of long term reward, the behavior policy head 𝛽𝜃′ is trained using only the state-action pairs;
- (2) While the main policy head 𝜋𝜃 is trained using only items on the trajectory with non-zero reward, the behavior policy 𝛽𝜃′ is trained using all of the items on the trajectory to avoid introducing bias in the 𝛽 estimate