ACER (off-policy learning w/ Retrace)

type

Post

status

Published

date

May 17, 2024

slug

acer

summary

不是宏碁🤣

Importance weight truncation with bias correction

Retrace estimator for off-policy, 其递推公式如下:

Note:

这里只考虑了.

这里表示MC estimator,

Retrace这种估计是有bias的, 但属于contraction mapping, 最终会收敛到真实值.

truncation with bias correction trick

recall: Off-PAC的off-policy policy gradient采用marginal distribution:

可以将其转化为下式:

proof:

可以证明

Note:

这里修正的是policy gradient, 因为其第一项也采用了truncated IS, 因此policy gradient会有误差, 所以使用第二项作为correction term保证最终的policy gradient estimate是unbiased.

补充: 注意这里还是基于off-pac的off-policy policy gradient, 只修正了当前时刻下action dist在使用behavior policy后的影响, 对marginal state分布的不一致并没有修正.

对value的估计(Critic的学习)则还是使用Retrace不变: 单次MSE更新有bias, 但是contraction mapping保证最终收敛到正确的fixed point; 因为Policy实质上使用的是我们自己建模的Critic来进行, 所以用在policy gradient中, 我们假设其是无偏的.

我们使用network 对correction term中的进行建模

Note: 作者称公式(8)为 truncation with bias correction trick

然后通过采样trajectory, 获得policy gradient estimate(off-policy ACER gradient):

Note: 这里还建模了baseline 用于减小方差.

Efficient TRPO

average policy network: a running average of past policies and forces the updated policy to not deviate far from this average

为了使用trust region, 这里采用和TRPO不同的策略, 使用一个linear constraint

Note: 我们最终使用作为最终的梯度

每个training iter分两个阶段

根据公式(9)计算原始梯度, 并根据公式(12)得到.

根据chain rule计算最终梯度zu

remark: the algorithm advanced in this section can be thought of as a general strategy for modifying the backward messages in back-propagation so as to stabilize the activations