Partial episode bootstrapping (PEB)

type

Post

status

Published

date

Apr 23, 2024

slug

rl_peb

summary

RL中如何正确处理truncation信号

category

强化学习

icon

password

Property

Jul 24, 2025 07:34 AM

paper: Time limits in reinforcement learning (ICML2018)

部分gym环境存在timeout mechanism(eg: mujoco timelimit=1000), 此时会truncate trajectory, 并返回done=True. 但事实上truncation和termination应该是两种不同的信号, 后者才是env自带的属性, 前者这是算法上的一种取舍. 所以新版本的gym也将这两种信号区分开了.

但因为此时的state并不是真实的terminal state, 如果使用bootstrap TD update, 会造成一定的error.

Eg: 在timeout前的一个state 和最后一个state 之间差异可能很小, 前者target value时, 而后者target value则变成了; 这对于使用function approx的value更新来说, 两个相似的state却存在本不该存在的差异

解决: partial-episode-bootstrap (PEB)

该方法由Pardo的论文Time limits in reinforcement learning提出

we continue bootstrapping from the next state during the training when the episode ends due to timeout.

对上面的例子来说, 就是state 的target value变成.

The state-value function of a policy at time t can be rewritten in terms of the time-limited return and the value from the last state :

Trick: 在计算value taget时, 可以让在真termination时设置为0, 从而可以批量计算.

实验结果:

上述mistake对性能影响很大, PEB可以有效修复这个问题