The effect of multi-step methods on overestimation in Deep Reinforcement Learning

Lingheng Meng, Rob Gorbet, Dana Kulić

Research output: Chapter in Book/Report/Conference proceedingConference PaperResearchpeer-review

4 Citations (Scopus)

Abstract

Multi-step (also called n-step) methods in Reinforcement Learning (RL) have been shown to be more efficient than the 1-step method due to faster propagation of the reward signal, both theoretically and empirically, in tasks exploiting tabular representation of the value-function. Recently, research in Deep Reinforcement Learning (DRL) also shows that multi-step methods improve learning speed and final performance in applications where the value-function and policy are represented with deep neural networks. However, there is a lack of understanding about what is actually contributing to the boost of performance. In this work, we analyze the effect of multi-step methods on alleviating the overestimation problem in DRL, where multi-step experiences are sampled from a replay buffer. Specifically building on top of Deep Deterministic Policy Gradient (DDPG), we propose Multi-step DDPG (MDDPG), where different step sizes are manually set, and a variant called Mixed Multi-step DDPG (MMDDPG) where an average over different multi-step backups is used as an update target for the Q-value function. Empirically, we show that both MDDPG and MMDDPG are significantly less affected by the overestimation problem than DDPG with 1-step backup, which consequently results in better final performance and learning speed. We also discuss the advantages and disadvantages of different ways to do multi-step expansion in order to reduce approximation error, and expose the tradeoff between overestimation and underestimation that underlies offline multi-step methods. Finally, we compare the computational resource needs of MDDPG and MMDDPG with those of Twin Delayed Deep Deterministic Policy Gradient (TD3), a state-of-the-art algorithm proposed to address overestimation in actor-critic methods, since they show comparable final performance and learning speed.

Original languageEnglish
Title of host publicationProceedings of ICPR 2020, 25th International Conference on Pattern Recognition
EditorsKim Boyer, Brian C.Lovell, Marcello Pelillo, Nicu Sebe, Rene Vidal, Jingyi Yu
Place of PublicationPiscataway NJ USA
PublisherIEEE, Institute of Electrical and Electronics Engineers
Pages347-353
Number of pages7
ISBN (Electronic)9781728188089
ISBN (Print)9781728188096
DOIs
Publication statusPublished - 2021
EventInternational Conference on Pattern Recognition 2020 - Virtual , Milano, Italy
Duration: 10 Jan 202115 Jan 2021
Conference number: 25th
https://ieeexplore-ieee-org.ezproxy.lib.monash.edu.au/xpl/conhome/9411940/proceeding (Proceedings)
https://www.micc.unifi.it/icpr2020/ (Website)

Publication series

NameProceedings - International Conference on Pattern Recognition
PublisherIEEE, Institute of Electrical and Electronics Engineers
ISSN (Print)1051-4651

Conference

ConferenceInternational Conference on Pattern Recognition 2020
Abbreviated titleICPR 2020
Country/TerritoryItaly
CityMilano
Period10/01/2115/01/21
Internet address

Cite this