TY - GEN
T1 - An Empirical Study of On-Policy and Off-Policy Actor-Critic Algorithms in the Context of Exploration-Exploitation Dilemma
AU - Seshagiri, Supriya
AU - Prema, K. V.
N1 - Publisher Copyright:
© 2023 IEEE.
PY - 2023
Y1 - 2023
N2 - The Exploration-Exploitation dilemma in Reinforcement Learning (RL) algorithms is about deciding whether to select a sub-optimal path to the outcome and acquire a more varied learning of the environment or to select the greedy path and seek to maximize rewards. It is a fundamental challenge in RL algorithms which influences their learning efficiency. The on-policy and off-policy design of RL algorithms has an influence on their ability to explore non-greedy actions thereby affecting their learning ability. The paper presents the results of experiments conducted to analyze the effect of on-policy and off-policy design and the use of entropy in exploration in actor-critic algorithms and investigate the root causes of effective learning in algorithms. An empirical comparison of Soft Actor Critic (SAC) which is off-policy and Proximal Policy Optimization (PPO), an on-policy algorithm, are performed through these experiments in several continuous OpenAI Gym environments and the effect of exploration strategies like entropy, off-policy target update, and Generalized Advantage Estimate (GAE) factor on the bias-variance balance of these algorithms are analyzed.
AB - The Exploration-Exploitation dilemma in Reinforcement Learning (RL) algorithms is about deciding whether to select a sub-optimal path to the outcome and acquire a more varied learning of the environment or to select the greedy path and seek to maximize rewards. It is a fundamental challenge in RL algorithms which influences their learning efficiency. The on-policy and off-policy design of RL algorithms has an influence on their ability to explore non-greedy actions thereby affecting their learning ability. The paper presents the results of experiments conducted to analyze the effect of on-policy and off-policy design and the use of entropy in exploration in actor-critic algorithms and investigate the root causes of effective learning in algorithms. An empirical comparison of Soft Actor Critic (SAC) which is off-policy and Proximal Policy Optimization (PPO), an on-policy algorithm, are performed through these experiments in several continuous OpenAI Gym environments and the effect of exploration strategies like entropy, off-policy target update, and Generalized Advantage Estimate (GAE) factor on the bias-variance balance of these algorithms are analyzed.
UR - https://www.scopus.com/pages/publications/85180803108
UR - https://www.scopus.com/pages/publications/85180803108#tab=citedBy
U2 - 10.1109/ICETCI58599.2023.10331400
DO - 10.1109/ICETCI58599.2023.10331400
M3 - Conference contribution
AN - SCOPUS:85180803108
T3 - Proceedings of the 2023 International Conference on Emerging Techniques in Computational Intelligence, ICETCI 2023
SP - 238
EP - 243
BT - Proceedings of the 2023 International Conference on Emerging Techniques in Computational Intelligence, ICETCI 2023
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 3rd International Conference on Emerging Techniques in Computational Intelligence, ICETCI 2023
Y2 - 21 September 2023 through 23 September 2023
ER -