Exploration Decay Policy (EDP) to Enhanced Exploration-Exploitation Trade-Off in DDPG for Continuous Action Control Optimization

The optimization of continuous action control tasks is a crucial step in deep reinforcement learning (DRL) applications. The goal is to identify optimal actions through the accumulation of experience in continuous action control tasks. This process can be achieved through DRL, which trains agents to...

Full description

Saved in:
Bibliographic Details
Published in2023 IEEE 21st Student Conference on Research and Development (SCOReD) pp. 19 - 26
Main Authors Sumiea, Ebrahim Hamid, AbdulKadir, Said Jadid, Alhussian, Hitham, Al-Selwi, Safwan Mahmood, Ragab, Mohammed Gamal, Alqushaibi, Alawi
Format Conference Proceeding
LanguageEnglish
Published IEEE 13.12.2023
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:The optimization of continuous action control tasks is a crucial step in deep reinforcement learning (DRL) applications. The goal is to identify optimal actions through the accumulation of experience in continuous action control tasks. This process can be achieved through DRL, which trains agents to develop a policy that maximizes the cumulative rewards gained from decision-making in dynamic environments. Balancing exploration and exploitation is a crucial challenge in acquiring this policy. To address the exploration-exploitation trade-off, the Exploration Decay Policy (EDP) implements a dynamic exploration noise strategy that adapts to the current training progress, enabling efficient exploration in the initial phases while gradually reducing exploration to focus on exploitation as training progresses. However, the fluctuating training stability across episodes in dynamic environments poses a challenge for exploitation policies to adapt accordingly. In this paper, we propose EDP to address exploration-exploitation trade-off dilemma. The objective is to dynamically modulate the noise scale, gradually decreasing it during periods of high training stability to promote exploration, while reducing it to maintain exploitation during periods of low training stability. The study introduces the EDP-DDPG method, enhancing continuous control tasks in Box2D environments. EDP-DDPG outperforms the standard DDPG by achieving higher rewards and quicker convergence. Its success stems from dynamically adjusting exploration noise every 25 episodes, balancing exploration and exploitation. This adaptive approach, reducing noise by 10% every 25 episodes, evolves from random to strategic limb movements, optimizing policy exploitation and adaptability in dynamic settings.
ISSN:2643-2447
DOI:10.1109/SCOReD60679.2023.10563810