Is Vanilla Policy Gradient Overlooked? Analyzing Deep Reinforcement Learning for Hanabi

In pursuit of enhanced multi-agent collaboration, we analyze several on-policy deep reinforcement learning algorithms in the recently published Hanabi benchmark. Our research suggests a perhaps counter-intuitive finding, where Proximal Policy Optimization (PPO) is outperformed by Vanilla Policy Grad...

Full description

Saved in:

Bibliographic Details
Main Authors	Grooten, Bram, Wemmenhove, Jelle, Poot, Maurice, Portegies, Jim
Format	Journal Article
Language	English
Published	22.03.2022
Subjects	Computer Science - Artificial Intelligence Computer Science - Learning Computer Science - Multiagent Systems
Online Access	Get full text

Cover

Loading…

More Information
Summary:	In pursuit of enhanced multi-agent collaboration, we analyze several on-policy deep reinforcement learning algorithms in the recently published Hanabi benchmark. Our research suggests a perhaps counter-intuitive finding, where Proximal Policy Optimization (PPO) is outperformed by Vanilla Policy Gradient over multiple random seeds in a simplified environment of the multi-agent cooperative card game. In our analysis of this behavior we look into Hanabi-specific metrics and hypothesize a reason for PPO's plateau. In addition, we provide proofs for the maximum length of a perfect game (71 turns) and any game (89 turns). Our code can be found at: https://github.com/bramgrooten/DeepRL-for-Hanabi
DOI:	10.48550/arxiv.2203.11656