Is Vanilla Policy Gradient Overlooked? Analyzing Deep Reinforcement Learning for Hanabi
In pursuit of enhanced multi-agent collaboration, we analyze several on-policy deep reinforcement learning algorithms in the recently published Hanabi benchmark. Our research suggests a perhaps counter-intuitive finding, where Proximal Policy Optimization (PPO) is outperformed by Vanilla Policy Grad...
Saved in:
Main Authors | , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
22.03.2022
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | In pursuit of enhanced multi-agent collaboration, we analyze several
on-policy deep reinforcement learning algorithms in the recently published
Hanabi benchmark. Our research suggests a perhaps counter-intuitive finding,
where Proximal Policy Optimization (PPO) is outperformed by Vanilla Policy
Gradient over multiple random seeds in a simplified environment of the
multi-agent cooperative card game. In our analysis of this behavior we look
into Hanabi-specific metrics and hypothesize a reason for PPO's plateau. In
addition, we provide proofs for the maximum length of a perfect game (71 turns)
and any game (89 turns). Our code can be found at:
https://github.com/bramgrooten/DeepRL-for-Hanabi |
---|---|
DOI: | 10.48550/arxiv.2203.11656 |