A multi-action deep reinforcement learning framework for flexible Job-shop scheduling problem
•An end-to-end DRL-based framework is introduced to solve the FJSP.•Multi-PPO is used to learn job operation action and machine action sub-policies in MPGN.•The proposed DRL shows its robustness via random and benchmark test instances. This paper presents an end-to-end deep reinforcement framework t...
Saved in:
Published in | Expert systems with applications Vol. 205; p. 117796 |
---|---|
Main Authors | , , , , , , |
Format | Journal Article |
Language | English |
Published |
Elsevier Ltd
01.11.2022
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | •An end-to-end DRL-based framework is introduced to solve the FJSP.•Multi-PPO is used to learn job operation action and machine action sub-policies in MPGN.•The proposed DRL shows its robustness via random and benchmark test instances.
This paper presents an end-to-end deep reinforcement framework to automatically learn a policy for solving a flexible Job-shop scheduling problem (FJSP) using a graph neural network. In the FJSP environment, the reinforcement agent needs to schedule an operation belonging to a job on an eligible machine among a set of compatible machines at each timestep. This means that an agent needs to control multiple actions simultaneously. Such a problem with multi-actions is formulated as a multiple Markov decision process (MMDP). For solving the MMDPs, we propose a multi-pointer graph networks (MPGN) architecture and a training algorithm called multi-Proximal Policy Optimization (multi-PPO) to learn two sub-policies, including a job operation action policy and a machine action policy to assign a job operation to a machine. The MPGN architecture consists of two encoder-decoder components, which define the job operation action policy and the machine action policy for predicting probability distributions over different operations and machines, respectively. We introduce a disjunctive graph representation of FJSP and use a graph neural network to embed the local state encountered during scheduling. The computational experiment results show that the agent can learn a high-quality dispatching policy and outperforms handcrafted heuristic dispatching rules in solution quality and meta-heuristic algorithm in running time. Moreover, the results achieved on random and benchmark instances demonstrate that the learned policies have a good generalization performance on real-world instances and significantly larger scale instances with up to 2000 operations. |
---|---|
ISSN: | 0957-4174 1873-6793 |
DOI: | 10.1016/j.eswa.2022.117796 |