HeterPS: Distributed Deep Learning With Reinforcement Learning Based Scheduling in Heterogeneous Environments

Deep neural networks (DNNs) exploit many layers and a large number of parameters to achieve excellent performance. The training process of DNN models generally handles large-scale input data with many sparse features, which incurs high Input/Output (IO) cost, while some layers are compute-intensive....

Full description

Saved in:
Bibliographic Details
Published inarXiv.org
Main Authors Liu, Ji, Wu, Zhihua, Yu, Dianhai, Ma, Yanjun, Feng, Danlei, Zhang, Minxu, Wu, Xinxuan, Yao, Xuefeng, Dou, Dejing
Format Paper
LanguageEnglish
Published Ithaca Cornell University Library, arXiv.org 07.06.2023
Subjects
Online AccessGet full text

Cover

Loading…
Abstract Deep neural networks (DNNs) exploit many layers and a large number of parameters to achieve excellent performance. The training process of DNN models generally handles large-scale input data with many sparse features, which incurs high Input/Output (IO) cost, while some layers are compute-intensive. The training process generally exploits distributed computing resources to reduce training time. In addition, heterogeneous computing resources, e.g., CPUs, GPUs of multiple types, are available for the distributed training process. Thus, the scheduling of multiple layers to diverse computing resources is critical for the training process. To efficiently train a DNN model using the heterogeneous computing resources, we propose a distributed framework, i.e., Paddle-Heterogeneous Parameter Server (Paddle-HeterPS), composed of a distributed architecture and a Reinforcement Learning (RL)-based scheduling method. The advantages of Paddle-HeterPS are three-fold compared with existing frameworks. First, Paddle-HeterPS enables efficient training process of diverse workloads with heterogeneous computing resources. Second, Paddle-HeterPS exploits an RL-based method to efficiently schedule the workload of each layer to appropriate computing resources to minimize the cost while satisfying throughput constraints. Third, Paddle-HeterPS manages data storage and data communication among distributed computing resources. We carry out extensive experiments to show that Paddle-HeterPS significantly outperforms state-of-the-art approaches in terms of throughput (14.5 times higher) and monetary cost (312.3% smaller). The codes of the framework are publicly available at: https://github.com/PaddlePaddle/Paddle.
AbstractList Deep neural networks (DNNs) exploit many layers and a large number of parameters to achieve excellent performance. The training process of DNN models generally handles large-scale input data with many sparse features, which incurs high Input/Output (IO) cost, while some layers are compute-intensive. The training process generally exploits distributed computing resources to reduce training time. In addition, heterogeneous computing resources, e.g., CPUs, GPUs of multiple types, are available for the distributed training process. Thus, the scheduling of multiple layers to diverse computing resources is critical for the training process. To efficiently train a DNN model using the heterogeneous computing resources, we propose a distributed framework, i.e., Paddle-Heterogeneous Parameter Server (Paddle-HeterPS), composed of a distributed architecture and a Reinforcement Learning (RL)-based scheduling method. The advantages of Paddle-HeterPS are three-fold compared with existing frameworks. First, Paddle-HeterPS enables efficient training process of diverse workloads with heterogeneous computing resources. Second, Paddle-HeterPS exploits an RL-based method to efficiently schedule the workload of each layer to appropriate computing resources to minimize the cost while satisfying throughput constraints. Third, Paddle-HeterPS manages data storage and data communication among distributed computing resources. We carry out extensive experiments to show that Paddle-HeterPS significantly outperforms state-of-the-art approaches in terms of throughput (14.5 times higher) and monetary cost (312.3% smaller). The codes of the framework are publicly available at: https://github.com/PaddlePaddle/Paddle.
Author Liu, Ji
Yao, Xuefeng
Zhang, Minxu
Yu, Dianhai
Dou, Dejing
Wu, Xinxuan
Wu, Zhihua
Ma, Yanjun
Feng, Danlei
Author_xml – sequence: 1
  givenname: Ji
  surname: Liu
  fullname: Liu, Ji
– sequence: 2
  givenname: Zhihua
  surname: Wu
  fullname: Wu, Zhihua
– sequence: 3
  givenname: Dianhai
  surname: Yu
  fullname: Yu, Dianhai
– sequence: 4
  givenname: Yanjun
  surname: Ma
  fullname: Ma, Yanjun
– sequence: 5
  givenname: Danlei
  surname: Feng
  fullname: Feng, Danlei
– sequence: 6
  givenname: Minxu
  surname: Zhang
  fullname: Zhang, Minxu
– sequence: 7
  givenname: Xinxuan
  surname: Wu
  fullname: Wu, Xinxuan
– sequence: 8
  givenname: Xuefeng
  surname: Yao
  fullname: Yao, Xuefeng
– sequence: 9
  givenname: Dejing
  surname: Dou
  fullname: Dou, Dejing
BookMark eNqNjM0KgkAYRYcoyMp3GGgtjGPaz7I0WrSIDFqK6adO5Dc1Pz1_FkHbVhfuOfeOSB8lQo84PAh8bzHjfEhcra-MMR7NeRgGDml3YEAd0hWNhTZKXKyBksYAd7qHXKHAmp6FaegRBFZSFdACmh9b57rz06KB0t7ehUD6uZQ1IEiraYJPoSS-Z3pCBlV-0-B-c0ym2-S02Xl3JR8WtMmu0irsUMYj5vvhcuGz4D_rBW4pS5Y
ContentType Paper
Copyright 2023. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Copyright_xml – notice: 2023. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
DBID 8FE
8FG
ABJCF
ABUWG
AFKRA
AZQEC
BENPR
BGLVJ
CCPQU
DWQXO
HCIFZ
L6V
M7S
PIMPY
PQEST
PQQKQ
PQUKI
PRINS
PTHSS
DatabaseName ProQuest SciTech Collection
ProQuest Technology Collection
Materials Science & Engineering Collection
ProQuest Central (Alumni)
ProQuest Central
ProQuest Central Essentials
ProQuest Central
Technology Collection
ProQuest One Community College
ProQuest Central Korea
SciTech Premium Collection
ProQuest Engineering Collection
Engineering Database
Publicly Available Content Database
ProQuest One Academic Eastern Edition (DO NOT USE)
ProQuest One Academic
ProQuest One Academic UKI Edition
ProQuest Central China
Engineering Collection
DatabaseTitle Publicly Available Content Database
Engineering Database
Technology Collection
ProQuest Central Essentials
ProQuest One Academic Eastern Edition
ProQuest Central (Alumni Edition)
SciTech Premium Collection
ProQuest One Community College
ProQuest Technology Collection
ProQuest SciTech Collection
ProQuest Central China
ProQuest Central
ProQuest Engineering Collection
ProQuest One Academic UKI Edition
ProQuest Central Korea
Materials Science & Engineering Collection
ProQuest One Academic
Engineering Collection
DatabaseTitleList Publicly Available Content Database
Database_xml – sequence: 1
  dbid: 8FG
  name: ProQuest Technology Collection
  url: https://search.proquest.com/technologycollection1
  sourceTypes: Aggregation Database
DeliveryMethod fulltext_linktorsrc
Discipline Physics
EISSN 2331-8422
Genre Working Paper/Pre-Print
GroupedDBID 8FE
8FG
ABJCF
ABUWG
AFKRA
ALMA_UNASSIGNED_HOLDINGS
AZQEC
BENPR
BGLVJ
CCPQU
DWQXO
FRJ
HCIFZ
L6V
M7S
M~E
PIMPY
PQEST
PQQKQ
PQUKI
PRINS
PTHSS
ID FETCH-proquest_journals_26011598103
IEDL.DBID BENPR
IngestDate Thu Oct 10 18:15:43 EDT 2024
IsOpenAccess true
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-proquest_journals_26011598103
OpenAccessLink https://www.proquest.com/docview/2601159810?pq-origsite=%requestingapplication%
PQID 2601159810
PQPubID 2050157
ParticipantIDs proquest_journals_2601159810
PublicationCentury 2000
PublicationDate 20230607
PublicationDateYYYYMMDD 2023-06-07
PublicationDate_xml – month: 06
  year: 2023
  text: 20230607
  day: 07
PublicationDecade 2020
PublicationPlace Ithaca
PublicationPlace_xml – name: Ithaca
PublicationTitle arXiv.org
PublicationYear 2023
Publisher Cornell University Library, arXiv.org
Publisher_xml – name: Cornell University Library, arXiv.org
SSID ssj0002672553
Score 3.4788268
SecondaryResourceType preprint
Snippet Deep neural networks (DNNs) exploit many layers and a large number of parameters to achieve excellent performance. The training process of DNN models generally...
SourceID proquest
SourceType Aggregation Database
SubjectTerms Artificial neural networks
Computer networks
Data communication
Data storage
Deep learning
Distributed processing
Machine learning
Mathematical models
Parameters
Scheduling
Training
Workload
Title HeterPS: Distributed Deep Learning With Reinforcement Learning Based Scheduling in Heterogeneous Environments
URI https://www.proquest.com/docview/2601159810
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwfV3dS8MwED_ciuCbn_gxR0Bfi02WdqkvwlxrETbKpri3kSbZ3INtXeurf7tJ6Jwg7DEcHMlxucvd_XIHcMs5yXjAlOtJP3CpkNjlPKQupopnIRaS2oruaBwkr_R55s-ahFvVwCo3NtEaalkIkyO_M62vtOtl2HsoP10zNcpUV5sRGi1wCKamTOsMonE6-c2ykKCv38y9f4bWeo_4EJyUl2p9BHsqP4Z9C7oU1Ql8JAaLkk7v0dC0rzWTp5REQ6VK1LQ9XaK3Vf2OJsr2NxU2lbelDbQHkmiqxS4NnnyJVjmyLAutFUqH9Cj6843tFG7i6OUxcTd7nDd6VM23p-6dQTsvcnUOKFQ6yPFZRhZMUkH0DfMWPPODkGDBQ84uoLOL0-Vu8hUcmJHqFg7V70C7Xn-pa-1466wLLRY_dRsZ69XoO_oB6p2P9Q
link.rule.ids 786,790,12792,21416,33406,33777,43633,43838
linkProvider ProQuest
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwfV1LT8MwDLZgE4IbT_EYEAmuEX2kWcoFCbZSYJsmNsRuVZpkYwfaspb_TxJ1DAlp50hWYjl2bH_5DHDNuZdyyhR2ZEAxEdLFnIcEu0TxNHSFJLaj2x_Q-I08T4JJXXAra1jl0idaRy1zYWrkN4b6Sode5jp3xRc2U6NMd7UeobEJTeJT39g5ix5_aywebesXs__PzdrYEe1Cc8gLtdiDDZXtw5aFXIryAD5jg0QZjm5Rx5DXmrlTSqKOUgWqSU9n6H1efaBXZdlNhS3krdbudfyRaKSVLg2afIbmGbIic20TSif0qPvnE9shXEXd8UOMl3tMaisqk9WZ_SNoZHmmjgGFSqc4AUu9KZNEePp-OVOeBjT0XMFDzk6gtU7S6frlS9iOx_1e0nsavJzBjhmuboFR7RY0qsW3OtchuEovrJ5_AJ6Pj2U
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=HeterPS%3A+Distributed+Deep+Learning+With+Reinforcement+Learning+Based+Scheduling+in+Heterogeneous+Environments&rft.jtitle=arXiv.org&rft.au=Liu%2C+Ji&rft.au=Wu%2C+Zhihua&rft.au=Yu%2C+Dianhai&rft.au=Ma%2C+Yanjun&rft.date=2023-06-07&rft.pub=Cornell+University+Library%2C+arXiv.org&rft.eissn=2331-8422