基于异构硬件的LSTM训练系统

在大数据时代,以LSTM为代表的深度神经网络模型具有处理海量数据的能力,在语言处理、语音识别、时序数据预测等领域表现优异。随着模型复杂度的提高,训练成本大幅提升。现有的LSTM训练系统使用了算子融合、多流等加速手段,但忽略了训练算子内部计算的可并行性,导致计算资源的利用率低,整体耗时长。为此,设计了基于细粒度模型划分和多流并行调度方法的LSTM训练系统TurboLSTM,在英伟达GPU和国产昇腾NPU这两种异构硬件上构建的全新底层训练算子实现了任务对计算资源的合理利用。与已有训练系统相比,在GPU上TurboLSTM的单算子训练时间缩短了23%,模型的整体训练时间缩短了17%,在NPU上Tur...

Full description

Saved in:
Bibliographic Details
Published in大数据 Vol. 10; no. 4; pp. 172 - 188
Main Authors 黄为新, 胡伟方, 曹雪娇, 石宣化
Format Journal Article
LanguageChinese
Published 人民邮电出版社有限公司 15.07.2024
华中科技大学计算机科学与技术学院,湖北武汉 430074
华中科技大学大数据技术与系统国家地方联合工程研究中心,服务计算技术与系统教育部重点实验室,湖北 武汉 430074
China InfoCom Media Group
Subjects
Online AccessGet full text
ISSN2096-0271
DOI10.11959/j.issn.2096-0271.2024053

Cover

Abstract 在大数据时代,以LSTM为代表的深度神经网络模型具有处理海量数据的能力,在语言处理、语音识别、时序数据预测等领域表现优异。随着模型复杂度的提高,训练成本大幅提升。现有的LSTM训练系统使用了算子融合、多流等加速手段,但忽略了训练算子内部计算的可并行性,导致计算资源的利用率低,整体耗时长。为此,设计了基于细粒度模型划分和多流并行调度方法的LSTM训练系统TurboLSTM,在英伟达GPU和国产昇腾NPU这两种异构硬件上构建的全新底层训练算子实现了任务对计算资源的合理利用。与已有训练系统相比,在GPU上TurboLSTM的单算子训练时间缩短了23%,模型的整体训练时间缩短了17%,在NPU上TurboLSTM的单算子训练时间缩短了15%,且对计算资源的利用率显著提高。这表明提出的加速方案是高效的,具有良好的泛化能力。
AbstractList 在大数据时代,以LSTM为代表的深度神经网络模型具有处理海量数据的能力,在语言处理、语音识别、时序数据预测等领域表现优异。随着模型复杂度的提高,训练成本大幅提升。现有的LSTM训练系统使用了算子融合、多流等加速手段,但忽略了训练算子内部计算的可并行性,导致计算资源的利用率低,整体耗时长。为此,设计了基于细粒度模型划分和多流并行调度方法的LSTM训练系统TurboLSTM,在英伟达GPU和国产昇腾NPU这两种异构硬件上构建的全新底层训练算子实现了任务对计算资源的合理利用。与已有训练系统相比,在GPU上TurboLSTM的单算子训练时间缩短了23%,模型的整体训练时间缩短了17%,在NPU上TurboLSTM的单算子训练时间缩短了15%,且对计算资源的利用率显著提高。这表明提出的加速方案是高效的,具有良好的泛化能力。
TP183; 在大数据时代,以LSTM为代表的深度神经网络模型具有处理海量数据的能力,在语言处理、语音识别、时序数据预测等领域表现优异.随着模型复杂度的提高,训练成本大幅提升.现有的LSTM训练系统使用了算子融合、多流等加速手段,但忽略了训练算子内部计算的可并行性,导致计算资源的利用率低,整体耗时长.为此,设计了基于细粒度模型划分和多流并行调度方法的LSTM训练系统TurboLSTM,在英伟达GPU和国产昇腾NPU这两种异构硬件上构建的全新底层训练算子实现了任务对计算资源的合理利用.与已有训练系统相比,在GPU上TurboLSTM的单算子训练时间缩短了23%,模型的整体训练时间缩短了17%,在NPU上TurboLSTM的单算子训练时间缩短了15%,且对计算资源的利用率显著提高.这表明提出的加速方案是高效的,具有良好的泛化能力.
Abstract_FL In the era of big data, deep neurals network models represented by LSTM have the ability to process massive data, and have excellent performance in the fields of language processing, speech recognition and time series data prediction. However, with the increase of model complexity, the training cost increases significantly. The existing LSTM training systems use acceleration methods, such as operator fusion and multi-stream, but neglect the parallelism of the internal calculation of a single training operator, which leads a low utilization rate of computing resources and a long traning time. Therefore, this paper designs a training acceleration system called TurboLSTM based on fine-grained model partitioning method and multi-stream parallel scheduling strategy. A new underlying training operator built on NVIDIA GPU and domestic Ascend NPU heterogeneous hardware realizes reasonable utilization of computing resources for tasks. Compared with the existing training systems, TurboLSTM on NVIDIA GPU has about 23% speed improvement of a single operator and about 17% speed improvement of the overall training time of a model, while TurboLSTM on Ascend NPU has about 15% speed improvement of a single operator, and the significant increase in the utilization of computing resources is observed. This shows that the acceleration method is efficient and has good generalization ability.
Author 曹雪娇
石宣化
胡伟方
黄为新
AuthorAffiliation 华中科技大学计算机科学与技术学院,湖北武汉 430074;华中科技大学大数据技术与系统国家地方联合工程研究中心,服务计算技术与系统教育部重点实验室,湖北 武汉 430074
AuthorAffiliation_xml – name: 华中科技大学计算机科学与技术学院,湖北武汉 430074;华中科技大学大数据技术与系统国家地方联合工程研究中心,服务计算技术与系统教育部重点实验室,湖北 武汉 430074
Author_FL HU Weifang
SHI Xuanhua
HUANG Weixin
CAO Xuejiao
Author_FL_xml – sequence: 1
  fullname: HUANG Weixin
– sequence: 2
  fullname: HU Weifang
– sequence: 3
  fullname: CAO Xuejiao
– sequence: 4
  fullname: SHI Xuanhua
Author_xml – sequence: 1
  fullname: 黄为新
  organization: 华中科技大学计算机科学与技术学院,湖北武汉430074
– sequence: 2
  fullname: 胡伟方
  organization: 华中科技大学计算机科学与技术学院,湖北武汉430074
– sequence: 3
  fullname: 曹雪娇
  organization: 华中科技大学计算机科学与技术学院,湖北武汉430074
– sequence: 4
  fullname: 石宣化
  organization: 华中科技大学计算机科学与技术学院,湖北武汉430074
BookMark eNo9kL1OwzAUhT0UiVL6CkgMHVP8n3hE5a-oiKHdLTt2qkQlQTEIsQIzD8BAxQ4sSAwNj9MSHgOrQQxX9-jcT0dHdwu08iK3AOwi2EdIMLGX9VPn8j6GggcQh8grTCEjLdD-9zZB17lUQ8Ip4REnbdBbzRfLxePq6-77-aF-eV1Wn_XTw2g8Oft5f6ur-_qjqqv5NthI1MzZ7t_ugMnR4WRwEozOj4eD_VEQ-w40iLCOTKjDiAmuVKwYRlQjTqAQAiFLsLZGESssYx4PcRjHmMQo0poZDg3pgGETawqVycsyvVDlrSxUKtdGUU6lKq_SeGYl1Uxg4weymJrEKEG1VV4kAhJDuc_qNVk3Kk9UPpVZcV3mvrw0ymXr31CIqOd2Gi5WzsncOSMPxqfrewP8Ao9DcgA
ClassificationCodes TP183
ContentType Journal Article
Copyright Copyright © Wanfang Data Co. Ltd. All Rights Reserved.
Copyright_xml – notice: Copyright © Wanfang Data Co. Ltd. All Rights Reserved.
DBID NSCOK
2B.
4A8
92I
93N
PSX
TCJ
DOA
DOI 10.11959/j.issn.2096-0271.2024053
DatabaseName 国家哲学社会科学文献中心 (National Center for Philosophy and Social Sciences Documentation)
Wanfang Data Journals - Hong Kong
WANFANG Data Centre
Wanfang Data Journals
万方数据期刊 - 香港版
China Online Journals (COJ)
China Online Journals (COJ)
DOAJ Directory of Open Access Journal Collection
DatabaseTitleList


Database_xml – sequence: 1
  dbid: DOA
  name: DOAJ Directory of Open Access Journals
  url: https://www.doaj.org/
  sourceTypes: Open Website
DeliveryMethod fulltext_linktorsrc
DocumentTitle_FL LSTM training system based on heterogeneous hardware
EndPage 188
ExternalDocumentID oai_doaj_org_article_4b592d59205c4dfda94beadfdf903d46
dasj202404014
DSJ2024004014
GroupedDBID -0I
-SI
-S~
2RA
5VR
92M
9D9
9DI
AAXDM
ALMA_UNASSIGNED_HOLDINGS
CAJEI
FA0
GROUPED_DOAJ
JUIAU
NSCOK
NTYSC
PB1
PB5
Q--
Q-8
R-I
RT9
S..
T8Y
U1F
U5I
~NM
~NO
2B.
4A8
92I
93N
AAITT
AFUIB
CQIGP
PSX
TCJ
ID FETCH-LOGICAL-c1194-82b8d7b78596aaca5214b163099911e32beda3e9e55194727cc23c18bb5d60d3
IEDL.DBID DOA
ISSN 2096-0271
IngestDate Wed Aug 27 01:24:58 EDT 2025
Thu May 29 03:56:13 EDT 2025
Mon Feb 17 13:28:26 EST 2025
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed false
IsScholarly true
Issue 4
Keywords 训练加速
LSTM
多流调度
细粒度并行
fine-grained parallelism
multi-stream scheduling
training acceleration
Language Chinese
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c1194-82b8d7b78596aaca5214b163099911e32beda3e9e55194727cc23c18bb5d60d3
OpenAccessLink https://doaj.org/article/4b592d59205c4dfda94beadfdf903d46
PageCount 17
ParticipantIDs doaj_primary_oai_doaj_org_article_4b592d59205c4dfda94beadfdf903d46
wanfang_journals_dasj202404014
cass_nssd_DSJ2024004014
PublicationCentury 2000
PublicationDate 2024-07-15
PublicationDateYYYYMMDD 2024-07-15
PublicationDate_xml – month: 07
  year: 2024
  text: 2024-07-15
  day: 15
PublicationDecade 2020
PublicationTitle 大数据
PublicationTitle_FL Big Data Research
PublicationYear 2024
Publisher 人民邮电出版社有限公司
华中科技大学计算机科学与技术学院,湖北武汉 430074
华中科技大学大数据技术与系统国家地方联合工程研究中心,服务计算技术与系统教育部重点实验室,湖北 武汉 430074
China InfoCom Media Group
Publisher_xml – name: 人民邮电出版社有限公司
– name: 华中科技大学计算机科学与技术学院,湖北武汉 430074
– name: 华中科技大学大数据技术与系统国家地方联合工程研究中心,服务计算技术与系统教育部重点实验室,湖北 武汉 430074
– name: China InfoCom Media Group
SSID ssib036436863
ssib024184013
ssib051371281
ssib057785240
ssib035218648
ssj0002857271
ssib058759006
Score 2.3598335
Snippet ...
TP183;...
SourceID doaj
wanfang
cass
SourceType Open Website
Aggregation Database
StartPage 172
SubjectTerms lstm
多流调度
细粒度并行
训练加速
Title 基于异构硬件的LSTM训练系统
URI https://www.ncpssd.cn/Literature/articleinfo?id=DSJ2024004014&type=eJournalArticle&typename=中文期刊文章&nav=1&langType=1&pageUrl=https%253A%252F%252Fwww.ncpssd.org%252Fjournal%252Fdetails%253Fgch%253D211192%2526nav%253D1%2526langType%253D2
https://d.wanfangdata.com.cn/periodical/dasj202404014
https://doaj.org/article/4b592d59205c4dfda94beadfdf903d46
Volume 10
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV07TyMxELZQkBANAvHKcReBBOXCem2v7ZIAEUKEhiDRrdaPBaVYEMk1FBRAfT_gikPXAw0SBeHnAMvPYOysSDoail3tQ_J4xo_5_PoGoWXwedJwKYLM7a2g4GOgSXEZhEQB2g-NtJE7jdzci7cP6M4hOxwK9eX2hPXpgfuGW6OKycjAFTJNTWZSSRVon5lMhsRQT7YdynBoMAU1CdySGB44AMrAIh4AfxI74vUB8RTDhLslpc93zgWLBiwoDFC9DMv1yLafkmLg-N1oDoS7Xbwcj6El3wNJJtfavgWvfv6DJ0jNBWCuaMDEZXwAf04oz9L8aMilNSbRRIlFF9f7NphCI-fH02jl9ebp5enP6_Pl27_r4v_tS--x-Hu9u99qvt_fFb2r4qFX9G5mUKux1drYDspYCoGGLNFAREoYrkAtGaepTsEgVAEW8wARWxIpa1JipQUEJSnopnVENBZKMROHhsyiSn6S23m0mDGhhZWxFiSlwsZScGq5JhnHOLY2qqI5p2OSdzom2dzf8apDV4FpFdWd2slpn0cjcczW_gOUd1KWd_JVeVdRrTRaUra2TmLSTtuLcVJ-fIeUBTTuEnSTuJj9RJXu2W_7C9BHV9XQ6Hp9s96o-QoH9-bF1gc8RdU6
linkProvider Directory of Open Access Journals
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=%E5%9F%BA%E4%BA%8E%E5%BC%82%E6%9E%84%E7%A1%AC%E4%BB%B6%E7%9A%84LSTM%E8%AE%AD%E7%BB%83%E7%B3%BB%E7%BB%9F&rft.jtitle=%E5%A4%A7%E6%95%B0%E6%8D%AE&rft.au=%E9%BB%84%E4%B8%BA%E6%96%B0&rft.au=%E8%83%A1%E4%BC%9F%E6%96%B9&rft.au=%E6%9B%B9%E9%9B%AA%E5%A8%87&rft.au=%E7%9F%B3%E5%AE%A3%E5%8C%96&rft.date=2024-07-15&rft.pub=%E5%8D%8E%E4%B8%AD%E7%A7%91%E6%8A%80%E5%A4%A7%E5%AD%A6%E8%AE%A1%E7%AE%97%E6%9C%BA%E7%A7%91%E5%AD%A6%E4%B8%8E%E6%8A%80%E6%9C%AF%E5%AD%A6%E9%99%A2%2C%E6%B9%96%E5%8C%97%E6%AD%A6%E6%B1%89+430074&rft.issn=2096-0271&rft.volume=10&rft.issue=4&rft.spage=172&rft.epage=188&rft_id=info:doi/10.11959%2Fj.issn.2096-0271.2024053&rft.externalDocID=dasj202404014
thumbnail_s http://utb.summon.serialssolutions.com/2.0.0/image/custom?url=http%3A%2F%2Fwww.wanfangdata.com.cn%2Fimages%2FPeriodicalImages%2Fdasj%2Fdasj.jpg