基于异构硬件的LSTM训练系统
在大数据时代,以LSTM为代表的深度神经网络模型具有处理海量数据的能力,在语言处理、语音识别、时序数据预测等领域表现优异。随着模型复杂度的提高,训练成本大幅提升。现有的LSTM训练系统使用了算子融合、多流等加速手段,但忽略了训练算子内部计算的可并行性,导致计算资源的利用率低,整体耗时长。为此,设计了基于细粒度模型划分和多流并行调度方法的LSTM训练系统TurboLSTM,在英伟达GPU和国产昇腾NPU这两种异构硬件上构建的全新底层训练算子实现了任务对计算资源的合理利用。与已有训练系统相比,在GPU上TurboLSTM的单算子训练时间缩短了23%,模型的整体训练时间缩短了17%,在NPU上Tur...
Saved in:
Published in | 大数据 Vol. 10; no. 4; pp. 172 - 188 |
---|---|
Main Authors | , , , |
Format | Journal Article |
Language | Chinese |
Published |
人民邮电出版社有限公司
15.07.2024
华中科技大学计算机科学与技术学院,湖北武汉 430074 华中科技大学大数据技术与系统国家地方联合工程研究中心,服务计算技术与系统教育部重点实验室,湖北 武汉 430074 China InfoCom Media Group |
Subjects | |
Online Access | Get full text |
ISSN | 2096-0271 |
DOI | 10.11959/j.issn.2096-0271.2024053 |
Cover
Abstract | 在大数据时代,以LSTM为代表的深度神经网络模型具有处理海量数据的能力,在语言处理、语音识别、时序数据预测等领域表现优异。随着模型复杂度的提高,训练成本大幅提升。现有的LSTM训练系统使用了算子融合、多流等加速手段,但忽略了训练算子内部计算的可并行性,导致计算资源的利用率低,整体耗时长。为此,设计了基于细粒度模型划分和多流并行调度方法的LSTM训练系统TurboLSTM,在英伟达GPU和国产昇腾NPU这两种异构硬件上构建的全新底层训练算子实现了任务对计算资源的合理利用。与已有训练系统相比,在GPU上TurboLSTM的单算子训练时间缩短了23%,模型的整体训练时间缩短了17%,在NPU上TurboLSTM的单算子训练时间缩短了15%,且对计算资源的利用率显著提高。这表明提出的加速方案是高效的,具有良好的泛化能力。 |
---|---|
AbstractList | 在大数据时代,以LSTM为代表的深度神经网络模型具有处理海量数据的能力,在语言处理、语音识别、时序数据预测等领域表现优异。随着模型复杂度的提高,训练成本大幅提升。现有的LSTM训练系统使用了算子融合、多流等加速手段,但忽略了训练算子内部计算的可并行性,导致计算资源的利用率低,整体耗时长。为此,设计了基于细粒度模型划分和多流并行调度方法的LSTM训练系统TurboLSTM,在英伟达GPU和国产昇腾NPU这两种异构硬件上构建的全新底层训练算子实现了任务对计算资源的合理利用。与已有训练系统相比,在GPU上TurboLSTM的单算子训练时间缩短了23%,模型的整体训练时间缩短了17%,在NPU上TurboLSTM的单算子训练时间缩短了15%,且对计算资源的利用率显著提高。这表明提出的加速方案是高效的,具有良好的泛化能力。 TP183; 在大数据时代,以LSTM为代表的深度神经网络模型具有处理海量数据的能力,在语言处理、语音识别、时序数据预测等领域表现优异.随着模型复杂度的提高,训练成本大幅提升.现有的LSTM训练系统使用了算子融合、多流等加速手段,但忽略了训练算子内部计算的可并行性,导致计算资源的利用率低,整体耗时长.为此,设计了基于细粒度模型划分和多流并行调度方法的LSTM训练系统TurboLSTM,在英伟达GPU和国产昇腾NPU这两种异构硬件上构建的全新底层训练算子实现了任务对计算资源的合理利用.与已有训练系统相比,在GPU上TurboLSTM的单算子训练时间缩短了23%,模型的整体训练时间缩短了17%,在NPU上TurboLSTM的单算子训练时间缩短了15%,且对计算资源的利用率显著提高.这表明提出的加速方案是高效的,具有良好的泛化能力. |
Abstract_FL | In the era of big data, deep neurals network models represented by LSTM have the ability to process massive data, and have excellent performance in the fields of language processing, speech recognition and time series data prediction. However, with the increase of model complexity, the training cost increases significantly. The existing LSTM training systems use acceleration methods, such as operator fusion and multi-stream, but neglect the parallelism of the internal calculation of a single training operator, which leads a low utilization rate of computing resources and a long traning time. Therefore, this paper designs a training acceleration system called TurboLSTM based on fine-grained model partitioning method and multi-stream parallel scheduling strategy. A new underlying training operator built on NVIDIA GPU and domestic Ascend NPU heterogeneous hardware realizes reasonable utilization of computing resources for tasks. Compared with the existing training systems, TurboLSTM on NVIDIA GPU has about 23% speed improvement of a single operator and about 17% speed improvement of the overall training time of a model, while TurboLSTM on Ascend NPU has about 15% speed improvement of a single operator, and the significant increase in the utilization of computing resources is observed. This shows that the acceleration method is efficient and has good generalization ability. |
Author | 曹雪娇 石宣化 胡伟方 黄为新 |
AuthorAffiliation | 华中科技大学计算机科学与技术学院,湖北武汉 430074;华中科技大学大数据技术与系统国家地方联合工程研究中心,服务计算技术与系统教育部重点实验室,湖北 武汉 430074 |
AuthorAffiliation_xml | – name: 华中科技大学计算机科学与技术学院,湖北武汉 430074;华中科技大学大数据技术与系统国家地方联合工程研究中心,服务计算技术与系统教育部重点实验室,湖北 武汉 430074 |
Author_FL | HU Weifang SHI Xuanhua HUANG Weixin CAO Xuejiao |
Author_FL_xml | – sequence: 1 fullname: HUANG Weixin – sequence: 2 fullname: HU Weifang – sequence: 3 fullname: CAO Xuejiao – sequence: 4 fullname: SHI Xuanhua |
Author_xml | – sequence: 1 fullname: 黄为新 organization: 华中科技大学计算机科学与技术学院,湖北武汉430074 – sequence: 2 fullname: 胡伟方 organization: 华中科技大学计算机科学与技术学院,湖北武汉430074 – sequence: 3 fullname: 曹雪娇 organization: 华中科技大学计算机科学与技术学院,湖北武汉430074 – sequence: 4 fullname: 石宣化 organization: 华中科技大学计算机科学与技术学院,湖北武汉430074 |
BookMark | eNo9kL1OwzAUhT0UiVL6CkgMHVP8n3hE5a-oiKHdLTt2qkQlQTEIsQIzD8BAxQ4sSAwNj9MSHgOrQQxX9-jcT0dHdwu08iK3AOwi2EdIMLGX9VPn8j6GggcQh8grTCEjLdD-9zZB17lUQ8Ip4REnbdBbzRfLxePq6-77-aF-eV1Wn_XTw2g8Oft5f6ur-_qjqqv5NthI1MzZ7t_ugMnR4WRwEozOj4eD_VEQ-w40iLCOTKjDiAmuVKwYRlQjTqAQAiFLsLZGESssYx4PcRjHmMQo0poZDg3pgGETawqVycsyvVDlrSxUKtdGUU6lKq_SeGYl1Uxg4weymJrEKEG1VV4kAhJDuc_qNVk3Kk9UPpVZcV3mvrw0ymXr31CIqOd2Gi5WzsncOSMPxqfrewP8Ao9DcgA |
ClassificationCodes | TP183 |
ContentType | Journal Article |
Copyright | Copyright © Wanfang Data Co. Ltd. All Rights Reserved. |
Copyright_xml | – notice: Copyright © Wanfang Data Co. Ltd. All Rights Reserved. |
DBID | NSCOK 2B. 4A8 92I 93N PSX TCJ DOA |
DOI | 10.11959/j.issn.2096-0271.2024053 |
DatabaseName | 国家哲学社会科学文献中心 (National Center for Philosophy and Social Sciences Documentation) Wanfang Data Journals - Hong Kong WANFANG Data Centre Wanfang Data Journals 万方数据期刊 - 香港版 China Online Journals (COJ) China Online Journals (COJ) DOAJ Directory of Open Access Journal Collection |
DatabaseTitleList | |
Database_xml | – sequence: 1 dbid: DOA name: DOAJ Directory of Open Access Journals url: https://www.doaj.org/ sourceTypes: Open Website |
DeliveryMethod | fulltext_linktorsrc |
DocumentTitle_FL | LSTM training system based on heterogeneous hardware |
EndPage | 188 |
ExternalDocumentID | oai_doaj_org_article_4b592d59205c4dfda94beadfdf903d46 dasj202404014 DSJ2024004014 |
GroupedDBID | -0I -SI -S~ 2RA 5VR 92M 9D9 9DI AAXDM ALMA_UNASSIGNED_HOLDINGS CAJEI FA0 GROUPED_DOAJ JUIAU NSCOK NTYSC PB1 PB5 Q-- Q-8 R-I RT9 S.. T8Y U1F U5I ~NM ~NO 2B. 4A8 92I 93N AAITT AFUIB CQIGP PSX TCJ |
ID | FETCH-LOGICAL-c1194-82b8d7b78596aaca5214b163099911e32beda3e9e55194727cc23c18bb5d60d3 |
IEDL.DBID | DOA |
ISSN | 2096-0271 |
IngestDate | Wed Aug 27 01:24:58 EDT 2025 Thu May 29 03:56:13 EDT 2025 Mon Feb 17 13:28:26 EST 2025 |
IsDoiOpenAccess | true |
IsOpenAccess | true |
IsPeerReviewed | false |
IsScholarly | true |
Issue | 4 |
Keywords | 训练加速 LSTM 多流调度 细粒度并行 fine-grained parallelism multi-stream scheduling training acceleration |
Language | Chinese |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-c1194-82b8d7b78596aaca5214b163099911e32beda3e9e55194727cc23c18bb5d60d3 |
OpenAccessLink | https://doaj.org/article/4b592d59205c4dfda94beadfdf903d46 |
PageCount | 17 |
ParticipantIDs | doaj_primary_oai_doaj_org_article_4b592d59205c4dfda94beadfdf903d46 wanfang_journals_dasj202404014 cass_nssd_DSJ2024004014 |
PublicationCentury | 2000 |
PublicationDate | 2024-07-15 |
PublicationDateYYYYMMDD | 2024-07-15 |
PublicationDate_xml | – month: 07 year: 2024 text: 2024-07-15 day: 15 |
PublicationDecade | 2020 |
PublicationTitle | 大数据 |
PublicationTitle_FL | Big Data Research |
PublicationYear | 2024 |
Publisher | 人民邮电出版社有限公司 华中科技大学计算机科学与技术学院,湖北武汉 430074 华中科技大学大数据技术与系统国家地方联合工程研究中心,服务计算技术与系统教育部重点实验室,湖北 武汉 430074 China InfoCom Media Group |
Publisher_xml | – name: 人民邮电出版社有限公司 – name: 华中科技大学计算机科学与技术学院,湖北武汉 430074 – name: 华中科技大学大数据技术与系统国家地方联合工程研究中心,服务计算技术与系统教育部重点实验室,湖北 武汉 430074 – name: China InfoCom Media Group |
SSID | ssib036436863 ssib024184013 ssib051371281 ssib057785240 ssib035218648 ssj0002857271 ssib058759006 |
Score | 2.3598335 |
Snippet | ... TP183;... |
SourceID | doaj wanfang cass |
SourceType | Open Website Aggregation Database |
StartPage | 172 |
SubjectTerms | lstm 多流调度 细粒度并行 训练加速 |
Title | 基于异构硬件的LSTM训练系统 |
URI | https://www.ncpssd.cn/Literature/articleinfo?id=DSJ2024004014&type=eJournalArticle&typename=中文期刊文章&nav=1&langType=1&pageUrl=https%253A%252F%252Fwww.ncpssd.org%252Fjournal%252Fdetails%253Fgch%253D211192%2526nav%253D1%2526langType%253D2 https://d.wanfangdata.com.cn/periodical/dasj202404014 https://doaj.org/article/4b592d59205c4dfda94beadfdf903d46 |
Volume | 10 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV07TyMxELZQkBANAvHKcReBBOXCem2v7ZIAEUKEhiDRrdaPBaVYEMk1FBRAfT_gikPXAw0SBeHnAMvPYOysSDoail3tQ_J4xo_5_PoGoWXwedJwKYLM7a2g4GOgSXEZhEQB2g-NtJE7jdzci7cP6M4hOxwK9eX2hPXpgfuGW6OKycjAFTJNTWZSSRVon5lMhsRQT7YdynBoMAU1CdySGB44AMrAIh4AfxI74vUB8RTDhLslpc93zgWLBiwoDFC9DMv1yLafkmLg-N1oDoS7Xbwcj6El3wNJJtfavgWvfv6DJ0jNBWCuaMDEZXwAf04oz9L8aMilNSbRRIlFF9f7NphCI-fH02jl9ebp5enP6_Pl27_r4v_tS--x-Hu9u99qvt_fFb2r4qFX9G5mUKux1drYDspYCoGGLNFAREoYrkAtGaepTsEgVAEW8wARWxIpa1JipQUEJSnopnVENBZKMROHhsyiSn6S23m0mDGhhZWxFiSlwsZScGq5JhnHOLY2qqI5p2OSdzom2dzf8apDV4FpFdWd2slpn0cjcczW_gOUd1KWd_JVeVdRrTRaUra2TmLSTtuLcVJ-fIeUBTTuEnSTuJj9RJXu2W_7C9BHV9XQ6Hp9s96o-QoH9-bF1gc8RdU6 |
linkProvider | Directory of Open Access Journals |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=%E5%9F%BA%E4%BA%8E%E5%BC%82%E6%9E%84%E7%A1%AC%E4%BB%B6%E7%9A%84LSTM%E8%AE%AD%E7%BB%83%E7%B3%BB%E7%BB%9F&rft.jtitle=%E5%A4%A7%E6%95%B0%E6%8D%AE&rft.au=%E9%BB%84%E4%B8%BA%E6%96%B0&rft.au=%E8%83%A1%E4%BC%9F%E6%96%B9&rft.au=%E6%9B%B9%E9%9B%AA%E5%A8%87&rft.au=%E7%9F%B3%E5%AE%A3%E5%8C%96&rft.date=2024-07-15&rft.pub=%E5%8D%8E%E4%B8%AD%E7%A7%91%E6%8A%80%E5%A4%A7%E5%AD%A6%E8%AE%A1%E7%AE%97%E6%9C%BA%E7%A7%91%E5%AD%A6%E4%B8%8E%E6%8A%80%E6%9C%AF%E5%AD%A6%E9%99%A2%2C%E6%B9%96%E5%8C%97%E6%AD%A6%E6%B1%89+430074&rft.issn=2096-0271&rft.volume=10&rft.issue=4&rft.spage=172&rft.epage=188&rft_id=info:doi/10.11959%2Fj.issn.2096-0271.2024053&rft.externalDocID=dasj202404014 |
thumbnail_s | http://utb.summon.serialssolutions.com/2.0.0/image/custom?url=http%3A%2F%2Fwww.wanfangdata.com.cn%2Fimages%2FPeriodicalImages%2Fdasj%2Fdasj.jpg |