Moka: Model-based concurrent kernel analysis

GPUs continue to increase the number of compute resources with each new generation. Many data-parallel applications have been re-engineered to leverage the thousands of cores on the GPU. But not every kernel can fully utilize all the resources available. Many applications contain multiple kernels th...

Full description

Saved in:
Bibliographic Details
Published in2017 IEEE International Symposium on Workload Characterization (IISWC) pp. 197 - 206
Main Authors Leiming Yu, Xun Gong, Yifan Sun, Qianqian Fang, Rubin, Norm, Kaeli, David
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.10.2017
Subjects
Online AccessGet full text
DOI10.1109/IISWC.2017.8167777

Cover

Loading…
Abstract GPUs continue to increase the number of compute resources with each new generation. Many data-parallel applications have been re-engineered to leverage the thousands of cores on the GPU. But not every kernel can fully utilize all the resources available. Many applications contain multiple kernels that could potentially be run concurrently. To better utilize the massive resources on the GPU, device vendors have started to support Concurrent Kernel Execution (CKE). However, the application throughput provided by CKE is subject to a number of factors, including the kernel configuration attributes, the dynamic behavior of each kernel (e.g., compute-intentive vs. memory-intensive), the kernel launch order and inter-kernel dependencies. Minor changes in any of theses factors can have a large impact on the effectiveness of CKE. In this paper, we present Moka, an empirical model for tuning concurrent kernel performance. Moka allows us to accurately predict the resulting performance and scalability of multi-kernel applications when using CKE. We consider both static and dynamic workload characteristics that impact the utility of CKE, and leverage these metrics to drive kernel scheduling decisions on NVIDIA GPUs. The underlying data transfer pattern and GPU resource contention are analyzed in detail. Our model is able to accurately predict the performance ceiling of concurrent kernel execution. We validate our model using several real-world applications that have multiple kernels that can run concurrently, and evaluate CKE performance on a NVIDIA Maxwell GPU. Our model is able to predict the performance of CKE applications accurately, providing estimates that differ by less than 12% as compared to actual runtime performance. Using our estimates, we can quickly find the best CKE strategy for our applications to achieve improved application throughput. We believe we have developed a useful tool to aid application programmers to accelerate their applications using CKE.
AbstractList GPUs continue to increase the number of compute resources with each new generation. Many data-parallel applications have been re-engineered to leverage the thousands of cores on the GPU. But not every kernel can fully utilize all the resources available. Many applications contain multiple kernels that could potentially be run concurrently. To better utilize the massive resources on the GPU, device vendors have started to support Concurrent Kernel Execution (CKE). However, the application throughput provided by CKE is subject to a number of factors, including the kernel configuration attributes, the dynamic behavior of each kernel (e.g., compute-intentive vs. memory-intensive), the kernel launch order and inter-kernel dependencies. Minor changes in any of theses factors can have a large impact on the effectiveness of CKE. In this paper, we present Moka, an empirical model for tuning concurrent kernel performance. Moka allows us to accurately predict the resulting performance and scalability of multi-kernel applications when using CKE. We consider both static and dynamic workload characteristics that impact the utility of CKE, and leverage these metrics to drive kernel scheduling decisions on NVIDIA GPUs. The underlying data transfer pattern and GPU resource contention are analyzed in detail. Our model is able to accurately predict the performance ceiling of concurrent kernel execution. We validate our model using several real-world applications that have multiple kernels that can run concurrently, and evaluate CKE performance on a NVIDIA Maxwell GPU. Our model is able to predict the performance of CKE applications accurately, providing estimates that differ by less than 12% as compared to actual runtime performance. Using our estimates, we can quickly find the best CKE strategy for our applications to achieve improved application throughput. We believe we have developed a useful tool to aid application programmers to accelerate their applications using CKE.
Author Xun Gong
Yifan Sun
Qianqian Fang
Kaeli, David
Leiming Yu
Rubin, Norm
Author_xml – sequence: 1
  surname: Leiming Yu
  fullname: Leiming Yu
  email: ylm@ece.neu.edu
– sequence: 2
  surname: Xun Gong
  fullname: Xun Gong
  email: gong.xun@husky.neu.edu
– sequence: 3
  surname: Yifan Sun
  fullname: Yifan Sun
  email: yifansun@ece.neu.edu
– sequence: 4
  surname: Qianqian Fang
  fullname: Qianqian Fang
  email: q.fang@neu.edu
– sequence: 5
  givenname: Norm
  surname: Rubin
  fullname: Rubin, Norm
  email: nrubin@nvidia.com
– sequence: 6
  givenname: David
  surname: Kaeli
  fullname: Kaeli, David
  email: kaeli@ece.neu.edu
BookMark eNotjstOwzAQAI1ED7T0B-CSDyBh10784IYiHpFacSgS3CpnvZaiBgc55dC_B4nOZW6jWYrLNCUW4gahQgR333W7j7aSgKayqM0fF2KJjbIapVKfV-JuOx38Q7GdAo9l72cOBU2JfnLmdCwOnBOPhU9-PM3DfC0W0Y8zr89eid3z03v7Wm7eXrr2cVMODo4lEzgDHJzUKlAA7Vy0UWmyxBS1NB7JSnQOyHAPTexDtE2o67ohZKtW4va_OjDz_jsPXz6f9ud99QsKLj7_
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1109/IISWC.2017.8167777
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Xplore POP ALL
IEEE Xplore All Conference Proceedings
IEEE/IET Electronic Library
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
EISBN 153861233X
9781538612330
EndPage 206
ExternalDocumentID 8167777
Genre orig-research
GroupedDBID 6IE
6IL
CBEJK
RIE
RIL
ID FETCH-LOGICAL-i90t-ec0970ed9263dcd0699f8f36c8cecf627a1c821990c7eb05fbdf85d4445c1e83
IEDL.DBID RIE
IngestDate Thu Jun 29 18:37:36 EDT 2023
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i90t-ec0970ed9263dcd0699f8f36c8cecf627a1c821990c7eb05fbdf85d4445c1e83
PageCount 10
ParticipantIDs ieee_primary_8167777
PublicationCentury 2000
PublicationDate 2017-Oct.
PublicationDateYYYYMMDD 2017-10-01
PublicationDate_xml – month: 10
  year: 2017
  text: 2017-Oct.
PublicationDecade 2010
PublicationTitle 2017 IEEE International Symposium on Workload Characterization (IISWC)
PublicationTitleAbbrev IISWC
PublicationYear 2017
Publisher IEEE
Publisher_xml – name: IEEE
Score 1.6490092
Snippet GPUs continue to increase the number of compute resources with each new generation. Many data-parallel applications have been re-engineered to leverage the...
SourceID ieee
SourceType Publisher
StartPage 197
SubjectTerms Concurrent Kernel Execution
Data transfer
Empirical Model
Engines
GPU
Graphics processing units
Instruction sets
Kernel
Performance evaluation
Tuning
Title Moka: Model-based concurrent kernel analysis
URI https://ieeexplore.ieee.org/document/8167777
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3NS8MwFH9sO3lS2cRvevC4dmmbT6_DsQkTYYq7jTZ5AdnoZLQX_3qTtpsoHswphEA-yS8veb_fA7ijzEjLaR4qKkzo8JaHUiWejaZSY4yiWOvMzp_49JU-LtmyA8MDFwYRa-czjHy2_ss3W135p7KRjLlwqQtdt80artaeB0PUaDZbvI29s5aI2oo_IqbUgDE5hvm-qcZPZB1VZR7pz18qjP_tywkMvql5wfMBdE6hg0UfhvPtOrsPfFyzTehxyQTOzNWN9FKwxl2BmyBr5UcGsJg8vIynYRsGIXxXpAxREyUIGpXw1GhDuFJW2pRrqVFbnogs1tKdO4pogTlhNjdWMkMpZTpGmZ5Br9gWeA4BV27BbOaMEMmouxY4UyJGJhKNxjPi5QX0_ThXH43Oxaod4uXfxVdw5Oe6cWy7hl65q_DGAXSZ39Yr8wV7jJF8
linkProvider IEEE
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3PT8IwFH5BPOhJDRh_u4NHNrqt7VqvRALKiAkYuZGtfUsMZBAyLv71ttvAaDzYU9M06Y-X9Otr3_c9gAfKtMg4TV1JI-0avOWukIFlo8lQay0pljqz8ZgP3ujzjM0a0NlzYRCxDD5Dz1bLv3y9Ulv7VNYVPo9MOYBDg_uUVWytHROGyO5wOHnv2XCtyKu7_siZUkJG_wTi3WBVpMjC2xappz5_6TD-dzan0P4m5zmve9g5gwbmLejEq0Xy6NjMZkvXIpN2jKOrKvElZ4GbHJdOUguQtGHSf5r2Bm6dCMH9kKRwUREZEdQy4KFWmnApM5GFXAmFKuNBlPhKmJNHEhVhSliW6kwwTSllykcRnkMzX-V4AQ6XxmRZYtwQwai5GBhnwkcWBQq15cSLS2jZdc7XldLFvF7i1d_N93A0mMaj-Wg4frmGY7vvVZjbDTSLzRZvDVwX6V1ppS9M95TJ
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2017+IEEE+International+Symposium+on+Workload+Characterization+%28IISWC%29&rft.atitle=Moka%3A+Model-based+concurrent+kernel+analysis&rft.au=Leiming+Yu&rft.au=Xun+Gong&rft.au=Yifan+Sun&rft.au=Qianqian+Fang&rft.date=2017-10-01&rft.pub=IEEE&rft.spage=197&rft.epage=206&rft_id=info:doi/10.1109%2FIISWC.2017.8167777&rft.externalDocID=8167777