Moka: Model-based concurrent kernel analysis
GPUs continue to increase the number of compute resources with each new generation. Many data-parallel applications have been re-engineered to leverage the thousands of cores on the GPU. But not every kernel can fully utilize all the resources available. Many applications contain multiple kernels th...
Saved in:
Published in | 2017 IEEE International Symposium on Workload Characterization (IISWC) pp. 197 - 206 |
---|---|
Main Authors | , , , , , |
Format | Conference Proceeding |
Language | English |
Published |
IEEE
01.10.2017
|
Subjects | |
Online Access | Get full text |
DOI | 10.1109/IISWC.2017.8167777 |
Cover
Loading…
Abstract | GPUs continue to increase the number of compute resources with each new generation. Many data-parallel applications have been re-engineered to leverage the thousands of cores on the GPU. But not every kernel can fully utilize all the resources available. Many applications contain multiple kernels that could potentially be run concurrently. To better utilize the massive resources on the GPU, device vendors have started to support Concurrent Kernel Execution (CKE). However, the application throughput provided by CKE is subject to a number of factors, including the kernel configuration attributes, the dynamic behavior of each kernel (e.g., compute-intentive vs. memory-intensive), the kernel launch order and inter-kernel dependencies. Minor changes in any of theses factors can have a large impact on the effectiveness of CKE. In this paper, we present Moka, an empirical model for tuning concurrent kernel performance. Moka allows us to accurately predict the resulting performance and scalability of multi-kernel applications when using CKE. We consider both static and dynamic workload characteristics that impact the utility of CKE, and leverage these metrics to drive kernel scheduling decisions on NVIDIA GPUs. The underlying data transfer pattern and GPU resource contention are analyzed in detail. Our model is able to accurately predict the performance ceiling of concurrent kernel execution. We validate our model using several real-world applications that have multiple kernels that can run concurrently, and evaluate CKE performance on a NVIDIA Maxwell GPU. Our model is able to predict the performance of CKE applications accurately, providing estimates that differ by less than 12% as compared to actual runtime performance. Using our estimates, we can quickly find the best CKE strategy for our applications to achieve improved application throughput. We believe we have developed a useful tool to aid application programmers to accelerate their applications using CKE. |
---|---|
AbstractList | GPUs continue to increase the number of compute resources with each new generation. Many data-parallel applications have been re-engineered to leverage the thousands of cores on the GPU. But not every kernel can fully utilize all the resources available. Many applications contain multiple kernels that could potentially be run concurrently. To better utilize the massive resources on the GPU, device vendors have started to support Concurrent Kernel Execution (CKE). However, the application throughput provided by CKE is subject to a number of factors, including the kernel configuration attributes, the dynamic behavior of each kernel (e.g., compute-intentive vs. memory-intensive), the kernel launch order and inter-kernel dependencies. Minor changes in any of theses factors can have a large impact on the effectiveness of CKE. In this paper, we present Moka, an empirical model for tuning concurrent kernel performance. Moka allows us to accurately predict the resulting performance and scalability of multi-kernel applications when using CKE. We consider both static and dynamic workload characteristics that impact the utility of CKE, and leverage these metrics to drive kernel scheduling decisions on NVIDIA GPUs. The underlying data transfer pattern and GPU resource contention are analyzed in detail. Our model is able to accurately predict the performance ceiling of concurrent kernel execution. We validate our model using several real-world applications that have multiple kernels that can run concurrently, and evaluate CKE performance on a NVIDIA Maxwell GPU. Our model is able to predict the performance of CKE applications accurately, providing estimates that differ by less than 12% as compared to actual runtime performance. Using our estimates, we can quickly find the best CKE strategy for our applications to achieve improved application throughput. We believe we have developed a useful tool to aid application programmers to accelerate their applications using CKE. |
Author | Xun Gong Yifan Sun Qianqian Fang Kaeli, David Leiming Yu Rubin, Norm |
Author_xml | – sequence: 1 surname: Leiming Yu fullname: Leiming Yu email: ylm@ece.neu.edu – sequence: 2 surname: Xun Gong fullname: Xun Gong email: gong.xun@husky.neu.edu – sequence: 3 surname: Yifan Sun fullname: Yifan Sun email: yifansun@ece.neu.edu – sequence: 4 surname: Qianqian Fang fullname: Qianqian Fang email: q.fang@neu.edu – sequence: 5 givenname: Norm surname: Rubin fullname: Rubin, Norm email: nrubin@nvidia.com – sequence: 6 givenname: David surname: Kaeli fullname: Kaeli, David email: kaeli@ece.neu.edu |
BookMark | eNotjstOwzAQAI1ED7T0B-CSDyBh10784IYiHpFacSgS3CpnvZaiBgc55dC_B4nOZW6jWYrLNCUW4gahQgR333W7j7aSgKayqM0fF2KJjbIapVKfV-JuOx38Q7GdAo9l72cOBU2JfnLmdCwOnBOPhU9-PM3DfC0W0Y8zr89eid3z03v7Wm7eXrr2cVMODo4lEzgDHJzUKlAA7Vy0UWmyxBS1NB7JSnQOyHAPTexDtE2o67ohZKtW4va_OjDz_jsPXz6f9ud99QsKLj7_ |
ContentType | Conference Proceeding |
DBID | 6IE 6IL CBEJK RIE RIL |
DOI | 10.1109/IISWC.2017.8167777 |
DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Xplore POP ALL IEEE Xplore All Conference Proceedings IEEE/IET Electronic Library IEEE Proceedings Order Plans (POP All) 1998-Present |
DatabaseTitleList | |
Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher |
DeliveryMethod | fulltext_linktorsrc |
EISBN | 153861233X 9781538612330 |
EndPage | 206 |
ExternalDocumentID | 8167777 |
Genre | orig-research |
GroupedDBID | 6IE 6IL CBEJK RIE RIL |
ID | FETCH-LOGICAL-i90t-ec0970ed9263dcd0699f8f36c8cecf627a1c821990c7eb05fbdf85d4445c1e83 |
IEDL.DBID | RIE |
IngestDate | Thu Jun 29 18:37:36 EDT 2023 |
IsPeerReviewed | false |
IsScholarly | false |
Language | English |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-i90t-ec0970ed9263dcd0699f8f36c8cecf627a1c821990c7eb05fbdf85d4445c1e83 |
PageCount | 10 |
ParticipantIDs | ieee_primary_8167777 |
PublicationCentury | 2000 |
PublicationDate | 2017-Oct. |
PublicationDateYYYYMMDD | 2017-10-01 |
PublicationDate_xml | – month: 10 year: 2017 text: 2017-Oct. |
PublicationDecade | 2010 |
PublicationTitle | 2017 IEEE International Symposium on Workload Characterization (IISWC) |
PublicationTitleAbbrev | IISWC |
PublicationYear | 2017 |
Publisher | IEEE |
Publisher_xml | – name: IEEE |
Score | 1.6490092 |
Snippet | GPUs continue to increase the number of compute resources with each new generation. Many data-parallel applications have been re-engineered to leverage the... |
SourceID | ieee |
SourceType | Publisher |
StartPage | 197 |
SubjectTerms | Concurrent Kernel Execution Data transfer Empirical Model Engines GPU Graphics processing units Instruction sets Kernel Performance evaluation Tuning |
Title | Moka: Model-based concurrent kernel analysis |
URI | https://ieeexplore.ieee.org/document/8167777 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3NS8MwFH9sO3lS2cRvevC4dmmbT6_DsQkTYYq7jTZ5AdnoZLQX_3qTtpsoHswphEA-yS8veb_fA7ijzEjLaR4qKkzo8JaHUiWejaZSY4yiWOvMzp_49JU-LtmyA8MDFwYRa-czjHy2_ss3W135p7KRjLlwqQtdt80artaeB0PUaDZbvI29s5aI2oo_IqbUgDE5hvm-qcZPZB1VZR7pz18qjP_tywkMvql5wfMBdE6hg0UfhvPtOrsPfFyzTehxyQTOzNWN9FKwxl2BmyBr5UcGsJg8vIynYRsGIXxXpAxREyUIGpXw1GhDuFJW2pRrqVFbnogs1tKdO4pogTlhNjdWMkMpZTpGmZ5Br9gWeA4BV27BbOaMEMmouxY4UyJGJhKNxjPi5QX0_ThXH43Oxaod4uXfxVdw5Oe6cWy7hl65q_DGAXSZ39Yr8wV7jJF8 |
linkProvider | IEEE |
linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3PT8IwFH5BPOhJDRh_u4NHNrqt7VqvRALKiAkYuZGtfUsMZBAyLv71ttvAaDzYU9M06Y-X9Otr3_c9gAfKtMg4TV1JI-0avOWukIFlo8lQay0pljqz8ZgP3ujzjM0a0NlzYRCxDD5Dz1bLv3y9Ulv7VNYVPo9MOYBDg_uUVWytHROGyO5wOHnv2XCtyKu7_siZUkJG_wTi3WBVpMjC2xappz5_6TD-dzan0P4m5zmve9g5gwbmLejEq0Xy6NjMZkvXIpN2jKOrKvElZ4GbHJdOUguQtGHSf5r2Bm6dCMH9kKRwUREZEdQy4KFWmnApM5GFXAmFKuNBlPhKmJNHEhVhSliW6kwwTSllykcRnkMzX-V4AQ6XxmRZYtwQwai5GBhnwkcWBQq15cSLS2jZdc7XldLFvF7i1d_N93A0mMaj-Wg4frmGY7vvVZjbDTSLzRZvDVwX6V1ppS9M95TJ |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2017+IEEE+International+Symposium+on+Workload+Characterization+%28IISWC%29&rft.atitle=Moka%3A+Model-based+concurrent+kernel+analysis&rft.au=Leiming+Yu&rft.au=Xun+Gong&rft.au=Yifan+Sun&rft.au=Qianqian+Fang&rft.date=2017-10-01&rft.pub=IEEE&rft.spage=197&rft.epage=206&rft_id=info:doi/10.1109%2FIISWC.2017.8167777&rft.externalDocID=8167777 |