TSM2X: High-performance tall-and-skinny matrix–matrix multiplication on GPUs
Linear algebra operations have been widely used in big data analytics and scientific computations. Many works have been done on optimizing linear algebra operations on GPUs with regular-shaped input. However, few works focus on fully utilizing GPU resources when the input is not regular-shaped. Curr...
Saved in:
Published in | Journal of parallel and distributed computing Vol. 151; pp. 70 - 85 |
---|---|
Main Authors | , , , , , |
Format | Journal Article |
Language | English |
Published |
Elsevier Inc
01.05.2021
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Abstract | Linear algebra operations have been widely used in big data analytics and scientific computations. Many works have been done on optimizing linear algebra operations on GPUs with regular-shaped input. However, few works focus on fully utilizing GPU resources when the input is not regular-shaped. Current optimizations do not consider fully utilizing the memory bandwidth and computing power; therefore, they can only achieve sub-optimal performance. In this paper, we propose two efficient algorithms – TSM2R and TSM2L – for two classes of tall-and-skinny matrix–matrix multiplications on GPUs. Both of them focus on optimizing linear algebra operation with at least one of the input matrices tall-and-skinny. Specifically, TSM2R is designed for a large regular-shaped matrix multiplying a tall-and-skinny matrix, while TSM2L is designed for a tall-and-skinny matrix multiplying a small regular-shaped matrix. We implement our proposed algorithms and test on several modern NVIDIA GPU micro-architectures. Experiments show that, compared to the current state-of-the-art works, (1) TSM2R speeds up the computation by 1.6x on average and improves the memory bandwidth utilization and computing power utilization by 18.1% and 20.5% on average, respectively, when the regular-shaped matrix size is relatively large or medium; and (2) TSM2L speeds up the computation by 1.9x on average and improves the memory bandwidth utilization by up to 9.3% on average when the regular-shaped matrix size is relatively small.
•Few works focus on optimizing GEMM on GPUs for the irregular-shaped input.•Current optimizations do not fully utilize the memory bandwidth and computing power.•We propose two efficient algorithms for two classes of tall-and-skinny GEMM on GPUs.•Our optimizations speedup GEMM by 1.1x∼3.5x for various tall-and-skinny inputs. |
---|---|
AbstractList | Linear algebra operations have been widely used in big data analytics and scientific computations. Many works have been done on optimizing linear algebra operations on GPUs with regular-shaped input. However, few works focus on fully utilizing GPU resources when the input is not regular-shaped. Current optimizations do not consider fully utilizing the memory bandwidth and computing power; therefore, they can only achieve sub-optimal performance. In this paper, we propose two efficient algorithms – TSM2R and TSM2L – for two classes of tall-and-skinny matrix–matrix multiplications on GPUs. Both of them focus on optimizing linear algebra operation with at least one of the input matrices tall-and-skinny. Specifically, TSM2R is designed for a large regular-shaped matrix multiplying a tall-and-skinny matrix, while TSM2L is designed for a tall-and-skinny matrix multiplying a small regular-shaped matrix. We implement our proposed algorithms and test on several modern NVIDIA GPU micro-architectures. Experiments show that, compared to the current state-of-the-art works, (1) TSM2R speeds up the computation by 1.6x on average and improves the memory bandwidth utilization and computing power utilization by 18.1% and 20.5% on average, respectively, when the regular-shaped matrix size is relatively large or medium; and (2) TSM2L speeds up the computation by 1.9x on average and improves the memory bandwidth utilization by up to 9.3% on average when the regular-shaped matrix size is relatively small.
•Few works focus on optimizing GEMM on GPUs for the irregular-shaped input.•Current optimizations do not fully utilize the memory bandwidth and computing power.•We propose two efficient algorithms for two classes of tall-and-skinny GEMM on GPUs.•Our optimizations speedup GEMM by 1.1x∼3.5x for various tall-and-skinny inputs. |
Author | Zhang, Jing Rivera, Cody Tao, Dingwen Chen, Jieyang Song, Shuaiwen Leon Xiong, Nan |
Author_xml | – sequence: 1 givenname: Cody orcidid: 0000-0001-7824-4054 surname: Rivera fullname: Rivera, Cody organization: The University of Alabama, Tuscaloosa, AL 35487, USA – sequence: 2 givenname: Jieyang surname: Chen fullname: Chen, Jieyang organization: Oak Ridge National Laboratory, Oak Ridge, TN 37830, USA – sequence: 3 givenname: Nan surname: Xiong fullname: Xiong, Nan organization: University of California, Riverside, Riverside, CA 92521, USA – sequence: 4 givenname: Jing surname: Zhang fullname: Zhang, Jing organization: University of Colorado Colorado Springs, CO 80918, USA – sequence: 5 givenname: Shuaiwen Leon surname: Song fullname: Song, Shuaiwen Leon organization: The University of Sydney, NSW 2006, Australia – sequence: 6 givenname: Dingwen orcidid: 0000-0001-5422-4497 surname: Tao fullname: Tao, Dingwen email: dingwen.tao@wsu.edu organization: The University of Alabama, Tuscaloosa, AL 35487, USA |
BookMark | eNp9kEtOwzAQhi1UJNrCBVjlAg5-5YXYoApapPKQaCV2luPY4JA4kW0Q3XEHbshJSCkrFtWMNLP5Rv98EzCynVUAnGIUY4TTszqu-0rGBBEcIxIjTA_AGKMihShn-QiMUcYozChOjsDE-xohjJMsH4O71eMteTqPFub5BfbK6c61wkoVBdE0UNgK-ldj7SZqRXDm4_vza7dE7VsTTN8YKYLpbDT0_GHtj8GhFo1XJ39zCtbXV6vZAi7v5zezyyWUDKMAZcoEprqggiSCsbRISpIlOhsCa1VQhiXBQmhWyKGYLDErkBapRHlZVpSmdArI7q50nfdOad470wq34RjxrRFe860RvjXCEeGDkQHK_0HShN_4wQnT7Ecvdqganno3ynEvjRo8VcYpGXjVmX34DyOpf3M |
CitedBy_id | crossref_primary_10_1145_3703352 crossref_primary_10_1109_TPDS_2024_3350368 crossref_primary_10_1145_3595178 crossref_primary_10_1007_s11042_022_13635_z crossref_primary_10_1145_3570638 |
Cites_doi | 10.1145/1014052.1014118 10.1016/j.parco.2009.12.005 10.1145/2818311 10.1109/NAS.2016.7549404 10.1145/3208040.3208050 10.1145/2907294.2907315 10.1016/j.parco.2014.09.001 10.1145/3018743.3018750 10.1145/3330345.3330355 10.1145/2907294.2907306 |
ContentType | Journal Article |
Copyright | 2021 Elsevier Inc. |
Copyright_xml | – notice: 2021 Elsevier Inc. |
DBID | AAYXX CITATION |
DOI | 10.1016/j.jpdc.2021.02.013 |
DatabaseName | CrossRef |
DatabaseTitle | CrossRef |
DatabaseTitleList | |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Computer Science |
EISSN | 1096-0848 |
EndPage | 85 |
ExternalDocumentID | 10_1016_j_jpdc_2021_02_013 S0743731521000344 |
GroupedDBID | --K --M -~X .~1 0R~ 1B1 1~. 1~5 29L 4.4 457 4G. 5GY 5VS 7-5 71M 8P~ 9JN AACTN AAEDT AAEDW AAIAV AAIKJ AAKOC AALRI AAOAW AAQFI AAQXK AAXUO AAYFN ABBOA ABEFU ABFNM ABFSI ABJNI ABMAC ABTAH ABXDB ABYKQ ACDAQ ACGFS ACNNM ACRLP ACZNC ADBBV ADEZE ADFGL ADHUB ADJOM ADMUD ADTZH AEBSH AECPX AEKER AENEX AFKWA AFTJW AGHFR AGUBO AGYEJ AHHHB AHJVU AHZHX AIALX AIEXJ AIKHN AITUG AJBFU AJOXV ALMA_UNASSIGNED_HOLDINGS AMFUW AMRAJ AOUOD ASPBG AVWKF AXJTR AZFZN BJAXD BKOJK BLXMC CAG COF CS3 DM4 DU5 E.L EBS EFBJH EFLBG EJD EO8 EO9 EP2 EP3 F5P FDB FEDTE FGOYB FIRID FNPLU FYGXN G-2 G-Q G8K GBLVA GBOLZ HLZ HVGLF HZ~ H~9 IHE J1W JJJVA K-O KOM LG5 LG9 LY7 M41 MO0 N9A O-L O9- OAUVE OZT P-8 P-9 P2P PC. Q38 R2- RIG ROL RPZ SBC SDF SDG SDP SES SET SEW SPC SPCBC SST SSV SSZ T5K TN5 TWZ WUQ XJT XOL XPP ZMT ZU3 ZY4 ~G- ~G0 AATTM AAXKI AAYWO AAYXX ABDPE ABWVN ACRPL ACVFH ADCNI ADNMO ADVLN AEIPS AEUPX AFJKZ AFPUW AFXIZ AGCQF AGQPQ AGRNS AIGII AIIUN AKBMS AKRWK AKYEP ANKPU APXCP BNPGV CITATION SSH |
ID | FETCH-LOGICAL-c410t-c64a13f93a25a44695b275f7096fe9341c21aaf49c9c94cb1490fa6c08bbd3363 |
IEDL.DBID | .~1 |
ISSN | 0743-7315 |
IngestDate | Thu Apr 24 23:02:09 EDT 2025 Tue Jul 01 03:20:50 EDT 2025 Fri Feb 23 02:45:49 EST 2024 |
IsDoiOpenAccess | false |
IsOpenAccess | true |
IsPeerReviewed | true |
IsScholarly | true |
Keywords | CUDA Matrix–matrix multiplication Performance optimization Tall-and-skinny matrix GPU |
Language | English |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-c410t-c64a13f93a25a44695b275f7096fe9341c21aaf49c9c94cb1490fa6c08bbd3363 |
ORCID | 0000-0001-5422-4497 0000-0001-7824-4054 |
OpenAccessLink | https://doi.org/10.1016/j.jpdc.2021.02.013 |
PageCount | 16 |
ParticipantIDs | crossref_primary_10_1016_j_jpdc_2021_02_013 crossref_citationtrail_10_1016_j_jpdc_2021_02_013 elsevier_sciencedirect_doi_10_1016_j_jpdc_2021_02_013 |
ProviderPackageCode | CITATION AAYXX |
PublicationCentury | 2000 |
PublicationDate | May 2021 2021-05-00 |
PublicationDateYYYYMMDD | 2021-05-01 |
PublicationDate_xml | – month: 05 year: 2021 text: May 2021 |
PublicationDecade | 2020 |
PublicationTitle | Journal of parallel and distributed computing |
PublicationYear | 2021 |
Publisher | Elsevier Inc |
Publisher_xml | – name: Elsevier Inc |
References | Darwin cluster Chen, Li, Li, Liang, Wu, Tao, Ouyang, Liu, Zhao, Guan (b5) 2018 Nvidia A100 Tensor Core GPU Architecture D. Tao, S. Di, X. Liang, Z. Chen, F. Cappello, Improving performance of iterative methods by lossy checkponting, in: Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing (HPDC), 2018, pp. 52–65. Volkov (b33) 2016 P. Wu, N. DeBardeleben, Q. Guan, S. Blanchard, J. Chen, D. Tao, X. Liang, K. Ouyang, Z. Chen, Silent data corruption resilient two-sided matrix factorizations, in: Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), 2017. K-means by NVIDIA J. Chen, N. Xiong, X. Liang, D. Tao, S. Li, K. Ouyang, K. Zhao, N. DeBardeleben, Q. Guan, Z. Chen, TSM2: optimizing tall-and-skinny matrix-matrix multiplication on GPUs, in: Proceedings of the ACM International Conference on Supercomputing (ICS), 2019, pp. 106–116. I.S. Dhillon, Y. Guan, B. Kulis, Kernel k-means: spectral clustering and normalized cuts, in: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2004, pp. 551–556. Tomov, Nath, Ltaief, Dongarra (b32) 2010 cuBLAS Benchmark Heinecke, Henry, Hutchinson, Pabst (b18) 2016 Wong, Papadopoulou, Sadooghi-Alvandi, Moshovos (b35) 2010 P. Wu, Q. Guan, N. DeBardeleben, S. Blanchard, D. Tao, X. Liang, J. Chen, Z. Chen, Towards practical algorithm based fault tolerance in dense linear algebra, in: Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC), 2016. Chen, Liang, Chen (b6) 2016 MAGMA: Matrix Algebra on GPU and Multicore Architectures Abdelfattah, Keyes, Ltaief (b1) 2016; 42 Tan, Song, Wu, Chen, Ge, Kerbyson (b28) 2015 Chen (b3) 2019 J. Chen, S. Li, Z. Chen, GPU-ABFT: Optimizing algorithm-based fault tolerance for heterogeneous systems with GPUs, in: 2016 IEEE International Conference on Networking, Architecture and Storage (NAS). Tan, Kothapalli, Chen, Hussaini, Bissiri, Chen (b27) 2014; 40 Huang, Abraham (b19) 1984 Ernst, Hager, Thies, Wellein (b17) 2020 . Dongarra, Gates, Haidar, Kurzak, Luszczek, Tomov, Yamazaki (b16) 2014 Basic Linear Algebra on NVIDIA GPUs CUDA Programming Guide Chen, Tan, Wu, Tao, Li, Liang, Li, Ge, Bhuyan, Chen (b7) 2016 PTX Programming Guide Dong, Haidar, Luszczek, Tomov, Abdelfattah, Dongarra (b15) 2016 Liang, Chen, Tao, Li, Wu, Li, Ouyang, Liu, Song, Chen (b21) 2017 Tomov, Dongarra, Baboulin (b31) 2010 Nvidia Tesla V100 GPU Architecture CULA D. Tao, S.L. Song, S. Krishnamoorthy, P. Wu, X. Liang, E.Z. Zhang, D. Kerbyson, Z. Chen, New-Sum: A novel online abft scheme for general iterative methods, in: Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC), 2016. Wu, Ding, Chen, Gao, Davies, Karlsson, Chen (b37) 2011 cuDNN Wang, Wu, Xu, Xiao, Yang (b34) 2016 PantaRhei cluster 10.1016/j.jpdc.2021.02.013_b23 10.1016/j.jpdc.2021.02.013_b24 Tan (10.1016/j.jpdc.2021.02.013_b28) 2015 10.1016/j.jpdc.2021.02.013_b22 Tan (10.1016/j.jpdc.2021.02.013_b27) 2014; 40 10.1016/j.jpdc.2021.02.013_b25 10.1016/j.jpdc.2021.02.013_b26 10.1016/j.jpdc.2021.02.013_b29 Chen (10.1016/j.jpdc.2021.02.013_b6) 2016 Wu (10.1016/j.jpdc.2021.02.013_b37) 2011 Dong (10.1016/j.jpdc.2021.02.013_b15) 2016 Tomov (10.1016/j.jpdc.2021.02.013_b31) 2010 Chen (10.1016/j.jpdc.2021.02.013_b5) 2018 Chen (10.1016/j.jpdc.2021.02.013_b3) 2019 10.1016/j.jpdc.2021.02.013_b20 Ernst (10.1016/j.jpdc.2021.02.013_b17) 2020 10.1016/j.jpdc.2021.02.013_b12 10.1016/j.jpdc.2021.02.013_b13 10.1016/j.jpdc.2021.02.013_b10 10.1016/j.jpdc.2021.02.013_b11 10.1016/j.jpdc.2021.02.013_b38 Chen (10.1016/j.jpdc.2021.02.013_b7) 2016 10.1016/j.jpdc.2021.02.013_b9 10.1016/j.jpdc.2021.02.013_b14 10.1016/j.jpdc.2021.02.013_b36 10.1016/j.jpdc.2021.02.013_b8 Liang (10.1016/j.jpdc.2021.02.013_b21) 2017 10.1016/j.jpdc.2021.02.013_b4 Dongarra (10.1016/j.jpdc.2021.02.013_b16) 2014 Wong (10.1016/j.jpdc.2021.02.013_b35) 2010 10.1016/j.jpdc.2021.02.013_b2 Huang (10.1016/j.jpdc.2021.02.013_b19) 1984 Wang (10.1016/j.jpdc.2021.02.013_b34) 2016 Volkov (10.1016/j.jpdc.2021.02.013_b33) 2016 Tomov (10.1016/j.jpdc.2021.02.013_b32) 2010 Abdelfattah (10.1016/j.jpdc.2021.02.013_b1) 2016; 42 10.1016/j.jpdc.2021.02.013_b30 Heinecke (10.1016/j.jpdc.2021.02.013_b18) 2016 |
References_xml | – reference: CULA, – reference: D. Tao, S. Di, X. Liang, Z. Chen, F. Cappello, Improving performance of iterative methods by lossy checkponting, in: Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing (HPDC), 2018, pp. 52–65. – volume: 42 start-page: 18 year: 2016 ident: b1 article-title: Kblas: An optimized library for dense matrix-vector multiplication on gpu accelerators publication-title: ACM Trans. Math. Softw. (TOMS) – start-page: 786 year: 2015 end-page: 796 ident: b28 article-title: Investigating the interplay between energy efficiency and resilience in high performance computing publication-title: 2015 IEEE International Parallel and Distributed Processing Symposium (IPDPS) – reference: cuDNN, – start-page: 25 year: 2011 end-page: 28 ident: b37 article-title: Fault tolerant matrix-matrix multiplication: correcting soft errors on-line publication-title: Proceedings of the Second Workshop on Scalable Algorithms for Large-Scale Systems – reference: I.S. Dhillon, Y. Guan, B. Kulis, Kernel k-means: spectral clustering and normalized cuts, in: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2004, pp. 551–556. – reference: MAGMA: Matrix Algebra on GPU and Multicore Architectures, – volume: 40 start-page: 559 year: 2014 end-page: 573 ident: b27 article-title: A survey of power and energy efficient techniques for high performance numerical linear algebra operations publication-title: Parallel Comput. – start-page: 30 year: 2017 ident: b21 article-title: Correcting soft errors online in fast fourier transform publication-title: SC17: International Conference for High Performance Computing, Networking, Storage and Analysis – reference: PantaRhei cluster, – year: 2014 ident: b16 article-title: Accelerating numerical dense linear algebra calculations with GPUs publication-title: Numerical Computations with GPUs – year: 2010 ident: b31 article-title: Towards dense linear algebra for hybrid GPU accelerated manycore systems publication-title: Parallel Matrix Algorithms and Applications – reference: P. Wu, Q. Guan, N. DeBardeleben, S. Blanchard, D. Tao, X. Liang, J. Chen, Z. Chen, Towards practical algorithm based fault tolerance in dense linear algebra, in: Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC), 2016. – reference: Nvidia A100 Tensor Core GPU Architecture, – reference: Nvidia Tesla V100 GPU Architecture, – year: 2019 ident: b3 article-title: Fault Tolerant and Energy Efficient One-Sided Matrix Decompositions on Heterogeneous Systems with GPUs – year: 2016 ident: b6 article-title: Online algorithm-based fault tolerance for cholesky decomposition on heterogeneous systems with GPUs publication-title: 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS) – reference: D. Tao, S.L. Song, S. Krishnamoorthy, P. Wu, X. Liang, E.Z. Zhang, D. Kerbyson, Z. Chen, New-Sum: A novel online abft scheme for general iterative methods, in: Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC), 2016. – start-page: 68 year: 2018 ident: b5 article-title: Fault tolerant one-sided matrix decompositions on heterogeneous systems with GPUs publication-title: SC18: International Conference for High Performance Computing, Networking, Storage, and Analysis – year: 1984 ident: b19 article-title: Algorithm-based fault tolerance for matrix operations publication-title: Comput. IEEE Trans. – year: 2010 ident: b35 article-title: Demystifying GPU microarchitecture through microbenchmarking publication-title: Performance Analysis of Systems & Software (ISPASS), 2010 IEEE International Symposium on – reference: cuBLAS Benchmark, – reference: Darwin cluster, – year: 2016 ident: b15 article-title: MAGMA Batched: A Batched BLAS Approach for Small Matrix Factorizations and Applications on GPUs – start-page: 667 year: 2016 end-page: 677 ident: b7 article-title: GreenLA: green linear algebra software for GPU-accelerated heterogeneous computing publication-title: SC16: International Conference for High Performance Computing, Networking, Storage and Analysis – reference: CUDA Programming Guide, – year: 2016 ident: b33 article-title: Understanding Latency Hiding on GPUs – start-page: 20 year: 2016 ident: b34 article-title: Blasx: A high performance level-3 blas library for heterogeneous multi-gpu computing publication-title: Proceedings of the 2016 International Conference on Supercomputing (ICS) – reference: . – reference: K-means by NVIDIA, – start-page: 1 year: 2010 end-page: 8 ident: b32 article-title: Dense linear algebra solvers for multicore with GPU accelerators publication-title: 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and PhD Forum (IPDPSW) – year: 2020 ident: b17 article-title: Performance engineering for real and complex tall & skinny matrix multiplication kernels on GPUs publication-title: Int. J. High Perform. Comput. Appl. – reference: PTX Programming Guide, – reference: Basic Linear Algebra on NVIDIA GPUs, – year: 2016 ident: b18 article-title: Libxsmm: accelerating small matrix multiplications by runtime code generation publication-title: SC16: International Conference for High Performance Computing, Networking, Storage and Analysis – reference: J. Chen, N. Xiong, X. Liang, D. Tao, S. Li, K. Ouyang, K. Zhao, N. DeBardeleben, Q. Guan, Z. Chen, TSM2: optimizing tall-and-skinny matrix-matrix multiplication on GPUs, in: Proceedings of the ACM International Conference on Supercomputing (ICS), 2019, pp. 106–116. – reference: J. Chen, S. Li, Z. Chen, GPU-ABFT: Optimizing algorithm-based fault tolerance for heterogeneous systems with GPUs, in: 2016 IEEE International Conference on Networking, Architecture and Storage (NAS). – reference: P. Wu, N. DeBardeleben, Q. Guan, S. Blanchard, J. Chen, D. Tao, X. Liang, K. Ouyang, Z. Chen, Silent data corruption resilient two-sided matrix factorizations, in: Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), 2017. – ident: 10.1016/j.jpdc.2021.02.013_b2 – start-page: 68 year: 2018 ident: 10.1016/j.jpdc.2021.02.013_b5 article-title: Fault tolerant one-sided matrix decompositions on heterogeneous systems with GPUs – ident: 10.1016/j.jpdc.2021.02.013_b14 doi: 10.1145/1014052.1014118 – year: 2010 ident: 10.1016/j.jpdc.2021.02.013_b31 article-title: Towards dense linear algebra for hybrid GPU accelerated manycore systems publication-title: Parallel Comput. doi: 10.1016/j.parco.2009.12.005 – ident: 10.1016/j.jpdc.2021.02.013_b12 – volume: 42 start-page: 18 issue: 3 year: 2016 ident: 10.1016/j.jpdc.2021.02.013_b1 article-title: Kblas: An optimized library for dense matrix-vector multiplication on gpu accelerators publication-title: ACM Trans. Math. Softw. (TOMS) doi: 10.1145/2818311 – ident: 10.1016/j.jpdc.2021.02.013_b10 – year: 2016 ident: 10.1016/j.jpdc.2021.02.013_b33 – ident: 10.1016/j.jpdc.2021.02.013_b4 doi: 10.1109/NAS.2016.7549404 – year: 1984 ident: 10.1016/j.jpdc.2021.02.013_b19 article-title: Algorithm-based fault tolerance for matrix operations publication-title: Comput. IEEE Trans. – start-page: 25 year: 2011 ident: 10.1016/j.jpdc.2021.02.013_b37 article-title: Fault tolerant matrix-matrix multiplication: correcting soft errors on-line – ident: 10.1016/j.jpdc.2021.02.013_b23 – ident: 10.1016/j.jpdc.2021.02.013_b25 – ident: 10.1016/j.jpdc.2021.02.013_b29 doi: 10.1145/3208040.3208050 – year: 2019 ident: 10.1016/j.jpdc.2021.02.013_b3 – year: 2016 ident: 10.1016/j.jpdc.2021.02.013_b15 – ident: 10.1016/j.jpdc.2021.02.013_b38 doi: 10.1145/2907294.2907315 – volume: 40 start-page: 559 issue: 10 year: 2014 ident: 10.1016/j.jpdc.2021.02.013_b27 article-title: A survey of power and energy efficient techniques for high performance numerical linear algebra operations publication-title: Parallel Comput. doi: 10.1016/j.parco.2014.09.001 – start-page: 786 year: 2015 ident: 10.1016/j.jpdc.2021.02.013_b28 article-title: Investigating the interplay between energy efficiency and resilience in high performance computing – ident: 10.1016/j.jpdc.2021.02.013_b36 doi: 10.1145/3018743.3018750 – start-page: 667 year: 2016 ident: 10.1016/j.jpdc.2021.02.013_b7 article-title: GreenLA: green linear algebra software for GPU-accelerated heterogeneous computing – start-page: 20 year: 2016 ident: 10.1016/j.jpdc.2021.02.013_b34 article-title: Blasx: A high performance level-3 blas library for heterogeneous multi-gpu computing – ident: 10.1016/j.jpdc.2021.02.013_b13 – year: 2014 ident: 10.1016/j.jpdc.2021.02.013_b16 article-title: Accelerating numerical dense linear algebra calculations with GPUs – ident: 10.1016/j.jpdc.2021.02.013_b11 – year: 2020 ident: 10.1016/j.jpdc.2021.02.013_b17 article-title: Performance engineering for real and complex tall & skinny matrix multiplication kernels on GPUs publication-title: Int. J. High Perform. Comput. Appl. – year: 2010 ident: 10.1016/j.jpdc.2021.02.013_b35 article-title: Demystifying GPU microarchitecture through microbenchmarking – year: 2016 ident: 10.1016/j.jpdc.2021.02.013_b18 article-title: Libxsmm: accelerating small matrix multiplications by runtime code generation – ident: 10.1016/j.jpdc.2021.02.013_b8 doi: 10.1145/3330345.3330355 – ident: 10.1016/j.jpdc.2021.02.013_b30 doi: 10.1145/2907294.2907306 – start-page: 30 year: 2017 ident: 10.1016/j.jpdc.2021.02.013_b21 article-title: Correcting soft errors online in fast fourier transform – ident: 10.1016/j.jpdc.2021.02.013_b24 – ident: 10.1016/j.jpdc.2021.02.013_b9 – start-page: 1 year: 2010 ident: 10.1016/j.jpdc.2021.02.013_b32 article-title: Dense linear algebra solvers for multicore with GPU accelerators – ident: 10.1016/j.jpdc.2021.02.013_b26 – ident: 10.1016/j.jpdc.2021.02.013_b22 – ident: 10.1016/j.jpdc.2021.02.013_b20 – year: 2016 ident: 10.1016/j.jpdc.2021.02.013_b6 article-title: Online algorithm-based fault tolerance for cholesky decomposition on heterogeneous systems with GPUs |
SSID | ssj0011578 |
Score | 2.4036076 |
Snippet | Linear algebra operations have been widely used in big data analytics and scientific computations. Many works have been done on optimizing linear algebra... |
SourceID | crossref elsevier |
SourceType | Enrichment Source Index Database Publisher |
StartPage | 70 |
SubjectTerms | CUDA GPU Matrix–matrix multiplication Performance optimization Tall-and-skinny matrix |
Title | TSM2X: High-performance tall-and-skinny matrix–matrix multiplication on GPUs |
URI | https://dx.doi.org/10.1016/j.jpdc.2021.02.013 |
Volume | 151 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV1bS8MwFA5jvvjiXbyOPPgmcb0k6-LbGM6pbAjbYG8hSRPYmLW4Cfoi_gf_ob_EpEmnguxB2oe2JKGcnp5L-M53ADhLRYi5oAQRQaVJUJRCXIQCaWGiiYBrhQu2_V6_0R3h2zEZV0C7rIWxsEpv-51NL6y1f1L30qznk0l9YJ1fElv_U7CsWE5QjBOr5RdvS5iH5ZJpllScdrQvnHEYr2meWhrDyPF2hvHfzumHw-lsgQ0fKcKWe5ltUFHZDtgsuzBA_1Pugv5w0IvGl9AiNlD-XQcATVg9QzxL0dw22HqFD5aN_-Xz_cNdQI8l9Jt20JzX96P5Hhh1robtLvJdEpDEYbBAsoF5GGsa84hwk9xRIqKE6MTkJlpR46RkFHKuMZXmwNJ8Gxpo3pBBU4g0jhvxPqhmj5k6AFC5DEukCdXYrEt1KgmWgWraZZU4BGEpHiY9hbjtZDFjJVZsyqxImRUpCyJmRHoIzpdzckegsXI0KaXOfqkBMxZ-xbyjf847Buv2ziEYT0B18fSsTk2UsRC1Qo1qYK11c9ftfwEtc9H3 |
linkProvider | Elsevier |
linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV3JTsMwEB1BOcCFHVHWHLghq1nstuZWVZQAbYREK_Vm2Y4tFUGoaJHgxj_wh3wJdu2wSKgHlByiJGNFL84szswbgJNcRJgLShARVJoARSnERSSQFsabCLlWeMa238vq6QBfDclwAdplLYxNq_S63-n0mbb2Z2oezdp4NKrdWuPXSKz9mbGs4EVYsuxUpAJLrcvrNPv6mRARp5AtG6cV8LUzLs3rbpxbJsPYUXdGyd_26YfN6azDqncWg5Z7ng1YUMUmrJWNGAL_XW5B1r_txcOzwCZtoPF3KUBgPOt7xIscTWyPrdfgwRLyv3y8vbuDwKcT-nW7wOwXN4PJNgw65_12inyjBCRxFE6RrGMeJZomPCbcxHeUiLhBdMOEJ1pRY6dkHHGuMZVmw9K8HhpqXpdhU4g8SerJDlSKx0LtQqBckCXyBtXYjEt1LgmWoWraYZWoQlTCw6RnEbfNLO5ZmS52xyykzELKwpgZSKtw-iUzdhwac-8mJers10xgRsnPkdv7p9wxLKf9Xpd1L7PrfVixV1xC4wFUpk_P6tA4HVNx5CfVJ3Bp1Kg |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=TSM2X%3A+High-performance+tall-and-skinny+matrix%E2%80%93matrix+multiplication+on+GPUs&rft.jtitle=Journal+of+parallel+and+distributed+computing&rft.au=Rivera%2C+Cody&rft.au=Chen%2C+Jieyang&rft.au=Xiong%2C+Nan&rft.au=Zhang%2C+Jing&rft.date=2021-05-01&rft.issn=0743-7315&rft.volume=151&rft.spage=70&rft.epage=85&rft_id=info:doi/10.1016%2Fj.jpdc.2021.02.013&rft.externalDBID=n%2Fa&rft.externalDocID=10_1016_j_jpdc_2021_02_013 |
thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0743-7315&client=summon |
thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0743-7315&client=summon |
thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0743-7315&client=summon |