TSM2X: High-performance tall-and-skinny matrix–matrix multiplication on GPUs

Linear algebra operations have been widely used in big data analytics and scientific computations. Many works have been done on optimizing linear algebra operations on GPUs with regular-shaped input. However, few works focus on fully utilizing GPU resources when the input is not regular-shaped. Curr...

Full description

Saved in:

Bibliographic Details
Published in	Journal of parallel and distributed computing Vol. 151; pp. 70 - 85
Main Authors	Rivera, Cody, Chen, Jieyang, Xiong, Nan, Zhang, Jing, Song, Shuaiwen Leon, Tao, Dingwen
Format	Journal Article
Language	English
Published	Elsevier Inc 01.05.2021
Subjects	CUDA GPU Matrix–matrix multiplication Performance optimization Tall-and-skinny matrix CUDA Matrix–matrix multiplication Performance optimization Tall-and-skinny matrix GPU
Online Access	Get full text

Cover

Loading…

Abstract	Linear algebra operations have been widely used in big data analytics and scientific computations. Many works have been done on optimizing linear algebra operations on GPUs with regular-shaped input. However, few works focus on fully utilizing GPU resources when the input is not regular-shaped. Current optimizations do not consider fully utilizing the memory bandwidth and computing power; therefore, they can only achieve sub-optimal performance. In this paper, we propose two efficient algorithms – TSM2R and TSM2L – for two classes of tall-and-skinny matrix–matrix multiplications on GPUs. Both of them focus on optimizing linear algebra operation with at least one of the input matrices tall-and-skinny. Specifically, TSM2R is designed for a large regular-shaped matrix multiplying a tall-and-skinny matrix, while TSM2L is designed for a tall-and-skinny matrix multiplying a small regular-shaped matrix. We implement our proposed algorithms and test on several modern NVIDIA GPU micro-architectures. Experiments show that, compared to the current state-of-the-art works, (1) TSM2R speeds up the computation by 1.6x on average and improves the memory bandwidth utilization and computing power utilization by 18.1% and 20.5% on average, respectively, when the regular-shaped matrix size is relatively large or medium; and (2) TSM2L speeds up the computation by 1.9x on average and improves the memory bandwidth utilization by up to 9.3% on average when the regular-shaped matrix size is relatively small. •Few works focus on optimizing GEMM on GPUs for the irregular-shaped input.•Current optimizations do not fully utilize the memory bandwidth and computing power.•We propose two efficient algorithms for two classes of tall-and-skinny GEMM on GPUs.•Our optimizations speedup GEMM by 1.1x∼3.5x for various tall-and-skinny inputs.
AbstractList	Linear algebra operations have been widely used in big data analytics and scientific computations. Many works have been done on optimizing linear algebra operations on GPUs with regular-shaped input. However, few works focus on fully utilizing GPU resources when the input is not regular-shaped. Current optimizations do not consider fully utilizing the memory bandwidth and computing power; therefore, they can only achieve sub-optimal performance. In this paper, we propose two efficient algorithms – TSM2R and TSM2L – for two classes of tall-and-skinny matrix–matrix multiplications on GPUs. Both of them focus on optimizing linear algebra operation with at least one of the input matrices tall-and-skinny. Specifically, TSM2R is designed for a large regular-shaped matrix multiplying a tall-and-skinny matrix, while TSM2L is designed for a tall-and-skinny matrix multiplying a small regular-shaped matrix. We implement our proposed algorithms and test on several modern NVIDIA GPU micro-architectures. Experiments show that, compared to the current state-of-the-art works, (1) TSM2R speeds up the computation by 1.6x on average and improves the memory bandwidth utilization and computing power utilization by 18.1% and 20.5% on average, respectively, when the regular-shaped matrix size is relatively large or medium; and (2) TSM2L speeds up the computation by 1.9x on average and improves the memory bandwidth utilization by up to 9.3% on average when the regular-shaped matrix size is relatively small. •Few works focus on optimizing GEMM on GPUs for the irregular-shaped input.•Current optimizations do not fully utilize the memory bandwidth and computing power.•We propose two efficient algorithms for two classes of tall-and-skinny GEMM on GPUs.•Our optimizations speedup GEMM by 1.1x∼3.5x for various tall-and-skinny inputs.
Author	Zhang, Jing Rivera, Cody Tao, Dingwen Chen, Jieyang Song, Shuaiwen Leon Xiong, Nan
Author_xml	– sequence: 1 givenname: Cody orcidid: 0000-0001-7824-4054 surname: Rivera fullname: Rivera, Cody organization: The University of Alabama, Tuscaloosa, AL 35487, USA – sequence: 2 givenname: Jieyang surname: Chen fullname: Chen, Jieyang organization: Oak Ridge National Laboratory, Oak Ridge, TN 37830, USA – sequence: 3 givenname: Nan surname: Xiong fullname: Xiong, Nan organization: University of California, Riverside, Riverside, CA 92521, USA – sequence: 4 givenname: Jing surname: Zhang fullname: Zhang, Jing organization: University of Colorado Colorado Springs, CO 80918, USA – sequence: 5 givenname: Shuaiwen Leon surname: Song fullname: Song, Shuaiwen Leon organization: The University of Sydney, NSW 2006, Australia – sequence: 6 givenname: Dingwen orcidid: 0000-0001-5422-4497 surname: Tao fullname: Tao, Dingwen email: dingwen.tao@wsu.edu organization: The University of Alabama, Tuscaloosa, AL 35487, USA
BookMark	eNp9kEtOwzAQhi1UJNrCBVjlAg5-5YXYoApapPKQaCV2luPY4JA4kW0Q3XEHbshJSCkrFtWMNLP5Rv98EzCynVUAnGIUY4TTszqu-0rGBBEcIxIjTA_AGKMihShn-QiMUcYozChOjsDE-xohjJMsH4O71eMteTqPFub5BfbK6c61wkoVBdE0UNgK-ldj7SZqRXDm4_vza7dE7VsTTN8YKYLpbDT0_GHtj8GhFo1XJ39zCtbXV6vZAi7v5zezyyWUDKMAZcoEprqggiSCsbRISpIlOhsCa1VQhiXBQmhWyKGYLDErkBapRHlZVpSmdArI7q50nfdOad470wq34RjxrRFe860RvjXCEeGDkQHK_0HShN_4wQnT7Ecvdqganno3ynEvjRo8VcYpGXjVmX34DyOpf3M
CitedBy_id	crossref_primary_10_1145_3703352 crossref_primary_10_1109_TPDS_2024_3350368 crossref_primary_10_1145_3595178 crossref_primary_10_1007_s11042_022_13635_z crossref_primary_10_1145_3570638
Cites_doi	10.1145/1014052.1014118 10.1016/j.parco.2009.12.005 10.1145/2818311 10.1109/NAS.2016.7549404 10.1145/3208040.3208050 10.1145/2907294.2907315 10.1016/j.parco.2014.09.001 10.1145/3018743.3018750 10.1145/3330345.3330355 10.1145/2907294.2907306
ContentType	Journal Article
Copyright	2021 Elsevier Inc.
Copyright_xml	– notice: 2021 Elsevier Inc.
DBID	AAYXX CITATION
DOI	10.1016/j.jpdc.2021.02.013
DatabaseName	CrossRef
DatabaseTitle	CrossRef
DatabaseTitleList
DeliveryMethod	fulltext_linktorsrc
Discipline	Computer Science
EISSN	1096-0848
EndPage	85
ExternalDocumentID	10_1016_j_jpdc_2021_02_013 S0743731521000344
GroupedDBID	--K --M -~X .~1 0R~ 1B1 1~. 1~5 29L 4.4 457 4G. 5GY 5VS 7-5 71M 8P~ 9JN AACTN AAEDT AAEDW AAIAV AAIKJ AAKOC AALRI AAOAW AAQFI AAQXK AAXUO AAYFN ABBOA ABEFU ABFNM ABFSI ABJNI ABMAC ABTAH ABXDB ABYKQ ACDAQ ACGFS ACNNM ACRLP ACZNC ADBBV ADEZE ADFGL ADHUB ADJOM ADMUD ADTZH AEBSH AECPX AEKER AENEX AFKWA AFTJW AGHFR AGUBO AGYEJ AHHHB AHJVU AHZHX AIALX AIEXJ AIKHN AITUG AJBFU AJOXV ALMA_UNASSIGNED_HOLDINGS AMFUW AMRAJ AOUOD ASPBG AVWKF AXJTR AZFZN BJAXD BKOJK BLXMC CAG COF CS3 DM4 DU5 E.L EBS EFBJH EFLBG EJD EO8 EO9 EP2 EP3 F5P FDB FEDTE FGOYB FIRID FNPLU FYGXN G-2 G-Q G8K GBLVA GBOLZ HLZ HVGLF HZ~ H~9 IHE J1W JJJVA K-O KOM LG5 LG9 LY7 M41 MO0 N9A O-L O9- OAUVE OZT P-8 P-9 P2P PC. Q38 R2- RIG ROL RPZ SBC SDF SDG SDP SES SET SEW SPC SPCBC SST SSV SSZ T5K TN5 TWZ WUQ XJT XOL XPP ZMT ZU3 ZY4 ~G- ~G0 AATTM AAXKI AAYWO AAYXX ABDPE ABWVN ACRPL ACVFH ADCNI ADNMO ADVLN AEIPS AEUPX AFJKZ AFPUW AFXIZ AGCQF AGQPQ AGRNS AIGII AIIUN AKBMS AKRWK AKYEP ANKPU APXCP BNPGV CITATION SSH
ID	FETCH-LOGICAL-c410t-c64a13f93a25a44695b275f7096fe9341c21aaf49c9c94cb1490fa6c08bbd3363
IEDL.DBID	.~1
ISSN	0743-7315
IngestDate	Thu Apr 24 23:02:09 EDT 2025 Tue Jul 01 03:20:50 EDT 2025 Fri Feb 23 02:45:49 EST 2024
IsDoiOpenAccess	false
IsOpenAccess	true
IsPeerReviewed	true
IsScholarly	true
Keywords	CUDA Matrix–matrix multiplication Performance optimization Tall-and-skinny matrix GPU
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-c410t-c64a13f93a25a44695b275f7096fe9341c21aaf49c9c94cb1490fa6c08bbd3363
ORCID	0000-0001-5422-4497 0000-0001-7824-4054
OpenAccessLink	https://doi.org/10.1016/j.jpdc.2021.02.013
PageCount	16
ParticipantIDs	crossref_primary_10_1016_j_jpdc_2021_02_013 crossref_citationtrail_10_1016_j_jpdc_2021_02_013 elsevier_sciencedirect_doi_10_1016_j_jpdc_2021_02_013
ProviderPackageCode	CITATION AAYXX
PublicationCentury	2000
PublicationDate	May 2021 2021-05-00
PublicationDateYYYYMMDD	2021-05-01
PublicationDate_xml	– month: 05 year: 2021 text: May 2021
PublicationDecade	2020
PublicationTitle	Journal of parallel and distributed computing
PublicationYear	2021
Publisher	Elsevier Inc
Publisher_xml	– name: Elsevier Inc
References	Darwin cluster Chen, Li, Li, Liang, Wu, Tao, Ouyang, Liu, Zhao, Guan (b5) 2018 Nvidia A100 Tensor Core GPU Architecture D. Tao, S. Di, X. Liang, Z. Chen, F. Cappello, Improving performance of iterative methods by lossy checkponting, in: Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing (HPDC), 2018, pp. 52–65. Volkov (b33) 2016 P. Wu, N. DeBardeleben, Q. Guan, S. Blanchard, J. Chen, D. Tao, X. Liang, K. Ouyang, Z. Chen, Silent data corruption resilient two-sided matrix factorizations, in: Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), 2017. K-means by NVIDIA J. Chen, N. Xiong, X. Liang, D. Tao, S. Li, K. Ouyang, K. Zhao, N. DeBardeleben, Q. Guan, Z. Chen, TSM2: optimizing tall-and-skinny matrix-matrix multiplication on GPUs, in: Proceedings of the ACM International Conference on Supercomputing (ICS), 2019, pp. 106–116. I.S. Dhillon, Y. Guan, B. Kulis, Kernel k-means: spectral clustering and normalized cuts, in: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2004, pp. 551–556. Tomov, Nath, Ltaief, Dongarra (b32) 2010 cuBLAS Benchmark Heinecke, Henry, Hutchinson, Pabst (b18) 2016 Wong, Papadopoulou, Sadooghi-Alvandi, Moshovos (b35) 2010 P. Wu, Q. Guan, N. DeBardeleben, S. Blanchard, D. Tao, X. Liang, J. Chen, Z. Chen, Towards practical algorithm based fault tolerance in dense linear algebra, in: Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC), 2016. Chen, Liang, Chen (b6) 2016 MAGMA: Matrix Algebra on GPU and Multicore Architectures Abdelfattah, Keyes, Ltaief (b1) 2016; 42 Tan, Song, Wu, Chen, Ge, Kerbyson (b28) 2015 Chen (b3) 2019 J. Chen, S. Li, Z. Chen, GPU-ABFT: Optimizing algorithm-based fault tolerance for heterogeneous systems with GPUs, in: 2016 IEEE International Conference on Networking, Architecture and Storage (NAS). Tan, Kothapalli, Chen, Hussaini, Bissiri, Chen (b27) 2014; 40 Huang, Abraham (b19) 1984 Ernst, Hager, Thies, Wellein (b17) 2020 . Dongarra, Gates, Haidar, Kurzak, Luszczek, Tomov, Yamazaki (b16) 2014 Basic Linear Algebra on NVIDIA GPUs CUDA Programming Guide Chen, Tan, Wu, Tao, Li, Liang, Li, Ge, Bhuyan, Chen (b7) 2016 PTX Programming Guide Dong, Haidar, Luszczek, Tomov, Abdelfattah, Dongarra (b15) 2016 Liang, Chen, Tao, Li, Wu, Li, Ouyang, Liu, Song, Chen (b21) 2017 Tomov, Dongarra, Baboulin (b31) 2010 Nvidia Tesla V100 GPU Architecture CULA D. Tao, S.L. Song, S. Krishnamoorthy, P. Wu, X. Liang, E.Z. Zhang, D. Kerbyson, Z. Chen, New-Sum: A novel online abft scheme for general iterative methods, in: Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC), 2016. Wu, Ding, Chen, Gao, Davies, Karlsson, Chen (b37) 2011 cuDNN Wang, Wu, Xu, Xiao, Yang (b34) 2016 PantaRhei cluster 10.1016/j.jpdc.2021.02.013_b23 10.1016/j.jpdc.2021.02.013_b24 Tan (10.1016/j.jpdc.2021.02.013_b28) 2015 10.1016/j.jpdc.2021.02.013_b22 Tan (10.1016/j.jpdc.2021.02.013_b27) 2014; 40 10.1016/j.jpdc.2021.02.013_b25 10.1016/j.jpdc.2021.02.013_b26 10.1016/j.jpdc.2021.02.013_b29 Chen (10.1016/j.jpdc.2021.02.013_b6) 2016 Wu (10.1016/j.jpdc.2021.02.013_b37) 2011 Dong (10.1016/j.jpdc.2021.02.013_b15) 2016 Tomov (10.1016/j.jpdc.2021.02.013_b31) 2010 Chen (10.1016/j.jpdc.2021.02.013_b5) 2018 Chen (10.1016/j.jpdc.2021.02.013_b3) 2019 10.1016/j.jpdc.2021.02.013_b20 Ernst (10.1016/j.jpdc.2021.02.013_b17) 2020 10.1016/j.jpdc.2021.02.013_b12 10.1016/j.jpdc.2021.02.013_b13 10.1016/j.jpdc.2021.02.013_b10 10.1016/j.jpdc.2021.02.013_b11 10.1016/j.jpdc.2021.02.013_b38 Chen (10.1016/j.jpdc.2021.02.013_b7) 2016 10.1016/j.jpdc.2021.02.013_b9 10.1016/j.jpdc.2021.02.013_b14 10.1016/j.jpdc.2021.02.013_b36 10.1016/j.jpdc.2021.02.013_b8 Liang (10.1016/j.jpdc.2021.02.013_b21) 2017 10.1016/j.jpdc.2021.02.013_b4 Dongarra (10.1016/j.jpdc.2021.02.013_b16) 2014 Wong (10.1016/j.jpdc.2021.02.013_b35) 2010 10.1016/j.jpdc.2021.02.013_b2 Huang (10.1016/j.jpdc.2021.02.013_b19) 1984 Wang (10.1016/j.jpdc.2021.02.013_b34) 2016 Volkov (10.1016/j.jpdc.2021.02.013_b33) 2016 Tomov (10.1016/j.jpdc.2021.02.013_b32) 2010 Abdelfattah (10.1016/j.jpdc.2021.02.013_b1) 2016; 42 10.1016/j.jpdc.2021.02.013_b30 Heinecke (10.1016/j.jpdc.2021.02.013_b18) 2016
References_xml	– reference: CULA, – reference: D. Tao, S. Di, X. Liang, Z. Chen, F. Cappello, Improving performance of iterative methods by lossy checkponting, in: Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing (HPDC), 2018, pp. 52–65. – volume: 42 start-page: 18 year: 2016 ident: b1 article-title: Kblas: An optimized library for dense matrix-vector multiplication on gpu accelerators publication-title: ACM Trans. Math. Softw. (TOMS) – start-page: 786 year: 2015 end-page: 796 ident: b28 article-title: Investigating the interplay between energy efficiency and resilience in high performance computing publication-title: 2015 IEEE International Parallel and Distributed Processing Symposium (IPDPS) – reference: cuDNN, – start-page: 25 year: 2011 end-page: 28 ident: b37 article-title: Fault tolerant matrix-matrix multiplication: correcting soft errors on-line publication-title: Proceedings of the Second Workshop on Scalable Algorithms for Large-Scale Systems – reference: I.S. Dhillon, Y. Guan, B. Kulis, Kernel k-means: spectral clustering and normalized cuts, in: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2004, pp. 551–556. – reference: MAGMA: Matrix Algebra on GPU and Multicore Architectures, – volume: 40 start-page: 559 year: 2014 end-page: 573 ident: b27 article-title: A survey of power and energy efficient techniques for high performance numerical linear algebra operations publication-title: Parallel Comput. – start-page: 30 year: 2017 ident: b21 article-title: Correcting soft errors online in fast fourier transform publication-title: SC17: International Conference for High Performance Computing, Networking, Storage and Analysis – reference: PantaRhei cluster, – year: 2014 ident: b16 article-title: Accelerating numerical dense linear algebra calculations with GPUs publication-title: Numerical Computations with GPUs – year: 2010 ident: b31 article-title: Towards dense linear algebra for hybrid GPU accelerated manycore systems publication-title: Parallel Matrix Algorithms and Applications – reference: P. Wu, Q. Guan, N. DeBardeleben, S. Blanchard, D. Tao, X. Liang, J. Chen, Z. Chen, Towards practical algorithm based fault tolerance in dense linear algebra, in: Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC), 2016. – reference: Nvidia A100 Tensor Core GPU Architecture, – reference: Nvidia Tesla V100 GPU Architecture, – year: 2019 ident: b3 article-title: Fault Tolerant and Energy Efficient One-Sided Matrix Decompositions on Heterogeneous Systems with GPUs – year: 2016 ident: b6 article-title: Online algorithm-based fault tolerance for cholesky decomposition on heterogeneous systems with GPUs publication-title: 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS) – reference: D. Tao, S.L. Song, S. Krishnamoorthy, P. Wu, X. Liang, E.Z. Zhang, D. Kerbyson, Z. Chen, New-Sum: A novel online abft scheme for general iterative methods, in: Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC), 2016. – start-page: 68 year: 2018 ident: b5 article-title: Fault tolerant one-sided matrix decompositions on heterogeneous systems with GPUs publication-title: SC18: International Conference for High Performance Computing, Networking, Storage, and Analysis – year: 1984 ident: b19 article-title: Algorithm-based fault tolerance for matrix operations publication-title: Comput. IEEE Trans. – year: 2010 ident: b35 article-title: Demystifying GPU microarchitecture through microbenchmarking publication-title: Performance Analysis of Systems & Software (ISPASS), 2010 IEEE International Symposium on – reference: cuBLAS Benchmark, – reference: Darwin cluster, – year: 2016 ident: b15 article-title: MAGMA Batched: A Batched BLAS Approach for Small Matrix Factorizations and Applications on GPUs – start-page: 667 year: 2016 end-page: 677 ident: b7 article-title: GreenLA: green linear algebra software for GPU-accelerated heterogeneous computing publication-title: SC16: International Conference for High Performance Computing, Networking, Storage and Analysis – reference: CUDA Programming Guide, – year: 2016 ident: b33 article-title: Understanding Latency Hiding on GPUs – start-page: 20 year: 2016 ident: b34 article-title: Blasx: A high performance level-3 blas library for heterogeneous multi-gpu computing publication-title: Proceedings of the 2016 International Conference on Supercomputing (ICS) – reference: . – reference: K-means by NVIDIA, – start-page: 1 year: 2010 end-page: 8 ident: b32 article-title: Dense linear algebra solvers for multicore with GPU accelerators publication-title: 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and PhD Forum (IPDPSW) – year: 2020 ident: b17 article-title: Performance engineering for real and complex tall & skinny matrix multiplication kernels on GPUs publication-title: Int. J. High Perform. Comput. Appl. – reference: PTX Programming Guide, – reference: Basic Linear Algebra on NVIDIA GPUs, – year: 2016 ident: b18 article-title: Libxsmm: accelerating small matrix multiplications by runtime code generation publication-title: SC16: International Conference for High Performance Computing, Networking, Storage and Analysis – reference: J. Chen, N. Xiong, X. Liang, D. Tao, S. Li, K. Ouyang, K. Zhao, N. DeBardeleben, Q. Guan, Z. Chen, TSM2: optimizing tall-and-skinny matrix-matrix multiplication on GPUs, in: Proceedings of the ACM International Conference on Supercomputing (ICS), 2019, pp. 106–116. – reference: J. Chen, S. Li, Z. Chen, GPU-ABFT: Optimizing algorithm-based fault tolerance for heterogeneous systems with GPUs, in: 2016 IEEE International Conference on Networking, Architecture and Storage (NAS). – reference: P. Wu, N. DeBardeleben, Q. Guan, S. Blanchard, J. Chen, D. Tao, X. Liang, K. Ouyang, Z. Chen, Silent data corruption resilient two-sided matrix factorizations, in: Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), 2017. – ident: 10.1016/j.jpdc.2021.02.013_b2 – start-page: 68 year: 2018 ident: 10.1016/j.jpdc.2021.02.013_b5 article-title: Fault tolerant one-sided matrix decompositions on heterogeneous systems with GPUs – ident: 10.1016/j.jpdc.2021.02.013_b14 doi: 10.1145/1014052.1014118 – year: 2010 ident: 10.1016/j.jpdc.2021.02.013_b31 article-title: Towards dense linear algebra for hybrid GPU accelerated manycore systems publication-title: Parallel Comput. doi: 10.1016/j.parco.2009.12.005 – ident: 10.1016/j.jpdc.2021.02.013_b12 – volume: 42 start-page: 18 issue: 3 year: 2016 ident: 10.1016/j.jpdc.2021.02.013_b1 article-title: Kblas: An optimized library for dense matrix-vector multiplication on gpu accelerators publication-title: ACM Trans. Math. Softw. (TOMS) doi: 10.1145/2818311 – ident: 10.1016/j.jpdc.2021.02.013_b10 – year: 2016 ident: 10.1016/j.jpdc.2021.02.013_b33 – ident: 10.1016/j.jpdc.2021.02.013_b4 doi: 10.1109/NAS.2016.7549404 – year: 1984 ident: 10.1016/j.jpdc.2021.02.013_b19 article-title: Algorithm-based fault tolerance for matrix operations publication-title: Comput. IEEE Trans. – start-page: 25 year: 2011 ident: 10.1016/j.jpdc.2021.02.013_b37 article-title: Fault tolerant matrix-matrix multiplication: correcting soft errors on-line – ident: 10.1016/j.jpdc.2021.02.013_b23 – ident: 10.1016/j.jpdc.2021.02.013_b25 – ident: 10.1016/j.jpdc.2021.02.013_b29 doi: 10.1145/3208040.3208050 – year: 2019 ident: 10.1016/j.jpdc.2021.02.013_b3 – year: 2016 ident: 10.1016/j.jpdc.2021.02.013_b15 – ident: 10.1016/j.jpdc.2021.02.013_b38 doi: 10.1145/2907294.2907315 – volume: 40 start-page: 559 issue: 10 year: 2014 ident: 10.1016/j.jpdc.2021.02.013_b27 article-title: A survey of power and energy efficient techniques for high performance numerical linear algebra operations publication-title: Parallel Comput. doi: 10.1016/j.parco.2014.09.001 – start-page: 786 year: 2015 ident: 10.1016/j.jpdc.2021.02.013_b28 article-title: Investigating the interplay between energy efficiency and resilience in high performance computing – ident: 10.1016/j.jpdc.2021.02.013_b36 doi: 10.1145/3018743.3018750 – start-page: 667 year: 2016 ident: 10.1016/j.jpdc.2021.02.013_b7 article-title: GreenLA: green linear algebra software for GPU-accelerated heterogeneous computing – start-page: 20 year: 2016 ident: 10.1016/j.jpdc.2021.02.013_b34 article-title: Blasx: A high performance level-3 blas library for heterogeneous multi-gpu computing – ident: 10.1016/j.jpdc.2021.02.013_b13 – year: 2014 ident: 10.1016/j.jpdc.2021.02.013_b16 article-title: Accelerating numerical dense linear algebra calculations with GPUs – ident: 10.1016/j.jpdc.2021.02.013_b11 – year: 2020 ident: 10.1016/j.jpdc.2021.02.013_b17 article-title: Performance engineering for real and complex tall & skinny matrix multiplication kernels on GPUs publication-title: Int. J. High Perform. Comput. Appl. – year: 2010 ident: 10.1016/j.jpdc.2021.02.013_b35 article-title: Demystifying GPU microarchitecture through microbenchmarking – year: 2016 ident: 10.1016/j.jpdc.2021.02.013_b18 article-title: Libxsmm: accelerating small matrix multiplications by runtime code generation – ident: 10.1016/j.jpdc.2021.02.013_b8 doi: 10.1145/3330345.3330355 – ident: 10.1016/j.jpdc.2021.02.013_b30 doi: 10.1145/2907294.2907306 – start-page: 30 year: 2017 ident: 10.1016/j.jpdc.2021.02.013_b21 article-title: Correcting soft errors online in fast fourier transform – ident: 10.1016/j.jpdc.2021.02.013_b24 – ident: 10.1016/j.jpdc.2021.02.013_b9 – start-page: 1 year: 2010 ident: 10.1016/j.jpdc.2021.02.013_b32 article-title: Dense linear algebra solvers for multicore with GPU accelerators – ident: 10.1016/j.jpdc.2021.02.013_b26 – ident: 10.1016/j.jpdc.2021.02.013_b22 – ident: 10.1016/j.jpdc.2021.02.013_b20 – year: 2016 ident: 10.1016/j.jpdc.2021.02.013_b6 article-title: Online algorithm-based fault tolerance for cholesky decomposition on heterogeneous systems with GPUs
SSID	ssj0011578
Score	2.4036076
Snippet	Linear algebra operations have been widely used in big data analytics and scientific computations. Many works have been done on optimizing linear algebra...
SourceID	crossref elsevier
SourceType	Enrichment Source Index Database Publisher
StartPage	70
SubjectTerms	CUDA GPU Matrix–matrix multiplication Performance optimization Tall-and-skinny matrix
Title	TSM2X: High-performance tall-and-skinny matrix–matrix multiplication on GPUs
URI	https://dx.doi.org/10.1016/j.jpdc.2021.02.013
Volume	151
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV1bS8MwFA5jvvjiXbyOPPgmcb0k6-LbGM6pbAjbYG8hSRPYmLW4Cfoi_gf_ob_EpEmnguxB2oe2JKGcnp5L-M53ADhLRYi5oAQRQaVJUJRCXIQCaWGiiYBrhQu2_V6_0R3h2zEZV0C7rIWxsEpv-51NL6y1f1L30qznk0l9YJ1fElv_U7CsWE5QjBOr5RdvS5iH5ZJpllScdrQvnHEYr2meWhrDyPF2hvHfzumHw-lsgQ0fKcKWe5ltUFHZDtgsuzBA_1Pugv5w0IvGl9AiNlD-XQcATVg9QzxL0dw22HqFD5aN_-Xz_cNdQI8l9Jt20JzX96P5Hhh1robtLvJdEpDEYbBAsoF5GGsa84hwk9xRIqKE6MTkJlpR46RkFHKuMZXmwNJ8Gxpo3pBBU4g0jhvxPqhmj5k6AFC5DEukCdXYrEt1KgmWgWraZZU4BGEpHiY9hbjtZDFjJVZsyqxImRUpCyJmRHoIzpdzckegsXI0KaXOfqkBMxZ-xbyjf847Buv2ziEYT0B18fSsTk2UsRC1Qo1qYK11c9ftfwEtc9H3
linkProvider	Elsevier
linkToHtml	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV3JTsMwEB1BOcCFHVHWHLghq1nstuZWVZQAbYREK_Vm2Y4tFUGoaJHgxj_wh3wJdu2wSKgHlByiJGNFL84szswbgJNcRJgLShARVJoARSnERSSQFsabCLlWeMa238vq6QBfDclwAdplLYxNq_S63-n0mbb2Z2oezdp4NKrdWuPXSKz9mbGs4EVYsuxUpAJLrcvrNPv6mRARp5AtG6cV8LUzLs3rbpxbJsPYUXdGyd_26YfN6azDqncWg5Z7ng1YUMUmrJWNGAL_XW5B1r_txcOzwCZtoPF3KUBgPOt7xIscTWyPrdfgwRLyv3y8vbuDwKcT-nW7wOwXN4PJNgw65_12inyjBCRxFE6RrGMeJZomPCbcxHeUiLhBdMOEJ1pRY6dkHHGuMZVmw9K8HhpqXpdhU4g8SerJDlSKx0LtQqBckCXyBtXYjEt1LgmWoWraYZWoQlTCw6RnEbfNLO5ZmS52xyykzELKwpgZSKtw-iUzdhwac-8mJers10xgRsnPkdv7p9wxLKf9Xpd1L7PrfVixV1xC4wFUpk_P6tA4HVNx5CfVJ3Bp1Kg
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=TSM2X%3A+High-performance+tall-and-skinny+matrix%E2%80%93matrix+multiplication+on+GPUs&rft.jtitle=Journal+of+parallel+and+distributed+computing&rft.au=Rivera%2C+Cody&rft.au=Chen%2C+Jieyang&rft.au=Xiong%2C+Nan&rft.au=Zhang%2C+Jing&rft.date=2021-05-01&rft.issn=0743-7315&rft.volume=151&rft.spage=70&rft.epage=85&rft_id=info:doi/10.1016%2Fj.jpdc.2021.02.013&rft.externalDBID=n%2Fa&rft.externalDocID=10_1016_j_jpdc_2021_02_013
thumbnail_l	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0743-7315&client=summon
thumbnail_m	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0743-7315&client=summon
thumbnail_s	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0743-7315&client=summon