TSM2X: High-performance tall-and-skinny matrix–matrix multiplication on GPUs

Linear algebra operations have been widely used in big data analytics and scientific computations. Many works have been done on optimizing linear algebra operations on GPUs with regular-shaped input. However, few works focus on fully utilizing GPU resources when the input is not regular-shaped. Curr...

Full description

Saved in:
Bibliographic Details
Published inJournal of parallel and distributed computing Vol. 151; pp. 70 - 85
Main Authors Rivera, Cody, Chen, Jieyang, Xiong, Nan, Zhang, Jing, Song, Shuaiwen Leon, Tao, Dingwen
Format Journal Article
LanguageEnglish
Published Elsevier Inc 01.05.2021
Subjects
Online AccessGet full text

Cover

Loading…
Abstract Linear algebra operations have been widely used in big data analytics and scientific computations. Many works have been done on optimizing linear algebra operations on GPUs with regular-shaped input. However, few works focus on fully utilizing GPU resources when the input is not regular-shaped. Current optimizations do not consider fully utilizing the memory bandwidth and computing power; therefore, they can only achieve sub-optimal performance. In this paper, we propose two efficient algorithms – TSM2R and TSM2L – for two classes of tall-and-skinny matrix–matrix multiplications on GPUs. Both of them focus on optimizing linear algebra operation with at least one of the input matrices tall-and-skinny. Specifically, TSM2R is designed for a large regular-shaped matrix multiplying a tall-and-skinny matrix, while TSM2L is designed for a tall-and-skinny matrix multiplying a small regular-shaped matrix. We implement our proposed algorithms and test on several modern NVIDIA GPU micro-architectures. Experiments show that, compared to the current state-of-the-art works, (1) TSM2R speeds up the computation by 1.6x on average and improves the memory bandwidth utilization and computing power utilization by 18.1% and 20.5% on average, respectively, when the regular-shaped matrix size is relatively large or medium; and (2) TSM2L speeds up the computation by 1.9x on average and improves the memory bandwidth utilization by up to 9.3% on average when the regular-shaped matrix size is relatively small. •Few works focus on optimizing GEMM on GPUs for the irregular-shaped input.•Current optimizations do not fully utilize the memory bandwidth and computing power.•We propose two efficient algorithms for two classes of tall-and-skinny GEMM on GPUs.•Our optimizations speedup GEMM by 1.1x∼3.5x for various tall-and-skinny inputs.
AbstractList Linear algebra operations have been widely used in big data analytics and scientific computations. Many works have been done on optimizing linear algebra operations on GPUs with regular-shaped input. However, few works focus on fully utilizing GPU resources when the input is not regular-shaped. Current optimizations do not consider fully utilizing the memory bandwidth and computing power; therefore, they can only achieve sub-optimal performance. In this paper, we propose two efficient algorithms – TSM2R and TSM2L – for two classes of tall-and-skinny matrix–matrix multiplications on GPUs. Both of them focus on optimizing linear algebra operation with at least one of the input matrices tall-and-skinny. Specifically, TSM2R is designed for a large regular-shaped matrix multiplying a tall-and-skinny matrix, while TSM2L is designed for a tall-and-skinny matrix multiplying a small regular-shaped matrix. We implement our proposed algorithms and test on several modern NVIDIA GPU micro-architectures. Experiments show that, compared to the current state-of-the-art works, (1) TSM2R speeds up the computation by 1.6x on average and improves the memory bandwidth utilization and computing power utilization by 18.1% and 20.5% on average, respectively, when the regular-shaped matrix size is relatively large or medium; and (2) TSM2L speeds up the computation by 1.9x on average and improves the memory bandwidth utilization by up to 9.3% on average when the regular-shaped matrix size is relatively small. •Few works focus on optimizing GEMM on GPUs for the irregular-shaped input.•Current optimizations do not fully utilize the memory bandwidth and computing power.•We propose two efficient algorithms for two classes of tall-and-skinny GEMM on GPUs.•Our optimizations speedup GEMM by 1.1x∼3.5x for various tall-and-skinny inputs.
Author Zhang, Jing
Rivera, Cody
Tao, Dingwen
Chen, Jieyang
Song, Shuaiwen Leon
Xiong, Nan
Author_xml – sequence: 1
  givenname: Cody
  orcidid: 0000-0001-7824-4054
  surname: Rivera
  fullname: Rivera, Cody
  organization: The University of Alabama, Tuscaloosa, AL 35487, USA
– sequence: 2
  givenname: Jieyang
  surname: Chen
  fullname: Chen, Jieyang
  organization: Oak Ridge National Laboratory, Oak Ridge, TN 37830, USA
– sequence: 3
  givenname: Nan
  surname: Xiong
  fullname: Xiong, Nan
  organization: University of California, Riverside, Riverside, CA 92521, USA
– sequence: 4
  givenname: Jing
  surname: Zhang
  fullname: Zhang, Jing
  organization: University of Colorado Colorado Springs, CO 80918, USA
– sequence: 5
  givenname: Shuaiwen Leon
  surname: Song
  fullname: Song, Shuaiwen Leon
  organization: The University of Sydney, NSW 2006, Australia
– sequence: 6
  givenname: Dingwen
  orcidid: 0000-0001-5422-4497
  surname: Tao
  fullname: Tao, Dingwen
  email: dingwen.tao@wsu.edu
  organization: The University of Alabama, Tuscaloosa, AL 35487, USA
BookMark eNp9kEtOwzAQhi1UJNrCBVjlAg5-5YXYoApapPKQaCV2luPY4JA4kW0Q3XEHbshJSCkrFtWMNLP5Rv98EzCynVUAnGIUY4TTszqu-0rGBBEcIxIjTA_AGKMihShn-QiMUcYozChOjsDE-xohjJMsH4O71eMteTqPFub5BfbK6c61wkoVBdE0UNgK-ldj7SZqRXDm4_vza7dE7VsTTN8YKYLpbDT0_GHtj8GhFo1XJ39zCtbXV6vZAi7v5zezyyWUDKMAZcoEprqggiSCsbRISpIlOhsCa1VQhiXBQmhWyKGYLDErkBapRHlZVpSmdArI7q50nfdOad470wq34RjxrRFe860RvjXCEeGDkQHK_0HShN_4wQnT7Ecvdqganno3ynEvjRo8VcYpGXjVmX34DyOpf3M
CitedBy_id crossref_primary_10_1145_3703352
crossref_primary_10_1109_TPDS_2024_3350368
crossref_primary_10_1145_3595178
crossref_primary_10_1007_s11042_022_13635_z
crossref_primary_10_1145_3570638
Cites_doi 10.1145/1014052.1014118
10.1016/j.parco.2009.12.005
10.1145/2818311
10.1109/NAS.2016.7549404
10.1145/3208040.3208050
10.1145/2907294.2907315
10.1016/j.parco.2014.09.001
10.1145/3018743.3018750
10.1145/3330345.3330355
10.1145/2907294.2907306
ContentType Journal Article
Copyright 2021 Elsevier Inc.
Copyright_xml – notice: 2021 Elsevier Inc.
DBID AAYXX
CITATION
DOI 10.1016/j.jpdc.2021.02.013
DatabaseName CrossRef
DatabaseTitle CrossRef
DatabaseTitleList
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
EISSN 1096-0848
EndPage 85
ExternalDocumentID 10_1016_j_jpdc_2021_02_013
S0743731521000344
GroupedDBID --K
--M
-~X
.~1
0R~
1B1
1~.
1~5
29L
4.4
457
4G.
5GY
5VS
7-5
71M
8P~
9JN
AACTN
AAEDT
AAEDW
AAIAV
AAIKJ
AAKOC
AALRI
AAOAW
AAQFI
AAQXK
AAXUO
AAYFN
ABBOA
ABEFU
ABFNM
ABFSI
ABJNI
ABMAC
ABTAH
ABXDB
ABYKQ
ACDAQ
ACGFS
ACNNM
ACRLP
ACZNC
ADBBV
ADEZE
ADFGL
ADHUB
ADJOM
ADMUD
ADTZH
AEBSH
AECPX
AEKER
AENEX
AFKWA
AFTJW
AGHFR
AGUBO
AGYEJ
AHHHB
AHJVU
AHZHX
AIALX
AIEXJ
AIKHN
AITUG
AJBFU
AJOXV
ALMA_UNASSIGNED_HOLDINGS
AMFUW
AMRAJ
AOUOD
ASPBG
AVWKF
AXJTR
AZFZN
BJAXD
BKOJK
BLXMC
CAG
COF
CS3
DM4
DU5
E.L
EBS
EFBJH
EFLBG
EJD
EO8
EO9
EP2
EP3
F5P
FDB
FEDTE
FGOYB
FIRID
FNPLU
FYGXN
G-2
G-Q
G8K
GBLVA
GBOLZ
HLZ
HVGLF
HZ~
H~9
IHE
J1W
JJJVA
K-O
KOM
LG5
LG9
LY7
M41
MO0
N9A
O-L
O9-
OAUVE
OZT
P-8
P-9
P2P
PC.
Q38
R2-
RIG
ROL
RPZ
SBC
SDF
SDG
SDP
SES
SET
SEW
SPC
SPCBC
SST
SSV
SSZ
T5K
TN5
TWZ
WUQ
XJT
XOL
XPP
ZMT
ZU3
ZY4
~G-
~G0
AATTM
AAXKI
AAYWO
AAYXX
ABDPE
ABWVN
ACRPL
ACVFH
ADCNI
ADNMO
ADVLN
AEIPS
AEUPX
AFJKZ
AFPUW
AFXIZ
AGCQF
AGQPQ
AGRNS
AIGII
AIIUN
AKBMS
AKRWK
AKYEP
ANKPU
APXCP
BNPGV
CITATION
SSH
ID FETCH-LOGICAL-c410t-c64a13f93a25a44695b275f7096fe9341c21aaf49c9c94cb1490fa6c08bbd3363
IEDL.DBID .~1
ISSN 0743-7315
IngestDate Thu Apr 24 23:02:09 EDT 2025
Tue Jul 01 03:20:50 EDT 2025
Fri Feb 23 02:45:49 EST 2024
IsDoiOpenAccess false
IsOpenAccess true
IsPeerReviewed true
IsScholarly true
Keywords CUDA
Matrix–matrix multiplication
Performance optimization
Tall-and-skinny matrix
GPU
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c410t-c64a13f93a25a44695b275f7096fe9341c21aaf49c9c94cb1490fa6c08bbd3363
ORCID 0000-0001-5422-4497
0000-0001-7824-4054
OpenAccessLink https://doi.org/10.1016/j.jpdc.2021.02.013
PageCount 16
ParticipantIDs crossref_primary_10_1016_j_jpdc_2021_02_013
crossref_citationtrail_10_1016_j_jpdc_2021_02_013
elsevier_sciencedirect_doi_10_1016_j_jpdc_2021_02_013
ProviderPackageCode CITATION
AAYXX
PublicationCentury 2000
PublicationDate May 2021
2021-05-00
PublicationDateYYYYMMDD 2021-05-01
PublicationDate_xml – month: 05
  year: 2021
  text: May 2021
PublicationDecade 2020
PublicationTitle Journal of parallel and distributed computing
PublicationYear 2021
Publisher Elsevier Inc
Publisher_xml – name: Elsevier Inc
References Darwin cluster
Chen, Li, Li, Liang, Wu, Tao, Ouyang, Liu, Zhao, Guan (b5) 2018
Nvidia A100 Tensor Core GPU Architecture
D. Tao, S. Di, X. Liang, Z. Chen, F. Cappello, Improving performance of iterative methods by lossy checkponting, in: Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing (HPDC), 2018, pp. 52–65.
Volkov (b33) 2016
P. Wu, N. DeBardeleben, Q. Guan, S. Blanchard, J. Chen, D. Tao, X. Liang, K. Ouyang, Z. Chen, Silent data corruption resilient two-sided matrix factorizations, in: Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), 2017.
K-means by NVIDIA
J. Chen, N. Xiong, X. Liang, D. Tao, S. Li, K. Ouyang, K. Zhao, N. DeBardeleben, Q. Guan, Z. Chen, TSM2: optimizing tall-and-skinny matrix-matrix multiplication on GPUs, in: Proceedings of the ACM International Conference on Supercomputing (ICS), 2019, pp. 106–116.
I.S. Dhillon, Y. Guan, B. Kulis, Kernel k-means: spectral clustering and normalized cuts, in: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2004, pp. 551–556.
Tomov, Nath, Ltaief, Dongarra (b32) 2010
cuBLAS Benchmark
Heinecke, Henry, Hutchinson, Pabst (b18) 2016
Wong, Papadopoulou, Sadooghi-Alvandi, Moshovos (b35) 2010
P. Wu, Q. Guan, N. DeBardeleben, S. Blanchard, D. Tao, X. Liang, J. Chen, Z. Chen, Towards practical algorithm based fault tolerance in dense linear algebra, in: Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC), 2016.
Chen, Liang, Chen (b6) 2016
MAGMA: Matrix Algebra on GPU and Multicore Architectures
Abdelfattah, Keyes, Ltaief (b1) 2016; 42
Tan, Song, Wu, Chen, Ge, Kerbyson (b28) 2015
Chen (b3) 2019
J. Chen, S. Li, Z. Chen, GPU-ABFT: Optimizing algorithm-based fault tolerance for heterogeneous systems with GPUs, in: 2016 IEEE International Conference on Networking, Architecture and Storage (NAS).
Tan, Kothapalli, Chen, Hussaini, Bissiri, Chen (b27) 2014; 40
Huang, Abraham (b19) 1984
Ernst, Hager, Thies, Wellein (b17) 2020
.
Dongarra, Gates, Haidar, Kurzak, Luszczek, Tomov, Yamazaki (b16) 2014
Basic Linear Algebra on NVIDIA GPUs
CUDA Programming Guide
Chen, Tan, Wu, Tao, Li, Liang, Li, Ge, Bhuyan, Chen (b7) 2016
PTX Programming Guide
Dong, Haidar, Luszczek, Tomov, Abdelfattah, Dongarra (b15) 2016
Liang, Chen, Tao, Li, Wu, Li, Ouyang, Liu, Song, Chen (b21) 2017
Tomov, Dongarra, Baboulin (b31) 2010
Nvidia Tesla V100 GPU Architecture
CULA
D. Tao, S.L. Song, S. Krishnamoorthy, P. Wu, X. Liang, E.Z. Zhang, D. Kerbyson, Z. Chen, New-Sum: A novel online abft scheme for general iterative methods, in: Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC), 2016.
Wu, Ding, Chen, Gao, Davies, Karlsson, Chen (b37) 2011
cuDNN
Wang, Wu, Xu, Xiao, Yang (b34) 2016
PantaRhei cluster
10.1016/j.jpdc.2021.02.013_b23
10.1016/j.jpdc.2021.02.013_b24
Tan (10.1016/j.jpdc.2021.02.013_b28) 2015
10.1016/j.jpdc.2021.02.013_b22
Tan (10.1016/j.jpdc.2021.02.013_b27) 2014; 40
10.1016/j.jpdc.2021.02.013_b25
10.1016/j.jpdc.2021.02.013_b26
10.1016/j.jpdc.2021.02.013_b29
Chen (10.1016/j.jpdc.2021.02.013_b6) 2016
Wu (10.1016/j.jpdc.2021.02.013_b37) 2011
Dong (10.1016/j.jpdc.2021.02.013_b15) 2016
Tomov (10.1016/j.jpdc.2021.02.013_b31) 2010
Chen (10.1016/j.jpdc.2021.02.013_b5) 2018
Chen (10.1016/j.jpdc.2021.02.013_b3) 2019
10.1016/j.jpdc.2021.02.013_b20
Ernst (10.1016/j.jpdc.2021.02.013_b17) 2020
10.1016/j.jpdc.2021.02.013_b12
10.1016/j.jpdc.2021.02.013_b13
10.1016/j.jpdc.2021.02.013_b10
10.1016/j.jpdc.2021.02.013_b11
10.1016/j.jpdc.2021.02.013_b38
Chen (10.1016/j.jpdc.2021.02.013_b7) 2016
10.1016/j.jpdc.2021.02.013_b9
10.1016/j.jpdc.2021.02.013_b14
10.1016/j.jpdc.2021.02.013_b36
10.1016/j.jpdc.2021.02.013_b8
Liang (10.1016/j.jpdc.2021.02.013_b21) 2017
10.1016/j.jpdc.2021.02.013_b4
Dongarra (10.1016/j.jpdc.2021.02.013_b16) 2014
Wong (10.1016/j.jpdc.2021.02.013_b35) 2010
10.1016/j.jpdc.2021.02.013_b2
Huang (10.1016/j.jpdc.2021.02.013_b19) 1984
Wang (10.1016/j.jpdc.2021.02.013_b34) 2016
Volkov (10.1016/j.jpdc.2021.02.013_b33) 2016
Tomov (10.1016/j.jpdc.2021.02.013_b32) 2010
Abdelfattah (10.1016/j.jpdc.2021.02.013_b1) 2016; 42
10.1016/j.jpdc.2021.02.013_b30
Heinecke (10.1016/j.jpdc.2021.02.013_b18) 2016
References_xml – reference: CULA,
– reference: D. Tao, S. Di, X. Liang, Z. Chen, F. Cappello, Improving performance of iterative methods by lossy checkponting, in: Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing (HPDC), 2018, pp. 52–65.
– volume: 42
  start-page: 18
  year: 2016
  ident: b1
  article-title: Kblas: An optimized library for dense matrix-vector multiplication on gpu accelerators
  publication-title: ACM Trans. Math. Softw. (TOMS)
– start-page: 786
  year: 2015
  end-page: 796
  ident: b28
  article-title: Investigating the interplay between energy efficiency and resilience in high performance computing
  publication-title: 2015 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
– reference: cuDNN,
– start-page: 25
  year: 2011
  end-page: 28
  ident: b37
  article-title: Fault tolerant matrix-matrix multiplication: correcting soft errors on-line
  publication-title: Proceedings of the Second Workshop on Scalable Algorithms for Large-Scale Systems
– reference: I.S. Dhillon, Y. Guan, B. Kulis, Kernel k-means: spectral clustering and normalized cuts, in: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2004, pp. 551–556.
– reference: MAGMA: Matrix Algebra on GPU and Multicore Architectures,
– volume: 40
  start-page: 559
  year: 2014
  end-page: 573
  ident: b27
  article-title: A survey of power and energy efficient techniques for high performance numerical linear algebra operations
  publication-title: Parallel Comput.
– start-page: 30
  year: 2017
  ident: b21
  article-title: Correcting soft errors online in fast fourier transform
  publication-title: SC17: International Conference for High Performance Computing, Networking, Storage and Analysis
– reference: PantaRhei cluster,
– year: 2014
  ident: b16
  article-title: Accelerating numerical dense linear algebra calculations with GPUs
  publication-title: Numerical Computations with GPUs
– year: 2010
  ident: b31
  article-title: Towards dense linear algebra for hybrid GPU accelerated manycore systems
  publication-title: Parallel Matrix Algorithms and Applications
– reference: P. Wu, Q. Guan, N. DeBardeleben, S. Blanchard, D. Tao, X. Liang, J. Chen, Z. Chen, Towards practical algorithm based fault tolerance in dense linear algebra, in: Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC), 2016.
– reference: Nvidia A100 Tensor Core GPU Architecture,
– reference: Nvidia Tesla V100 GPU Architecture,
– year: 2019
  ident: b3
  article-title: Fault Tolerant and Energy Efficient One-Sided Matrix Decompositions on Heterogeneous Systems with GPUs
– year: 2016
  ident: b6
  article-title: Online algorithm-based fault tolerance for cholesky decomposition on heterogeneous systems with GPUs
  publication-title: 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
– reference: D. Tao, S.L. Song, S. Krishnamoorthy, P. Wu, X. Liang, E.Z. Zhang, D. Kerbyson, Z. Chen, New-Sum: A novel online abft scheme for general iterative methods, in: Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC), 2016.
– start-page: 68
  year: 2018
  ident: b5
  article-title: Fault tolerant one-sided matrix decompositions on heterogeneous systems with GPUs
  publication-title: SC18: International Conference for High Performance Computing, Networking, Storage, and Analysis
– year: 1984
  ident: b19
  article-title: Algorithm-based fault tolerance for matrix operations
  publication-title: Comput. IEEE Trans.
– year: 2010
  ident: b35
  article-title: Demystifying GPU microarchitecture through microbenchmarking
  publication-title: Performance Analysis of Systems & Software (ISPASS), 2010 IEEE International Symposium on
– reference: cuBLAS Benchmark,
– reference: Darwin cluster,
– year: 2016
  ident: b15
  article-title: MAGMA Batched: A Batched BLAS Approach for Small Matrix Factorizations and Applications on GPUs
– start-page: 667
  year: 2016
  end-page: 677
  ident: b7
  article-title: GreenLA: green linear algebra software for GPU-accelerated heterogeneous computing
  publication-title: SC16: International Conference for High Performance Computing, Networking, Storage and Analysis
– reference: CUDA Programming Guide,
– year: 2016
  ident: b33
  article-title: Understanding Latency Hiding on GPUs
– start-page: 20
  year: 2016
  ident: b34
  article-title: Blasx: A high performance level-3 blas library for heterogeneous multi-gpu computing
  publication-title: Proceedings of the 2016 International Conference on Supercomputing (ICS)
– reference: .
– reference: K-means by NVIDIA,
– start-page: 1
  year: 2010
  end-page: 8
  ident: b32
  article-title: Dense linear algebra solvers for multicore with GPU accelerators
  publication-title: 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and PhD Forum (IPDPSW)
– year: 2020
  ident: b17
  article-title: Performance engineering for real and complex tall & skinny matrix multiplication kernels on GPUs
  publication-title: Int. J. High Perform. Comput. Appl.
– reference: PTX Programming Guide,
– reference: Basic Linear Algebra on NVIDIA GPUs,
– year: 2016
  ident: b18
  article-title: Libxsmm: accelerating small matrix multiplications by runtime code generation
  publication-title: SC16: International Conference for High Performance Computing, Networking, Storage and Analysis
– reference: J. Chen, N. Xiong, X. Liang, D. Tao, S. Li, K. Ouyang, K. Zhao, N. DeBardeleben, Q. Guan, Z. Chen, TSM2: optimizing tall-and-skinny matrix-matrix multiplication on GPUs, in: Proceedings of the ACM International Conference on Supercomputing (ICS), 2019, pp. 106–116.
– reference: J. Chen, S. Li, Z. Chen, GPU-ABFT: Optimizing algorithm-based fault tolerance for heterogeneous systems with GPUs, in: 2016 IEEE International Conference on Networking, Architecture and Storage (NAS).
– reference: P. Wu, N. DeBardeleben, Q. Guan, S. Blanchard, J. Chen, D. Tao, X. Liang, K. Ouyang, Z. Chen, Silent data corruption resilient two-sided matrix factorizations, in: Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), 2017.
– ident: 10.1016/j.jpdc.2021.02.013_b2
– start-page: 68
  year: 2018
  ident: 10.1016/j.jpdc.2021.02.013_b5
  article-title: Fault tolerant one-sided matrix decompositions on heterogeneous systems with GPUs
– ident: 10.1016/j.jpdc.2021.02.013_b14
  doi: 10.1145/1014052.1014118
– year: 2010
  ident: 10.1016/j.jpdc.2021.02.013_b31
  article-title: Towards dense linear algebra for hybrid GPU accelerated manycore systems
  publication-title: Parallel Comput.
  doi: 10.1016/j.parco.2009.12.005
– ident: 10.1016/j.jpdc.2021.02.013_b12
– volume: 42
  start-page: 18
  issue: 3
  year: 2016
  ident: 10.1016/j.jpdc.2021.02.013_b1
  article-title: Kblas: An optimized library for dense matrix-vector multiplication on gpu accelerators
  publication-title: ACM Trans. Math. Softw. (TOMS)
  doi: 10.1145/2818311
– ident: 10.1016/j.jpdc.2021.02.013_b10
– year: 2016
  ident: 10.1016/j.jpdc.2021.02.013_b33
– ident: 10.1016/j.jpdc.2021.02.013_b4
  doi: 10.1109/NAS.2016.7549404
– year: 1984
  ident: 10.1016/j.jpdc.2021.02.013_b19
  article-title: Algorithm-based fault tolerance for matrix operations
  publication-title: Comput. IEEE Trans.
– start-page: 25
  year: 2011
  ident: 10.1016/j.jpdc.2021.02.013_b37
  article-title: Fault tolerant matrix-matrix multiplication: correcting soft errors on-line
– ident: 10.1016/j.jpdc.2021.02.013_b23
– ident: 10.1016/j.jpdc.2021.02.013_b25
– ident: 10.1016/j.jpdc.2021.02.013_b29
  doi: 10.1145/3208040.3208050
– year: 2019
  ident: 10.1016/j.jpdc.2021.02.013_b3
– year: 2016
  ident: 10.1016/j.jpdc.2021.02.013_b15
– ident: 10.1016/j.jpdc.2021.02.013_b38
  doi: 10.1145/2907294.2907315
– volume: 40
  start-page: 559
  issue: 10
  year: 2014
  ident: 10.1016/j.jpdc.2021.02.013_b27
  article-title: A survey of power and energy efficient techniques for high performance numerical linear algebra operations
  publication-title: Parallel Comput.
  doi: 10.1016/j.parco.2014.09.001
– start-page: 786
  year: 2015
  ident: 10.1016/j.jpdc.2021.02.013_b28
  article-title: Investigating the interplay between energy efficiency and resilience in high performance computing
– ident: 10.1016/j.jpdc.2021.02.013_b36
  doi: 10.1145/3018743.3018750
– start-page: 667
  year: 2016
  ident: 10.1016/j.jpdc.2021.02.013_b7
  article-title: GreenLA: green linear algebra software for GPU-accelerated heterogeneous computing
– start-page: 20
  year: 2016
  ident: 10.1016/j.jpdc.2021.02.013_b34
  article-title: Blasx: A high performance level-3 blas library for heterogeneous multi-gpu computing
– ident: 10.1016/j.jpdc.2021.02.013_b13
– year: 2014
  ident: 10.1016/j.jpdc.2021.02.013_b16
  article-title: Accelerating numerical dense linear algebra calculations with GPUs
– ident: 10.1016/j.jpdc.2021.02.013_b11
– year: 2020
  ident: 10.1016/j.jpdc.2021.02.013_b17
  article-title: Performance engineering for real and complex tall & skinny matrix multiplication kernels on GPUs
  publication-title: Int. J. High Perform. Comput. Appl.
– year: 2010
  ident: 10.1016/j.jpdc.2021.02.013_b35
  article-title: Demystifying GPU microarchitecture through microbenchmarking
– year: 2016
  ident: 10.1016/j.jpdc.2021.02.013_b18
  article-title: Libxsmm: accelerating small matrix multiplications by runtime code generation
– ident: 10.1016/j.jpdc.2021.02.013_b8
  doi: 10.1145/3330345.3330355
– ident: 10.1016/j.jpdc.2021.02.013_b30
  doi: 10.1145/2907294.2907306
– start-page: 30
  year: 2017
  ident: 10.1016/j.jpdc.2021.02.013_b21
  article-title: Correcting soft errors online in fast fourier transform
– ident: 10.1016/j.jpdc.2021.02.013_b24
– ident: 10.1016/j.jpdc.2021.02.013_b9
– start-page: 1
  year: 2010
  ident: 10.1016/j.jpdc.2021.02.013_b32
  article-title: Dense linear algebra solvers for multicore with GPU accelerators
– ident: 10.1016/j.jpdc.2021.02.013_b26
– ident: 10.1016/j.jpdc.2021.02.013_b22
– ident: 10.1016/j.jpdc.2021.02.013_b20
– year: 2016
  ident: 10.1016/j.jpdc.2021.02.013_b6
  article-title: Online algorithm-based fault tolerance for cholesky decomposition on heterogeneous systems with GPUs
SSID ssj0011578
Score 2.4036076
Snippet Linear algebra operations have been widely used in big data analytics and scientific computations. Many works have been done on optimizing linear algebra...
SourceID crossref
elsevier
SourceType Enrichment Source
Index Database
Publisher
StartPage 70
SubjectTerms CUDA
GPU
Matrix–matrix multiplication
Performance optimization
Tall-and-skinny matrix
Title TSM2X: High-performance tall-and-skinny matrix–matrix multiplication on GPUs
URI https://dx.doi.org/10.1016/j.jpdc.2021.02.013
Volume 151
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV1bS8MwFA5jvvjiXbyOPPgmcb0k6-LbGM6pbAjbYG8hSRPYmLW4Cfoi_gf_ob_EpEmnguxB2oe2JKGcnp5L-M53ADhLRYi5oAQRQaVJUJRCXIQCaWGiiYBrhQu2_V6_0R3h2zEZV0C7rIWxsEpv-51NL6y1f1L30qznk0l9YJ1fElv_U7CsWE5QjBOr5RdvS5iH5ZJpllScdrQvnHEYr2meWhrDyPF2hvHfzumHw-lsgQ0fKcKWe5ltUFHZDtgsuzBA_1Pugv5w0IvGl9AiNlD-XQcATVg9QzxL0dw22HqFD5aN_-Xz_cNdQI8l9Jt20JzX96P5Hhh1robtLvJdEpDEYbBAsoF5GGsa84hwk9xRIqKE6MTkJlpR46RkFHKuMZXmwNJ8Gxpo3pBBU4g0jhvxPqhmj5k6AFC5DEukCdXYrEt1KgmWgWraZZU4BGEpHiY9hbjtZDFjJVZsyqxImRUpCyJmRHoIzpdzckegsXI0KaXOfqkBMxZ-xbyjf847Buv2ziEYT0B18fSsTk2UsRC1Qo1qYK11c9ftfwEtc9H3
linkProvider Elsevier
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV3JTsMwEB1BOcCFHVHWHLghq1nstuZWVZQAbYREK_Vm2Y4tFUGoaJHgxj_wh3wJdu2wSKgHlByiJGNFL84szswbgJNcRJgLShARVJoARSnERSSQFsabCLlWeMa238vq6QBfDclwAdplLYxNq_S63-n0mbb2Z2oezdp4NKrdWuPXSKz9mbGs4EVYsuxUpAJLrcvrNPv6mRARp5AtG6cV8LUzLs3rbpxbJsPYUXdGyd_26YfN6azDqncWg5Z7ng1YUMUmrJWNGAL_XW5B1r_txcOzwCZtoPF3KUBgPOt7xIscTWyPrdfgwRLyv3y8vbuDwKcT-nW7wOwXN4PJNgw65_12inyjBCRxFE6RrGMeJZomPCbcxHeUiLhBdMOEJ1pRY6dkHHGuMZVmw9K8HhpqXpdhU4g8SerJDlSKx0LtQqBckCXyBtXYjEt1LgmWoWraYZWoQlTCw6RnEbfNLO5ZmS52xyykzELKwpgZSKtw-iUzdhwac-8mJers10xgRsnPkdv7p9wxLKf9Xpd1L7PrfVixV1xC4wFUpk_P6tA4HVNx5CfVJ3Bp1Kg
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=TSM2X%3A+High-performance+tall-and-skinny+matrix%E2%80%93matrix+multiplication+on+GPUs&rft.jtitle=Journal+of+parallel+and+distributed+computing&rft.au=Rivera%2C+Cody&rft.au=Chen%2C+Jieyang&rft.au=Xiong%2C+Nan&rft.au=Zhang%2C+Jing&rft.date=2021-05-01&rft.issn=0743-7315&rft.volume=151&rft.spage=70&rft.epage=85&rft_id=info:doi/10.1016%2Fj.jpdc.2021.02.013&rft.externalDBID=n%2Fa&rft.externalDocID=10_1016_j_jpdc_2021_02_013
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0743-7315&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0743-7315&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0743-7315&client=summon