Landing Stencil Code on Godson-T

The advent of multi-core/many-core chip technology offers both an extraordinary opportunity and a profound challenge. In particular, computer architects and system software designers are faced with a unique opportunity to introducing new architecture features as well as adequate compiler technology...

Full description

Saved in:

Bibliographic Details
Published in	Journal of computer science and technology Vol. 25; no. 4; pp. 886 - 894
Main Author	崔慧敏王蕾范东睿冯晓兵
Format	Journal Article
Language	English
Published	Boston Springer US 01.07.2010 Springer Nature B.V
Subjects	Architects Architecture Artificial Intelligence Chips Compilers Computation Computer architecture Computer programs Computer Science Computers Data Structures and Information Theory Design Designers Information Systems Applications (incl.Internet) Optimization Optimization techniques R&D Research & development Short Paper Software Software Engineering Synchronism Synchronization Theory of Computation Tiling 优化技术同步机制芯片技术计算机系统软件架构 many-core SPM fine-grain synchronization compiler stencil Jacobi
Online Access	Get full text
ISSN	1000-9000 1860-4749
DOI	10.1007/s11390-010-9373-6

Cover

Abstract	The advent of multi-core/many-core chip technology offers both an extraordinary opportunity and a profound challenge. In particular, computer architects and system software designers are faced with a unique opportunity to introducing new architecture features as well as adequate compiler technology -- together they may have profound impact. This paper presents a case study （using the 1-D Jacobi computation） of compiler-amendable performance optimization techniques on a many-core architecture Godson-T. Godson-T architecture has several unique features that are chosen for this study： 1） chip-level global addressable memory in particular the scratchpad memories （SPM） local to the processing cores; 2） fine-grain memory based synchronization （e.g., full-empty bit for fine-grain synchronization）. Leveraging state-of-the-art performance optimization methods for 1-D stencil parallelization （e.g., timed tiling and variants）, we developed and implement a number of many-core-based optimization for Godson-T. Our experimental study shows good performance in both execution time speedup and scalability, validate the value of globally accessed SPM and fine-grain synchronization mechanism （full-empty bits） under the Godson-T, and provides some useful guidelines for future compiler technology of many-core chip architectures.
AbstractList	The advent of multi-core/many-core chip technology offers both an extraordinary opportunity and a profound challenge. In particular, computer architects and system software designers are faced with a unique opportunity to introducing new architecture features as well as adequate compiler technology - together they may have profound impact. This paper presents a case study (using the 1-D Jacobi computation) of compiler-amendable performance optimization techniques on a many-core architecture Godson-T. Godson-T architecture has several unique features that are chosen for this study: 1) chip-level global addressable memory in particular the scratchpad memories (SPM) local to the processing cores; 2) fine-grain memory based synchronization (e.g., full-empty bit for fine-grain synchronization). Leveraging state-of-the-art performance optimization methods for 1-D stencil parallelization (e.g., timed tiling and variants), we developed and implement a number of many-core-based optimization for Godson-T. Our experimental study shows good performance in both execution time speedup and scalability, validate the value of globally accessed SPM and fine-grain synchronization mechanism (full-empty bits) under the Godson-T, and provides some useful guidelines for future compiler technology of many-core chip architectures. The advent of multi-core/many-core chip technology offers both an extraordinary opportunity and a profound challenge. In particular, computer architects and system software designers are faced with a unique opportunity to introducing new architecture features as well as adequate compiler technology -- together they may have profound impact. This paper presents a case study (using the 1-D Jacobi computation) of compiler-amendable performance optimization techniques on a many-core architecture Godson-T. Godson-T architecture has several unique features that are chosen for this study: 1) chip-level global addressable memory in particular the scratchpad memories (SPM) local to the processing cores; 2) fine-grain memory based synchronization (e.g., full-empty bit for fine-grain synchronization). Leveraging state-of-the-art performance optimization methods for 1-D stencil parallelization (e.g., timed tiling and variants), we developed and implement a number of many-core-based optimization for Godson-T. Our experimental study shows good performance in both execution time speedup and scalability, validate the value of globally accessed SPM and fine-grain synchronization mechanism (full-empty bits) under the Godson-T, and provides some useful guidelines for future compiler technology of many-core chip architectures.[PUBLICATION ABSTRACT] The advent of multi-core/many-core chip technology offers both an extraordinary opportunity and a profound challenge. In particular, computer architects and system software designers are faced with a unique opportunity to introducing new architecture features as well as adequate compiler technology -- together they may have profound impact. This paper presents a case study （using the 1-D Jacobi computation） of compiler-amendable performance optimization techniques on a many-core architecture Godson-T. Godson-T architecture has several unique features that are chosen for this study： 1） chip-level global addressable memory in particular the scratchpad memories （SPM） local to the processing cores; 2） fine-grain memory based synchronization （e.g., full-empty bit for fine-grain synchronization）. Leveraging state-of-the-art performance optimization methods for 1-D stencil parallelization （e.g., timed tiling and variants）, we developed and implement a number of many-core-based optimization for Godson-T. Our experimental study shows good performance in both execution time speedup and scalability, validate the value of globally accessed SPM and fine-grain synchronization mechanism （full-empty bits） under the Godson-T, and provides some useful guidelines for future compiler technology of many-core chip architectures.
Author	崔慧敏王蕾范东睿冯晓兵
AuthorAffiliation	Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences Beijing 100190, China Graduate University of Chinese Academy of Sciences, Beijing 100039, China
Author_xml	– sequence: 1 fullname: 崔慧敏王蕾范东睿冯晓兵
BookMark	eNp9kE1LAzEQhoMo2FZ_gLfFi6fo5GOTzVGKVqHgwXoOm9ls3bpN2k178N-b0oLgwcvMwDzvMDxjch5i8ITcMLhnAPohMSYMUGBAjdCCqjMyYpUCKrU053kGyJtcLsk4pRWA0CDliBTzOjRdWBbvOx-w64tpbHwRQzGLTYqBLq7IRVv3yV-f-oR8PD8tpi90_jZ7nT7OKQqudtRxIaVSslFVjQzbSmleSldqAbyRbdsq4VSL6EvXOFei8QYB0aEABAG1mJC7493NELd7n3Z23SX0fV8HH_fJGtBGC62rTN7-IVdxP4T8nK00ByMrwTLEjhAOMaXBt3YzdOt6-LYM7MGYPRqz2Zg9GLMqZ_gxkzIbln74Pfxf6PQNfsaw3OacdTV-tV3vbVaigSsufgAAvXit
Cites_doi	10.1109/IPDPS.2000.845979 10.1109/PDCAT.2008.28 10.1109/ISCA.1998.694790 10.1109/40.127581 10.1109/IPDPS.2007.370291 10.1109/SC.2008.5222004 10.1007/s11227-007-0111-y 10.1137/070693199 10.1109/ICCD.2006.4380784 10.1145/1531743.1531756 10.1145/1345206.1345210 10.1109/MM.2005.37 10.1145/1250662.1250668 10.1007/978-3-540-85451-7_14 10.1007/11823285_14 10.1145/255129.255132 10.1145/1273442.1250761 10.1109/ISSCC.2007.373606 10.1145/1048935.1050187 10.1145/1345206.1345255 10.1145/209936.209952 10.1145/1178597.1178605 10.1145/1360612.1360617 10.1145/113446.113449 10.1145/301618.301668 10.1109/IPDPS.2007.370639
ContentType	Journal Article
Copyright	Springer 2010 Springer 2010.
Copyright_xml	– notice: Springer 2010 – notice: Springer 2010.
DBID	2RA 92L CQIGP W92 ~WA AAYXX CITATION 3V. 7SC 7WY 7WZ 7XB 87Z 8AL 8FD 8FE 8FG 8FK 8FL ABJCF ABUWG AFKRA ARAPS AZQEC BENPR BEZIV BGLVJ CCPQU DWQXO FRNLG F~G GNUQQ HCIFZ JQ2 K60 K6~ K7- L.- L6V L7M L~C L~D M0C M0N M7S P5Z P62 PHGZM PHGZT PKEHL PQBIZ PQBZA PQEST PQGLB PQQKQ PQUKI PRINS PTHSS Q9U
DOI	10.1007/s11390-010-9373-6
DatabaseName	维普_期刊中文科技期刊数据库-CALIS站点中文科技期刊数据库-7.0平台中文科技期刊数据库-工程技术中文科技期刊数据库- 镜像站点 CrossRef ProQuest Central (Corporate) Computer and Information Systems Abstracts ABI/INFORM Collection ABI/INFORM Global (PDF only) ProQuest Central (purchase pre-March 2016) ABI/INFORM Global (Alumni Edition) Computing Database (Alumni Edition) Technology Research Database ProQuest SciTech Collection ProQuest Technology Collection ProQuest Central (Alumni) (purchase pre-March 2016) ABI/INFORM Collection (Alumni Edition) ProQuest Materials Science & Engineering ProQuest Central (Alumni) ProQuest Central UK/Ireland Advanced Technologies & Aerospace Collection ProQuest Central Essentials ProQuest Central Business Premium Collection (Proquest) Technology collection ProQuest One Community College ProQuest Central Korea Business Premium Collection (Alumni) ABI/INFORM Global (Corporate) ProQuest Central Student SciTech Premium Collection ProQuest Computer Science Collection ProQuest Business Collection (Alumni Edition) ProQuest Business Collection Computer Science Database ABI/INFORM Professional Advanced ProQuest Engineering Collection Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Academic Computer and Information Systems Abstracts Professional ABI/INFORM Global Computing Database Engineering Database Advanced Technologies & Aerospace Database ProQuest Advanced Technologies & Aerospace Collection ProQuest Central Premium ProQuest One Academic (New) ProQuest One Academic Middle East (New) ProQuest One Business (UW System Shared) ProQuest One Business (Alumni) ProQuest One Academic Eastern Edition (DO NOT USE) ProQuest One Applied & Life Sciences ProQuest One Academic ProQuest One Academic UKI Edition ProQuest Central China Engineering Collection ProQuest Central Basic
DatabaseTitle	CrossRef ABI/INFORM Global (Corporate) ProQuest Business Collection (Alumni Edition) ProQuest One Business Computer Science Database ProQuest Central Student Technology Collection Technology Research Database Computer and Information Systems Abstracts – Academic ProQuest One Academic Middle East (New) ProQuest Advanced Technologies & Aerospace Collection ProQuest Central Essentials ProQuest Computer Science Collection Computer and Information Systems Abstracts ProQuest Central (Alumni Edition) SciTech Premium Collection ProQuest One Community College ProQuest Central China ABI/INFORM Complete ProQuest Central ABI/INFORM Professional Advanced ProQuest One Applied & Life Sciences ProQuest Engineering Collection ProQuest Central Korea ProQuest Central (New) Advanced Technologies Database with Aerospace ABI/INFORM Complete (Alumni Edition) Engineering Collection Advanced Technologies & Aerospace Collection Business Premium Collection ABI/INFORM Global ProQuest Computing Engineering Database ABI/INFORM Global (Alumni Edition) ProQuest Central Basic ProQuest Computing (Alumni Edition) ProQuest One Academic Eastern Edition ProQuest Technology Collection ProQuest SciTech Collection ProQuest Business Collection Computer and Information Systems Abstracts Professional Advanced Technologies & Aerospace Database ProQuest One Academic UKI Edition Materials Science & Engineering Collection ProQuest One Business (Alumni) ProQuest One Academic ProQuest Central (Alumni) ProQuest One Academic (New) Business Premium Collection (Alumni)
DatabaseTitleList	Computer and Information Systems Abstracts ABI/INFORM Global (Corporate)
Database_xml	– sequence: 1 dbid: 8FG name: ProQuest Technology Collection url: https://search.proquest.com/technologycollection1 sourceTypes: Aggregation Database
DeliveryMethod	fulltext_linktorsrc
Discipline	Computer Science Architecture
DocumentTitleAlternate	Landing Stencil Code on Godson-T
EISSN	1860-4749
EndPage	894
ExternalDocumentID	2376505781 10_1007_s11390_010_9373_6 34470262
GroupedDBID	-4Z -59 -5G -BR -EM -Y2 -~C .86 .VR 06D 0R~ 0VY 1N0 1SB 2.D 28- 29K 2B. 2C0 2J2 2JN 2JY 2KG 2KM 2LR 2RA 2VQ 2~H 30V 3V. 4.4 406 408 409 40D 40E 5GY 5QI 5VR 5VS 67Z 6NX 7WY 8FE 8FG 8FL 8TC 8UJ 92H 92I 92L 92R 93N 95- 95. 95~ 96X AAAVM AABHQ AABYN AAFGU AAHNG AAIAL AAJKR AANZL AAOBN AARHV AARTL AATVU AAUYE AAWCG AAYIU AAYQN AAYTO ABBBX ABBXA ABDZT ABECU ABFGW ABFTD ABFTV ABHLI ABHQN ABJOX ABKAS ABKCH ABKTR ABMNI ABMQK ABNWP ABQBU ABSXP ABTEG ABTHY ABTMW ABULA ABUWG ABXPI ACBMV ACBRV ACBXY ACGFS ACHSB ACHXU ACIGE ACIPQ ACKNC ACMDZ ACMLO ACOKC ACOMO ACSNA ACTTH ACVWB ACWMK ADGRI ADHHG ADHIR ADINQ ADKNI ADKPE ADMDM ADRFC ADTIX ADURQ ADYFF ADZKW AEBTG AEFIE AEFTE AEGAL AEGNC AEJHL AEJRE AEKMD AENEX AEOHA AEPYU AESTI AETLH AEVTX AEXYK AEYWE AFEXP AFGCZ AFKRA AFLOW AFQWF AFUIB AFWTZ AFZKB AGAYW AGDGC AGGBP AGGDS AGJBK AGMZJ AGQMX AGWIL AGWZB AGYKE AHAVH AHBYD AHKAY AHSBF AHYZX AIAKS AIIXL AILAN AIMYW AITGF AJBLW AJDOV AJRNO ALMA_UNASSIGNED_HOLDINGS ALWAN AMKLP AMYLF AMYQR ARAPS ARMRJ ASPBG AVWKF AXYYD AZFZN AZQEC B-. BA0 BBWZM BDATZ BENPR BEZIV BGLVJ BGNMA BPHCQ CAG CCEZO CDYEO CHBEP COF CQIGP CS3 CSCUP CUBFJ CW9 D-I DNIVK DU5 DWQXO EBLON EBS EIOEI EJD ESBYG F5P FA0 FEDTE FERAY FFXSO FIGPU FINBP FNLPD FRNLG FRRFC FSGXE FWDCC GGCAI GGRSB GJIRD GNUQQ GNWQR GQ6 GQ7 GQ8 GROUPED_ABI_INFORM_COMPLETE GXS H13 HCIFZ HF~ HG6 HMJXF HQYDN HRMNR HVGLF HZ~ IAO IHE IJ- IPNFZ IXC IXD IXE IZIGR IZQ I~X I~Z J-C JBSCW JCJTX JZLTJ K60 K6V K6~ K7- KDC KOV LAK LLZTM M0C M0N M4Y MA- N2Q N95 NB0 NDZJH NF0 NQJWS NU0 O9- O93 O9G O9I O9J OAM P19 P2P P62 P9O PF0 PQBIZ PQEST PQQKQ PQUKI PRINS PROAC PT4 PT5 Q2X QOK QOS R4E R89 R9I RHV RNI RNS ROL RPX RSV RZK S16 S1Z S26 S27 S28 S3B SAP SCJ SCL SCLPG SCO SDH SDM SHX SISQX SJYHP SNE SNX SOJ SPISZ SRMVM SSLCW STPWE SZN T13 T16 TCJ TGT TSG TSK TSV TUC U2A UG4 UNUBA UOJIU UTJUX UZXMN VC2 VFIZW W23 W48 W92 WK8 YLTOR Z7R Z7U Z7X Z81 Z83 Z88 Z8R Z8W Z92 ZMTXR ~A9 ~EX ~WA -SI -S~ 5XA 5XJ AACDK AAJBT AASML AATNV AAXDM AAYZH ABAKF ABJCF ABJNI ABQSL ABTKH ABWNU ACAOD ACDTI ACPIV ACZOJ ADTPH AEFQL AEMSY AESKC AEVLU AFBBN AGQEE AGRTI AIGIU AMXSW AOCGG BSONS CAJEI CCPQU DDRTE DPUIP IKXTQ IWAJR M7S NPVJJ PQBZA PTHSS Q-- SNPRN SOHCF U1G U5S AAPKM AAYXX ABBRH ABDBE ABFSG ABRTQ ACSTC ADHKG AEZWR AFDZB AFHIU AFOHR AGQPQ AHPBZ AHWEU AIXLP ATHPR AYFIA CITATION ICD IVC PHGZM PHGZT PQGLB TGMPQ 7SC 7XB 8AL 8FD 8FK JQ2 L.- L6V L7M L~C L~D PKEHL Q9U PUEGO
ID	FETCH-LOGICAL-c326t-b2344664d68ac1cf867254b57302d4fff63b6fcce5bdbb5c9e9c0ccbc30c030a3
IEDL.DBID	U2A
ISSN	1000-9000
IngestDate	Thu Sep 04 22:21:22 EDT 2025 Sat Aug 23 14:15:14 EDT 2025 Tue Aug 05 11:59:17 EDT 2025 Fri Feb 21 02:40:03 EST 2025 Fri Nov 25 17:04:00 EST 2022
IsPeerReviewed	true
IsScholarly	true
Issue	4
Keywords	many-core SPM fine-grain synchronization compiler stencil Jacobi
Language	English
License	http://www.springer.com/tdm
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-c326t-b2344664d68ac1cf867254b57302d4fff63b6fcce5bdbb5c9e9c0ccbc30c030a3
Notes	11-2296/TP TP332 many-core, stencil, Jacobi, compiler SPM, fine-grain synchronization many-core, stencil, Jacobi, compiler; SPM, fine-grain synchronization TG76 ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 ObjectType-Article-2 ObjectType-Feature-1 content type line 23
PQID	872094831
PQPubID	326258
PageCount	9
ParticipantIDs	proquest_miscellaneous_907973778 proquest_journals_872094831 crossref_primary_10_1007_s11390_010_9373_6 springer_journals_10_1007_s11390_010_9373_6 chongqing_backfile_34470262
PublicationCentury	2000
PublicationDate	2010-07-01
PublicationDateYYYYMMDD	2010-07-01
PublicationDate_xml	– month: 07 year: 2010 text: 2010-07-01 day: 01
PublicationDecade	2010
PublicationPlace	Boston
PublicationPlace_xml	– name: Boston – name: Beijing
PublicationTitle	Journal of computer science and technology
PublicationTitleAbbrev	J. Comput. Sci. Technol
PublicationTitleAlternate	Journal of Computer Science and Technology
PublicationYear	2010
Publisher	Springer US Springer Nature B.V
Publisher_xml	– name: Springer US – name: Springer Nature B.V
References	FrigoMStrumpenVThe memory behavior of cache oblivious stencil computationsJournal of Supercomputing200629293112 McCalpin J, Wonnacott D. Time skewing: A value-based approach to optimizing for memory locality. Technical Report DCS-TR-379, DCS, Rugers University, 1999. Venetis I E, Gao G R. Mapping the LU decomposition on a many core architecture: Challenges and solutions. In Proc. ACM International Conference on Computing Frontiers (CF2009), Ischia, Italy, May 18-20, 2009, pp.71-80. Dally W J. Computer architecture in the many-core era. In Keynote at the 24th Int. Conf. Comput. Design, San Jose, CA, USA, Oct. 1, 2006. Kamil S, Datta K, Williams S, Oliker L, Shalf J, Yelick K. Implicit and explicit optimizations for stencil computations. In Proc. MSPC2006, San Jose, USA, Oct. 22, 2006, pp.51-60. Huang H, Yuan N et al. Architecture supported synchronization-based cache coherence protocol for many-core processors. In the 2nd Workshop on Chip Multiprocessor Memory Systems and Interconnects (CMPMSI) of ISCA’08, Beijing, China, June 22, 2008. AlversonRCallahanDCummingsDKoblenzBPorterfieldASmithBThe Tera computer systemSIGARCH Comput. Archit. News1990183b1610.1145/255129.255132 Datta K, Murphy M, Volkov V, Williams S, Carter J, Oliker L, Patterson D, Shalf J, Yelick K. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In Proc. SC2008, Austin, USA, Nov. 15-21, 2008, Article No. 1. Baskaran M, Bondhugula U, Krishnamoorthy S, Ramanujam J, Rountev A, Sadayappan P. Automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories. In Proc. 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2008), Salt Lake City, USA, Feb. 20-23, 2008, pp.1-10. Kranz D, Lim B H, Agarwal A. Low-cost support for finegrain synchronization in multiprocessors. Technical Report MIT/LCS/TM-470, Massachusetts Institute of Technology, Cambridge, 1992. AlversonRCallahanDThe Tera compute systemSIGARCH Comput. Archit. News1990183b1610.1145/255129.255132 Wonnacott D. Using time skewing to eliminate idle time due to memory bandwidth and network limitations. In Proc. International Conference on Parallel and Distributed Computing Systems, Cancun, Mexico, May 1-5, 2000, p.171. Keckler S W, Dally W J, Maskit D, Carter N P, Chang A, Lee W S. Exploiting fine-grain thread level parallelism on the MIT multi-ALU processor. In Proc. the 25th Int. Symp. Computer Architecture, Barcelona, Spain, Jun. 27-Jul. 2, 1998, pp.302-317. Zhu W, Sreedhar V C, Hu Z, Gao G R. Synchronization state buffer: Supporting efficient fine-grain synchronization on many-core architectures. In Proc. ISCA 2007, San Diego, USA, June 9-13, 2007, pp.35-45. Tseng C W. Compiler optimizations for eliminating barrier synchronization. In Proc. PPOPP 1995, Santa Barbara, California, USA, July 19-21, 1995, pp.144-155. Smith B. The Architecture of HEP. Parallel MIMD Computation: HEP Supercomputer and Its Applications. Kowalik J S (ed.), Scientific Computation Series, Cambridge: MIT Press, MA, 1985, p.41-55. DallyWJThe message-driven processorIEEE Micro.1992122233910.1109/40.127581 MontrymJMoretonHThe GeForce 6800IEEE Micro2005252415110.1109/MM.2005.37 Krishnamoorthy S, Baskaran M, Bondhugula U, Ramanujam J, Rountev A, Sadayappan P. Effective automatic parallelization of stencil computations. In Proc. ACM SIGPLAN Conference on Programming Language Design and Implementation, San Diego, USA, June 10-13, 2007, pp.235-244. Cray MTA-2 System, http://www.cray.com/About/History.aspx. Dally W J, Labonte F, Das A, Hanrahan P, Ahn J H, Gummaraju J, Erez M, Jayasena N, Buck I, Knight T J, Kapasi U J. Merrimac: Supercomputing with Streams. In Proc. the Supercomputer Conference, Phoenix, USA, Nov. 15-21, 2003. Tan G, Fan D, Zhang J, Russo A, Gao G R. Experience on optimizing irregular computation for memory hierarchy in manycore architecture. In Proc. PPoPP 2008, Salt Lake City, USA, Feb. 14-18, pp.279-280. Hu Z, del Cuvillo J, Zhu W, Gao G R. Optimization of dense matrix multiplication on IBM Cyclops-64: Challenges and experiences. In Proc. Euro-Par 2006, Dresden, Germany, Aug. 29-Sept. 1, 2006, pp.134-144. Seiler L, Carmean D, Sprangle E, Forsyth T, AbrashM, Dubey P, Junkins S, Lake A, Sugerman J, Cavin R, Espasa R, Grochowski E, Juan T, Hanrahan P. Larrabee: A many-core x86 architecture for visual computing. ACM Transactions on Graphics, 27(3): Article No. 18. Song Y, Li Z. New tiling techniques to improve cache temporal locality. In Proc. ACM SIGPLAN Conference on Program ming Language Design and Implementation, Atlanta, USA, May 1-4, 1999, pp.215-228. DattaKKamilSWilliamsSOlikerLShalfJYelickKOptimization and performance modeling of stencil computations on modern microprocessorsSIAM Review200851112915910.1137/070693199 Borkar S Y, Mulder H, Dubey P, Pawlowski S S, Kahn K C, Rattner J R, Kuck D J. Platform 2015: Intel processor and platform evolution for the next decade. Technical Report, Intel White Paper, Mar. 2005. Renganarayanan L, Harthikote-Matha M, Dewri R, Rajopadhye S V. Towards optimal multi-level tiling for stencil computations. In Proc. IPDPS, Long Beach, USA, Mar. 26-30, 2007, p.101. Hofstee P. Power efficient architecture and the cell processor. In HPCA-11,Invited Paper and Keynote Speech, San Francisco, USA, Feb. 12-16, 2005. Haataja J, Savolainen V. Cray T3E User’s Guide. Center for Scientific Computing, Finland, 1997. Asanovic K, Bodik R, Catanzaro B C, Gebis J J, Husbands P, Keutzer K, Patterson D A, Plishker W L, Shalf J, Williams S W, Yelick K A. The landscape of parallel computing research: A view from Berkeley. UCB/EECS-2006-183, University of California, Berkeley, 2006. Ye X, Nguyen V H, Lavenier D, Fan D. Efficient parallelization of a protein sequence comparison algorithm on manycore architecture. In Proc. the Ninth International Conference on Parallel and Distributed Computing, Applications and Technologies, Dunedin, New Zealand, Dec. 1-4, 2008, pp.167-170. Long G, Fan D et al. A performance model of dense matrix operations on many-core architectures. In Proc. Euro-Par 2008, Las Palmas de Gran Canaria, Spain, Aug. 26-29, 2008, pp.120-129. Xue L, Chen L, Hu Z, Gao G R. Performance Tuning of the Fast Fourier Transform on a Multicore Architecture. CAPSL Technical Memo 81, Feb. 8, 2008. Vangal S, Howard J, Ruhl G, Dighe S, Wilson H, Tschanz J, Finan D, Iyer P, Singh A, Jacob T, Jain S, Venkataraman S, Hoskote Y, Borkar N. An 80-tile 1.28TFLOPS network-onchip in 65 nm CMOS. In Proc. IEEE International Solid-State Circuits Conference, San Francisco, USA, Feb. 11-15, 2007. Michael E Wolf, Monica S Lam. A data locality optimizing algorithm. In Proc. ACM SIGPLAN Conf. Progr. Lang. Design and Implementation, Toronto, Canada, Jun. 24-28, 1991, pp.30-44. 9373_CR31 9373_CR10 9373_CR32 9373_CR6 9373_CR11 9373_CR33 9373_CR12 9373_CR34 9373_CR4 9373_CR13 9373_CR35 9373_CR5 9373_CR14 9373_CR36 9373_CR2 9373_CR3 9373_CR16 9373_CR17 9373_CR1 9373_CR18 K Datta (9373_CR15) 2008; 51 M Frigo (9373_CR7) 2006; 29 R Alverson (9373_CR25) 1990; 18 9373_CR21 9373_CR22 9373_CR23 WJ Dally (9373_CR26) 1992; 12 9373_CR24 9373_CR27 9373_CR28 9373_CR29 9373_CR19 J Montrym (9373_CR30) 2005; 25 9373_CR8 9373_CR9 R Alverson (9373_CR20) 1990; 18
References_xml	– reference: Dally W J. Computer architecture in the many-core era. In Keynote at the 24th Int. Conf. Comput. Design, San Jose, CA, USA, Oct. 1, 2006. – reference: Wonnacott D. Using time skewing to eliminate idle time due to memory bandwidth and network limitations. In Proc. International Conference on Parallel and Distributed Computing Systems, Cancun, Mexico, May 1-5, 2000, p.171. – reference: DallyWJThe message-driven processorIEEE Micro.1992122233910.1109/40.127581 – reference: Xue L, Chen L, Hu Z, Gao G R. Performance Tuning of the Fast Fourier Transform on a Multicore Architecture. CAPSL Technical Memo 81, Feb. 8, 2008. – reference: Hofstee P. Power efficient architecture and the cell processor. In HPCA-11,Invited Paper and Keynote Speech, San Francisco, USA, Feb. 12-16, 2005. – reference: Zhu W, Sreedhar V C, Hu Z, Gao G R. Synchronization state buffer: Supporting efficient fine-grain synchronization on many-core architectures. In Proc. ISCA 2007, San Diego, USA, June 9-13, 2007, pp.35-45. – reference: Krishnamoorthy S, Baskaran M, Bondhugula U, Ramanujam J, Rountev A, Sadayappan P. Effective automatic parallelization of stencil computations. In Proc. ACM SIGPLAN Conference on Programming Language Design and Implementation, San Diego, USA, June 10-13, 2007, pp.235-244. – reference: Baskaran M, Bondhugula U, Krishnamoorthy S, Ramanujam J, Rountev A, Sadayappan P. Automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories. In Proc. 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2008), Salt Lake City, USA, Feb. 20-23, 2008, pp.1-10. – reference: DattaKKamilSWilliamsSOlikerLShalfJYelickKOptimization and performance modeling of stencil computations on modern microprocessorsSIAM Review200851112915910.1137/070693199 – reference: FrigoMStrumpenVThe memory behavior of cache oblivious stencil computationsJournal of Supercomputing200629293112 – reference: Asanovic K, Bodik R, Catanzaro B C, Gebis J J, Husbands P, Keutzer K, Patterson D A, Plishker W L, Shalf J, Williams S W, Yelick K A. The landscape of parallel computing research: A view from Berkeley. UCB/EECS-2006-183, University of California, Berkeley, 2006. – reference: Dally W J, Labonte F, Das A, Hanrahan P, Ahn J H, Gummaraju J, Erez M, Jayasena N, Buck I, Knight T J, Kapasi U J. Merrimac: Supercomputing with Streams. In Proc. the Supercomputer Conference, Phoenix, USA, Nov. 15-21, 2003. – reference: Haataja J, Savolainen V. Cray T3E User’s Guide. Center for Scientific Computing, Finland, 1997. – reference: Ye X, Nguyen V H, Lavenier D, Fan D. Efficient parallelization of a protein sequence comparison algorithm on manycore architecture. In Proc. the Ninth International Conference on Parallel and Distributed Computing, Applications and Technologies, Dunedin, New Zealand, Dec. 1-4, 2008, pp.167-170. – reference: Michael E Wolf, Monica S Lam. A data locality optimizing algorithm. In Proc. ACM SIGPLAN Conf. Progr. Lang. Design and Implementation, Toronto, Canada, Jun. 24-28, 1991, pp.30-44. – reference: Vangal S, Howard J, Ruhl G, Dighe S, Wilson H, Tschanz J, Finan D, Iyer P, Singh A, Jacob T, Jain S, Venkataraman S, Hoskote Y, Borkar N. An 80-tile 1.28TFLOPS network-onchip in 65 nm CMOS. In Proc. IEEE International Solid-State Circuits Conference, San Francisco, USA, Feb. 11-15, 2007. – reference: Seiler L, Carmean D, Sprangle E, Forsyth T, AbrashM, Dubey P, Junkins S, Lake A, Sugerman J, Cavin R, Espasa R, Grochowski E, Juan T, Hanrahan P. Larrabee: A many-core x86 architecture for visual computing. ACM Transactions on Graphics, 27(3): Article No. 18. – reference: Datta K, Murphy M, Volkov V, Williams S, Carter J, Oliker L, Patterson D, Shalf J, Yelick K. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In Proc. SC2008, Austin, USA, Nov. 15-21, 2008, Article No. 1. – reference: Cray MTA-2 System, http://www.cray.com/About/History.aspx. – reference: Tseng C W. Compiler optimizations for eliminating barrier synchronization. In Proc. PPOPP 1995, Santa Barbara, California, USA, July 19-21, 1995, pp.144-155. – reference: MontrymJMoretonHThe GeForce 6800IEEE Micro2005252415110.1109/MM.2005.37 – reference: Long G, Fan D et al. A performance model of dense matrix operations on many-core architectures. In Proc. Euro-Par 2008, Las Palmas de Gran Canaria, Spain, Aug. 26-29, 2008, pp.120-129. – reference: Smith B. The Architecture of HEP. Parallel MIMD Computation: HEP Supercomputer and Its Applications. Kowalik J S (ed.), Scientific Computation Series, Cambridge: MIT Press, MA, 1985, p.41-55. – reference: Borkar S Y, Mulder H, Dubey P, Pawlowski S S, Kahn K C, Rattner J R, Kuck D J. Platform 2015: Intel processor and platform evolution for the next decade. Technical Report, Intel White Paper, Mar. 2005. – reference: McCalpin J, Wonnacott D. Time skewing: A value-based approach to optimizing for memory locality. Technical Report DCS-TR-379, DCS, Rugers University, 1999. – reference: Hu Z, del Cuvillo J, Zhu W, Gao G R. Optimization of dense matrix multiplication on IBM Cyclops-64: Challenges and experiences. In Proc. Euro-Par 2006, Dresden, Germany, Aug. 29-Sept. 1, 2006, pp.134-144. – reference: AlversonRCallahanDCummingsDKoblenzBPorterfieldASmithBThe Tera computer systemSIGARCH Comput. Archit. News1990183b1610.1145/255129.255132 – reference: Keckler S W, Dally W J, Maskit D, Carter N P, Chang A, Lee W S. Exploiting fine-grain thread level parallelism on the MIT multi-ALU processor. In Proc. the 25th Int. Symp. Computer Architecture, Barcelona, Spain, Jun. 27-Jul. 2, 1998, pp.302-317. – reference: Tan G, Fan D, Zhang J, Russo A, Gao G R. Experience on optimizing irregular computation for memory hierarchy in manycore architecture. In Proc. PPoPP 2008, Salt Lake City, USA, Feb. 14-18, pp.279-280. – reference: Kranz D, Lim B H, Agarwal A. Low-cost support for finegrain synchronization in multiprocessors. Technical Report MIT/LCS/TM-470, Massachusetts Institute of Technology, Cambridge, 1992. – reference: Venetis I E, Gao G R. Mapping the LU decomposition on a many core architecture: Challenges and solutions. In Proc. ACM International Conference on Computing Frontiers (CF2009), Ischia, Italy, May 18-20, 2009, pp.71-80. – reference: Huang H, Yuan N et al. Architecture supported synchronization-based cache coherence protocol for many-core processors. In the 2nd Workshop on Chip Multiprocessor Memory Systems and Interconnects (CMPMSI) of ISCA’08, Beijing, China, June 22, 2008. – reference: AlversonRCallahanDThe Tera compute systemSIGARCH Comput. Archit. News1990183b1610.1145/255129.255132 – reference: Kamil S, Datta K, Williams S, Oliker L, Shalf J, Yelick K. Implicit and explicit optimizations for stencil computations. In Proc. MSPC2006, San Jose, USA, Oct. 22, 2006, pp.51-60. – reference: Renganarayanan L, Harthikote-Matha M, Dewri R, Rajopadhye S V. Towards optimal multi-level tiling for stencil computations. In Proc. IPDPS, Long Beach, USA, Mar. 26-30, 2007, p.101. – reference: Song Y, Li Z. New tiling techniques to improve cache temporal locality. In Proc. ACM SIGPLAN Conference on Program ming Language Design and Implementation, Atlanta, USA, May 1-4, 1999, pp.215-228. – ident: 9373_CR13 doi: 10.1109/IPDPS.2000.845979 – ident: 9373_CR17 doi: 10.1109/PDCAT.2008.28 – ident: 9373_CR28 doi: 10.1109/ISCA.1998.694790 – volume: 12 start-page: 23 issue: 2 year: 1992 ident: 9373_CR26 publication-title: IEEE Micro. doi: 10.1109/40.127581 – ident: 9373_CR10 doi: 10.1109/IPDPS.2007.370291 – ident: 9373_CR9 doi: 10.1109/SC.2008.5222004 – ident: 9373_CR16 – volume: 29 start-page: 93 issue: 2 year: 2006 ident: 9373_CR7 publication-title: Journal of Supercomputing doi: 10.1007/s11227-007-0111-y – ident: 9373_CR31 – volume: 51 start-page: 129 issue: 1 year: 2008 ident: 9373_CR15 publication-title: SIAM Review doi: 10.1137/070693199 – ident: 9373_CR29 – ident: 9373_CR2 – ident: 9373_CR1 doi: 10.1109/ICCD.2006.4380784 – ident: 9373_CR35 doi: 10.1145/1531743.1531756 – ident: 9373_CR14 doi: 10.1145/1345206.1345210 – volume: 25 start-page: 41 issue: 2 year: 2005 ident: 9373_CR30 publication-title: IEEE Micro doi: 10.1109/MM.2005.37 – ident: 9373_CR27 – ident: 9373_CR4 doi: 10.1145/1250662.1250668 – ident: 9373_CR18 doi: 10.1007/978-3-540-85451-7_14 – ident: 9373_CR23 – ident: 9373_CR5 doi: 10.1007/11823285_14 – volume: 18 start-page: 1 issue: 3b year: 1990 ident: 9373_CR20 publication-title: SIGARCH Comput. Archit. News doi: 10.1145/255129.255132 – ident: 9373_CR6 doi: 10.1145/1273442.1250761 – ident: 9373_CR33 doi: 10.1109/ISSCC.2007.373606 – ident: 9373_CR34 doi: 10.1145/1048935.1050187 – ident: 9373_CR19 doi: 10.1145/1345206.1345255 – ident: 9373_CR22 doi: 10.1145/209936.209952 – ident: 9373_CR8 doi: 10.1145/1178597.1178605 – ident: 9373_CR11 – ident: 9373_CR32 – volume: 18 start-page: 1 issue: 3b year: 1990 ident: 9373_CR25 publication-title: SIGARCH Comput. Archit. News doi: 10.1145/255129.255132 – ident: 9373_CR3 doi: 10.1145/1360612.1360617 – ident: 9373_CR21 doi: 10.1145/113446.113449 – ident: 9373_CR12 doi: 10.1145/301618.301668 – ident: 9373_CR24 – ident: 9373_CR36 doi: 10.1109/IPDPS.2007.370639
SSID	ssj0037044
Score	1.8395927
Snippet	The advent of multi-core/many-core chip technology offers both an extraordinary opportunity and a profound challenge. In particular, computer architects and...
SourceID	proquest crossref springer chongqing
SourceType	Aggregation Database Index Database Publisher
StartPage	886
SubjectTerms	Architects Architecture Artificial Intelligence Chips Compilers Computation Computer architecture Computer programs Computer Science Computers Data Structures and Information Theory Design Designers Information Systems Applications (incl.Internet) Optimization Optimization techniques R&D Research & development Short Paper Software Software Engineering Synchronism Synchronization Theory of Computation Tiling 优化技术同步机制芯片技术计算机系统软件架构
SummonAdditionalLinks	– databaseName: ProQuest Central dbid: BENPR link: http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwfV1LSx0xFD5Y3UhBrQ-8PkoWXbUEZ5LMJLMQsaKVUkWsgruQZDKtKDM-rv_fc-ZO7q2Fdh1I4Ds5j-Q8PoBPMXPSyQoVKSrNlRYFNyKvuTe5KoRyZV1Sg_PZeXl6rb7fFDdzcJZ6YaisMtnE3lDXXaA_8j2jBb5EjMwPHh45kUZRcjUxaLiBWaHe7yeMvYMFtMgGr_3C1-Pzi8tkmqXOenZX-tPmxJaZ0px9Lx3GQlSjhStSS94PW_jdtb8e0YW8dVqzSPSv5Gnvk05WYGkIJtnhRPofYC62q_D-8I_cwCosJ94GNqjxGrAfk14W9pMC5tt7dtTVkXUt-9ZRAxa_Wofrk-Oro1M-MCXwgOHXmHshKS-r6tK4kIfGlAi68gWqr6hV0zSlpKaeEAtfe1-EKlYhC8EHmQXUcic3YL7t2rgJrMi8rzJX-axEUUllMu9EE3Wlc-FiU41gewoLetpwR_OjLM0NxNecGMHnBJR9mIzLsLPByISwRYQtIWxL3CpBaQfNebZTOY-ATVfxylMew7Wxe3m2-J6vtNTajOBLEsBsg38et_Xf47ZhcVIYQJW4OzA_fnqJuxhvjP3H4Ra9Ag1qzr0 priority: 102 providerName: ProQuest
Title	Landing Stencil Code on Godson-T
URI	https://link.springer.com/article/10.1007/s11390-010-9373-6 https://www.proquest.com/docview/872094831 https://www.proquest.com/docview/907973778
Volume	25
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3PT9swFH4acOEyGGNaKVQ-cAJZcmzHdo4FtUWwoQmoxE6W7TjbBEqAlv-f5zahAm2HnXJw9Cx98fN7L9_7AXAYmRNOFKhIUWoqNc-p4VlJvclkzqVTpUoFzt8v1dlUnt_mt20d96zLdu8oycVNvSp2Q2clJVExiiZVULUGGzmG7kkbp3zYXb9Cs8UE1_TfmqaJmB2V-TcRqaHC76b-9YjbvTVMK2_zHUG6sDvjbfjYOoxkuPzCn-BDrHdgqxvGQFrd_Azk27JAhVwnL_jPPTltykiamkyaVFVFb3ZhOh7dnJ7RdvwBDehTzannIpGtslTGhSxURiGS0ueok7yUVVUpkSp1Qsx96X0eilgEFoIPggVUXSe-wHrd1PErkJx5XzBXeKYQfyEN845XURc64y5WRQ_6rzig-Qx3qSmUTc0AMUTjPTjqkLEPyx4YdtXtOEFqEVKbILUKRXXY2VYdZtZojmGkEVkPyOsqnuNETrg6Ns8zi0F6oYXWpgfHHeIrAf_cbu-_3u7D5pL9T-m2-7A-f3qOB-hUzP0A1sx4MoCN4eTnxQifJ6PLH1eDxdF6AVgHxB0
linkProvider	Springer Nature
linkToHtml	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV1Lb9QwEB6V7QGExKOAWMrDB7iALBLbiZNDhUpp2dLtCsFW6s3YjtMiUNKyWyF-HP-NmWy8C0hw69mSbX2e8Yw9880APA2JlVaWqEhBaa60yHgh0oq7IlWZUDavciI4H07y0ZF6d5wdr8HPyIWhtMp4J3YXddV6-iN_WWiBL5FCpq_Ozjk1jaLgauygYfvOCtVWV2Gs53UchB_f8QU329p_g8f9TIi93enOiPdNBrhHz2XOnZAU0lRVXlif-rrIcb_KZSj5olJ1XeeS-DA-ZK5yLvNlKH3ivfMy8aggVuK8V2BdEcF1AOuvdyfvP0RTIHXSdZOlP3RO3TljWLXj7qHvRTlhOCK15F1xh9O2OTlHk_WnkVx5vn8FazsbuHcLbvTOK9teSNttWAvNBlzf_i0WsQE3Y58I1l8bd4CNF9wZ9pEc9M9f2U5bBdY27G1LhC8-vQtHlwLaPRg0bRPuA8sS58rEli7JUTSkKhJnRR10qVNhQ10OYXMJC1p2_4XqVRmqU4ivRzGE5xEoc7Yoz2FWhZgJYYMIG0LY5DhVhNL0mjozS7kaAluOoopR3MQ2ob2YmTLB3UitiyG8iAewmuCfyz3473JP4Opoejg24_3JwSZcWyQlUBbwQxjMv12ER-jrzN3jXqIYfLpsIf4F0IANOw
linkToPdf	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3fb9MwED7BkBAv2xggyvjhhz0xWXVsx04ep0EZMCqktVLfLNtxNgRKOtr9_7trklWg8cBzrLP02ee7y913B3CUhFdelahISVuurcx5IbOKhyLTudTeVIYIzt-m5myuvyzyRT_ndDVUuw8pyY7TQF2amvV4WdXjLfENHRcqqBIczavi5iE8wtc4o4s-lyfDU6ys2ExzpX_YnKZjDmnN-0RQc4Wrtrm8xq3_NFJbz_OvZOnGBk32Ybd3HtlJd9pP4UFqDmBvGMzAej19Buy8I6uwC_KIf_xip22VWNuwTy0xrPjsOcwnH2enZ7wfhcAj-ldrHqSixKuuTOFjFuvCIKo65KifstJ1XRtFrJ2Y8lCFkMcylVHEGKISEdXYqxew07RNegksFyGUwpdBGDwLpQsRvKyTLW0mfarLERze4YCmNP6kBlGOGgNiuCZH8H5Axi27fhhu2_mYIHUIqSNInUFRA3auV42VK6zEkLJQ2QjY3Ve805So8E1qb1YOA_bSKmuLERwPiG8F_HO7V_-1-h08_v5h4s4_T78ewpOuKICqcF_Dzvr3TXqDvsY6vN3cp1uUvMc4
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Landing+Stencil+Code+on+Godson-T&rft.jtitle=Journal+of+computer+science+and+technology&rft.au=Cui%2C+Hui-Min&rft.au=Wang%2C+Lei&rft.au=Fan%2C+Dong-Rui&rft.au=Feng%2C+Xiao-Bing&rft.date=2010-07-01&rft.issn=1000-9000&rft.eissn=1860-4749&rft.volume=25&rft.issue=4&rft.spage=886&rft.epage=894&rft_id=info:doi/10.1007%2Fs11390-010-9373-6&rft.externalDBID=n%2Fa&rft.externalDocID=10_1007_s11390_010_9373_6
thumbnail_s	http://utb.summon.serialssolutions.com/2.0.0/image/custom?url=http%3A%2F%2Fimage.cqvip.com%2Fvip1000%2Fqk%2F85226X%2F85226X.jpg