Spatiotemporal Modeling of Node Temperatures in Supercomputers
Los Alamos National Laboratory is home to many large supercomputing clusters. These clusters require an enormous amount of power (∼500-2000 kW each), and most of this energy is converted into heat. Thus, cooling the components of the supercomputer becomes a critical and expensive endeavor. Recently,...
Saved in:
Published in | Journal of the American Statistical Association Vol. 112; no. 517; pp. 92 - 108 |
---|---|
Main Authors | , , , , , , |
Format | Journal Article |
Language | English |
Published |
Alexandria
Taylor & Francis
01.03.2017
Taylor & Francis Group,LLC Taylor & Francis Ltd |
Subjects | |
Online Access | Get full text |
Cover
Loading…
Abstract | Los Alamos National Laboratory is home to many large supercomputing clusters. These clusters require an enormous amount of power (∼500-2000 kW each), and most of this energy is converted into heat. Thus, cooling the components of the supercomputer becomes a critical and expensive endeavor. Recently, a project was initiated to investigate the effect that changes to the cooling system in a machine room had on three large machines that were housed there. Coupled with this goal was the aim to develop a general good-practice for characterizing the effect of cooling changes and monitoring machine node temperatures in this and other machine rooms. This article focuses on the statistical approach used to quantify the effect that several cooling changes to the room had on the temperatures of the individual nodes of the computers. The largest cluster in the room has 1600 nodes that run a variety of jobs during general use. Since extremes temperatures are important, a Normal distribution plus generalized Pareto distribution for the upper tail is used to model the marginal distribution, along with a Gaussian process copula to account for spatio-temporal dependence. A Gaussian Markov random field (GMRF) model is used to model the spatial effects on the node temperatures as the cooling changes take place. This model is then used to assess the condition of the node temperatures after each change to the room. The analysis approach was used to uncover the cause of a problematic episode of overheating nodes on one of the supercomputing clusters. This same approach can easily be applied to monitor and investigate cooling systems at other data centers, as well. Supplementary materials for this article are available online. |
---|---|
AbstractList | Los Alamos National Laboratory is home to many large supercomputing clusters. These clusters require an enormous amount of power (∼500-2000 kW each), and most of this energy is converted into heat. Thus, cooling the components of the supercomputer becomes a critical and expensive endeavor. Recently, a project was initiated to investigate the effect that changes to the cooling system in a machine room had on three large machines that were housed there. Coupled with this goal was the aim to develop a general good-practice for characterizing the effect of cooling changes and monitoring machine node temperatures in this and other machine rooms. This article focuses on the statistical approach used to quantify the effect that several cooling changes to the room had on the temperatures of the individual nodes of the computers. The largest cluster in the room has 1600 nodes that run a variety of jobs during general use. Since extremes temperatures are important, a Normal distribution plus generalized Pareto distribution for the upper tail is used to model the marginal distribution, along with a Gaussian process copula to account for spatio-temporal dependence. A Gaussian Markov random field (GMRF) model is used to model the spatial effects on the node temperatures as the cooling changes take place. This model is then used to assess the condition of the node temperatures after each change to the room. The analysis approach was used to uncover the cause of a problematic episode of overheating nodes on one of the supercomputing clusters. This same approach can easily be applied to monitor and investigate cooling systems at other data centers, as well. Supplementary materials for this article are available online. Los Alamos National Laboratory is home to many large supercomputing clusters. These clusters require an enormous amount of power (~500-2000 kW each), and most of this energy is converted into heat. Thus, cooling the components of the supercomputer becomes a critical and expensive endeavor. Recently, a project was initiated to investigate the effect that changes to the cooling system in a machine room had on three large machines that were housed there. Coupled with this goal was the aim to develop a general good-practice for characterizing the effect of cooling changes and monitoring machine node temperatures in this and other machine rooms. This article focuses on the statistical approach used to quantify the effect that several cooling changes to the room had on the temperatures of the individual nodes of the computers. The largest cluster in the room has 1600 nodes that run a variety of jobs during general use. Since extremes temperatures are important, a Normal distribution plus generalized Pareto distribution for the upper tail is used to model the marginal distribution, along with a Gaussian process copula to account for spatiotemporal dependence. A Gaussian Markov random field (GMRF) model is used to model the spatial effects on the node temperatures as the cooling changes take place. This model is then used to assess the condition of the node temperatures after each change to the room. The analysis approach was used to uncover the cause of a problematic episode of overheating nodes on one of the supercomputing clusters. This same approach can easily be applied to monitor and investigate cooling systems at other data centers, as well. |
Author | Bonnie, Amanda M. Montoya, Andrew J. Michalak, Sarah E. Ticknor, Lawrence O. Storlie, Curtis B. Reich, Brian J. Rust, William N. |
Author_xml | – sequence: 1 givenname: Curtis B. surname: Storlie fullname: Storlie, Curtis B. email: storlie.curt@mayo.edu organization: Statistical Sciences Group, Los Alamos National Laboratory – sequence: 2 givenname: Brian J. surname: Reich fullname: Reich, Brian J. organization: Department of Statistics, North Carolina State University – sequence: 3 givenname: William N. surname: Rust fullname: Rust, William N. organization: Statistical Sciences Group, Los Alamos National Laboratory – sequence: 4 givenname: Lawrence O. surname: Ticknor fullname: Ticknor, Lawrence O. organization: Statistical Sciences Group, Los Alamos National Laboratory – sequence: 5 givenname: Amanda M. surname: Bonnie fullname: Bonnie, Amanda M. organization: High Performance Computing, Los Alamos National Laboratory – sequence: 6 givenname: Andrew J. surname: Montoya fullname: Montoya, Andrew J. organization: High Performance Computing, Los Alamos National Laboratory – sequence: 7 givenname: Sarah E. surname: Michalak fullname: Michalak, Sarah E. organization: Statistical Sciences Group, Los Alamos National Laboratory |
BookMark | eNp9UE1LxDAUDLKCu6s_YaHguWs-m-YiyuIXrHrYPXgLMU2kS9vUJEX235tS9ei7vBlm5j2YBZh1rjMArBBcI1jCK4gKjCgTa5zQGiHBMEcnYI4Y4Tnm9G0G5qMnH01nYBHCAabhZTkH17texdpF0_bOqyZ7dpVp6u4jczZ7STjbJ8V4FQdvQlZ32W5IVLu2H6Lx4RycWtUEc_Gzl2B_f7ffPObb14enze0215TCmNuiUFYIiKgoCC5KJt4ZF7bQZYUMIwhabI2ltIBFWQlb6cRtxS3F2iKiyRJcTmd77z4HE6I8uMF36aPEhHDO6Xh4Cdjk0t6F4I2Vva9b5Y8SQTk2JX-bkmNT8qeplFtNuUOIzv-FKIOYC4iTfjPpdWedb9WX800lozo2zluvOl0HSf5_8Q0qIXrv |
CitedBy_id | crossref_primary_10_1111_biom_13066 crossref_primary_10_1080_02664763_2019_1686131 |
Cites_doi | 10.1214/11-STS376 10.1080/01621459.2013.770694 10.1007/978-1-4757-3076-0 10.1007/s10651-007-0078-0 10.1198/016214504000000854 10.1080/01621459.1999.10473885 10.1080/01621459.1997.10474012 10.1016/0167-9473(91)90115-I 10.1214/12-AOAS591 10.1002/sim.2868 10.1016/S0378-3758(03)00111-3 10.1023/A:1024072610684 10.1023/A:1009963131610 10.1093/biomet/ast042 10.1214/09-AOP455 10.1016/j.insmatheco.2005.05.008 10.1214/13-AOAS628 10.1198/016214506000000753 10.1093/biomet/asr080 10.1111/j.1751-5823.2005.tb00254.x 10.1007/978-0-387-09766-4_155 10.1109/TDMR.2012.2192736 10.1201/9780203492024 10.1198/jasa.2009.tm08577 10.1080/01621459.1994.10476754 10.1109/TNN.2009.2016339 10.1002/env.715 10.1214/12-AOAS600 10.1111/1467-9868.00288 10.1007/s10687-008-0068-0 10.1198/016214502760047113 10.1111/rssb.12035 10.1111/j.1467-9876.2011.01025.x |
ContentType | Journal Article |
Copyright | 2017 American Statistical Association 2017 Copyright © 2017 American Statistical Association 2017 American Statistical Association |
Copyright_xml | – notice: 2017 American Statistical Association 2017 – notice: Copyright © 2017 American Statistical Association – notice: 2017 American Statistical Association |
DBID | AAYXX CITATION 8BJ FQK JBE K9. |
DOI | 10.1080/01621459.2016.1195271 |
DatabaseName | CrossRef International Bibliography of the Social Sciences (IBSS) International Bibliography of the Social Sciences International Bibliography of the Social Sciences ProQuest Health & Medical Complete (Alumni) |
DatabaseTitle | CrossRef International Bibliography of the Social Sciences (IBSS) ProQuest Health & Medical Complete (Alumni) |
DatabaseTitleList | International Bibliography of the Social Sciences (IBSS) |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Statistics |
EISSN | 1537-274X |
EndPage | 108 |
ExternalDocumentID | 10_1080_01621459_2016_1195271 45027902 1195271 |
Genre | Applications and Case Studies |
GroupedDBID | -DZ -~X ..I .7F .QJ 0BK 0R~ 29L 2AX 30N 4.4 5GY 5RE 692 7WY 85S 8FL AAAVI AAAVZ AABCJ AAENE AAJMT AALDU AAMIU AAPUL AAQRR ABBHK ABBKH ABCCY ABEHJ ABFAN ABFIM ABJVF ABLIJ ABLJU ABPEM ABPFR ABPPZ ABQHQ ABTAI ABXUL ABYAD ABYWD ACGFO ACGFS ACGOD ACIWK ACMTB ACNCT ACTIO ACTMH ACTWD ADCVX ADGTB ADLSF ADODI AEGYZ AEISY AENEX AEOZL AEPSL AEYOC AFFNX AFOLD AFSUE AFVYC AFWLO AFXHP AFXKK AGDLA AGMYJ AHDLD AIHXQ AIJEM AIRXU AKBVH AKOOK ALMA_UNASSIGNED_HOLDINGS ALQZU AQRUH AVBZW BLEHA CCCUG CJ0 CS3 D0L DGEBU DKSSO DQDLB DSRWC DU5 EBS ECEWR EFSUC EJD E~A E~B F5P FJW FUNRP FVPDL GROUPED_ABI_INFORM_COMPLETE GTTXZ H13 HF~ HQ6 HZ~ H~9 H~P IAO IEA IGG IOF IPNFZ IPO J.P JAAYA JAS JBMMH JBZCM JHFFW JKQEH JLEZI JLXEF JMS JPL JSODD JST K60 K6~ KYCEM LU7 M4Z MS~ MW2 N95 NA5 NY~ O9- OFU OK1 P2P RIG RNANH ROSJB RTWRZ RWL RXW S-T SA0 SNACF TAE TEJ TFL TFT TFW TN5 TTHFI U5U UPT UT5 UU3 V1K WH7 WZA XFK YQT YYM ZGOLN ZUP ~S~ ABJNI ABRLO ABXSQ ABXYU ACUBG AEUPB JENOY AAHBH AAYXX ABPAQ ADACV AHDZW ALIPV AWYRJ CITATION IPSME LJTGL TBQAZ TDBHL TUROJ 8BJ ABPQH ADMHG FQK JBE K9. |
ID | FETCH-LOGICAL-c440t-f66af9901496326859b579f6c8d1e5310f2fef446068d9fdcf2ffd7f42cf13c3 |
ISSN | 0162-1459 |
IngestDate | Mon Nov 04 11:46:12 EST 2024 Fri Aug 23 02:37:26 EDT 2024 Fri Feb 02 07:18:50 EST 2024 Tue Jul 04 18:18:33 EDT 2023 |
IsDoiOpenAccess | false |
IsOpenAccess | true |
IsPeerReviewed | true |
IsScholarly | true |
Issue | 517 |
Language | English |
LinkModel | OpenURL |
MergedId | FETCHMERGED-LOGICAL-c440t-f66af9901496326859b579f6c8d1e5310f2fef446068d9fdcf2ffd7f42cf13c3 |
OpenAccessLink | http://arxiv.org/pdf/1505.06275 |
PQID | 2337774496 |
PQPubID | 41715 |
PageCount | 17 |
ParticipantIDs | informaworld_taylorfrancis_310_1080_01621459_2016_1195271 jstor_primary_45027902 crossref_primary_10_1080_01621459_2016_1195271 proquest_journals_2337774496 |
PublicationCentury | 2000 |
PublicationDate | 20170301 |
PublicationDateYYYYMMDD | 2017-03-01 |
PublicationDate_xml | – month: 03 year: 2017 text: 20170301 day: 01 |
PublicationDecade | 2010 |
PublicationPlace | Alexandria |
PublicationPlace_xml | – name: Alexandria |
PublicationTitle | Journal of the American Statistical Association |
PublicationYear | 2017 |
Publisher | Taylor & Francis Taylor & Francis Group,LLC Taylor & Francis Ltd |
Publisher_xml | – name: Taylor & Francis – name: Taylor & Francis Group,LLC – name: Taylor & Francis Ltd |
References | cit0011 Huser R. (cit0018) 2013 cit0033 cit0012 cit0034 cit0031 cit0010 cit0032 cit0030 Coles S. G. (cit0003) 2005 cit0019 cit0017 cit0039 Cressie N. (cit0006) 2011 cit0015 cit0037 cit0014 cit0036 cit0022 cit0001 cit0023 cit0020 cit0042 cit0040 Hastie T. (cit0016) 1990 cit0041 Li S. Z. (cit0021) 2009; 3 Storlie C. (cit0038) 2016 Smith R. L. (cit0035) 1990 cit0008 cit0009 Michalak S. E. (cit0024) 2015 Gelman A. (cit0013) 2014; 2 cit0028 cit0007 cit0029 cit0004 cit0026 cit0005 cit0027 cit0002 cit0025 |
References_xml | – ident: cit0007 doi: 10.1214/11-STS376 – ident: cit0037 doi: 10.1080/01621459.2013.770694 – ident: cit0025 doi: 10.1007/978-1-4757-3076-0 – ident: cit0034 doi: 10.1007/s10651-007-0078-0 – ident: cit0036 doi: 10.1198/016214504000000854 – year: 2015 ident: cit0024 publication-title: Technical Report LA-UR-15-26974 contributor: fullname: Michalak S. E. – ident: cit0005 doi: 10.1080/01621459.1999.10473885 – ident: cit0042 doi: 10.1080/01621459.1997.10474012 – ident: cit0020 doi: 10.1016/0167-9473(91)90115-I – ident: cit0031 doi: 10.1214/12-AOAS591 – year: 2016 ident: cit0038 publication-title: Technometrics contributor: fullname: Storlie C. – ident: cit0022 doi: 10.1002/sim.2868 – ident: cit0041 doi: 10.1016/S0378-3758(03)00111-3 – ident: cit0011 doi: 10.1023/A:1024072610684 – volume-title: Generalized Additive Models year: 1990 ident: cit0016 contributor: fullname: Hastie T. – volume: 2 volume-title: Bayesian Data Analysis year: 2014 ident: cit0013 contributor: fullname: Gelman A. – ident: cit0004 doi: 10.1023/A:1009963131610 – ident: cit0040 doi: 10.1093/biomet/ast042 – ident: cit0019 doi: 10.1214/09-AOP455 – ident: cit0010 doi: 10.1016/j.insmatheco.2005.05.008 – ident: cit0029 doi: 10.1214/13-AOAS628 – ident: cit0030 doi: 10.1198/016214506000000753 – ident: cit0039 doi: 10.1093/biomet/asr080 – ident: cit0008 doi: 10.1111/j.1751-5823.2005.tb00254.x – ident: cit0009 doi: 10.1007/978-0-387-09766-4_155 – ident: cit0023 doi: 10.1109/TDMR.2012.2192736 – ident: cit0033 doi: 10.1201/9780203492024 – ident: cit0026 doi: 10.1198/jasa.2009.tm08577 – ident: cit0015 doi: 10.1080/01621459.1994.10476754 – volume-title: Statistics for Spatio-Temporal Data year: 2011 ident: cit0006 contributor: fullname: Cressie N. – year: 1990 ident: cit0035 publication-title: Unpublished manuscript contributor: fullname: Smith R. L. – volume: 3 volume-title: Markov Random Field Modeling in Image Analysis year: 2009 ident: cit0021 contributor: fullname: Li S. Z. – ident: cit0001 doi: 10.1109/TNN.2009.2016339 – ident: cit0012 doi: 10.1002/env.715 – volume-title: An Introduction to Statistical Modeling of Extreme Values year: 2005 ident: cit0003 contributor: fullname: Coles S. G. – ident: cit0027 doi: 10.1214/12-AOAS600 – ident: cit0032 doi: 10.1111/1467-9868.00288 – start-page: 1 year: 2013 ident: cit0018 publication-title: Biometrika contributor: fullname: Huser R. – ident: cit0002 doi: 10.1007/s10687-008-0068-0 – ident: cit0014 doi: 10.1198/016214502760047113 – ident: cit0017 doi: 10.1111/rssb.12035 – ident: cit0028 doi: 10.1111/j.1467-9876.2011.01025.x |
SSID | ssj0000788 |
Score | 2.27327 |
Snippet | Los Alamos National Laboratory is home to many large supercomputing clusters. These clusters require an enormous amount of power (∼500-2000 kW each), and most... Los Alamos National Laboratory is home to many large supercomputing clusters. These clusters require an enormous amount of power (~500-2000 kW each), and most... |
SourceID | proquest crossref jstor informaworld |
SourceType | Aggregation Database Publisher |
StartPage | 92 |
SubjectTerms | Applications and Case Studies Changes Clusters Computer simulation Computers Cooling Cooling effects Cooling systems Copula Data centers Extreme value Extremes Fields (mathematics) Gaussian process Generalized pareto distribution Hierarchical bayesian modeling High performance computing Laboratories Nodes Normal distribution Overheating Power Regression analysis Spatio-temporal Statistical analysis Statistical methods Statistics Supercomputers |
Title | Spatiotemporal Modeling of Node Temperatures in Supercomputers |
URI | https://www.tandfonline.com/doi/abs/10.1080/01621459.2016.1195271 https://www.jstor.org/stable/45027902 https://www.proquest.com/docview/2337774496 |
Volume | 112 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV1La9tAEF5McsklNG1D3SZFh96MhKVdraRLIAkJIbTuISr4JqR9gIlRgi1RyE_Jr83sS5Zqh6a9yJIWac3Mp9nZ3ZlvEPqWCPARpiX3S7B8PhEc-2mccXDk0rTiFXi0sUoU_jGjN7_I7Tyej0bPvailtqkC9rQzr-R_tAr3QK8qS_YfNNu9FG7AOegXjqBhOL5Jx3c6HNqySy11XbOljWKewfkkhxZLmqzDXu9auGS2jsP6Fb-0l2uiq_s2msp5hyZNkbXV0uxwXLaqrvPkIui2cMTClJm6WCkjcrtpaE2eiV3qmcy6lnzB7muzfvC9_G34b38G_YUJGOy6yCwNpXyrRohZToN3uHrIdkmTRn5ILC-4cGY48WG-PB_Y6TDqATI2KZ_W7pp6enYEDzVTxPbgYKMpoT_VnQrro4FivItMEZg_eLdJDDP2TPGU7sMvVtYTT2ebUT7RNU27f--ywxRv-64eBn7PgBXXRcJueQPaxcnfoUOLAe_cAO0IjUT9Hh10EFh_QGdDxHkOcd6D9BTivD7ivEXtDRH3EeXXV_nljW8rcPiMkGnjS0pLqXZOCdjpiMJXXMVJJilLeSjAek9lJIUkBGbBKc8kZ3AteSJJxGSIGT5Ge_VDLT4hLwNfswRvWJRhRXgWl1USRmVCuaS4wpSNUeDEUzwanpUidPS1Vp6Fkmdh5TlGWV-IRaPhJg3SCvyXZ4-1xLuenK7H6MSpoLDf-bqIME5gkgQi-Pzac1_QweYLOEF7zaoVp-CsNtVXjZoX0pyOCw |
link.rule.ids | 315,783,787,27936,27937,60214,61003 |
linkProvider | Taylor & Francis |
linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwpV07T8MwELZQGejCu6JQIANrQh6OkyxICFEVaLM0SN2sxA8JgVJE04Vfz13iQAEhhm6xIsfyne_82bn7jpCLSAFGcHNp5-D5bKpkYMdhIgHIxXEhC0C0ISYKT1I2eqT3s3C2kguDYZV4htYNUUTtq9G48TK6DYm7BJiCBNuYZ-IxB0nLfEwj32RIAIZpHG765Y2juvYkdrGxT5vF89dnvu1P39hL24jFX1673oqGO0S0k2giUJ6dZVU44v0Hv-N6s9wl2wapWtfN0tojG6rcJ10Epw238wG5mtbh2Ibd6sXCumqY3W7NtZXCs5XBG0PavLCeSmu6hKYwdSQWhyQb3mY3I9vUY7AFpW5la8Zyjf_RKFitz0CnRRglmolYegps2dW-VhrOly6LZaKlgLaWkaa-0F4ggh7plPNSHRErAeSRAzZSuVdQmYR5EXl-HjGpWVAETPSJ0yqBvzasG9xryUyNWDiKhRux9Emyqipe1dcduqlNwoN_-vZqvX6OREM4qyeu3yeDVtHcWPaC-0EQAWQGERyvMeY52RplkzEf36UPJ6TrI1aoA9sGpFO9LdUpIJ2qOKuX8ge6fe6v |
linkToPdf | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwpV3NS8MwFA8yQXbxezid2oPX1rZp0_YiiDrmVxE2wVtomgRE6YbtLv71vqSJOkU87NZQ0pD38l5-Sd_7PYROEgEYwS-4W4DncyPBsZvGGQcgl6aMM0C0sUoUvs_J6DG6eYptNGFtwirVGVq2RBHaVyvjnnFpI-JOAaUofm2VZhIQT3GWhSqLfBWQgK-WOvbzL2ec6NKTqour-tgknr8-s7A9LZCX2oDFX05b70TDDcTsHNoAlBdv3jCvfP9B77jUJDfRusGpznm7sLbQiqi2UVdB05bZeQedjXUwtuG2enVUVTWV2-5MpZPDszOBN4ayuXaeK2c8h2ZpqkjUu2gyvJpcjFxTjcEto8hvXElIIdVftAhsNiSgURYnmSRlygMBluzLUAoJp0ufpDyTvIS25ImMwlIGuMQ91KmmldhDTga4owBkJIqARTyLC5YEYZEQLglmmJR95Fkd0FnLuUEDS2VqxEKVWKgRSx9l3zVFG33ZIdvKJBT_07en1fo5UhTDST3zwz4aWD1TY9c1DTFOADCDCPaXGPMYrT1cDunddX57gLqhAgo6qm2AOs3bXBwCzGnYkV7IH-hK7Vw |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Spatiotemporal+Modeling+of+Node+Temperatures+in+Supercomputers&rft.jtitle=Journal+of+the+American+Statistical+Association&rft.au=Storlie%2C+Curtis+B.&rft.au=Reich%2C+Brian+J.&rft.au=Rust%2C+William+N.&rft.au=Ticknor%2C+Lawrence+O.&rft.date=2017-03-01&rft.pub=Taylor+%26+Francis+Group%2CLLC&rft.issn=0162-1459&rft.eissn=1537-274X&rft.volume=112&rft.issue=517&rft.spage=92&rft.epage=108&rft_id=info:doi/10.1080%2F01621459.2016.1195271&rft.externalDocID=45027902 |
thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0162-1459&client=summon |
thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0162-1459&client=summon |
thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0162-1459&client=summon |