Operational Data Analytics in practice: Experiences from design to deployment in production HPC environments
Netti, Alessio, Ott, Michael, Guillen, Carla, Tafani, Daniele, Schulz, Martin
Published in Parallel computing (01.10.2022)
Published in Parallel computing (01.10.2022)
Get full text
Journal Article
HPC Hardware Design Reliability Benchmarking With HDFIT
Omland, Patrik, Netti, Alessio, Peng, Yang, Baldovin, Andrea, Paulitsch, Michael, Espinosa, Gustavo, Parra, Jorge, Hinz, Gereon, Knoll, Alois
Published in IEEE transactions on parallel and distributed systems (01.03.2023)
Published in IEEE transactions on parallel and distributed systems (01.03.2023)
Get full text
Journal Article
A machine learning approach to online fault classification in HPC systems
Netti, Alessio, Kiziltan, Zeynep, Babaoglu, Ozalp, Sîrbu, Alina, Bartolini, Andrea, Borghesi, Andrea
Published in Future generation computer systems (01.09.2020)
Published in Future generation computer systems (01.09.2020)
Get full text
Journal Article
A Conceptual Framework for HPC Operational Data Analytics
Netti, Alessio, Shin, Woong, Ott, Michael, Wilde, Torsten, Bates, Natalie
Published in 2021 IEEE International Conference on Cluster Computing (CLUSTER) (01.09.2021)
Published in 2021 IEEE International Conference on Cluster Computing (CLUSTER) (01.09.2021)
Get full text
Conference Proceeding
AccaSim: a customizable workload management simulator for job dispatching research in HPC systems
Galleguillos, Cristian, Kiziltan, Zeynep, Netti, Alessio, Soto, Ricardo
Published in Cluster computing (01.03.2020)
Published in Cluster computing (01.03.2020)
Get full text
Journal Article
Mixed precision support in HPC applications: What about reliability?
Netti, Alessio, Peng, Yang, Omland, Patrik, Paulitsch, Michael, Parra, Jorge, Espinosa, Gustavo, Agarwal, Udit, Chan, Abraham, Pattabiraman, Karthik
Published in Journal of parallel and distributed computing (01.11.2023)
Published in Journal of parallel and distributed computing (01.11.2023)
Get full text
Journal Article
Operational Data Analytics in Practice: Experiences from Design to Deployment in Production HPC Environments
Netti, Alessio, Ott, Michael, Guillen, Carla, Tafani, Daniele, Schulz, Martin
Year of Publication 28.06.2021
Year of Publication 28.06.2021
Get full text
Journal Article
A Machine Learning Approach to Online Fault Classification in HPC Systems
Netti, Alessio, Kiziltan, Zeynep, Babaoglu, Ozalp, Sirbu, Alina, Bartolini, Andrea, Borghesi, Andrea
Published in arXiv.org (27.07.2020)
Published in arXiv.org (27.07.2020)
Get full text
Paper
Journal Article
DCDB Wintermute: Enabling Online and Holistic Operational Data Analytics on HPC Systems
Netti, Alessio, Mueller, Micha, Guillen, Carla, Ott, Michael, Tafani, Daniele, Gence Ozer, Schulz, Martin
Published in arXiv.org (18.04.2020)
Published in arXiv.org (18.04.2020)
Get full text
Paper
Journal Article
AccaSim: a Customizable Workload Management Simulator for Job Dispatching Research in HPC Systems
Galleguillos, Cristian, Kiziltan, Zeynep, Netti, Alessio, Soto, Ricardo
Year of Publication 18.06.2018
Year of Publication 18.06.2018
Get full text
Journal Article
From Facility to Application Sensor Data: Modular, Continuous and Holistic Monitoring with DCDB
Netti, Alessio, Mueller, Micha, Auweter, Axel, Guillen, Carla, Ott, Michael, Tafani, Daniele, Schulz, Martin
Published in arXiv.org (14.08.2019)
Published in arXiv.org (14.08.2019)
Get full text
Paper
Journal Article