Improving Numerical Reproducibility of Scientific Software in Parallel Systems

Recently, numerical reproducibility has received increased emphasis from the scientific community. Software results that are not reproducible make it difficult to examine the science the software supports. A common source of numerical reproducibility errors in computational science occurs during flo...

Full description

Saved in:
Bibliographic Details
Published in2020 IEEE International Conference on Electro Information Technology (EIT) pp. 066 - 074
Main Authors Jalal Apostal, Sara Faraji, Apostal, David, Marsh, Ronald
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.07.2020
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Recently, numerical reproducibility has received increased emphasis from the scientific community. Software results that are not reproducible make it difficult to examine the science the software supports. A common source of numerical reproducibility errors in computational science occurs during floating-point arithmetic. Finite precisions and limited storage for floating-point numbers require computers to truncate and round results of some math operations. As a consequence, an approximate value is stored instead of the exact result. One programming idiom that is not always reproducible is the global sum reduction of a distributed array. Changing the number of compute units changes the order array elements are added together which, in turn, changes the truncation and rounding. This may change the result of individual add operations and the resulting global sum. Therefore, floating-point addition is not always associative. This research has improved the numerical reproducibility of scientific applications on parallel systems. Automating the improvement of reproducibility in scientific software is the innovative contribution of this research. Two reproducible global sum reduction functions have been implemented and packaged in a software library. The automated improving of reproducibility has been done by developing a source code scanner to recognize certain MPI-based global sum reductions that may have reproducibility errors. The scanner replaces those reductions with calls to the library function containing reproducible codes. Reproducibility and performance testing have demonstrated the effectiveness of the system. This will extend the usefulness of legacy software and can lead to faster rates of discovery, and more efficient application of scientists' time.
ISSN:2154-0373
DOI:10.1109/EIT48999.2020.9208338