Automatic Path Migration over InfiniBand: Early Experiences

High computational power of commodity PCs combined with the emergence of low latency and high bandwidth interconnects has escalated the trends of cluster computing. Clusters with InfiniBand are being deployed, as reflected in the TOP 500 Supercomputer rankings. However, increasing scale of these clu...

Full description

Saved in:
Bibliographic Details
Published in2007 IEEE International Parallel and Distributed Processing Symposium pp. 1 - 8
Main Authors Abhinav Vishnu, Mamidala, A.R., Sundeep Narravula, Panda, D.K.
Format Conference Proceeding
LanguageEnglish
Published IEEE 2007
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:High computational power of commodity PCs combined with the emergence of low latency and high bandwidth interconnects has escalated the trends of cluster computing. Clusters with InfiniBand are being deployed, as reflected in the TOP 500 Supercomputer rankings. However, increasing scale of these clusters has reduced the mean time between failures (MTBF) of components. Network component is one such component of clusters, where failure of network interface cards (NICs), cables and/or switches breaks existing path(s) of communication. InfiniBand provides a hardware mechanism, automatic path migration (APM), which allows user transparent detection and recovery from network fault(s), without application restart. In this paper, we design a set of modules; which work together for providing network fault tolerance for user level applications leveraging the APM feature. Our performance evaluation at the MPI layer shows that APM incurs negligible overhead in the absence of faults in the system. In the presence of network faults, APM incurs negligible overhead for reasonably long running applications.
ISBN:1424409098
9781424409099
ISSN:1530-2075
DOI:10.1109/IPDPS.2007.370626