Reproducible High Performance Computing without Redundancy with Nix

High performance computing (HPC) clusters are typically managed in a restrictive manner; the large user base makes cluster administrators unwilling to allow privilege escalation. Here we discuss existing methods of package management, including those which have been developed with scalability in min...

Full description

Saved in:
Bibliographic Details
Published in2022 Seventh International Conference on Parallel, Distributed and Grid Computing (PDGC) pp. 238 - 242
Main Authors Goswami, Rohit, S., Ruhila, Goswami, Amrita, Goswami, Sonaly, Goswami, Debabrata
Format Conference Proceeding
LanguageEnglish
Published IEEE 25.11.2022
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:High performance computing (HPC) clusters are typically managed in a restrictive manner; the large user base makes cluster administrators unwilling to allow privilege escalation. Here we discuss existing methods of package management, including those which have been developed with scalability in mind, and enumerate the drawbacks and advantages of each management methodology. We contrast the paradigms of containerization via docker, virtualization via KVM, pod-infrastructures via Kubernetes, and specialized HPC packaging systems via Spack and identify key areas of neglect. We demonstrate how functional programming due to reliance on immutable states has been leveraged for deterministic package management via the nix-language expressions. We show its associated ecosystem is a prime candidate for HPC package management. We further develop guidelines and identify bottlenecks in the existing structure and present the methodology by which the nix ecosystem should be developed further as an optimal tool for HPC package management. We assert that the caveats of the nix ecosystem can easily mitigated by considerations relevant only to HPC systems, without compromising on functional methodology and features of the nix-language. We show that benefits of adoption in terms of generating reproducible derivations in a secure manner allow for workflows to be scaled across heterogeneous clusters. In particular, from the implementation hurdles faced during the compilation and running of the d-SEAMS scientific software engine, distributed as a nix-derivation on an HPC cluster, we identify communication protocols for working with SLURM and TORQUE user resource allocation queues. These protocols are heuristically defined and described in terms of the reference implementation required for queue-efficient nix builds.
ISBN:9781665454001
1665454008
ISSN:2573-3079
DOI:10.1109/PDGC56933.2022.10053342