RediI: Test Infrastructure to Enable Deterministic Reproduction of Failures for Distributed Systems

Despite the fact that distributed systems have become a crucial aspect of modern technology and support many of the software systems that enable modern life, developers experience challenges in performing regression testing of these systems. Existing solutions for testing distributed systems are oft...

Full description

Saved in:

Bibliographic Details
Published in	Proceedings / International Conference on Software Engineering pp. 191 - 203
Main Authors	Feng, Yang, Lin, Zheyuan, Zhao, Dongchen, Zhou, Mengbo, Liu, Jia, Jones, James A.
Format	Conference Proceeding
Language	English
Published	IEEE 26.04.2025
Subjects	Computer bugs Distributed Systems Infrastructure Performance analysis Regression Testing Runtime Software systems Testing
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Despite the fact that distributed systems have become a crucial aspect of modern technology and support many of the software systems that enable modern life, developers experience challenges in performing regression testing of these systems. Existing solutions for testing distributed systems are often either: (1) specialized testing environments that are created specifically for each system by its development team, which requires substantial effort for each team, with little-to-no sharing of this effort across teams; or (2) randomized injection tools that are often computationally expensive and offer no guarantees of preventing regressions, due to their randomness. The challenge of providing a generalized and practical solution to trigger bugs for reproducing and demonstrating failures, as well as to guard against regressions, is largely unaddressed. In this work, we present RediI, an infrastructure for supporting regression testing of distributed systems. RediI contains a dataset of real bugs on common distributed systems, along with a generalizable testing framework RediT that enables developers to write tests that can reproduce failures by providing ways to deterministically control distributed execution. In addition to the real failures in RediIfrom multiple distributed systems, RediT provides a reusable, programmable, platform-agnostic, deterministic testing framework for developers of distributed systems. It can help automate the running of such tests, for both practitioners and researchers. We demonstrate RediT with 63 bugs that we selected in Jira on 7 large and widely used distributed systems. Our case studies show that RediI can be used to allow developers to write tests that effectively reproduce failures on distributed systems and generate specific scenarios for regression testing, as well as providing deterministic failure injection that can help developers and researchers to better understand deterministic failures that may occur in distributed systems in the future. Additionally, our studies show that RediI is efficient for real-world system regression testing, providing a powerful tool for developers and researchers in the field of distributed-system testing.
ISSN:	1558-1225
DOI:	10.1109/ICSE55347.2025.00244