An empirical study on the challenges that developers encounter when developing Apache Spark applications

Apache Spark is one of the most popular big data frameworks that abstract the underlying distributed computation details. However, even though Spark provides various abstractions, developers may still encounter challenges related to the peculiarity of distributed computation and environment. To unde...

Full description

Saved in:
Bibliographic Details
Published inThe Journal of systems and software Vol. 194; p. 111488
Main Authors Wang, Zehao, Chen, Tse-Hsun (Peter), Zhang, Haoxiang, Wang, Shaowei
Format Journal Article
LanguageEnglish
Published Elsevier Inc 01.12.2022
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Apache Spark is one of the most popular big data frameworks that abstract the underlying distributed computation details. However, even though Spark provides various abstractions, developers may still encounter challenges related to the peculiarity of distributed computation and environment. To understand the challenges that developers encounter, and provide insight for future studies, in this paper, we conduct an empirical study on the questions that developers encounter. We manually analyze 1,000 randomly selected questions that we collected from Stack Overflow. We find that: 1) questions related to data processing (e.g., transforming data format) are the most common among the 11 types of questions that we uncovered. 2) Even though data processing questions are the most common ones, they require the least amount of time to receive an answer. Questions related to configuration and performance require the most time to receive an answer. 3) Most of the issues are caused by developers’ insufficient knowledge in API usages, data conversation across frameworks, and environment-related configurations. We also discuss the implication of our findings for researchers and practitioners. In summary, our work provides insights for future research directions and highlight the need for more software engineering research in this area. •We identify 11 question types that developers encounter in Spark development.•Among all, data processing, configuration, and IO are the most common questions.•Questions on performance and configuration require longer time to solve.•Data abstraction and lack of API usage knowledge are main root causes of questions.
ISSN:0164-1212
1873-1228
DOI:10.1016/j.jss.2022.111488