An empirical study on the challenges that developers encounter when developing Apache Spark applications
Apache Spark is one of the most popular big data frameworks that abstract the underlying distributed computation details. However, even though Spark provides various abstractions, developers may still encounter challenges related to the peculiarity of distributed computation and environment. To unde...
Saved in:
Published in | The Journal of systems and software Vol. 194; p. 111488 |
---|---|
Main Authors | , , , |
Format | Journal Article |
Language | English |
Published |
Elsevier Inc
01.12.2022
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Apache Spark is one of the most popular big data frameworks that abstract the underlying distributed computation details. However, even though Spark provides various abstractions, developers may still encounter challenges related to the peculiarity of distributed computation and environment. To understand the challenges that developers encounter, and provide insight for future studies, in this paper, we conduct an empirical study on the questions that developers encounter. We manually analyze 1,000 randomly selected questions that we collected from Stack Overflow. We find that: 1) questions related to data processing (e.g., transforming data format) are the most common among the 11 types of questions that we uncovered. 2) Even though data processing questions are the most common ones, they require the least amount of time to receive an answer. Questions related to configuration and performance require the most time to receive an answer. 3) Most of the issues are caused by developers’ insufficient knowledge in API usages, data conversation across frameworks, and environment-related configurations. We also discuss the implication of our findings for researchers and practitioners. In summary, our work provides insights for future research directions and highlight the need for more software engineering research in this area.
•We identify 11 question types that developers encounter in Spark development.•Among all, data processing, configuration, and IO are the most common questions.•Questions on performance and configuration require longer time to solve.•Data abstraction and lack of API usage knowledge are main root causes of questions. |
---|---|
ISSN: | 0164-1212 1873-1228 |
DOI: | 10.1016/j.jss.2022.111488 |