Automated Generation of Accessibility Test Reports from Recorded User Transcripts

Testing for accessibility is a significant step when developing software, as it ensures that all users, including those with disabilities, can effectively engage with web and mobile applications. While automated tools exist to detect accessibility issues in software, none are as comprehensive and ef...

Full description

Saved in:

Bibliographic Details
Published in	Proceedings / International Conference on Software Engineering pp. 204 - 216
Main Authors	Huq, Syed Fatiul, Tafreshipour, Mahan, Kalcevich, Kate, Malek, Sam
Format	Conference Proceeding
Language	English
Published	IEEE 26.04.2025
Subjects	crowd-sourced software testing Large language models Reproducibility of results Semantics Software software accessibility Software engineering Software testing Systematics Testing Usability
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Testing for accessibility is a significant step when developing software, as it ensures that all users, including those with disabilities, can effectively engage with web and mobile applications. While automated tools exist to detect accessibility issues in software, none are as comprehensive and effective as the process of user testing, where testers with various disabilities evaluate the application for accessibility and usability issues. However, user testing is not popular with software developers as it requires conducting lengthy interviews with users and later parsing through large recordings to derive the issues to fix. In this paper, we explore how large language models (LLMs) like GPT 4.0, which have shown promising results in context comprehension and semantic text generation, can mitigate this issue and streamline the user testing process. Our solution, called Reca11, takes in auto-generated transcripts from user testing video recordings and extracts the accessibility and usability issues mentioned by the tester. Our systematic prompt engineering determines the optimal configuration of input, instruction, context and demonstrations for best results. We evaluate Reca11's effectiveness on 36 user testing sessions across three applications. Based on the findings, we investigate the strengths and weaknesses of using LLMs in this space.
ISSN:	1558-1225
DOI:	10.1109/ICSE55347.2025.00043