Video Swin Transformers in Pain Detection: A Comprehensive Evaluation of Effectiveness, Generalizability, and Explainability

Recent advancements in deep learning, particularly in transformer-based models, offer promising potential for establishing new benchmarks in automated pain assessment through facial expressions. We propose to use Video Swin Transformer (VST) that leverages temporal dynamics and offers a potential fo...

Full description

Saved in:

Bibliographic Details
Published in	2024 12th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW) pp. 22 - 30
Main Authors	Rau, Maximilian, Ertugrul, Itir Onal
Format	Conference Proceeding
Language	English
Published	IEEE 15.09.2024
Subjects	Automated Pain Detection Computational modeling Computer vision Decision making Deep learning Explainability Focal Loss Generalizability Pain Real-time systems Streaming media Technological innovation Training Transformers Video Swin Transformer
Online Access	Get full text
DOI	10.1109/ACIIW63320.2024.00008

Cover

Loading…

More Information
Summary:	Recent advancements in deep learning, particularly in transformer-based models, offer promising potential for establishing new benchmarks in automated pain assessment through facial expressions. We propose to use Video Swin Transformer (VST) that leverages temporal dynamics and offers a potential for nuanced detection capabilities of pain through varying scales. Our study involves applying the VST and comparing its performance against other transformer-based state-of-the-art models such as the Swin Transformer and the Vision Transformer (ViT). Through ablation studies, we demonstrate the positive impact of incorporating an increased temporal depth into the model. Additionally, we evaluate the use of focal loss to mitigate the problem of imbalanced class distribution found in the UNBC McMaster dataset, which has turned out to be insufficient. Furthermore, our research has also focused on the generalizability of our models across different datasets, highlighting the need for more diverse datasets in training phases. Through the extraction of attention maps, we present insights into the explainability, particularly the focus points of our models, confirming their utilization of pain-related regions for decision-making. The results are promising: our best models, VST-0 and VST-1-TD, have achieved state-of-the-art performance in automated pain detection with F1 scores of 0.56 and 0.59, respectively. This paper underscores the potential of the VST architecture in automated pain assessment. Code is available at https://github.com/MRausus/VST-APA
DOI:	10.1109/ACIIW63320.2024.00008