Video Swin Transformers in Pain Detection: A Comprehensive Evaluation of Effectiveness, Generalizability, and Explainability
Recent advancements in deep learning, particularly in transformer-based models, offer promising potential for establishing new benchmarks in automated pain assessment through facial expressions. We propose to use Video Swin Transformer (VST) that leverages temporal dynamics and offers a potential fo...
Saved in:
Published in | 2024 12th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW) pp. 22 - 30 |
---|---|
Main Authors | , |
Format | Conference Proceeding |
Language | English |
Published |
IEEE
15.09.2024
|
Subjects | |
Online Access | Get full text |
DOI | 10.1109/ACIIW63320.2024.00008 |
Cover
Loading…
Summary: | Recent advancements in deep learning, particularly in transformer-based models, offer promising potential for establishing new benchmarks in automated pain assessment through facial expressions. We propose to use Video Swin Transformer (VST) that leverages temporal dynamics and offers a potential for nuanced detection capabilities of pain through varying scales. Our study involves applying the VST and comparing its performance against other transformer-based state-of-the-art models such as the Swin Transformer and the Vision Transformer (ViT). Through ablation studies, we demonstrate the positive impact of incorporating an increased temporal depth into the model. Additionally, we evaluate the use of focal loss to mitigate the problem of imbalanced class distribution found in the UNBC McMaster dataset, which has turned out to be insufficient. Furthermore, our research has also focused on the generalizability of our models across different datasets, highlighting the need for more diverse datasets in training phases. Through the extraction of attention maps, we present insights into the explainability, particularly the focus points of our models, confirming their utilization of pain-related regions for decision-making. The results are promising: our best models, VST-0 and VST-1-TD, have achieved state-of-the-art performance in automated pain detection with F1 scores of 0.56 and 0.59, respectively. This paper underscores the potential of the VST architecture in automated pain assessment. Code is available at https://github.com/MRausus/VST-APA |
---|---|
DOI: | 10.1109/ACIIW63320.2024.00008 |