Prediction of Mild Cognitive Impairment Using a Hybrid Audio‐Visual Approach: An I_CONECT Study
Background Cognitive decline can affect speech, language, head pose, eye gaze, and facial expressions in older adults. Artificial Intelligence (AI) can assist in automated prediction and monitoring of the progress of cognitive decline. Our previous work shows that utilizing unimodal linguistic featu...
Saved in:
Published in | Alzheimer's & dementia Vol. 19; no. S19 |
---|---|
Main Authors | , , , , |
Format | Journal Article |
Language | English |
Published |
01.12.2023
|
Online Access | Get full text |
Cover
Loading…
Summary: | Background
Cognitive decline can affect speech, language, head pose, eye gaze, and facial expressions in older adults. Artificial Intelligence (AI) can assist in automated prediction and monitoring of the progress of cognitive decline. Our previous work shows that utilizing unimodal linguistic features or facial features separately can lead to the prediction of cognitive impairment. We hypothesize that utilizing multimodal features from both audio and video can lead to a more accurate AI algorithm in differentiating those with mild cognitive impairment (MCI) from those with normal cognition (NC).
Method
We utilize deep learning methods (DL), specifically Transformers to extract both linguistic and facial features to predict whether an older subject is MCI or NC. Data is collected through Internet‐Based Conversational Engagement Clinical Trial (I‐CONECT) (NCT02871921) aiming to determine the effects of social interactions (video chats) on cognitive functions. Videos and respective transcribed audios of three themes; Summertime (30 participants), Halloween (32 participants), and Self‐care (30 participants) are selected based on “Good” video qualities. Dataset is balanced (half are diagnosed with MCI).
Facial features are extracted from video frames through a convolutional Autoencoder, and then a Bidirectional Encoder Representations from Transformers (BERT) model captures temporal facial information. Linguistic features are generated by the DistilBERT language model being deployed on the transcribed audio. We used a 10‐fold cross‐validation approach to train and test the models separately. For each modality, the probability scores given by the transformers are fused using a majority voting method.
Result
Accuracy, F1 score, precision, recall and area under the curve (AUC) are the evaluation metrics in this study. Score‐level fusion outperforms previous non‐hybrid methods of MCI prediction. Audio‐visual fusion leads to promising results for Summertime, Halloween and Self‐care with accuracy of 84.8%, 85%, and 84.5% and AUC of 86%, 86.1%, and 85.7%,respectively, which are marked improvements from the previous findings (48% accuracy using linguistic features alone and 67.8% using facial features alone).
Conclusion
The results of our study show that the proposed multimodal approach improves the accuracy of the AI/ML model in distinguishing MCI from NC. That is, audio‐visual fusion could outperform other single methods. |
---|---|
ISSN: | 1552-5260 1552-5279 |
DOI: | 10.1002/alz.074808 |