Prediction of Mild Cognitive Impairment Using a Hybrid Audio‐Visual Approach: An I_CONECT Study

Background Cognitive decline can affect speech, language, head pose, eye gaze, and facial expressions in older adults. Artificial Intelligence (AI) can assist in automated prediction and monitoring of the progress of cognitive decline. Our previous work shows that utilizing unimodal linguistic featu...

Full description

Saved in:

Bibliographic Details
Published in	Alzheimer's & dementia Vol. 19; no. S19
Main Authors	Poor, Farida Far, Alsuhaibani, Muath, Fard, Ali Pourramezan, Mahoor, Mohammad, Dodge, Hiroko H
Format	Journal Article
Language	English
Published	01.12.2023
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Background Cognitive decline can affect speech, language, head pose, eye gaze, and facial expressions in older adults. Artificial Intelligence (AI) can assist in automated prediction and monitoring of the progress of cognitive decline. Our previous work shows that utilizing unimodal linguistic features or facial features separately can lead to the prediction of cognitive impairment. We hypothesize that utilizing multimodal features from both audio and video can lead to a more accurate AI algorithm in differentiating those with mild cognitive impairment (MCI) from those with normal cognition (NC). Method We utilize deep learning methods (DL), specifically Transformers to extract both linguistic and facial features to predict whether an older subject is MCI or NC. Data is collected through Internet‐Based Conversational Engagement Clinical Trial (I‐CONECT) (NCT02871921) aiming to determine the effects of social interactions (video chats) on cognitive functions. Videos and respective transcribed audios of three themes; Summertime (30 participants), Halloween (32 participants), and Self‐care (30 participants) are selected based on “Good” video qualities. Dataset is balanced (half are diagnosed with MCI). Facial features are extracted from video frames through a convolutional Autoencoder, and then a Bidirectional Encoder Representations from Transformers (BERT) model captures temporal facial information. Linguistic features are generated by the DistilBERT language model being deployed on the transcribed audio. We used a 10‐fold cross‐validation approach to train and test the models separately. For each modality, the probability scores given by the transformers are fused using a majority voting method. Result Accuracy, F1 score, precision, recall and area under the curve (AUC) are the evaluation metrics in this study. Score‐level fusion outperforms previous non‐hybrid methods of MCI prediction. Audio‐visual fusion leads to promising results for Summertime, Halloween and Self‐care with accuracy of 84.8%, 85%, and 84.5% and AUC of 86%, 86.1%, and 85.7%,respectively, which are marked improvements from the previous findings (48% accuracy using linguistic features alone and 67.8% using facial features alone). Conclusion The results of our study show that the proposed multimodal approach improves the accuracy of the AI/ML model in distinguishing MCI from NC. That is, audio‐visual fusion could outperform other single methods.
ISSN:	1552-5260 1552-5279
DOI:	10.1002/alz.074808