Intermediate Layer Attention Mechanism for Multimodal Fusion in Personality and Affect Computing

This article introduces a versatile multimodal architecture designed for personality-aware systems, encompassing tasks such as personality trait prediction, sentiment analysis, and emotion recognition. This is a unique attempt to develop a general pipeline that is applicable to the personality affec...

Full description

Saved in:

Bibliographic Details
Published in	IEEE access Vol. 12; pp. 112776 - 112793
Main Authors	Sreevidya, P., Aravinth, J., Samiappan, Sathishkumar
Format	Journal Article
Language	English
Published	Piscataway IEEE 2024 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Accuracy Attention Attention mechanisms Audio signals Big-five personality traits Classification Computation Computational modeling Context Correlation coefficients Data mining Emotion recognition Feature extraction fusion techniques Human factors Long short term memory Machine learning Performance measurement Personality Search algorithms Sentiment analysis State-of-the-art reviews Task analysis Visual tasks Visualization
Online Access	Get full text

Cover

Loading…

More Information
Summary:	This article introduces a versatile multimodal architecture designed for personality-aware systems, encompassing tasks such as personality trait prediction, sentiment analysis, and emotion recognition. This is a unique attempt to develop a general pipeline that is applicable to the personality affect computing applications within the context of multimodal data. The proposed model employs task-specific feature extraction models that are appropriately trained for each application. An intermediate layer, employing both inter- and intra-attention mechanisms for fusion, is presented. This dual attention mechanism is further improved with a binary search algorithm, which is notably the key contribution of the work. This fusion models discerns distinctive features crucial for classification and regression tasks. To evaluate the system's efficacy, short-duration video clips and corresponding transcriptions from databases were utilized. Low-level acoustic features were derived from audio signals, while high-level and mid-level audio features were extracted through a transformer-based sentence-RoBERTa model applied to audio transcripts. Visual features were obtained from context and facial images through deep face networks, followed by the use of CNN and LSTM models. Dimensionality reduction and multimodal fusion techniques were implemented prior to applying machine learning-based classification and prediction tasks. Performance metrics such as mean accuracy and squared correlation coefficients (<inline-formula> <tex-math notation="LaTeX">R^{2} </tex-math></inline-formula>) were chosen for prediction tasks, while accuracy and F1-score were employed for classification tasks. The study explored various fusion techniques and dimension-reduction approaches to establish an efficient pipeline, ultimately aiming to reduce uncertainties and enhance robustness. The results indicate that the proposed architecture performs comparably with state-of-the-art systems across all evaluated domains.
ISSN:	2169-3536 2169-3536
DOI:	10.1109/ACCESS.2024.3442377