Multi-modal Sentiment Analysis of Audio and Visual Context of the Data using Machine Learning

Sentiment analysis on video streams in real time comprises using visual and/or aural data from the data stream to identify a subject's emotional expressions over time. Sentiment can be assessed using a variety of modalities, including speech, lip movements, and facial expression. This paper pre...

Full description

Saved in:
Bibliographic Details
Published in2022 3rd International Conference on Smart Electronics and Communication (ICOSEC) pp. 1198 - 1205
Main Author Juyal, Prachi
Format Conference Proceeding
LanguageEnglish
Published IEEE 20.10.2022
Subjects
Online AccessGet full text
DOI10.1109/ICOSEC54921.2022.9951988

Cover

Loading…
More Information
Summary:Sentiment analysis on video streams in real time comprises using visual and/or aural data from the data stream to identify a subject's emotional expressions over time. Sentiment can be assessed using a variety of modalities, including speech, lip movements, and facial expression. This paper presents a multi-modal deep learning strategy for sentiment classification that fuses derived features from an audiovisual input stream in real time. The proposed system consists of four small deep neural network models that analyse visual and auditory data at the same time. To create a final forecast, the visual and audio emotion features are merged into a single stream and an exponentially weighted moving average is used to gather data over time. This paper introduces a method for multimodal sentiment analysis based on feature extraction as well as emotion recognition from text and visual modalities using convolutional neural networks. By merging visual, text, and audio capabilities, a 12% performance gain has been achieved. While using RNN-COVAREP, a few critical factors that are frequently missed in multimodal analysis in research has been analysed.
DOI:10.1109/ICOSEC54921.2022.9951988