AutoFoley: Artificial Synthesis of Synchronized Sound Tracks for Silent Videos With Deep Learning

In movie productions, the Foley artist is responsible for creating an overlay soundtrack that helps the movie come alive for the audience. This requires the artist identify sounds that enhance the experience for the listener, reinforcing the director's intention for the scene. The artist must d...

Full description

Saved in:
Bibliographic Details
Published inIEEE transactions on multimedia Vol. 23; pp. 1895 - 1907
Main Authors Ghose, Sanchita, Prevost, John Jeffrey
Format Journal Article
LanguageEnglish
Published Piscataway IEEE 2021
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:In movie productions, the Foley artist is responsible for creating an overlay soundtrack that helps the movie come alive for the audience. This requires the artist identify sounds that enhance the experience for the listener, reinforcing the director's intention for the scene. The artist must decide what artificial sound captures the essence of the sound and action depicted in the scene. In this paper, we present AutoFoley, an automated deep-learning tool that is used to synthesize a representative audio track for videos. AutoFoley is used to associate audio files with soundless video or to identify critical scenarios and provide a synthesized, reinforced and time-synchronized soundtrack. Our algorithm is capable of precise recognition of actions as well as interframe relations in fast- moving video clips through incorporating interpolation technique and temporal relational networks (TRN). We employ a robust multiscale recurrent neural network (RNN) and a convolutional neural network (CNN) for better understanding of the intricate input-to-output associations. To evaluate AutoFoley, we create an audio-video dataset containing a variety of sounds frequently used as Foley effects in movies. While the Foley dataset was limited to short-duration videos off the representative activities, this dataset demonstrates the capabilities of our proposed system. We show the synthesized sounds are portrayed with accurate temporal synchronization of the associated visual inputs. Human qualitative testing of AutoFoley shows more than 73% of the test subjects considered the generated soundtrack as original, which is a noteworthy improvement in comparable cross-modal research.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:1520-9210
1941-0077
DOI:10.1109/TMM.2020.3005033