System and Method for Audio Processing using Time-Invariant Speaker Embeddings

A system and method for sound processing for performing multi-talker conversation analysis is provided. The sound processing system includes a deep neural network trained for processing audio segments of an audio mixture of the multi-talker conversation. The deep neural network includes a speaker-in...

Full description

Saved in:
Bibliographic Details
Main Authors Le Roux, Jonathan, Subramanian, Aswin Shanmugam, Böddeker, Christoph, Wichern, Gordon
Format Patent
LanguageEnglish
Published 12.09.2024
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:A system and method for sound processing for performing multi-talker conversation analysis is provided. The sound processing system includes a deep neural network trained for processing audio segments of an audio mixture of the multi-talker conversation. The deep neural network includes a speaker-independent layer that produces a speaker-independent output, and a speaker-biased layer applied once independently to each of the audio segments for each multiple speakers of the audio mixture. The deep neural network also processes a time-invariant embedding by individually assigning each application of the speaker-biased layer to a corresponding speaker by inputting the corresponding time-invariant speaker embedding. The deep neural network thus produces data indicative of time-frequency activity regions of each speaker of the multiple speakers in the audio mixture from a combination of speaker-biased outputs.
Bibliography:Application Number: US202318224659