Time-Frequency Domain Speech Enhancement Framework using Audio Spectrogram Transformer with Masked Multi-head Attention

Speech enhancement is crucial in various applications, such as speech recognition, hearing aids, and telecommunications-VoIP. This paper presents a novel time-frequency domain speech enhancement framework utilizing an Audio Spectrogram Transformer (AST) with a Masked Multi-head Attention (MMHAt) lay...

Full description

Saved in:
Bibliographic Details
Published in2023 8th International Conference on Computers and Devices for Communication (CODEC) pp. 1 - 2
Main Authors Samui, Suman, Garai, Soumen
Format Conference Proceeding
LanguageEnglish
Published IEEE 14.12.2023
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Speech enhancement is crucial in various applications, such as speech recognition, hearing aids, and telecommunications-VoIP. This paper presents a novel time-frequency domain speech enhancement framework utilizing an Audio Spectrogram Transformer (AST) with a Masked Multi-head Attention (MMHAt) layer. The AST is a convolution-free deep learning architecture, and it is directly applied to an audio spectrogram. Its multi-head attention layer can capture the long-range global context in the time-frequency domain. Moreover, the masking has been applied to ensure the causality of the Multi-Head Attention mechanism, which is essential for real-time applications.
DOI:10.1109/CODEC60112.2023.10465846