Towards Efficient and Real-Time Piano Transcription Using Neural Autoregressive Models
In recent years, advancements in neural network designs and the availability of large-scale labeled datasets have led to significant improvements in the accuracy of piano transcription models. However, most previous work focused on high-performance offline transcription, neglecting deliberate consid...
Saved in:
Main Authors | , , |
---|---|
Format | Journal Article |
Language | English |
Published |
10.04.2024
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | In recent years, advancements in neural network designs and the availability
of large-scale labeled datasets have led to significant improvements in the
accuracy of piano transcription models. However, most previous work focused on
high-performance offline transcription, neglecting deliberate consideration of
model size. The goal of this work is to implement real-time inference for piano
transcription while ensuring both high performance and lightweight. To this
end, we propose novel architectures for convolutional recurrent neural
networks, redesigning an existing autoregressive piano transcription model.
First, we extend the acoustic module by adding a frequency-conditioned FiLM
layer to the CNN module to adapt the convolutional filters on the frequency
axis. Second, we improve note-state sequence modeling by using a pitchwise LSTM
that focuses on note-state transitions within a note. In addition, we augment
the autoregressive connection with an enhanced recursive context. Using these
components, we propose two types of models; one for high performance and the
other for high compactness. Through extensive experiments, we show that the
proposed models are comparable to state-of-the-art models in terms of note
accuracy on the MAESTRO dataset. We also investigate the effective model size
and real-time inference latency by gradually streamlining the architecture.
Finally, we conduct cross-data evaluation on unseen piano datasets and in-depth
analysis to elucidate the effect of the proposed components in the view of note
length and pitch range. |
---|---|
DOI: | 10.48550/arxiv.2404.06818 |