LSTM and GPT-2 Synthetic Speech Transfer Learning for Speaker Recognition to Overcome Data Scarcity

In speech recognition problems, data scarcity often poses an issue due to the willingness of humans to provide large amounts of data for learning and classification. In this work, we take a set of 5 spoken Harvard sentences from 7 subjects and consider their MFCC attributes. Using character level LS...

Full description

Saved in:
Bibliographic Details
Main Authors Bird, Jordan J, Faria, Diego R, Ekárt, Anikó, Premebida, Cristiano, Ayrosa, Pedro P. S
Format Journal Article
LanguageEnglish
Published 01.07.2020
Subjects
Online AccessGet full text
DOI10.48550/arxiv.2007.00659

Cover

Abstract In speech recognition problems, data scarcity often poses an issue due to the willingness of humans to provide large amounts of data for learning and classification. In this work, we take a set of 5 spoken Harvard sentences from 7 subjects and consider their MFCC attributes. Using character level LSTMs (supervised learning) and OpenAI's attention-based GPT-2 models, synthetic MFCCs are generated by learning from the data provided on a per-subject basis. A neural network is trained to classify the data against a large dataset of Flickr8k speakers and is then compared to a transfer learning network performing the same task but with an initial weight distribution dictated by learning from the synthetic data generated by the two models. The best result for all of the 7 subjects were networks that had been exposed to synthetic data, the model pre-trained with LSTM-produced data achieved the best result 3 times and the GPT-2 equivalent 5 times (since one subject had their best result from both models at a draw). Through these results, we argue that speaker classification can be improved by utilising a small amount of user data but with exposure to synthetically-generated MFCCs which then allow the networks to achieve near maximum classification scores.
AbstractList In speech recognition problems, data scarcity often poses an issue due to the willingness of humans to provide large amounts of data for learning and classification. In this work, we take a set of 5 spoken Harvard sentences from 7 subjects and consider their MFCC attributes. Using character level LSTMs (supervised learning) and OpenAI's attention-based GPT-2 models, synthetic MFCCs are generated by learning from the data provided on a per-subject basis. A neural network is trained to classify the data against a large dataset of Flickr8k speakers and is then compared to a transfer learning network performing the same task but with an initial weight distribution dictated by learning from the synthetic data generated by the two models. The best result for all of the 7 subjects were networks that had been exposed to synthetic data, the model pre-trained with LSTM-produced data achieved the best result 3 times and the GPT-2 equivalent 5 times (since one subject had their best result from both models at a draw). Through these results, we argue that speaker classification can be improved by utilising a small amount of user data but with exposure to synthetically-generated MFCCs which then allow the networks to achieve near maximum classification scores.
Author Premebida, Cristiano
Faria, Diego R
Bird, Jordan J
Ekárt, Anikó
Ayrosa, Pedro P. S
Author_xml – sequence: 1
  givenname: Jordan J
  surname: Bird
  fullname: Bird, Jordan J
– sequence: 2
  givenname: Diego R
  surname: Faria
  fullname: Faria, Diego R
– sequence: 3
  givenname: Anikó
  surname: Ekárt
  fullname: Ekárt, Anikó
– sequence: 4
  givenname: Cristiano
  surname: Premebida
  fullname: Premebida, Cristiano
– sequence: 5
  givenname: Pedro P. S
  surname: Ayrosa
  fullname: Ayrosa, Pedro P. S
BackLink https://doi.org/10.48550/arXiv.2007.00659$$DView paper in arXiv
BookMark eNqFzrkOwjAQBFAXUHB9ABX7AwQTCEfNWYBAOH20MptgAWu0WIj8PYfoqUaameLVVYU9k1Ltvo6GkyTRPZSne0Sx1uNI61EyrSm7MekWkI-w2qfdGEzJ4UTBWTA3InuCVJDvOQlsCIUdF5B7-Yx4fpcHsr5gF5xnCB52DxLrrwRzDAjGolgXyqaq5ni5U-uXDdVZLtLZuvvlZDdxV5Qy-7CyL2vw__EC1B5EAw
ContentType Journal Article
Copyright http://arxiv.org/licenses/nonexclusive-distrib/1.0
Copyright_xml – notice: http://arxiv.org/licenses/nonexclusive-distrib/1.0
DBID AKY
GOX
DOI 10.48550/arxiv.2007.00659
DatabaseName arXiv Computer Science
arXiv.org
DatabaseTitleList
Database_xml – sequence: 1
  dbid: GOX
  name: arXiv.org
  url: http://arxiv.org/find
  sourceTypes: Open Access Repository
DeliveryMethod fulltext_linktorsrc
ExternalDocumentID 2007_00659
GroupedDBID AKY
GOX
ID FETCH-arxiv_primary_2007_006593
IEDL.DBID GOX
IngestDate Tue Jul 22 23:19:17 EDT 2025
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-arxiv_primary_2007_006593
OpenAccessLink https://arxiv.org/abs/2007.00659
ParticipantIDs arxiv_primary_2007_00659
PublicationCentury 2000
PublicationDate 2020-07-01
PublicationDateYYYYMMDD 2020-07-01
PublicationDate_xml – month: 07
  year: 2020
  text: 2020-07-01
  day: 01
PublicationDecade 2020
PublicationYear 2020
Score 3.4576268
SecondaryResourceType preprint
Snippet In speech recognition problems, data scarcity often poses an issue due to the willingness of humans to provide large amounts of data for learning and...
SourceID arxiv
SourceType Open Access Repository
SubjectTerms Computer Science - Learning
Computer Science - Sound
Title LSTM and GPT-2 Synthetic Speech Transfer Learning for Speaker Recognition to Overcome Data Scarcity
URI https://arxiv.org/abs/2007.00659
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwdV1NSwMxEB3anryIRaV-z8FrsN3NLsmxaD8Qa8VdYW-SZLMqwrasq-i_d5LdopdeM0MI-eC9xyQvAJeGE0gO9ZDl4ShgXFrOFOeaRYb0h-BKS-MeCi_u4_kTv82irAO4eQujqu-3r8YfWH9ctQ6DcSS70A0CJ65my6wpTnorrjb_L484pm_6BxLTPdht2R2Om-XoQ8eW-2DuknSBpNhx9pCyAJOfklgXZWCytta8oseLwlbYmp2-IDFJF1Tv1Pi4ueKzKrFe4ZL2Hu0SizeqVujqJ4aI9AFcTCfp9Zz5YT2vGw8J9_ujK_zSiMND6JHStwNAVdD501JYwnEucyFHueCaWyNEqItYHMFgWy_H20MnsBM4kejvmJ5Cr64-7Rkhaa3P_XT-Ao--d30
linkProvider Cornell University
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=LSTM+and+GPT-2+Synthetic+Speech+Transfer+Learning+for+Speaker+Recognition+to+Overcome+Data+Scarcity&rft.au=Bird%2C+Jordan+J&rft.au=Faria%2C+Diego+R&rft.au=Ek%C3%A1rt%2C+Anik%C3%B3&rft.au=Premebida%2C+Cristiano&rft.date=2020-07-01&rft_id=info:doi/10.48550%2Farxiv.2007.00659&rft.externalDocID=2007_00659