Fully Unsupervised Topic Clustering of Unlabelled Spoken Audio Using Self-Supervised Representation Learning and Topic Model

Unsupervised topic clustering of spoken audio is an important research topic for zero-resourced unwritten languages. A classical approach is to find a set of spoken terms from only the audio based on dynamic time warping or generative modeling (e.g., hidden Markov model), and apply a topic model to...

Full description

Saved in:
Bibliographic Details
Published inICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) pp. 1 - 5
Main Authors Maekaku, Takashi, Fujita, Yuya, Chang, Xuankai, Watanabe, Shinji
Format Conference Proceeding
LanguageEnglish
Published IEEE 04.06.2023
Subjects
Online AccessGet full text

Cover

Loading…
Abstract Unsupervised topic clustering of spoken audio is an important research topic for zero-resourced unwritten languages. A classical approach is to find a set of spoken terms from only the audio based on dynamic time warping or generative modeling (e.g., hidden Markov model), and apply a topic model to classify topics. The spoken term discovery is the most important and difficult part. In this paper, we propose to combine self-supervised representation learning (SSRL) methods as a component of spoken term discovery and probabilistic topic models. Most SSRL methods pre-train a model which predicts high-quality pseudo labels generated from an audio-only corpus. These pseudo labels can be used to produce a sequence of pseudo subwords by applying deduplication and a subword model. Then, we apply a topic model based on latent Dirichlet allocation for these pseudo-subword sequences in an unsupervised manner. The clustering performance is evaluated on the Fisher corpus using normalized mutual information. We confirm the improvement of the proposed method and its effectiveness compared to an existing approach using dynamic time warping and topic models although the experimental setups are not directly comparable.
AbstractList Unsupervised topic clustering of spoken audio is an important research topic for zero-resourced unwritten languages. A classical approach is to find a set of spoken terms from only the audio based on dynamic time warping or generative modeling (e.g., hidden Markov model), and apply a topic model to classify topics. The spoken term discovery is the most important and difficult part. In this paper, we propose to combine self-supervised representation learning (SSRL) methods as a component of spoken term discovery and probabilistic topic models. Most SSRL methods pre-train a model which predicts high-quality pseudo labels generated from an audio-only corpus. These pseudo labels can be used to produce a sequence of pseudo subwords by applying deduplication and a subword model. Then, we apply a topic model based on latent Dirichlet allocation for these pseudo-subword sequences in an unsupervised manner. The clustering performance is evaluated on the Fisher corpus using normalized mutual information. We confirm the improvement of the proposed method and its effectiveness compared to an existing approach using dynamic time warping and topic models although the experimental setups are not directly comparable.
Author Fujita, Yuya
Chang, Xuankai
Maekaku, Takashi
Watanabe, Shinji
Author_xml – sequence: 1
  givenname: Takashi
  surname: Maekaku
  fullname: Maekaku, Takashi
  organization: Yahoo Japan Corporation,Tokyo,Japan
– sequence: 2
  givenname: Yuya
  surname: Fujita
  fullname: Fujita, Yuya
  organization: Yahoo Japan Corporation,Tokyo,Japan
– sequence: 3
  givenname: Xuankai
  surname: Chang
  fullname: Chang, Xuankai
  organization: Carnegie Mellon University,PA,USA
– sequence: 4
  givenname: Shinji
  surname: Watanabe
  fullname: Watanabe, Shinji
  organization: Carnegie Mellon University,PA,USA
BookMark eNpFkNFKwzAYhaMouE3fwIv4AJ1_kiZpLsdwKkwUu4F3IzV_JRrT0rTCwId3Q4dX5-J8fHDOmJzEJiIhVwymjIG5vp_PyvIpN0LqKQcupgzASF7AERkzzQumBNf6mIy40CZjBl7OyDildwAodF6MyPdiCGFL1zENLXZfPqGjq6b1r3QehtRj5-MbbeodEGyFIezqsm0-MNLZ4HxD12kPlBjqrPw3PGPbYcLY2943kS7RdnHP2XiwPzQOwzk5rW1IePGXE7Je3Kzmd9ny8Xa3bJl5pgEyplTOmTaqqHLlcpND7jSzulYCGdayRulkxQpeGeFkzZUBJitdgRTaWYliQi5_vR4RN23nP2233RyuEj8k5mIw
ContentType Conference Proceeding
DBID 6IE
6IH
CBEJK
RIE
RIO
DOI 10.1109/ICASSP49357.2023.10095280
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan (POP) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Electronic Library Online
IEEE Proceedings Order Plans (POP) 1998-present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Engineering
EISBN 1728163277
9781728163277
EISSN 2379-190X
EndPage 5
ExternalDocumentID 10095280
Genre orig-research
GroupedDBID 23M
6IE
6IF
6IH
6IK
6IL
6IM
6IN
AAJGR
ABLEC
ACGFS
ADZIZ
ALMA_UNASSIGNED_HOLDINGS
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
CHZPO
IEGSK
IJVOP
IPLJI
JC5
M43
OCL
RIE
RIL
RIO
RNS
ID FETCH-LOGICAL-i1700-1664217968b46d49404d71a7f63e1ef5fe5d5b182b93d5f269015b7b0537da5e3
IEDL.DBID RIE
IngestDate Wed Jun 26 19:24:05 EDT 2024
IsDoiOpenAccess false
IsOpenAccess true
IsPeerReviewed false
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i1700-1664217968b46d49404d71a7f63e1ef5fe5d5b182b93d5f269015b7b0537da5e3
OpenAccessLink https://doi.org/10.1109/icassp49357.2023.10095280
PageCount 5
ParticipantIDs ieee_primary_10095280
PublicationCentury 2000
PublicationDate 2023-June-4
PublicationDateYYYYMMDD 2023-06-04
PublicationDate_xml – month: 06
  year: 2023
  text: 2023-June-4
  day: 04
PublicationDecade 2020
PublicationTitle ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
PublicationTitleAbbrev ICASSP
PublicationYear 2023
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssj0008748
Score 2.274837
Snippet Unsupervised topic clustering of spoken audio is an important research topic for zero-resourced unwritten languages. A classical approach is to find a set of...
SourceID ieee
SourceType Publisher
StartPage 1
SubjectTerms Acoustics
Hidden Markov models
HuBERT
LDA
Predictive models
Probabilistic logic
Representation learning
Self-supervised learning
Signal processing
Topic model
Unsupervised
WavLM
Title Fully Unsupervised Topic Clustering of Unlabelled Spoken Audio Using Self-Supervised Representation Learning and Topic Model
URI https://ieeexplore.ieee.org/document/10095280
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3LS8MwGA-6g-jF18Q3Eby2rm1ePY7hUMEx7Aa7jab5ImOjLW49KP7xJlm3qSB4KyH9UtI0yZf-HgjdahpwyVvSC2hqEhTQ0ksJCz1GZQgpMJVpSxR-7rGHIXka0VFNVndcGABw4DPw7aX7l6-KrLJHZeYLNxuCUJgMfVu0wiVZaz3tCk7EDrqpRTTvHjvtJOmTOKLctxbh_urmHzYqbhXp7qPeqv0leGTqVwvpZx-_pBn__YAHqLkh7OH-eik6RFuQH6G9b1qDx-jTJpvveJjPq9LOD3NQeFCUkwx3ZpVVSzC1cKFNBTMu7Gm-wklZTCHH7UpNCuywBTiBmfaSTYQXB6St-Us5rtVaX3Gar6Jbs7VZEw2794POg1dbL3gTK9jnBcwSYHnMhCRMkZi0iOJByjWLIABNNVBFpclNZBwpqkNra0Ull1YdRqUUohPUyIscThHOuAqVDKWlyJogIHQkKQmJAiEASHyGmrYjx-VSXWO86sPzP8ov0K59nw6uRS5RY_FWwZXZGCzktRsQX-ZyucA
link.rule.ids 310,311,786,790,795,796,802,23958,23959,25170,27958,55109
linkProvider IEEE
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1ba8IwFA7DwS4vuzl2XwZ7bWfbXNpHkYluKrIq-CZNczJEacu0Dxv78Utq1W0w2Fsp6SGkaU5O-l0QulfU4YLXhOXQSBcooIQVEeZajAoXImAyVoYo3O2x1pA8jeioJKsXXBgAKMBnYJvL4l--TOPcHJXpL1xvCFxfV-jbOtHXgiVda73w-pz4O-iulNF8aDfqYdgngUe5bUzC7dXjP4xUijzSPEC9VQ-W8JGpnS-EHX_8Emf8dxcPUXVD2cP9dTI6QluQHKP9b2qDJ-jTlJvveJjM88ysEHOQeJBmkxg3ZrnRS9CtcKp0Az0zzHm-xGGWTiHB9VxOUlygC3AIM2WFmwgvBZS2ZDAluNRrfcVRsopu7NZmVTRsPg4aLas0X7AmRrLPcpihwPKA-YIwSQJSI5I7EVfMAwcUVUAlFbo6EYEnqXKNsRUVXBh9GBlR8E5RJUkTOEM45tKVwhWGJKuDgK88QYlLJPg-AAnOUdUM5Dhb6muMV2N48cf9W7TbGnQ7406793yJ9sy7LcBb5ApVFm85XOttwkLcFJPjC98bvRY
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=proceeding&rft.title=ICASSP+2023+-+2023+IEEE+International+Conference+on+Acoustics%2C+Speech+and+Signal+Processing+%28ICASSP%29&rft.atitle=Fully+Unsupervised+Topic+Clustering+of+Unlabelled+Spoken+Audio+Using+Self-Supervised+Representation+Learning+and+Topic+Model&rft.au=Maekaku%2C+Takashi&rft.au=Fujita%2C+Yuya&rft.au=Chang%2C+Xuankai&rft.au=Watanabe%2C+Shinji&rft.date=2023-06-04&rft.pub=IEEE&rft.eissn=2379-190X&rft.spage=1&rft.epage=5&rft_id=info:doi/10.1109%2FICASSP49357.2023.10095280&rft.externalDocID=10095280