Fully Unsupervised Topic Clustering of Unlabelled Spoken Audio Using Self-Supervised Representation Learning and Topic Model

Unsupervised topic clustering of spoken audio is an important research topic for zero-resourced unwritten languages. A classical approach is to find a set of spoken terms from only the audio based on dynamic time warping or generative modeling (e.g., hidden Markov model), and apply a topic model to...

Full description

Saved in:

Bibliographic Details
Published in	ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) pp. 1 - 5
Main Authors	Maekaku, Takashi, Fujita, Yuya, Chang, Xuankai, Watanabe, Shinji
Format	Conference Proceeding
Language	English
Published	IEEE 04.06.2023
Subjects	Acoustics Hidden Markov models HuBERT LDA Predictive models Probabilistic logic Representation learning Self-supervised learning Signal processing Topic model Unsupervised WavLM
Online Access	Get full text

Cover

Loading…

Abstract	Unsupervised topic clustering of spoken audio is an important research topic for zero-resourced unwritten languages. A classical approach is to find a set of spoken terms from only the audio based on dynamic time warping or generative modeling (e.g., hidden Markov model), and apply a topic model to classify topics. The spoken term discovery is the most important and difficult part. In this paper, we propose to combine self-supervised representation learning (SSRL) methods as a component of spoken term discovery and probabilistic topic models. Most SSRL methods pre-train a model which predicts high-quality pseudo labels generated from an audio-only corpus. These pseudo labels can be used to produce a sequence of pseudo subwords by applying deduplication and a subword model. Then, we apply a topic model based on latent Dirichlet allocation for these pseudo-subword sequences in an unsupervised manner. The clustering performance is evaluated on the Fisher corpus using normalized mutual information. We confirm the improvement of the proposed method and its effectiveness compared to an existing approach using dynamic time warping and topic models although the experimental setups are not directly comparable.
AbstractList	Unsupervised topic clustering of spoken audio is an important research topic for zero-resourced unwritten languages. A classical approach is to find a set of spoken terms from only the audio based on dynamic time warping or generative modeling (e.g., hidden Markov model), and apply a topic model to classify topics. The spoken term discovery is the most important and difficult part. In this paper, we propose to combine self-supervised representation learning (SSRL) methods as a component of spoken term discovery and probabilistic topic models. Most SSRL methods pre-train a model which predicts high-quality pseudo labels generated from an audio-only corpus. These pseudo labels can be used to produce a sequence of pseudo subwords by applying deduplication and a subword model. Then, we apply a topic model based on latent Dirichlet allocation for these pseudo-subword sequences in an unsupervised manner. The clustering performance is evaluated on the Fisher corpus using normalized mutual information. We confirm the improvement of the proposed method and its effectiveness compared to an existing approach using dynamic time warping and topic models although the experimental setups are not directly comparable.
Author	Fujita, Yuya Chang, Xuankai Maekaku, Takashi Watanabe, Shinji
Author_xml	– sequence: 1 givenname: Takashi surname: Maekaku fullname: Maekaku, Takashi organization: Yahoo Japan Corporation,Tokyo,Japan – sequence: 2 givenname: Yuya surname: Fujita fullname: Fujita, Yuya organization: Yahoo Japan Corporation,Tokyo,Japan – sequence: 3 givenname: Xuankai surname: Chang fullname: Chang, Xuankai organization: Carnegie Mellon University,PA,USA – sequence: 4 givenname: Shinji surname: Watanabe fullname: Watanabe, Shinji organization: Carnegie Mellon University,PA,USA
BookMark	eNpFkNFKwzAYhaMouE3fwIv4AJ1_kiZpLsdwKkwUu4F3IzV_JRrT0rTCwId3Q4dX5-J8fHDOmJzEJiIhVwymjIG5vp_PyvIpN0LqKQcupgzASF7AERkzzQumBNf6mIy40CZjBl7OyDildwAodF6MyPdiCGFL1zENLXZfPqGjq6b1r3QehtRj5-MbbeodEGyFIezqsm0-MNLZ4HxD12kPlBjqrPw3PGPbYcLY2943kS7RdnHP2XiwPzQOwzk5rW1IePGXE7Je3Kzmd9ny8Xa3bJl5pgEyplTOmTaqqHLlcpND7jSzulYCGdayRulkxQpeGeFkzZUBJitdgRTaWYliQi5_vR4RN23nP2233RyuEj8k5mIw
ContentType	Conference Proceeding
DBID	6IE 6IH CBEJK RIE RIO
DOI	10.1109/ICASSP49357.2023.10095280
DatabaseName	IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan (POP) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library Online IEEE Proceedings Order Plans (POP) 1998-present
DatabaseTitleList
Database_xml	– sequence: 1 dbid: RIE name: IEEE url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher
DeliveryMethod	fulltext_linktorsrc
Discipline	Engineering
EISBN	1728163277 9781728163277
EISSN	2379-190X
EndPage	5
ExternalDocumentID	10095280
Genre	orig-research
GroupedDBID	23M 6IE 6IF 6IH 6IK 6IL 6IM 6IN AAJGR ABLEC ACGFS ADZIZ ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO IEGSK IJVOP IPLJI JC5 M43 OCL RIE RIL RIO RNS
ID	FETCH-LOGICAL-i1700-1664217968b46d49404d71a7f63e1ef5fe5d5b182b93d5f269015b7b0537da5e3
IEDL.DBID	RIE
IngestDate	Wed Jun 26 19:24:05 EDT 2024
IsDoiOpenAccess	false
IsOpenAccess	true
IsPeerReviewed	false
IsScholarly	true
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-i1700-1664217968b46d49404d71a7f63e1ef5fe5d5b182b93d5f269015b7b0537da5e3
OpenAccessLink	https://doi.org/10.1109/icassp49357.2023.10095280
PageCount	5
ParticipantIDs	ieee_primary_10095280
PublicationCentury	2000
PublicationDate	2023-June-4
PublicationDateYYYYMMDD	2023-06-04
PublicationDate_xml	– month: 06 year: 2023 text: 2023-June-4 day: 04
PublicationDecade	2020
PublicationTitle	ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
PublicationTitleAbbrev	ICASSP
PublicationYear	2023
Publisher	IEEE
Publisher_xml	– name: IEEE
SSID	ssj0008748
Score	2.274837
Snippet	Unsupervised topic clustering of spoken audio is an important research topic for zero-resourced unwritten languages. A classical approach is to find a set of...
SourceID	ieee
SourceType	Publisher
StartPage	1
SubjectTerms	Acoustics Hidden Markov models HuBERT LDA Predictive models Probabilistic logic Representation learning Self-supervised learning Signal processing Topic model Unsupervised WavLM
Title	Fully Unsupervised Topic Clustering of Unlabelled Spoken Audio Using Self-Supervised Representation Learning and Topic Model
URI	https://ieeexplore.ieee.org/document/10095280
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3LS8MwGA-6g-jF18Q3Eby2rm1ePY7hUMEx7Aa7jab5ImOjLW49KP7xJlm3qSB4KyH9UtI0yZf-HgjdahpwyVvSC2hqEhTQ0ksJCz1GZQgpMJVpSxR-7rGHIXka0VFNVndcGABw4DPw7aX7l6-KrLJHZeYLNxuCUJgMfVu0wiVZaz3tCk7EDrqpRTTvHjvtJOmTOKLctxbh_urmHzYqbhXp7qPeqv0leGTqVwvpZx-_pBn__YAHqLkh7OH-eik6RFuQH6G9b1qDx-jTJpvveJjPq9LOD3NQeFCUkwx3ZpVVSzC1cKFNBTMu7Gm-wklZTCHH7UpNCuywBTiBmfaSTYQXB6St-Us5rtVaX3Gar6Jbs7VZEw2794POg1dbL3gTK9jnBcwSYHnMhCRMkZi0iOJByjWLIABNNVBFpclNZBwpqkNra0Ull1YdRqUUohPUyIscThHOuAqVDKWlyJogIHQkKQmJAiEASHyGmrYjx-VSXWO86sPzP8ov0K59nw6uRS5RY_FWwZXZGCzktRsQX-ZyucA
link.rule.ids	310,311,786,790,795,796,802,23958,23959,25170,27958,55109
linkProvider	IEEE
linkToHtml	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1ba8IwFA7DwS4vuzl2XwZ7bWfbXNpHkYluKrIq-CZNczJEacu0Dxv78Utq1W0w2Fsp6SGkaU5O-l0QulfU4YLXhOXQSBcooIQVEeZajAoXImAyVoYo3O2x1pA8jeioJKsXXBgAKMBnYJvL4l--TOPcHJXpL1xvCFxfV-jbOtHXgiVda73w-pz4O-iulNF8aDfqYdgngUe5bUzC7dXjP4xUijzSPEC9VQ-W8JGpnS-EHX_8Emf8dxcPUXVD2cP9dTI6QluQHKP9b2qDJ-jTlJvveJjM88ysEHOQeJBmkxg3ZrnRS9CtcKp0Az0zzHm-xGGWTiHB9VxOUlygC3AIM2WFmwgvBZS2ZDAluNRrfcVRsopu7NZmVTRsPg4aLas0X7AmRrLPcpihwPKA-YIwSQJSI5I7EVfMAwcUVUAlFbo6EYEnqXKNsRUVXBh9GBlR8E5RJUkTOEM45tKVwhWGJKuDgK88QYlLJPg-AAnOUdUM5Dhb6muMV2N48cf9W7TbGnQ7406793yJ9sy7LcBb5ApVFm85XOttwkLcFJPjC98bvRY
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=proceeding&rft.title=ICASSP+2023+-+2023+IEEE+International+Conference+on+Acoustics%2C+Speech+and+Signal+Processing+%28ICASSP%29&rft.atitle=Fully+Unsupervised+Topic+Clustering+of+Unlabelled+Spoken+Audio+Using+Self-Supervised+Representation+Learning+and+Topic+Model&rft.au=Maekaku%2C+Takashi&rft.au=Fujita%2C+Yuya&rft.au=Chang%2C+Xuankai&rft.au=Watanabe%2C+Shinji&rft.date=2023-06-04&rft.pub=IEEE&rft.eissn=2379-190X&rft.spage=1&rft.epage=5&rft_id=info:doi/10.1109%2FICASSP49357.2023.10095280&rft.externalDocID=10095280