An EM Clustering Algorithm which Produces a Dual Representation

Clustering text documents is an important step in mining useful information on the Web or other text-based resources. The common task in text clustering is to handle text in a multi-dimensional space, and to partition documents into groups, where each group contains documents that are similar to eac...

Full description

Saved in:

Bibliographic Details
Published in	2011 Tenth International Conference on Machine Learning and Applications Vol. 2; pp. 90 - 95
Main Authors	Sun Kim, Wilbur, W. J.
Format	Conference Proceeding
Language	English
Published	IEEE 01.12.2011
Subjects	Algorithm design and analysis Clustering algorithms Humans Parkinson's disease Probabilistic logic Vectors
Online Access	Get full text
ISBN	9781457721342 1457721341
DOI	10.1109/ICMLA.2011.29

Cover

Loading…

Abstract	Clustering text documents is an important step in mining useful information on the Web or other text-based resources. The common task in text clustering is to handle text in a multi-dimensional space, and to partition documents into groups, where each group contains documents that are similar to each other. However, this strategy lacks a comprehensive view for humans since it cannot explain the subject of each cluster. Utilizing semantic information such as an ontology can solve this problem, but it needs a well-defined database or pre-labeled gold standard set. In this paper, we present a theme-based clustering algorithm for text documents. Given text, subject terms are extracted and used for clustering documents in a probabilistic framework. An EM approach is used to ensure documents are assigned to correct themes, hence it converges to an optimal solution. The proposed method is distinctive because its results are sufficiently explanatory for human understanding as well as efficient for usual clustering performance. The experimental results show that the proposed method provides competitive performance compared to other state-of-the-art approaches. In addition, the extracted themes represent well the topics of clusters on the MEDLINE dataset.
AbstractList	Clustering text documents is an important step in mining useful information on the Web or other text-based resources. The common task in text clustering is to handle text in a multi-dimensional space, and to partition documents into groups, where each group contains documents that are similar to each other. However, this strategy lacks a comprehensive view for humans since it cannot explain the subject of each cluster. Utilizing semantic information such as an ontology can solve this problem, but it needs a well-defined database or pre-labeled gold standard set. In this paper, we present a theme-based clustering algorithm for text documents. Given text, subject terms are extracted and used for clustering documents in a probabilistic framework. An EM approach is used to ensure documents are assigned to correct themes, hence it converges to an optimal solution. The proposed method is distinctive because its results are sufficiently explanatory for human understanding as well as efficient for usual clustering performance. The experimental results show that the proposed method provides competitive performance compared to other state-of-the-art approaches. In addition, the extracted themes represent well the topics of clusters on the MEDLINE dataset.
Author	Sun Kim Wilbur, W. J.
Author_xml	– sequence: 1 surname: Sun Kim fullname: Sun Kim email: sun.kim@nih.gov organization: Nat. Center for Biotechnol. Inf., Nat. Inst. of Health, Bethesda, MD, USA – sequence: 2 givenname: W. J. surname: Wilbur fullname: Wilbur, W. J. email: wilbur@ncbi.nlm.nih.gov organization: Nat. Center for Biotechnol. Inf., Nat. Inst. of Health, Bethesda, MD, USA
BookMark	eNotzE1LwzAcgPGICrrZoycv-QKrSfN-klKnDjoU2X2k6T9rpGtH0iJ--wn6XH63Z4GuhnEAhO4pySkl5nFTbesyLwileWEu0IIoaQSXRBWXKDNKUy6UKijjxQ3KUvoiv0lpDFW36Kkc8HqLq35OE8QwHHDZH8YYpu6Iv7vgOvwRx3Z2kLDFz7Pt8SecIiQYJjuFcbhD1972CbJ_l2j3st5Vb6v6_XVTlfUqGDKtBEBjGaeyVU7aVkuqPDSaeds2omBGOEu0A9XYxiutiAcNvCVSC6-ldZot0cPfNgDA_hTD0cafvaRcEcHZGQm3S10
ContentType	Conference Proceeding
DBID	6IE 6IL CBEJK RIE RIL
DOI	10.1109/ICMLA.2011.29
DatabaseName	IEEE Electronic Library (IEL) Conference Proceedings IEEE Xplore POP ALL IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml	– sequence: 1 dbid: RIE name: IEEE Xplore url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher
DeliveryMethod	fulltext_linktorsrc
Discipline	Computer Science
EISBN	0769546072 9780769546070
EndPage	95
ExternalDocumentID	6147054
Genre	orig-research
GroupedDBID	6IE 6IF 6IK 6IL 6IN AAJGR AAWTH ADFMO ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK IEGSK IERZE OCL RIE RIL
ID	FETCH-LOGICAL-i90t-5eeba3416d7c6ad8617feb83fadb52395ca08ce7babf7870fe8e4d0685f86ac83
IEDL.DBID	RIE
ISBN	9781457721342 1457721341
IngestDate	Wed Aug 27 04:10:29 EDT 2025
IsPeerReviewed	false
IsScholarly	false
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-i90t-5eeba3416d7c6ad8617feb83fadb52395ca08ce7babf7870fe8e4d0685f86ac83
PageCount	6
ParticipantIDs	ieee_primary_6147054
PublicationCentury	2000
PublicationDate	2011-Dec.
PublicationDateYYYYMMDD	2011-12-01
PublicationDate_xml	– month: 12 year: 2011 text: 2011-Dec.
PublicationDecade	2010
PublicationTitle	2011 Tenth International Conference on Machine Learning and Applications
PublicationTitleAbbrev	icmla
PublicationYear	2011
Publisher	IEEE
Publisher_xml	– name: IEEE
SSID	ssj0000669917 ssib026767183
Score	1.4782128
Snippet	Clustering text documents is an important step in mining useful information on the Web or other text-based resources. The common task in text clustering is to...
SourceID	ieee
SourceType	Publisher
StartPage	90
SubjectTerms	Algorithm design and analysis Clustering algorithms Humans Parkinson's disease Probabilistic logic Vectors
Title	An EM Clustering Algorithm which Produces a Dual Representation
URI	https://ieeexplore.ieee.org/document/6147054
Volume	2
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1LT8JAEN4AJ0-oYHxnDx4ttKWP7ckQhKCxxhhMuJF9TIUIxWAbE3-9s9sWjPHgbdu0TXd2p9_sdr5vCLmSkcPA47blMnQ3L-HCEojS2FLScQDCnimdED8G4xfvfupPa-R6y4UBAJN8Bh3dNP_y1Vrmequsi1ASYohRJ3VcuBVcrWruuFp4zCmhsvgKBxj6hIbL5WMIqYXLKomn8tjdaW527wbxQ79Q9NTR5o9KKwZoRk0SV69Y5Je8dfJMdOTXL_XG__Zhn7R3lD76tAWrA1KD9JA0q5oOtHTxFrnpp3QY08Ey1woKeCntL1_Xm0U2X9HP-ULO9VMUzocPyultzpf02eTSlhSmtE0mo-FkMLbKIgvWIrIzywcQHE0TqFAGXDEMaBIQrJdwJXCNGvmS20xCKLhItG8ngEOr7ID5CQu4ZL0j0kjXKRwTKgXnDBiTPFIegwhvkE7APMkiF70-OiEtbY7ZeyGjMSstcfr36TOyZ7ZvTebIOWlkmxwuEP8zcWkG_hv6Iasl
linkProvider	IEEE
linkToHtml	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1NT8JAEN0gHvSECsZv9-DRYlv6sT0ZghBQSozBhBvZ3U6FiMVgGxN_vbPbFozx4G3btE07nel7u515Q8iVDCwGDjcNm2G4OTEXhkCUxlEkLQvAb-nWCeHI6z879xN3UiHX61oYANDJZ9BUQ_0vP1rKTC2V3SCU-Egxtsg24r4T5NVapffYSnrMKsAy_w57SH58Xc3lIolU0mWlyFOxbW9UN28GnXDYzjU9Fd_80WtFQ02vRsLyJvMMk9dmloqm_Pql3_jfp9gjjU1RH31cw9U-qUByQGplVwdaBHmd3LYT2g1pZ5EpDQU8lLYXL8vVPJ290c_ZXM7UVSL0iA_K6V3GF_RJZ9MWRUxJg4x73XGnbxRtFox5YKaGCyA4msaLfOnxiCGliUGwVswjgbPUwJXcZBJ8wUWsojsGfLmR6TE3Zh6XrHVIqskygSNCpeCcAWOSB5HDIMATpOUxR7LAxrgPjkldmWP6ngtpTAtLnPy9-5Ls9MfhcDocjB5Oya5ezNV5JGekmq4yOEc2kIoL7QTfK5-udQ
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2011+Tenth+International+Conference+on+Machine+Learning+and+Applications&rft.atitle=An+EM+Clustering+Algorithm+which+Produces+a+Dual+Representation&rft.au=Sun+Kim&rft.au=Wilbur%2C+W.+J.&rft.date=2011-12-01&rft.pub=IEEE&rft.isbn=9781457721342&rft.volume=2&rft.spage=90&rft.epage=95&rft_id=info:doi/10.1109%2FICMLA.2011.29&rft.externalDocID=6147054
thumbnail_l	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781457721342/lc.gif&client=summon&freeimage=true
thumbnail_m	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781457721342/mc.gif&client=summon&freeimage=true
thumbnail_s	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781457721342/sc.gif&client=summon&freeimage=true