An EM Clustering Algorithm which Produces a Dual Representation
Clustering text documents is an important step in mining useful information on the Web or other text-based resources. The common task in text clustering is to handle text in a multi-dimensional space, and to partition documents into groups, where each group contains documents that are similar to eac...
Saved in:
Published in | 2011 Tenth International Conference on Machine Learning and Applications Vol. 2; pp. 90 - 95 |
---|---|
Main Authors | , |
Format | Conference Proceeding |
Language | English |
Published |
IEEE
01.12.2011
|
Subjects | |
Online Access | Get full text |
ISBN | 9781457721342 1457721341 |
DOI | 10.1109/ICMLA.2011.29 |
Cover
Loading…
Abstract | Clustering text documents is an important step in mining useful information on the Web or other text-based resources. The common task in text clustering is to handle text in a multi-dimensional space, and to partition documents into groups, where each group contains documents that are similar to each other. However, this strategy lacks a comprehensive view for humans since it cannot explain the subject of each cluster. Utilizing semantic information such as an ontology can solve this problem, but it needs a well-defined database or pre-labeled gold standard set. In this paper, we present a theme-based clustering algorithm for text documents. Given text, subject terms are extracted and used for clustering documents in a probabilistic framework. An EM approach is used to ensure documents are assigned to correct themes, hence it converges to an optimal solution. The proposed method is distinctive because its results are sufficiently explanatory for human understanding as well as efficient for usual clustering performance. The experimental results show that the proposed method provides competitive performance compared to other state-of-the-art approaches. In addition, the extracted themes represent well the topics of clusters on the MEDLINE dataset. |
---|---|
AbstractList | Clustering text documents is an important step in mining useful information on the Web or other text-based resources. The common task in text clustering is to handle text in a multi-dimensional space, and to partition documents into groups, where each group contains documents that are similar to each other. However, this strategy lacks a comprehensive view for humans since it cannot explain the subject of each cluster. Utilizing semantic information such as an ontology can solve this problem, but it needs a well-defined database or pre-labeled gold standard set. In this paper, we present a theme-based clustering algorithm for text documents. Given text, subject terms are extracted and used for clustering documents in a probabilistic framework. An EM approach is used to ensure documents are assigned to correct themes, hence it converges to an optimal solution. The proposed method is distinctive because its results are sufficiently explanatory for human understanding as well as efficient for usual clustering performance. The experimental results show that the proposed method provides competitive performance compared to other state-of-the-art approaches. In addition, the extracted themes represent well the topics of clusters on the MEDLINE dataset. |
Author | Sun Kim Wilbur, W. J. |
Author_xml | – sequence: 1 surname: Sun Kim fullname: Sun Kim email: sun.kim@nih.gov organization: Nat. Center for Biotechnol. Inf., Nat. Inst. of Health, Bethesda, MD, USA – sequence: 2 givenname: W. J. surname: Wilbur fullname: Wilbur, W. J. email: wilbur@ncbi.nlm.nih.gov organization: Nat. Center for Biotechnol. Inf., Nat. Inst. of Health, Bethesda, MD, USA |
BookMark | eNotzE1LwzAcgPGICrrZoycv-QKrSfN-klKnDjoU2X2k6T9rpGtH0iJ--wn6XH63Z4GuhnEAhO4pySkl5nFTbesyLwileWEu0IIoaQSXRBWXKDNKUy6UKijjxQ3KUvoiv0lpDFW36Kkc8HqLq35OE8QwHHDZH8YYpu6Iv7vgOvwRx3Z2kLDFz7Pt8SecIiQYJjuFcbhD1972CbJ_l2j3st5Vb6v6_XVTlfUqGDKtBEBjGaeyVU7aVkuqPDSaeds2omBGOEu0A9XYxiutiAcNvCVSC6-ldZot0cPfNgDA_hTD0cafvaRcEcHZGQm3S10 |
ContentType | Conference Proceeding |
DBID | 6IE 6IL CBEJK RIE RIL |
DOI | 10.1109/ICMLA.2011.29 |
DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Xplore POP ALL IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP All) 1998-Present |
DatabaseTitleList | |
Database_xml | – sequence: 1 dbid: RIE name: IEEE Xplore url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Computer Science |
EISBN | 0769546072 9780769546070 |
EndPage | 95 |
ExternalDocumentID | 6147054 |
Genre | orig-research |
GroupedDBID | 6IE 6IF 6IK 6IL 6IN AAJGR AAWTH ADFMO ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK IEGSK IERZE OCL RIE RIL |
ID | FETCH-LOGICAL-i90t-5eeba3416d7c6ad8617feb83fadb52395ca08ce7babf7870fe8e4d0685f86ac83 |
IEDL.DBID | RIE |
ISBN | 9781457721342 1457721341 |
IngestDate | Wed Aug 27 04:10:29 EDT 2025 |
IsPeerReviewed | false |
IsScholarly | false |
Language | English |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-i90t-5eeba3416d7c6ad8617feb83fadb52395ca08ce7babf7870fe8e4d0685f86ac83 |
PageCount | 6 |
ParticipantIDs | ieee_primary_6147054 |
PublicationCentury | 2000 |
PublicationDate | 2011-Dec. |
PublicationDateYYYYMMDD | 2011-12-01 |
PublicationDate_xml | – month: 12 year: 2011 text: 2011-Dec. |
PublicationDecade | 2010 |
PublicationTitle | 2011 Tenth International Conference on Machine Learning and Applications |
PublicationTitleAbbrev | icmla |
PublicationYear | 2011 |
Publisher | IEEE |
Publisher_xml | – name: IEEE |
SSID | ssj0000669917 ssib026767183 |
Score | 1.4782128 |
Snippet | Clustering text documents is an important step in mining useful information on the Web or other text-based resources. The common task in text clustering is to... |
SourceID | ieee |
SourceType | Publisher |
StartPage | 90 |
SubjectTerms | Algorithm design and analysis Clustering algorithms Humans Parkinson's disease Probabilistic logic Vectors |
Title | An EM Clustering Algorithm which Produces a Dual Representation |
URI | https://ieeexplore.ieee.org/document/6147054 |
Volume | 2 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1LT8JAEN4AJ0-oYHxnDx4ttKWP7ckQhKCxxhhMuJF9TIUIxWAbE3-9s9sWjPHgbdu0TXd2p9_sdr5vCLmSkcPA47blMnQ3L-HCEojS2FLScQDCnimdED8G4xfvfupPa-R6y4UBAJN8Bh3dNP_y1Vrmequsi1ASYohRJ3VcuBVcrWruuFp4zCmhsvgKBxj6hIbL5WMIqYXLKomn8tjdaW527wbxQ79Q9NTR5o9KKwZoRk0SV69Y5Je8dfJMdOTXL_XG__Zhn7R3lD76tAWrA1KD9JA0q5oOtHTxFrnpp3QY08Ey1woKeCntL1_Xm0U2X9HP-ULO9VMUzocPyultzpf02eTSlhSmtE0mo-FkMLbKIgvWIrIzywcQHE0TqFAGXDEMaBIQrJdwJXCNGvmS20xCKLhItG8ngEOr7ID5CQu4ZL0j0kjXKRwTKgXnDBiTPFIegwhvkE7APMkiF70-OiEtbY7ZeyGjMSstcfr36TOyZ7ZvTebIOWlkmxwuEP8zcWkG_hv6Iasl |
linkProvider | IEEE |
linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1NT8JAEN0gHvSECsZv9-DRYlv6sT0ZghBQSozBhBvZ3U6FiMVgGxN_vbPbFozx4G3btE07nel7u515Q8iVDCwGDjcNm2G4OTEXhkCUxlEkLQvAb-nWCeHI6z879xN3UiHX61oYANDJZ9BUQ_0vP1rKTC2V3SCU-Egxtsg24r4T5NVapffYSnrMKsAy_w57SH58Xc3lIolU0mWlyFOxbW9UN28GnXDYzjU9Fd_80WtFQ02vRsLyJvMMk9dmloqm_Pql3_jfp9gjjU1RH31cw9U-qUByQGplVwdaBHmd3LYT2g1pZ5EpDQU8lLYXL8vVPJ290c_ZXM7UVSL0iA_K6V3GF_RJZ9MWRUxJg4x73XGnbxRtFox5YKaGCyA4msaLfOnxiCGliUGwVswjgbPUwJXcZBJ8wUWsojsGfLmR6TE3Zh6XrHVIqskygSNCpeCcAWOSB5HDIMATpOUxR7LAxrgPjkldmWP6ngtpTAtLnPy9-5Ls9MfhcDocjB5Oya5ezNV5JGekmq4yOEc2kIoL7QTfK5-udQ |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2011+Tenth+International+Conference+on+Machine+Learning+and+Applications&rft.atitle=An+EM+Clustering+Algorithm+which+Produces+a+Dual+Representation&rft.au=Sun+Kim&rft.au=Wilbur%2C+W.+J.&rft.date=2011-12-01&rft.pub=IEEE&rft.isbn=9781457721342&rft.volume=2&rft.spage=90&rft.epage=95&rft_id=info:doi/10.1109%2FICMLA.2011.29&rft.externalDocID=6147054 |
thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781457721342/lc.gif&client=summon&freeimage=true |
thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781457721342/mc.gif&client=summon&freeimage=true |
thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781457721342/sc.gif&client=summon&freeimage=true |