An EM Clustering Algorithm which Produces a Dual Representation

Clustering text documents is an important step in mining useful information on the Web or other text-based resources. The common task in text clustering is to handle text in a multi-dimensional space, and to partition documents into groups, where each group contains documents that are similar to eac...

Full description

Saved in:
Bibliographic Details
Published in2011 Tenth International Conference on Machine Learning and Applications Vol. 2; pp. 90 - 95
Main Authors Sun Kim, Wilbur, W. J.
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.12.2011
Subjects
Online AccessGet full text
ISBN9781457721342
1457721341
DOI10.1109/ICMLA.2011.29

Cover

Loading…
Abstract Clustering text documents is an important step in mining useful information on the Web or other text-based resources. The common task in text clustering is to handle text in a multi-dimensional space, and to partition documents into groups, where each group contains documents that are similar to each other. However, this strategy lacks a comprehensive view for humans since it cannot explain the subject of each cluster. Utilizing semantic information such as an ontology can solve this problem, but it needs a well-defined database or pre-labeled gold standard set. In this paper, we present a theme-based clustering algorithm for text documents. Given text, subject terms are extracted and used for clustering documents in a probabilistic framework. An EM approach is used to ensure documents are assigned to correct themes, hence it converges to an optimal solution. The proposed method is distinctive because its results are sufficiently explanatory for human understanding as well as efficient for usual clustering performance. The experimental results show that the proposed method provides competitive performance compared to other state-of-the-art approaches. In addition, the extracted themes represent well the topics of clusters on the MEDLINE dataset.
AbstractList Clustering text documents is an important step in mining useful information on the Web or other text-based resources. The common task in text clustering is to handle text in a multi-dimensional space, and to partition documents into groups, where each group contains documents that are similar to each other. However, this strategy lacks a comprehensive view for humans since it cannot explain the subject of each cluster. Utilizing semantic information such as an ontology can solve this problem, but it needs a well-defined database or pre-labeled gold standard set. In this paper, we present a theme-based clustering algorithm for text documents. Given text, subject terms are extracted and used for clustering documents in a probabilistic framework. An EM approach is used to ensure documents are assigned to correct themes, hence it converges to an optimal solution. The proposed method is distinctive because its results are sufficiently explanatory for human understanding as well as efficient for usual clustering performance. The experimental results show that the proposed method provides competitive performance compared to other state-of-the-art approaches. In addition, the extracted themes represent well the topics of clusters on the MEDLINE dataset.
Author Sun Kim
Wilbur, W. J.
Author_xml – sequence: 1
  surname: Sun Kim
  fullname: Sun Kim
  email: sun.kim@nih.gov
  organization: Nat. Center for Biotechnol. Inf., Nat. Inst. of Health, Bethesda, MD, USA
– sequence: 2
  givenname: W. J.
  surname: Wilbur
  fullname: Wilbur, W. J.
  email: wilbur@ncbi.nlm.nih.gov
  organization: Nat. Center for Biotechnol. Inf., Nat. Inst. of Health, Bethesda, MD, USA
BookMark eNotzE1LwzAcgPGICrrZoycv-QKrSfN-klKnDjoU2X2k6T9rpGtH0iJ--wn6XH63Z4GuhnEAhO4pySkl5nFTbesyLwileWEu0IIoaQSXRBWXKDNKUy6UKijjxQ3KUvoiv0lpDFW36Kkc8HqLq35OE8QwHHDZH8YYpu6Iv7vgOvwRx3Z2kLDFz7Pt8SecIiQYJjuFcbhD1972CbJ_l2j3st5Vb6v6_XVTlfUqGDKtBEBjGaeyVU7aVkuqPDSaeds2omBGOEu0A9XYxiutiAcNvCVSC6-ldZot0cPfNgDA_hTD0cafvaRcEcHZGQm3S10
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1109/ICMLA.2011.29
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Xplore POP ALL
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Xplore
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
EISBN 0769546072
9780769546070
EndPage 95
ExternalDocumentID 6147054
Genre orig-research
GroupedDBID 6IE
6IF
6IK
6IL
6IN
AAJGR
AAWTH
ADFMO
ALMA_UNASSIGNED_HOLDINGS
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
IEGSK
IERZE
OCL
RIE
RIL
ID FETCH-LOGICAL-i90t-5eeba3416d7c6ad8617feb83fadb52395ca08ce7babf7870fe8e4d0685f86ac83
IEDL.DBID RIE
ISBN 9781457721342
1457721341
IngestDate Wed Aug 27 04:10:29 EDT 2025
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i90t-5eeba3416d7c6ad8617feb83fadb52395ca08ce7babf7870fe8e4d0685f86ac83
PageCount 6
ParticipantIDs ieee_primary_6147054
PublicationCentury 2000
PublicationDate 2011-Dec.
PublicationDateYYYYMMDD 2011-12-01
PublicationDate_xml – month: 12
  year: 2011
  text: 2011-Dec.
PublicationDecade 2010
PublicationTitle 2011 Tenth International Conference on Machine Learning and Applications
PublicationTitleAbbrev icmla
PublicationYear 2011
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssj0000669917
ssib026767183
Score 1.4782128
Snippet Clustering text documents is an important step in mining useful information on the Web or other text-based resources. The common task in text clustering is to...
SourceID ieee
SourceType Publisher
StartPage 90
SubjectTerms Algorithm design and analysis
Clustering algorithms
Humans
Parkinson's disease
Probabilistic logic
Vectors
Title An EM Clustering Algorithm which Produces a Dual Representation
URI https://ieeexplore.ieee.org/document/6147054
Volume 2
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1LT8JAEN4AJ0-oYHxnDx4ttKWP7ckQhKCxxhhMuJF9TIUIxWAbE3-9s9sWjPHgbdu0TXd2p9_sdr5vCLmSkcPA47blMnQ3L-HCEojS2FLScQDCnimdED8G4xfvfupPa-R6y4UBAJN8Bh3dNP_y1Vrmequsi1ASYohRJ3VcuBVcrWruuFp4zCmhsvgKBxj6hIbL5WMIqYXLKomn8tjdaW527wbxQ79Q9NTR5o9KKwZoRk0SV69Y5Je8dfJMdOTXL_XG__Zhn7R3lD76tAWrA1KD9JA0q5oOtHTxFrnpp3QY08Ey1woKeCntL1_Xm0U2X9HP-ULO9VMUzocPyultzpf02eTSlhSmtE0mo-FkMLbKIgvWIrIzywcQHE0TqFAGXDEMaBIQrJdwJXCNGvmS20xCKLhItG8ngEOr7ID5CQu4ZL0j0kjXKRwTKgXnDBiTPFIegwhvkE7APMkiF70-OiEtbY7ZeyGjMSstcfr36TOyZ7ZvTebIOWlkmxwuEP8zcWkG_hv6Iasl
linkProvider IEEE
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1NT8JAEN0gHvSECsZv9-DRYlv6sT0ZghBQSozBhBvZ3U6FiMVgGxN_vbPbFozx4G3btE07nel7u515Q8iVDCwGDjcNm2G4OTEXhkCUxlEkLQvAb-nWCeHI6z879xN3UiHX61oYANDJZ9BUQ_0vP1rKTC2V3SCU-Egxtsg24r4T5NVapffYSnrMKsAy_w57SH58Xc3lIolU0mWlyFOxbW9UN28GnXDYzjU9Fd_80WtFQ02vRsLyJvMMk9dmloqm_Pql3_jfp9gjjU1RH31cw9U-qUByQGplVwdaBHmd3LYT2g1pZ5EpDQU8lLYXL8vVPJ290c_ZXM7UVSL0iA_K6V3GF_RJZ9MWRUxJg4x73XGnbxRtFox5YKaGCyA4msaLfOnxiCGliUGwVswjgbPUwJXcZBJ8wUWsojsGfLmR6TE3Zh6XrHVIqskygSNCpeCcAWOSB5HDIMATpOUxR7LAxrgPjkldmWP6ngtpTAtLnPy9-5Ls9MfhcDocjB5Oya5ezNV5JGekmq4yOEc2kIoL7QTfK5-udQ
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2011+Tenth+International+Conference+on+Machine+Learning+and+Applications&rft.atitle=An+EM+Clustering+Algorithm+which+Produces+a+Dual+Representation&rft.au=Sun+Kim&rft.au=Wilbur%2C+W.+J.&rft.date=2011-12-01&rft.pub=IEEE&rft.isbn=9781457721342&rft.volume=2&rft.spage=90&rft.epage=95&rft_id=info:doi/10.1109%2FICMLA.2011.29&rft.externalDocID=6147054
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781457721342/lc.gif&client=summon&freeimage=true
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781457721342/mc.gif&client=summon&freeimage=true
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781457721342/sc.gif&client=summon&freeimage=true