Novel similarity measure for document clustering based on topic phrases
Document clustering is a subset of the data clustering field which categorizes large set of documents into similar and related groups. In the traditional vector space model (VSM) researchers have considered the unique word which occurs in the document set as the candidate feature. Recently a new tre...
Saved in:
Published in | 2009 International Conference on Networking and Media Convergence pp. 92 - 96 |
---|---|
Main Authors | , , |
Format | Conference Proceeding |
Language | English |
Published |
IEEE
01.03.2009
|
Subjects | |
Online Access | Get full text |
ISBN | 9781424437764 1424437768 |
DOI | 10.1109/ICNM.2009.4907196 |
Cover
Abstract | Document clustering is a subset of the data clustering field which categorizes large set of documents into similar and related groups. In the traditional vector space model (VSM) researchers have considered the unique word which occurs in the document set as the candidate feature. Recently a new trend which considered the phrase to be a more informative feature has taken place; the matter which contributes in improving the document clustering accuracy and effectiveness. This paper proposes a new approach for computing the similarity measure of the traditional VSM by considering the topic phrases of the document as the constituting terms for the VSM instead of the traditional term ldquowordrdquo and applying the new approach to the Buckshot method, which is a mix of the Hierarchical Agglomerative Clustering (HAC) algorithm and the K-means partitioning algorithm. Such a mechanism may raise the effectiveness of the clustering by increasing the evaluation metrics values. |
---|---|
AbstractList | Document clustering is a subset of the data clustering field which categorizes large set of documents into similar and related groups. In the traditional vector space model (VSM) researchers have considered the unique word which occurs in the document set as the candidate feature. Recently a new trend which considered the phrase to be a more informative feature has taken place; the matter which contributes in improving the document clustering accuracy and effectiveness. This paper proposes a new approach for computing the similarity measure of the traditional VSM by considering the topic phrases of the document as the constituting terms for the VSM instead of the traditional term ldquowordrdquo and applying the new approach to the Buckshot method, which is a mix of the Hierarchical Agglomerative Clustering (HAC) algorithm and the K-means partitioning algorithm. Such a mechanism may raise the effectiveness of the clustering by increasing the evaluation metrics values. |
Author | ELdesoky, A.E. Sakr, N.A. Saleh, M. |
Author_xml | – sequence: 1 givenname: A.E. surname: ELdesoky fullname: ELdesoky, A.E. organization: Dept. of Comput. & Syst., Mansoura Univ., Mansoura – sequence: 2 givenname: M. surname: Saleh fullname: Saleh, M. organization: Dept. of Comput. & Syst., King AbdulAziz Univ., Jeddah – sequence: 3 givenname: N.A. surname: Sakr fullname: Sakr, N.A. organization: Dept. of Comput. & Syst., Mansoura Univ., Mansoura |
BookMark | eNpVUM1KxDAYjOiC7toHEC95gdak-WuOUnRdWNfL3pek_aKRtilJK-zbW3AvzmWYYRiYWaObIQyA0AMlBaVEP-3qw3tREqILromiWl6hTKuK8pJzplRVXf_Tkq_QeolXmpa0krcoS-mbLOCiVKK8Q9tD-IEOJ9_7zkQ_nXEPJs0RsAsRt6GZexgm3HRzmiD64RNbk6DFYcBTGH2Dx6-4GOkerZzpEmQX3qDj68uxfsv3H9td_bzPvSZT3rqWMqG4Aakss4JSKkVjrZUNAe2YA2e4dSCXnBDUMWJaopYtxhrjBLANevyr9QBwGqPvTTyfLlewX9hbU0c |
ContentType | Conference Proceeding |
DBID | 6IE 6IL CBEJK RIE RIL |
DOI | 10.1109/ICNM.2009.4907196 |
DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP All) 1998-Present |
DatabaseTitleList | |
Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher |
DeliveryMethod | fulltext_linktorsrc |
EISBN | 9781424437788 1424437784 |
EndPage | 96 |
ExternalDocumentID | 4907196 |
Genre | orig-research |
GroupedDBID | 6IE 6IF 6IK 6IL 6IN AAJGR AARBI AAWTH ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK IERZE OCL RIE RIL |
ID | FETCH-LOGICAL-i90t-dfd13574ae67b3b511165cbbb6c0e9f3fefa4bfe6fd1551f30ad07424abaaf5e3 |
IEDL.DBID | RIE |
ISBN | 9781424437764 1424437768 |
IngestDate | Wed Aug 27 01:39:26 EDT 2025 |
IsPeerReviewed | false |
IsScholarly | false |
LCCN | 2008912186 |
Language | English |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-i90t-dfd13574ae67b3b511165cbbb6c0e9f3fefa4bfe6fd1551f30ad07424abaaf5e3 |
PageCount | 5 |
ParticipantIDs | ieee_primary_4907196 |
PublicationCentury | 2000 |
PublicationDate | 2009-March |
PublicationDateYYYYMMDD | 2009-03-01 |
PublicationDate_xml | – month: 03 year: 2009 text: 2009-March |
PublicationDecade | 2000 |
PublicationTitle | 2009 International Conference on Networking and Media Convergence |
PublicationTitleAbbrev | ICNM |
PublicationYear | 2009 |
Publisher | IEEE |
Publisher_xml | – name: IEEE |
SSID | ssj0000452752 |
Score | 1.4403315 |
Snippet | Document clustering is a subset of the data clustering field which categorizes large set of documents into similar and related groups. In the traditional... |
SourceID | ieee |
SourceType | Publisher |
StartPage | 92 |
SubjectTerms | Clustering algorithms Clustering methods Frequency Humans Information retrieval Natural language processing Organizing Partitioning algorithms Taxonomy Text mining |
Title | Novel similarity measure for document clustering based on topic phrases |
URI | https://ieeexplore.ieee.org/document/4907196 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV07T8MwELbaTkyAWsRbHhhJm9SPJHNFKUitGIrUrYrtsxTRJhVNGPj1nPMoAjEwJbESy45Puu9e3xFy54-BOb3moTRFHmeR9pSRxmNGyAgBKvhV94b5Qs5e-fNKrDrk_lALAwBV8hkM3W0Vyze5Lp2rbMTRkkOJ6ZIuilldq3Xwpzhq8FCM29otFiKObimdmmfeRDUDPx49TRbzmq2ymfRHd5VKuUyPybxdVp1T8jYsCzXUn78YG_-77hMy-C7joy8HBXVKOpD1yeMi_4AN3afbFI1axOB0W7sJKcJX2k5G9aZ0FAr4HXWKztA8o0W-SzXF08eB_YAspw_Lycxruil4aewXnrEmYCLkCchQMYU4K5BCK6Wk9iG2zIJNuLIg8T1EUZb5iXF2M09UklgB7Iz0sjyDc0IRRAbWgm9cjBXniWKNFx2PIRRGWXtB-u4frHc1X8a62f7l38NX5KiO0Li8rmvSK95LuEFFX6jb6oS_ACeupTk |
linkProvider | IEEE |
linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV07T8MwELZKGWAC1CLeeGAkbVI7r7mitNBEDEXqVsX2Wapok4omDPx6znkUgRiYkljJyY4t3Xev7wi5swfAjF6z8DQFFmeBtITylMWU6wUIUMEuuzdEsTd-5U9zd94i97taGAAok8-gZ27LWL7KZGFcZX2OlhyemD2yj3qfu1W11s6jYsjBfXfQVG8xH5F0Q-pUP_M6runYYX8yjKOKr7IW-6O_SqleRkckaiZWZZW89Ypc9OTnL87G_878mHS_C_noy05FnZAWpB3yGGcfsKLb5XqJZi2icLquHIUUASxthFG5KgyJAn5HjapTNEtpnm2WkuL-48C2S2ajh9lwbNX9FKxlaOeW0sphrs8T8HzBBCItx3OlEMKTNoSaadAJFxo8fA9xlGZ2oozlzBORJNoFdkraaZbCGaEIIx2twVYmyopyglDiRYYD8F0ltD4nHfMPFpuKMWNRL__i7-FbcjCeRdPFdBI_X5LDKl5jsryuSDt_L-Aa1X4ubsrd_gIBHKiG |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2009+International+Conference+on+Networking+and+Media+Convergence&rft.atitle=Novel+similarity+measure+for+document+clustering+based+on+topic+phrases&rft.au=ELdesoky%2C+A.E.&rft.au=Saleh%2C+M.&rft.au=Sakr%2C+N.A.&rft.date=2009-03-01&rft.pub=IEEE&rft.isbn=9781424437764&rft.spage=92&rft.epage=96&rft_id=info:doi/10.1109%2FICNM.2009.4907196&rft.externalDocID=4907196 |
thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781424437764/lc.gif&client=summon&freeimage=true |
thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781424437764/mc.gif&client=summon&freeimage=true |
thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781424437764/sc.gif&client=summon&freeimage=true |