Novel similarity measure for document clustering based on topic phrases

Document clustering is a subset of the data clustering field which categorizes large set of documents into similar and related groups. In the traditional vector space model (VSM) researchers have considered the unique word which occurs in the document set as the candidate feature. Recently a new tre...

Full description

Saved in:
Bibliographic Details
Published in2009 International Conference on Networking and Media Convergence pp. 92 - 96
Main Authors ELdesoky, A.E., Saleh, M., Sakr, N.A.
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.03.2009
Subjects
Online AccessGet full text
ISBN9781424437764
1424437768
DOI10.1109/ICNM.2009.4907196

Cover

Abstract Document clustering is a subset of the data clustering field which categorizes large set of documents into similar and related groups. In the traditional vector space model (VSM) researchers have considered the unique word which occurs in the document set as the candidate feature. Recently a new trend which considered the phrase to be a more informative feature has taken place; the matter which contributes in improving the document clustering accuracy and effectiveness. This paper proposes a new approach for computing the similarity measure of the traditional VSM by considering the topic phrases of the document as the constituting terms for the VSM instead of the traditional term ldquowordrdquo and applying the new approach to the Buckshot method, which is a mix of the Hierarchical Agglomerative Clustering (HAC) algorithm and the K-means partitioning algorithm. Such a mechanism may raise the effectiveness of the clustering by increasing the evaluation metrics values.
AbstractList Document clustering is a subset of the data clustering field which categorizes large set of documents into similar and related groups. In the traditional vector space model (VSM) researchers have considered the unique word which occurs in the document set as the candidate feature. Recently a new trend which considered the phrase to be a more informative feature has taken place; the matter which contributes in improving the document clustering accuracy and effectiveness. This paper proposes a new approach for computing the similarity measure of the traditional VSM by considering the topic phrases of the document as the constituting terms for the VSM instead of the traditional term ldquowordrdquo and applying the new approach to the Buckshot method, which is a mix of the Hierarchical Agglomerative Clustering (HAC) algorithm and the K-means partitioning algorithm. Such a mechanism may raise the effectiveness of the clustering by increasing the evaluation metrics values.
Author ELdesoky, A.E.
Sakr, N.A.
Saleh, M.
Author_xml – sequence: 1
  givenname: A.E.
  surname: ELdesoky
  fullname: ELdesoky, A.E.
  organization: Dept. of Comput. & Syst., Mansoura Univ., Mansoura
– sequence: 2
  givenname: M.
  surname: Saleh
  fullname: Saleh, M.
  organization: Dept. of Comput. & Syst., King AbdulAziz Univ., Jeddah
– sequence: 3
  givenname: N.A.
  surname: Sakr
  fullname: Sakr, N.A.
  organization: Dept. of Comput. & Syst., Mansoura Univ., Mansoura
BookMark eNpVUM1KxDAYjOiC7toHEC95gdak-WuOUnRdWNfL3pek_aKRtilJK-zbW3AvzmWYYRiYWaObIQyA0AMlBaVEP-3qw3tREqILromiWl6hTKuK8pJzplRVXf_Tkq_QeolXmpa0krcoS-mbLOCiVKK8Q9tD-IEOJ9_7zkQ_nXEPJs0RsAsRt6GZexgm3HRzmiD64RNbk6DFYcBTGH2Dx6-4GOkerZzpEmQX3qDj68uxfsv3H9td_bzPvSZT3rqWMqG4Aakss4JSKkVjrZUNAe2YA2e4dSCXnBDUMWJaopYtxhrjBLANevyr9QBwGqPvTTyfLlewX9hbU0c
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1109/ICNM.2009.4907196
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
EISBN 9781424437788
1424437784
EndPage 96
ExternalDocumentID 4907196
Genre orig-research
GroupedDBID 6IE
6IF
6IK
6IL
6IN
AAJGR
AARBI
AAWTH
ALMA_UNASSIGNED_HOLDINGS
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
IERZE
OCL
RIE
RIL
ID FETCH-LOGICAL-i90t-dfd13574ae67b3b511165cbbb6c0e9f3fefa4bfe6fd1551f30ad07424abaaf5e3
IEDL.DBID RIE
ISBN 9781424437764
1424437768
IngestDate Wed Aug 27 01:39:26 EDT 2025
IsPeerReviewed false
IsScholarly false
LCCN 2008912186
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i90t-dfd13574ae67b3b511165cbbb6c0e9f3fefa4bfe6fd1551f30ad07424abaaf5e3
PageCount 5
ParticipantIDs ieee_primary_4907196
PublicationCentury 2000
PublicationDate 2009-March
PublicationDateYYYYMMDD 2009-03-01
PublicationDate_xml – month: 03
  year: 2009
  text: 2009-March
PublicationDecade 2000
PublicationTitle 2009 International Conference on Networking and Media Convergence
PublicationTitleAbbrev ICNM
PublicationYear 2009
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssj0000452752
Score 1.4403315
Snippet Document clustering is a subset of the data clustering field which categorizes large set of documents into similar and related groups. In the traditional...
SourceID ieee
SourceType Publisher
StartPage 92
SubjectTerms Clustering algorithms
Clustering methods
Frequency
Humans
Information retrieval
Natural language processing
Organizing
Partitioning algorithms
Taxonomy
Text mining
Title Novel similarity measure for document clustering based on topic phrases
URI https://ieeexplore.ieee.org/document/4907196
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV07T8MwELbaTkyAWsRbHhhJm9SPJHNFKUitGIrUrYrtsxTRJhVNGPj1nPMoAjEwJbESy45Puu9e3xFy54-BOb3moTRFHmeR9pSRxmNGyAgBKvhV94b5Qs5e-fNKrDrk_lALAwBV8hkM3W0Vyze5Lp2rbMTRkkOJ6ZIuilldq3Xwpzhq8FCM29otFiKObimdmmfeRDUDPx49TRbzmq2ymfRHd5VKuUyPybxdVp1T8jYsCzXUn78YG_-77hMy-C7joy8HBXVKOpD1yeMi_4AN3afbFI1axOB0W7sJKcJX2k5G9aZ0FAr4HXWKztA8o0W-SzXF08eB_YAspw_Lycxruil4aewXnrEmYCLkCchQMYU4K5BCK6Wk9iG2zIJNuLIg8T1EUZb5iXF2M09UklgB7Iz0sjyDc0IRRAbWgm9cjBXniWKNFx2PIRRGWXtB-u4frHc1X8a62f7l38NX5KiO0Li8rmvSK95LuEFFX6jb6oS_ACeupTk
linkProvider IEEE
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV07T8MwELZKGWAC1CLeeGAkbVI7r7mitNBEDEXqVsX2Wapok4omDPx6znkUgRiYkljJyY4t3Xev7wi5swfAjF6z8DQFFmeBtITylMWU6wUIUMEuuzdEsTd-5U9zd94i97taGAAok8-gZ27LWL7KZGFcZX2OlhyemD2yj3qfu1W11s6jYsjBfXfQVG8xH5F0Q-pUP_M6runYYX8yjKOKr7IW-6O_SqleRkckaiZWZZW89Ypc9OTnL87G_878mHS_C_noy05FnZAWpB3yGGcfsKLb5XqJZi2icLquHIUUASxthFG5KgyJAn5HjapTNEtpnm2WkuL-48C2S2ajh9lwbNX9FKxlaOeW0sphrs8T8HzBBCItx3OlEMKTNoSaadAJFxo8fA9xlGZ2oozlzBORJNoFdkraaZbCGaEIIx2twVYmyopyglDiRYYD8F0ltD4nHfMPFpuKMWNRL__i7-FbcjCeRdPFdBI_X5LDKl5jsryuSDt_L-Aa1X4ubsrd_gIBHKiG
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2009+International+Conference+on+Networking+and+Media+Convergence&rft.atitle=Novel+similarity+measure+for+document+clustering+based+on+topic+phrases&rft.au=ELdesoky%2C+A.E.&rft.au=Saleh%2C+M.&rft.au=Sakr%2C+N.A.&rft.date=2009-03-01&rft.pub=IEEE&rft.isbn=9781424437764&rft.spage=92&rft.epage=96&rft_id=info:doi/10.1109%2FICNM.2009.4907196&rft.externalDocID=4907196
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781424437764/lc.gif&client=summon&freeimage=true
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781424437764/mc.gif&client=summon&freeimage=true
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781424437764/sc.gif&client=summon&freeimage=true