Improved Keyword Extraction for Persian Academic Texts Using RAKE Algorithm; Case Study: Persian Theses and Dissertations

Keywords and key phrases are subsets of most relevant words or phrases that summarize contents of a document while they play a critical role in information and document retrieval. Keyword extraction from scientific text is challenging and time-consuming due to the technical and multi-subject nature...

Full description

Saved in:
Bibliographic Details
Published inPizhūhishnāmah-i pardāzish va mudiriyyat-i iṭṭilāʻāt (Online) Vol. 37; no. 1; pp. 197 - 228
Main Authors Mehrabi, Elaheh, Mohebi, Azadeh, Ahmadi, Abbas
Format Journal Article
LanguagePersian
Published Iranian Research Institute for Information and Technology 01.09.2021
Subjects
Online AccessGet full text

Cover

Loading…
Abstract Keywords and key phrases are subsets of most relevant words or phrases that summarize contents of a document while they play a critical role in information and document retrieval. Keyword extraction from scientific text is challenging and time-consuming due to the technical and multi-subject nature of the text, while the number of documents requiring keywords is increasing. There are various algorithms and methods developed for automatic keyword extraction. Rapid Automatic Keyword Extraction (RAKE) is a popular algorithm in this domain. RAKE’s decisions are based on the observation that keywords generally contain multiple words and they rarely include stopwords and words with minimum lexical meanings. Candidate keywords are a set of single-word or multi-word sequences selected based on the scores assigned to them by some scoring criteria in RAKE. In this research, a new modified version of RAKE algorithm is proposed in which candidate keyword scoring scheme is improved to increase precision and recall in the keyword extraction process. The proposed algorithm is to cover some of the main weaknesses of RAKE algorithm, especially in Persian scientific documents. To study the weaknesses of RAKE algorithm and evaluating the proposed modified version of RAKE, a set of metadata of Persian theses and dissertations are used. The result of test and evaluation of the proposed algorithm confirm improvement in precision, recall and F-measure. We study effectiveness of RAKE in extracting keywords from Persian texts. We find that RAKE algorithm often extracts long phrases with redundant words on Persian texts, leading to low accuracy. In this paper, we study sources of scoring inefficiency of RAKE algorithm and propose an improved version of RAKE algorithm with a novel scoring mechanism. Our scoring mechanism overcomes some of the weaknesses in RAKE’s original scoring for Persian texts and yields better results. Our evaluations on Persian corpus demonstrate that our improved RAKE algorithm outperforms original RAKE algorithm by extracting more accurate keyword. Our results show that improved RAKE achieves more than 20% higher precision and recall on average compared to original RAKE.
AbstractList Keywords and key phrases are subsets of most relevant words or phrases that summarize contents of a document while they play a critical role in information and document retrieval. Keyword extraction from scientific text is challenging and time-consuming due to the technical and multi-subject nature of the text, while the number of documents requiring keywords is increasing. There are various algorithms and methods developed for automatic keyword extraction. Rapid Automatic Keyword Extraction (RAKE) is a popular algorithm in this domain. RAKE’s decisions are based on the observation that keywords generally contain multiple words and they rarely include stopwords and words with minimum lexical meanings. Candidate keywords are a set of single-word or multi-word sequences selected based on the scores assigned to them by some scoring criteria in RAKE. In this research, a new modified version of RAKE algorithm is proposed in which candidate keyword scoring scheme is improved to increase precision and recall in the keyword extraction process. The proposed algorithm is to cover some of the main weaknesses of RAKE algorithm, especially in Persian scientific documents. To study the weaknesses of RAKE algorithm and evaluating the proposed modified version of RAKE, a set of metadata of Persian theses and dissertations are used. The result of test and evaluation of the proposed algorithm confirm improvement in precision, recall and F-measure. We study effectiveness of RAKE in extracting keywords from Persian texts. We find that RAKE algorithm often extracts long phrases with redundant words on Persian texts, leading to low accuracy. In this paper, we study sources of scoring inefficiency of RAKE algorithm and propose an improved version of RAKE algorithm with a novel scoring mechanism. Our scoring mechanism overcomes some of the weaknesses in RAKE’s original scoring for Persian texts and yields better results. Our evaluations on Persian corpus demonstrate that our improved RAKE algorithm outperforms original RAKE algorithm by extracting more accurate keyword. Our results show that improved RAKE achieves more than 20% higher precision and recall on average compared to original RAKE.
Author Mohebi, Azadeh
Ahmadi, Abbas
Mehrabi, Elaheh
Author_xml – sequence: 1
  givenname: Elaheh
  surname: Mehrabi
  fullname: Mehrabi, Elaheh
– sequence: 2
  givenname: Azadeh
  surname: Mohebi
  fullname: Mohebi, Azadeh
– sequence: 3
  givenname: Abbas
  surname: Ahmadi
  fullname: Ahmadi, Abbas
BookMark eNp1kctKAzEUhoMoeN26zgt0zGWSzOiq1KpFQdG6DpnkpEY6k5KMl769o5UuBFfncOD74D__IdrtYgcInVJSCCZKdfYaVm3BVUELWqsddMCYoKOKcbq73RnfRyc5h4bwiihBBTlA61m7SvEdHL6F9UdMDk8_-2RsH2KHfUz4AVIOpsNjaxy0weI5fPYZP-fQLfDj-HaKx8tFTKF_aS_wxGTAT_2bW59vwfkLZMjYdA5fhpwh9eZbno_RnjfLDCe_8wg9X03nk5vR3f31bDK-G1kmiRp5IkrChgQ1aaTjStYeGpBcKuJlRV3FqaKOu9JXDKgjwBkwy0GUnlbKEX6EZhuvi-ZVr1JoTVrraIL-OcS00Cb1wS5Bm9JDDXVpJTSltbKxQkohKyc488S5wVVsXDbFnBP4rY8S_dOD_u5Bc6WpHnoYgPIPYMMm__DksPwP-wI2LZAe
CitedBy_id crossref_primary_10_1007_s42835_023_01704_8
Cites_doi 10.14569/IJARAI.2013.020206
10.2991/aiie-16.2016.28
10.1109/ICCMC.2019.8819630
10.1007/s10844-019-00558-9
10.1093/bioinformatics/14.7.600
10.1002/widm.1339
10.1109/ICCITECHN.2018.8631917
10.1007/978-981-10-7512-4_47
10.1016/j.ins.2019.09.013
10.18653/v1/P17-1054
10.18653/v1/W16-1609
10.1111/j.1751-1097.1972.tb06217.x
10.1002/9780470689646.ch1
10.1007/978-3-642-54105-6_9
10.1109/AISP.2017.8515121
ContentType Journal Article
CorporateAuthor Amirkabir University of Technology; Tehran, Iran
Department of Industrial Engineering and Management Systems; Amirkabir University of Technology; Tehran, Iran
Faculty of Information Technology; Iranian Research Institute for Information Science and Technology (IranDoc); Tehran, Iran
CorporateAuthor_xml – name: Department of Industrial Engineering and Management Systems; Amirkabir University of Technology; Tehran, Iran
– name: Faculty of Information Technology; Iranian Research Institute for Information Science and Technology (IranDoc); Tehran, Iran
– name: Amirkabir University of Technology; Tehran, Iran
DBID AAYXX
CITATION
DOA
DOI 10.52547/jipm.37.1.197
DatabaseName CrossRef
DOAJ Directory of Open Access Journals
DatabaseTitle CrossRef
DatabaseTitleList
Database_xml – sequence: 1
  dbid: DOA
  name: DOAJ Directory of Open Access Journals
  url: https://www.doaj.org/
  sourceTypes: Open Website
DeliveryMethod fulltext_linktorsrc
Discipline Library & Information Science
EISSN 2251-8231
EndPage 228
ExternalDocumentID oai_doaj_org_article_a4fe9e94c6eb4cc6bc566568d532f0dd
10_52547_jipm_37_1_197
GroupedDBID 5VS
AAYXX
ALMA_UNASSIGNED_HOLDINGS
CITATION
GROUPED_DOAJ
RNS
ID FETCH-LOGICAL-c2607-f0540222590b6d3769febe63670f681d83171d3d4f82e1d0e32e2c3e54f187d03
IEDL.DBID DOA
ISSN 2251-8223
IngestDate Wed Aug 27 01:15:25 EDT 2025
Tue Jul 01 02:53:41 EDT 2025
Thu Apr 24 22:51:37 EDT 2025
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed true
IsScholarly true
Issue 1
Language Persian
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c2607-f0540222590b6d3769febe63670f681d83171d3d4f82e1d0e32e2c3e54f187d03
OpenAccessLink https://doaj.org/article/a4fe9e94c6eb4cc6bc566568d532f0dd
PageCount 32
ParticipantIDs doaj_primary_oai_doaj_org_article_a4fe9e94c6eb4cc6bc566568d532f0dd
crossref_primary_10_52547_jipm_37_1_197
crossref_citationtrail_10_52547_jipm_37_1_197
ProviderPackageCode CITATION
AAYXX
PublicationCentury 2000
PublicationDate 2021-9-01
2021-09-01
PublicationDateYYYYMMDD 2021-09-01
PublicationDate_xml – month: 09
  year: 2021
  text: 2021-9-01
  day: 01
PublicationDecade 2020
PublicationTitle Pizhūhishnāmah-i pardāzish va mudiriyyat-i iṭṭilāʻāt (Online)
PublicationYear 2021
Publisher Iranian Research Institute for Information and Technology
Publisher_xml – name: Iranian Research Institute for Information and Technology
References ref35
ref23
ref45
ref37
ref25
ref47
ref31
ref53
ref41
ref33
ref55
ref21
ref43
ref39
ref27
ref49
ref19
ref29
ref51
References_xml – ident: ref45
  doi: 10.14569/IJARAI.2013.020206
– ident: ref39
– ident: ref29
  doi: 10.2991/aiie-16.2016.28
– ident: ref49
  doi: 10.1109/ICCMC.2019.8819630
– ident: ref21
  doi: 10.1007/s10844-019-00558-9
– ident: ref23
  doi: 10.1093/bioinformatics/14.7.600
– ident: ref41
  doi: 10.1002/widm.1339
– ident: ref31
  doi: 10.1109/ICCITECHN.2018.8631917
– ident: ref47
  doi: 10.1007/978-981-10-7512-4_47
– ident: ref27
  doi: 10.1016/j.ins.2019.09.013
– ident: ref37
  doi: 10.18653/v1/P17-1054
– ident: ref35
  doi: 10.18653/v1/W16-1609
– ident: ref19
  doi: 10.1111/j.1751-1097.1972.tb06217.x
– ident: ref43
  doi: 10.1002/9780470689646.ch1
– ident: ref55
– ident: ref53
– ident: ref51
  doi: 10.1007/978-3-642-54105-6_9
– ident: ref25
  doi: 10.1109/AISP.2017.8515121
– ident: ref33
SSID ssib038075150
ssib023167310
ssib050736755
ssj0001386611
ssib020483885
Score 2.1613307
Snippet Keywords and key phrases are subsets of most relevant words or phrases that summarize contents of a document while they play a critical role in information and...
SourceID doaj
crossref
SourceType Open Website
Enrichment Source
Index Database
StartPage 197
SubjectTerms keyword extraction
natural language processing
part of speech tagging
persian scientific document
rake algorithm
Title Improved Keyword Extraction for Persian Academic Texts Using RAKE Algorithm; Case Study: Persian Theses and Dissertations
URI https://doaj.org/article/a4fe9e94c6eb4cc6bc566568d532f0dd
Volume 37
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV09T8MwELVQJxbEpyhfugHBlLaOEzuBqUBRBYIBgdQtSvwBRW1BbRF04bdzl6RtFsTCEimREym-F997zvmZseM005zbWHvaaI0CRWkcB-PUCyPXMopysqapgbt72X0Kbnphr7LVF9WEFfbARcc108DZ2MaBljYLtJaZRgISysiEwsfHGRp9MedVxBQiidxoRbR0ZfFpvXeFyJDLOibyxTmSIoHMOVzOzogIExepNcQ79zCLisLxMURBpZqv_fdhQ6gGb3Byi6pktIrxf56hrtfZWkktoV280gZbcekmOywXJsAJlCuPKBJQftJbbFbMKlgDt3b2iUoUOl_TcbHaAbA9UIk8QgjmhfTwiMP5BPJSA3ho33agPXh-G_enL8NzuMScCFSaODtb3IhAnNgJpCMDV5Wf_5Nt9nTdebzseuV2DJ5G0aM8R-yO5GHcyqTBgSl2iABJDnBOIu2NkIpwI0zgIt9y07LCt74WNgwcj5RpiR1WG72N7C4DFMZOaiUdCr4g5i4V0qYWmZuS0inj15k379JEl17ltGXGIEHNkocgoRAkQiU8wRDU2emi_Xvh0vFrywuK0KIVuWvnFxBzSYm55C_M7f3HQ_bZqk_1MXm92gGrTccf9hAJzjQ7yrGMx7vvzg-t-POD
linkProvider Directory of Open Access Journals
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Improved+Keyword+Extraction+for+Persian+Academic+Texts+Using+RAKE+Algorithm%3B+Case+Study%3A+Persian+Theses+and+Dissertations&rft.jtitle=Pizh%C5%ABhishn%C4%81mah-i+pard%C4%81zish+va+mudiriyyat-i+i%E1%B9%AD%E1%B9%ADil%C4%81%CA%BB%C4%81t+%28Online%29&rft.au=Mehrabi%2C+Elaheh&rft.au=Mohebi%2C+Azadeh&rft.au=Ahmadi%2C+Abbas&rft.date=2021-09-01&rft.issn=2251-8223&rft.eissn=2251-8231&rft.volume=37&rft.issue=1&rft.spage=197&rft.epage=228&rft_id=info:doi/10.52547%2Fjipm.37.1.197&rft.externalDBID=n%2Fa&rft.externalDocID=10_52547_jipm_37_1_197
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2251-8223&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2251-8223&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2251-8223&client=summon