Improved Keyword Extraction for Persian Academic Texts Using RAKE Algorithm; Case Study: Persian Theses and Dissertations

Keywords and key phrases are subsets of most relevant words or phrases that summarize contents of a document while they play a critical role in information and document retrieval. Keyword extraction from scientific text is challenging and time-consuming due to the technical and multi-subject nature...

Full description

Saved in:

Bibliographic Details
Published in	Pizhūhishnāmah-i pardāzish va mudiriyyat-i iṭṭilāʻāt (Online) Vol. 37; no. 1; pp. 197 - 228
Main Authors	Mehrabi, Elaheh, Mohebi, Azadeh, Ahmadi, Abbas
Format	Journal Article
Language	Persian
Published	Iranian Research Institute for Information and Technology 01.09.2021
Subjects	keyword extraction natural language processing part of speech tagging persian scientific document rake algorithm
Online Access	Get full text

Cover

Loading…

Abstract	Keywords and key phrases are subsets of most relevant words or phrases that summarize contents of a document while they play a critical role in information and document retrieval. Keyword extraction from scientific text is challenging and time-consuming due to the technical and multi-subject nature of the text, while the number of documents requiring keywords is increasing. There are various algorithms and methods developed for automatic keyword extraction. Rapid Automatic Keyword Extraction (RAKE) is a popular algorithm in this domain. RAKE’s decisions are based on the observation that keywords generally contain multiple words and they rarely include stopwords and words with minimum lexical meanings. Candidate keywords are a set of single-word or multi-word sequences selected based on the scores assigned to them by some scoring criteria in RAKE. In this research, a new modified version of RAKE algorithm is proposed in which candidate keyword scoring scheme is improved to increase precision and recall in the keyword extraction process. The proposed algorithm is to cover some of the main weaknesses of RAKE algorithm, especially in Persian scientific documents. To study the weaknesses of RAKE algorithm and evaluating the proposed modified version of RAKE, a set of metadata of Persian theses and dissertations are used. The result of test and evaluation of the proposed algorithm confirm improvement in precision, recall and F-measure. We study effectiveness of RAKE in extracting keywords from Persian texts. We find that RAKE algorithm often extracts long phrases with redundant words on Persian texts, leading to low accuracy. In this paper, we study sources of scoring inefficiency of RAKE algorithm and propose an improved version of RAKE algorithm with a novel scoring mechanism. Our scoring mechanism overcomes some of the weaknesses in RAKE’s original scoring for Persian texts and yields better results. Our evaluations on Persian corpus demonstrate that our improved RAKE algorithm outperforms original RAKE algorithm by extracting more accurate keyword. Our results show that improved RAKE achieves more than 20% higher precision and recall on average compared to original RAKE.
AbstractList	Keywords and key phrases are subsets of most relevant words or phrases that summarize contents of a document while they play a critical role in information and document retrieval. Keyword extraction from scientific text is challenging and time-consuming due to the technical and multi-subject nature of the text, while the number of documents requiring keywords is increasing. There are various algorithms and methods developed for automatic keyword extraction. Rapid Automatic Keyword Extraction (RAKE) is a popular algorithm in this domain. RAKE’s decisions are based on the observation that keywords generally contain multiple words and they rarely include stopwords and words with minimum lexical meanings. Candidate keywords are a set of single-word or multi-word sequences selected based on the scores assigned to them by some scoring criteria in RAKE. In this research, a new modified version of RAKE algorithm is proposed in which candidate keyword scoring scheme is improved to increase precision and recall in the keyword extraction process. The proposed algorithm is to cover some of the main weaknesses of RAKE algorithm, especially in Persian scientific documents. To study the weaknesses of RAKE algorithm and evaluating the proposed modified version of RAKE, a set of metadata of Persian theses and dissertations are used. The result of test and evaluation of the proposed algorithm confirm improvement in precision, recall and F-measure. We study effectiveness of RAKE in extracting keywords from Persian texts. We find that RAKE algorithm often extracts long phrases with redundant words on Persian texts, leading to low accuracy. In this paper, we study sources of scoring inefficiency of RAKE algorithm and propose an improved version of RAKE algorithm with a novel scoring mechanism. Our scoring mechanism overcomes some of the weaknesses in RAKE’s original scoring for Persian texts and yields better results. Our evaluations on Persian corpus demonstrate that our improved RAKE algorithm outperforms original RAKE algorithm by extracting more accurate keyword. Our results show that improved RAKE achieves more than 20% higher precision and recall on average compared to original RAKE.
Author	Mohebi, Azadeh Ahmadi, Abbas Mehrabi, Elaheh
Author_xml	– sequence: 1 givenname: Elaheh surname: Mehrabi fullname: Mehrabi, Elaheh – sequence: 2 givenname: Azadeh surname: Mohebi fullname: Mohebi, Azadeh – sequence: 3 givenname: Abbas surname: Ahmadi fullname: Ahmadi, Abbas
BookMark	eNp1kctKAzEUhoMoeN26zgt0zGWSzOiq1KpFQdG6DpnkpEY6k5KMl769o5UuBFfncOD74D__IdrtYgcInVJSCCZKdfYaVm3BVUELWqsddMCYoKOKcbq73RnfRyc5h4bwiihBBTlA61m7SvEdHL6F9UdMDk8_-2RsH2KHfUz4AVIOpsNjaxy0weI5fPYZP-fQLfDj-HaKx8tFTKF_aS_wxGTAT_2bW59vwfkLZMjYdA5fhpwh9eZbno_RnjfLDCe_8wg9X03nk5vR3f31bDK-G1kmiRp5IkrChgQ1aaTjStYeGpBcKuJlRV3FqaKOu9JXDKgjwBkwy0GUnlbKEX6EZhuvi-ZVr1JoTVrraIL-OcS00Cb1wS5Bm9JDDXVpJTSltbKxQkohKyc488S5wVVsXDbFnBP4rY8S_dOD_u5Bc6WpHnoYgPIPYMMm__DksPwP-wI2LZAe
CitedBy_id	crossref_primary_10_1007_s42835_023_01704_8
Cites_doi	10.14569/IJARAI.2013.020206 10.2991/aiie-16.2016.28 10.1109/ICCMC.2019.8819630 10.1007/s10844-019-00558-9 10.1093/bioinformatics/14.7.600 10.1002/widm.1339 10.1109/ICCITECHN.2018.8631917 10.1007/978-981-10-7512-4_47 10.1016/j.ins.2019.09.013 10.18653/v1/P17-1054 10.18653/v1/W16-1609 10.1111/j.1751-1097.1972.tb06217.x 10.1002/9780470689646.ch1 10.1007/978-3-642-54105-6_9 10.1109/AISP.2017.8515121
ContentType	Journal Article
CorporateAuthor	Amirkabir University of Technology; Tehran, Iran Department of Industrial Engineering and Management Systems; Amirkabir University of Technology; Tehran, Iran Faculty of Information Technology; Iranian Research Institute for Information Science and Technology (IranDoc); Tehran, Iran
CorporateAuthor_xml	– name: Department of Industrial Engineering and Management Systems; Amirkabir University of Technology; Tehran, Iran – name: Faculty of Information Technology; Iranian Research Institute for Information Science and Technology (IranDoc); Tehran, Iran – name: Amirkabir University of Technology; Tehran, Iran
DBID	AAYXX CITATION DOA
DOI	10.52547/jipm.37.1.197
DatabaseName	CrossRef DOAJ Directory of Open Access Journals
DatabaseTitle	CrossRef
DatabaseTitleList
Database_xml	– sequence: 1 dbid: DOA name: DOAJ Directory of Open Access Journals url: https://www.doaj.org/ sourceTypes: Open Website
DeliveryMethod	fulltext_linktorsrc
Discipline	Library & Information Science
EISSN	2251-8231
EndPage	228
ExternalDocumentID	oai_doaj_org_article_a4fe9e94c6eb4cc6bc566568d532f0dd 10_52547_jipm_37_1_197
GroupedDBID	5VS AAYXX ALMA_UNASSIGNED_HOLDINGS CITATION GROUPED_DOAJ RNS
ID	FETCH-LOGICAL-c2607-f0540222590b6d3769febe63670f681d83171d3d4f82e1d0e32e2c3e54f187d03
IEDL.DBID	DOA
ISSN	2251-8223
IngestDate	Wed Aug 27 01:15:25 EDT 2025 Tue Jul 01 02:53:41 EDT 2025 Thu Apr 24 22:51:37 EDT 2025
IsDoiOpenAccess	true
IsOpenAccess	true
IsPeerReviewed	true
IsScholarly	true
Issue	1
Language	Persian
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-c2607-f0540222590b6d3769febe63670f681d83171d3d4f82e1d0e32e2c3e54f187d03
OpenAccessLink	https://doaj.org/article/a4fe9e94c6eb4cc6bc566568d532f0dd
PageCount	32
ParticipantIDs	doaj_primary_oai_doaj_org_article_a4fe9e94c6eb4cc6bc566568d532f0dd crossref_primary_10_52547_jipm_37_1_197 crossref_citationtrail_10_52547_jipm_37_1_197
ProviderPackageCode	CITATION AAYXX
PublicationCentury	2000
PublicationDate	2021-9-01 2021-09-01
PublicationDateYYYYMMDD	2021-09-01
PublicationDate_xml	– month: 09 year: 2021 text: 2021-9-01 day: 01
PublicationDecade	2020
PublicationTitle	Pizhūhishnāmah-i pardāzish va mudiriyyat-i iṭṭilāʻāt (Online)
PublicationYear	2021
Publisher	Iranian Research Institute for Information and Technology
Publisher_xml	– name: Iranian Research Institute for Information and Technology
References	ref35 ref23 ref45 ref37 ref25 ref47 ref31 ref53 ref41 ref33 ref55 ref21 ref43 ref39 ref27 ref49 ref19 ref29 ref51
References_xml	– ident: ref45 doi: 10.14569/IJARAI.2013.020206 – ident: ref39 – ident: ref29 doi: 10.2991/aiie-16.2016.28 – ident: ref49 doi: 10.1109/ICCMC.2019.8819630 – ident: ref21 doi: 10.1007/s10844-019-00558-9 – ident: ref23 doi: 10.1093/bioinformatics/14.7.600 – ident: ref41 doi: 10.1002/widm.1339 – ident: ref31 doi: 10.1109/ICCITECHN.2018.8631917 – ident: ref47 doi: 10.1007/978-981-10-7512-4_47 – ident: ref27 doi: 10.1016/j.ins.2019.09.013 – ident: ref37 doi: 10.18653/v1/P17-1054 – ident: ref35 doi: 10.18653/v1/W16-1609 – ident: ref19 doi: 10.1111/j.1751-1097.1972.tb06217.x – ident: ref43 doi: 10.1002/9780470689646.ch1 – ident: ref55 – ident: ref53 – ident: ref51 doi: 10.1007/978-3-642-54105-6_9 – ident: ref25 doi: 10.1109/AISP.2017.8515121 – ident: ref33
SSID	ssib038075150 ssib023167310 ssib050736755 ssj0001386611 ssib020483885
Score	2.1613307
Snippet	Keywords and key phrases are subsets of most relevant words or phrases that summarize contents of a document while they play a critical role in information and...
SourceID	doaj crossref
SourceType	Open Website Enrichment Source Index Database
StartPage	197
SubjectTerms	keyword extraction natural language processing part of speech tagging persian scientific document rake algorithm
Title	Improved Keyword Extraction for Persian Academic Texts Using RAKE Algorithm; Case Study: Persian Theses and Dissertations
URI	https://doaj.org/article/a4fe9e94c6eb4cc6bc566568d532f0dd
Volume	37
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV09T8MwELVQJxbEpyhfugHBlLaOEzuBqUBRBYIBgdQtSvwBRW1BbRF04bdzl6RtFsTCEimREym-F997zvmZseM005zbWHvaaI0CRWkcB-PUCyPXMopysqapgbt72X0Kbnphr7LVF9WEFfbARcc108DZ2MaBljYLtJaZRgISysiEwsfHGRp9MedVxBQiidxoRbR0ZfFpvXeFyJDLOibyxTmSIoHMOVzOzogIExepNcQ79zCLisLxMURBpZqv_fdhQ6gGb3Byi6pktIrxf56hrtfZWkktoV280gZbcekmOywXJsAJlCuPKBJQftJbbFbMKlgDt3b2iUoUOl_TcbHaAbA9UIk8QgjmhfTwiMP5BPJSA3ho33agPXh-G_enL8NzuMScCFSaODtb3IhAnNgJpCMDV5Wf_5Nt9nTdebzseuV2DJ5G0aM8R-yO5GHcyqTBgSl2iABJDnBOIu2NkIpwI0zgIt9y07LCt74WNgwcj5RpiR1WG72N7C4DFMZOaiUdCr4g5i4V0qYWmZuS0inj15k379JEl17ltGXGIEHNkocgoRAkQiU8wRDU2emi_Xvh0vFrywuK0KIVuWvnFxBzSYm55C_M7f3HQ_bZqk_1MXm92gGrTccf9hAJzjQ7yrGMx7vvzg-t-POD
linkProvider	Directory of Open Access Journals
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Improved+Keyword+Extraction+for+Persian+Academic+Texts+Using+RAKE+Algorithm%3B+Case+Study%3A+Persian+Theses+and+Dissertations&rft.jtitle=Pizh%C5%ABhishn%C4%81mah-i+pard%C4%81zish+va+mudiriyyat-i+i%E1%B9%AD%E1%B9%ADil%C4%81%CA%BB%C4%81t+%28Online%29&rft.au=Mehrabi%2C+Elaheh&rft.au=Mohebi%2C+Azadeh&rft.au=Ahmadi%2C+Abbas&rft.date=2021-09-01&rft.issn=2251-8223&rft.eissn=2251-8231&rft.volume=37&rft.issue=1&rft.spage=197&rft.epage=228&rft_id=info:doi/10.52547%2Fjipm.37.1.197&rft.externalDBID=n%2Fa&rft.externalDocID=10_52547_jipm_37_1_197
thumbnail_l	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2251-8223&client=summon
thumbnail_m	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2251-8223&client=summon
thumbnail_s	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2251-8223&client=summon