Improved Keyword Extraction for Persian Academic Texts Using RAKE Algorithm; Case Study: Persian Theses and Dissertations
Keywords and key phrases are subsets of most relevant words or phrases that summarize contents of a document while they play a critical role in information and document retrieval. Keyword extraction from scientific text is challenging and time-consuming due to the technical and multi-subject nature...
Saved in:
Published in | Pizhūhishnāmah-i pardāzish va mudiriyyat-i iṭṭilāʻāt (Online) Vol. 37; no. 1; pp. 197 - 228 |
---|---|
Main Authors | , , |
Format | Journal Article |
Language | Persian |
Published |
Iranian Research Institute for Information and Technology
01.09.2021
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Abstract | Keywords and key phrases are subsets of most relevant words or phrases that summarize contents of a document while they play a critical role in information and document retrieval. Keyword extraction from scientific text is challenging and time-consuming due to the technical and multi-subject nature of the text, while the number of documents requiring keywords is increasing. There are various algorithms and methods developed for automatic keyword extraction. Rapid Automatic Keyword Extraction (RAKE) is a popular algorithm in this domain. RAKE’s decisions are based on the observation that keywords generally contain multiple words and they rarely include stopwords and words with minimum lexical meanings. Candidate keywords are a set of single-word or multi-word sequences selected based on the scores assigned to them by some scoring criteria in RAKE. In this research, a new modified version of RAKE algorithm is proposed in which candidate keyword scoring scheme is improved to increase precision and recall in the keyword extraction process. The proposed algorithm is to cover some of the main weaknesses of RAKE algorithm, especially in Persian scientific documents. To study the weaknesses of RAKE algorithm and evaluating the proposed modified version of RAKE, a set of metadata of Persian theses and dissertations are used. The result of test and evaluation of the proposed algorithm confirm improvement in precision, recall and F-measure. We study effectiveness of RAKE in extracting keywords from Persian texts. We find that RAKE algorithm often extracts long phrases with redundant words on Persian texts, leading to low accuracy. In this paper, we study sources of scoring inefficiency of RAKE algorithm and propose an improved version of RAKE algorithm with a novel scoring mechanism. Our scoring mechanism overcomes some of the weaknesses in RAKE’s original scoring for Persian texts and yields better results. Our evaluations on Persian corpus demonstrate that our improved RAKE algorithm outperforms original RAKE algorithm by extracting more accurate keyword. Our results show that improved RAKE achieves more than 20% higher precision and recall on average compared to original RAKE. |
---|---|
AbstractList | Keywords and key phrases are subsets of most relevant words or phrases that summarize contents of a document while they play a critical role in information and document retrieval. Keyword extraction from scientific text is challenging and time-consuming due to the technical and multi-subject nature of the text, while the number of documents requiring keywords is increasing. There are various algorithms and methods developed for automatic keyword extraction. Rapid Automatic Keyword Extraction (RAKE) is a popular algorithm in this domain. RAKE’s decisions are based on the observation that keywords generally contain multiple words and they rarely include stopwords and words with minimum lexical meanings. Candidate keywords are a set of single-word or multi-word sequences selected based on the scores assigned to them by some scoring criteria in RAKE. In this research, a new modified version of RAKE algorithm is proposed in which candidate keyword scoring scheme is improved to increase precision and recall in the keyword extraction process. The proposed algorithm is to cover some of the main weaknesses of RAKE algorithm, especially in Persian scientific documents. To study the weaknesses of RAKE algorithm and evaluating the proposed modified version of RAKE, a set of metadata of Persian theses and dissertations are used. The result of test and evaluation of the proposed algorithm confirm improvement in precision, recall and F-measure. We study effectiveness of RAKE in extracting keywords from Persian texts. We find that RAKE algorithm often extracts long phrases with redundant words on Persian texts, leading to low accuracy. In this paper, we study sources of scoring inefficiency of RAKE algorithm and propose an improved version of RAKE algorithm with a novel scoring mechanism. Our scoring mechanism overcomes some of the weaknesses in RAKE’s original scoring for Persian texts and yields better results. Our evaluations on Persian corpus demonstrate that our improved RAKE algorithm outperforms original RAKE algorithm by extracting more accurate keyword. Our results show that improved RAKE achieves more than 20% higher precision and recall on average compared to original RAKE. |
Author | Mohebi, Azadeh Ahmadi, Abbas Mehrabi, Elaheh |
Author_xml | – sequence: 1 givenname: Elaheh surname: Mehrabi fullname: Mehrabi, Elaheh – sequence: 2 givenname: Azadeh surname: Mohebi fullname: Mohebi, Azadeh – sequence: 3 givenname: Abbas surname: Ahmadi fullname: Ahmadi, Abbas |
BookMark | eNp1kctKAzEUhoMoeN26zgt0zGWSzOiq1KpFQdG6DpnkpEY6k5KMl769o5UuBFfncOD74D__IdrtYgcInVJSCCZKdfYaVm3BVUELWqsddMCYoKOKcbq73RnfRyc5h4bwiihBBTlA61m7SvEdHL6F9UdMDk8_-2RsH2KHfUz4AVIOpsNjaxy0weI5fPYZP-fQLfDj-HaKx8tFTKF_aS_wxGTAT_2bW59vwfkLZMjYdA5fhpwh9eZbno_RnjfLDCe_8wg9X03nk5vR3f31bDK-G1kmiRp5IkrChgQ1aaTjStYeGpBcKuJlRV3FqaKOu9JXDKgjwBkwy0GUnlbKEX6EZhuvi-ZVr1JoTVrraIL-OcS00Cb1wS5Bm9JDDXVpJTSltbKxQkohKyc488S5wVVsXDbFnBP4rY8S_dOD_u5Bc6WpHnoYgPIPYMMm__DksPwP-wI2LZAe |
CitedBy_id | crossref_primary_10_1007_s42835_023_01704_8 |
Cites_doi | 10.14569/IJARAI.2013.020206 10.2991/aiie-16.2016.28 10.1109/ICCMC.2019.8819630 10.1007/s10844-019-00558-9 10.1093/bioinformatics/14.7.600 10.1002/widm.1339 10.1109/ICCITECHN.2018.8631917 10.1007/978-981-10-7512-4_47 10.1016/j.ins.2019.09.013 10.18653/v1/P17-1054 10.18653/v1/W16-1609 10.1111/j.1751-1097.1972.tb06217.x 10.1002/9780470689646.ch1 10.1007/978-3-642-54105-6_9 10.1109/AISP.2017.8515121 |
ContentType | Journal Article |
CorporateAuthor | Amirkabir University of Technology; Tehran, Iran Department of Industrial Engineering and Management Systems; Amirkabir University of Technology; Tehran, Iran Faculty of Information Technology; Iranian Research Institute for Information Science and Technology (IranDoc); Tehran, Iran |
CorporateAuthor_xml | – name: Department of Industrial Engineering and Management Systems; Amirkabir University of Technology; Tehran, Iran – name: Faculty of Information Technology; Iranian Research Institute for Information Science and Technology (IranDoc); Tehran, Iran – name: Amirkabir University of Technology; Tehran, Iran |
DBID | AAYXX CITATION DOA |
DOI | 10.52547/jipm.37.1.197 |
DatabaseName | CrossRef DOAJ Directory of Open Access Journals |
DatabaseTitle | CrossRef |
DatabaseTitleList | |
Database_xml | – sequence: 1 dbid: DOA name: DOAJ Directory of Open Access Journals url: https://www.doaj.org/ sourceTypes: Open Website |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Library & Information Science |
EISSN | 2251-8231 |
EndPage | 228 |
ExternalDocumentID | oai_doaj_org_article_a4fe9e94c6eb4cc6bc566568d532f0dd 10_52547_jipm_37_1_197 |
GroupedDBID | 5VS AAYXX ALMA_UNASSIGNED_HOLDINGS CITATION GROUPED_DOAJ RNS |
ID | FETCH-LOGICAL-c2607-f0540222590b6d3769febe63670f681d83171d3d4f82e1d0e32e2c3e54f187d03 |
IEDL.DBID | DOA |
ISSN | 2251-8223 |
IngestDate | Wed Aug 27 01:15:25 EDT 2025 Tue Jul 01 02:53:41 EDT 2025 Thu Apr 24 22:51:37 EDT 2025 |
IsDoiOpenAccess | true |
IsOpenAccess | true |
IsPeerReviewed | true |
IsScholarly | true |
Issue | 1 |
Language | Persian |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-c2607-f0540222590b6d3769febe63670f681d83171d3d4f82e1d0e32e2c3e54f187d03 |
OpenAccessLink | https://doaj.org/article/a4fe9e94c6eb4cc6bc566568d532f0dd |
PageCount | 32 |
ParticipantIDs | doaj_primary_oai_doaj_org_article_a4fe9e94c6eb4cc6bc566568d532f0dd crossref_primary_10_52547_jipm_37_1_197 crossref_citationtrail_10_52547_jipm_37_1_197 |
ProviderPackageCode | CITATION AAYXX |
PublicationCentury | 2000 |
PublicationDate | 2021-9-01 2021-09-01 |
PublicationDateYYYYMMDD | 2021-09-01 |
PublicationDate_xml | – month: 09 year: 2021 text: 2021-9-01 day: 01 |
PublicationDecade | 2020 |
PublicationTitle | Pizhūhishnāmah-i pardāzish va mudiriyyat-i iṭṭilāʻāt (Online) |
PublicationYear | 2021 |
Publisher | Iranian Research Institute for Information and Technology |
Publisher_xml | – name: Iranian Research Institute for Information and Technology |
References | ref35 ref23 ref45 ref37 ref25 ref47 ref31 ref53 ref41 ref33 ref55 ref21 ref43 ref39 ref27 ref49 ref19 ref29 ref51 |
References_xml | – ident: ref45 doi: 10.14569/IJARAI.2013.020206 – ident: ref39 – ident: ref29 doi: 10.2991/aiie-16.2016.28 – ident: ref49 doi: 10.1109/ICCMC.2019.8819630 – ident: ref21 doi: 10.1007/s10844-019-00558-9 – ident: ref23 doi: 10.1093/bioinformatics/14.7.600 – ident: ref41 doi: 10.1002/widm.1339 – ident: ref31 doi: 10.1109/ICCITECHN.2018.8631917 – ident: ref47 doi: 10.1007/978-981-10-7512-4_47 – ident: ref27 doi: 10.1016/j.ins.2019.09.013 – ident: ref37 doi: 10.18653/v1/P17-1054 – ident: ref35 doi: 10.18653/v1/W16-1609 – ident: ref19 doi: 10.1111/j.1751-1097.1972.tb06217.x – ident: ref43 doi: 10.1002/9780470689646.ch1 – ident: ref55 – ident: ref53 – ident: ref51 doi: 10.1007/978-3-642-54105-6_9 – ident: ref25 doi: 10.1109/AISP.2017.8515121 – ident: ref33 |
SSID | ssib038075150 ssib023167310 ssib050736755 ssj0001386611 ssib020483885 |
Score | 2.1613307 |
Snippet | Keywords and key phrases are subsets of most relevant words or phrases that summarize contents of a document while they play a critical role in information and... |
SourceID | doaj crossref |
SourceType | Open Website Enrichment Source Index Database |
StartPage | 197 |
SubjectTerms | keyword extraction natural language processing part of speech tagging persian scientific document rake algorithm |
Title | Improved Keyword Extraction for Persian Academic Texts Using RAKE Algorithm; Case Study: Persian Theses and Dissertations |
URI | https://doaj.org/article/a4fe9e94c6eb4cc6bc566568d532f0dd |
Volume | 37 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV09T8MwELVQJxbEpyhfugHBlLaOEzuBqUBRBYIBgdQtSvwBRW1BbRF04bdzl6RtFsTCEimREym-F997zvmZseM005zbWHvaaI0CRWkcB-PUCyPXMopysqapgbt72X0Kbnphr7LVF9WEFfbARcc108DZ2MaBljYLtJaZRgISysiEwsfHGRp9MedVxBQiidxoRbR0ZfFpvXeFyJDLOibyxTmSIoHMOVzOzogIExepNcQ79zCLisLxMURBpZqv_fdhQ6gGb3Byi6pktIrxf56hrtfZWkktoV280gZbcekmOywXJsAJlCuPKBJQftJbbFbMKlgDt3b2iUoUOl_TcbHaAbA9UIk8QgjmhfTwiMP5BPJSA3ho33agPXh-G_enL8NzuMScCFSaODtb3IhAnNgJpCMDV5Wf_5Nt9nTdebzseuV2DJ5G0aM8R-yO5GHcyqTBgSl2iABJDnBOIu2NkIpwI0zgIt9y07LCt74WNgwcj5RpiR1WG72N7C4DFMZOaiUdCr4g5i4V0qYWmZuS0inj15k379JEl17ltGXGIEHNkocgoRAkQiU8wRDU2emi_Xvh0vFrywuK0KIVuWvnFxBzSYm55C_M7f3HQ_bZqk_1MXm92gGrTccf9hAJzjQ7yrGMx7vvzg-t-POD |
linkProvider | Directory of Open Access Journals |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Improved+Keyword+Extraction+for+Persian+Academic+Texts+Using+RAKE+Algorithm%3B+Case+Study%3A+Persian+Theses+and+Dissertations&rft.jtitle=Pizh%C5%ABhishn%C4%81mah-i+pard%C4%81zish+va+mudiriyyat-i+i%E1%B9%AD%E1%B9%ADil%C4%81%CA%BB%C4%81t+%28Online%29&rft.au=Mehrabi%2C+Elaheh&rft.au=Mohebi%2C+Azadeh&rft.au=Ahmadi%2C+Abbas&rft.date=2021-09-01&rft.issn=2251-8223&rft.eissn=2251-8231&rft.volume=37&rft.issue=1&rft.spage=197&rft.epage=228&rft_id=info:doi/10.52547%2Fjipm.37.1.197&rft.externalDBID=n%2Fa&rft.externalDocID=10_52547_jipm_37_1_197 |
thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2251-8223&client=summon |
thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2251-8223&client=summon |
thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2251-8223&client=summon |