A Decision Tree Based Approach for Pashto Coreference Resolution: The Case of Person Name Aliases

Coreference resolution is an important problem in fields such as natural language understanding, natural language generation, named entity recognition, text summarization, and anaphora resolution. Determining whether or not two proper nouns are aliases of each other (i.e. aliases identification) is...

Full description

Saved in:
Bibliographic Details
Published inVFAST Transactions on Software Engineering Vol. 13; no. 2; pp. 161 - 169
Main Authors Zuhra, Fatima Tuz, Ali, Hina, Naz, Surayya
Format Journal Article
LanguageEnglish
Published 06.06.2025
Online AccessGet full text
ISSN2411-6246
2309-3978
DOI10.21015/vtse.v13i2.2143

Cover

Abstract Coreference resolution is an important problem in fields such as natural language understanding, natural language generation, named entity recognition, text summarization, and anaphora resolution. Determining whether or not two proper nouns are aliases of each other (i.e. aliases identification) is a classification problem. A binary classifier for alias identification is needed which returns “Yes” if the two input nouns are aliases and “No” otherwise. In this research paper, a binary decision tree based classifier is proposed that is augmented with cosine similarity measure for personal name aliases identification in Pashto. This classifier is trained on aliases records containing features’ vectors.  A total of 10000 proper nouns’ pairs examples from the Pashto corpus have been extracted and a collection of crawled Pashto text, and recorded their features in this work. This resulted in 10000 example records, having 12 attributes. The selected dataset contains examples from different genres of the corpus e.g. novels, dramas, news, sports, letters and essays. These examples contain 5000 positive instances (i.e. class “Yes”) and 5000 negative instances (i.e. class “No”). These records are divided into two parts: the training part and the testing part in the ratio of 7:3. The 7000 examples of training part are used to induct the decision tree. This decision tree is created using Rapidminer, which is a data mining tool. Then, first order logic rules are created from the decision tree. These rules are then transformed into an algorithm, which is implemented in programming language Python. These rules are tested on the testing part of examples, which contain 3000 labeled examples. A total of 2794 out of these 3000 examples are classified correctly, which means an accuracy of approximately 93%. The error analysis of the 7% classification errors is performed to improve the system in future.
AbstractList Coreference resolution is an important problem in fields such as natural language understanding, natural language generation, named entity recognition, text summarization, and anaphora resolution. Determining whether or not two proper nouns are aliases of each other (i.e. aliases identification) is a classification problem. A binary classifier for alias identification is needed which returns “Yes” if the two input nouns are aliases and “No” otherwise. In this research paper, a binary decision tree based classifier is proposed that is augmented with cosine similarity measure for personal name aliases identification in Pashto. This classifier is trained on aliases records containing features’ vectors.  A total of 10000 proper nouns’ pairs examples from the Pashto corpus have been extracted and a collection of crawled Pashto text, and recorded their features in this work. This resulted in 10000 example records, having 12 attributes. The selected dataset contains examples from different genres of the corpus e.g. novels, dramas, news, sports, letters and essays. These examples contain 5000 positive instances (i.e. class “Yes”) and 5000 negative instances (i.e. class “No”). These records are divided into two parts: the training part and the testing part in the ratio of 7:3. The 7000 examples of training part are used to induct the decision tree. This decision tree is created using Rapidminer, which is a data mining tool. Then, first order logic rules are created from the decision tree. These rules are then transformed into an algorithm, which is implemented in programming language Python. These rules are tested on the testing part of examples, which contain 3000 labeled examples. A total of 2794 out of these 3000 examples are classified correctly, which means an accuracy of approximately 93%. The error analysis of the 7% classification errors is performed to improve the system in future.
Author Naz, Surayya
Ali, Hina
Zuhra, Fatima Tuz
Author_xml – sequence: 1
  givenname: Fatima Tuz
  orcidid: 0000-0003-2427-9483
  surname: Zuhra
  fullname: Zuhra, Fatima Tuz
– sequence: 2
  givenname: Hina
  orcidid: 0009-0006-5690-5089
  surname: Ali
  fullname: Ali, Hina
– sequence: 3
  givenname: Surayya
  orcidid: 0009-0004-9015-3488
  surname: Naz
  fullname: Naz, Surayya
BookMark eNotkF9LwzAUxYNMcM69-3i_QGeSm7SNb7X-haFD-l7S9oYVtmYkc-C3N06fzuHAORx-12w2-YkYuxV8JQUX-u50jLQ6CRxlChResLlEbjI0RTlLXgmR5VLlV2wZ49hxpYpcaSzmzFbwSP0YRz9BE4jgwUYaoDocgrf9FpwPsLFxe_RQ-0COAk09wSdFv_s6ptY9NFuCOrXAO9hQiGnp3e4Jqt2Y0njDLp3dRVr-64I1z09N_ZqtP17e6mqd9SVipkhrxYXsSimlFQ6dMWZA1w2EvOTKFoKk0MqVRnSlFqSt4QWSKpRBHHJcMP432wcfY3raHsK4t-G7Fbw9Q2p_IbVnSO0vJPwBPDtcTA
Cites_doi 10.1016/j.jbi.2017.04.015
10.1016/j.jjimei.2022.100115
10.1109/WKDD.2010.56
10.32604/cmc.2021.015054
10.54254/2755-2721/54/20241498
10.1109/IALP.2009.21
10.1016/j.ins.2014.02.050
10.1016/j.inffus.2024.102769
10.1145/956755.956759
10.1007/978-3-540-30211-7_3
10.1162/089120101753342653
10.7717/peerj-cs.1617
10.14569/IJACSA.2023.01406142
10.3115/1073083.1073102
10.1016/j.specom.2023.102970
10.1016/j.jbi.2023.104578
10.1109/TKDE.2010.162
ContentType Journal Article
DBID AAYXX
CITATION
DOI 10.21015/vtse.v13i2.2143
DatabaseName CrossRef
DatabaseTitle CrossRef
DatabaseTitleList CrossRef
DeliveryMethod fulltext_linktorsrc
EISSN 2309-3978
EndPage 169
ExternalDocumentID 10_21015_vtse_v13i2_2143
GroupedDBID AAYXX
CITATION
M~E
ID FETCH-LOGICAL-c833-4e554012b8222a1f3f999d3fbde30804a71e2154f891b851e5a9073e474933d63
ISSN 2411-6246
IngestDate Thu Jul 03 08:45:19 EDT 2025
IsDoiOpenAccess false
IsOpenAccess true
IsPeerReviewed false
IsScholarly false
Issue 2
Language English
LinkModel OpenURL
MergedId FETCHMERGED-LOGICAL-c833-4e554012b8222a1f3f999d3fbde30804a71e2154f891b851e5a9073e474933d63
ORCID 0000-0003-2427-9483
0009-0004-9015-3488
0009-0006-5690-5089
OpenAccessLink https://vfast.org/journals/index.php/VTSE/article/download/2143/1707
PageCount 9
ParticipantIDs crossref_primary_10_21015_vtse_v13i2_2143
PublicationCentury 2000
PublicationDate 2025-06-06
PublicationDateYYYYMMDD 2025-06-06
PublicationDate_xml – month: 06
  year: 2025
  text: 2025-06-06
  day: 06
PublicationDecade 2020
PublicationTitle VFAST Transactions on Software Engineering
PublicationYear 2025
References 59603
59614
59602
59613
59605
59616
59604
59615
59610
59621
59620
59601
59612
59611
59622
59607
59618
59606
59617
59609
59608
59619
References_xml – ident: 59616
– ident: 59615
  doi: 10.1016/j.jbi.2017.04.015
– ident: 59602
  doi: 10.1016/j.jjimei.2022.100115
– ident: 59605
  doi: 10.1109/WKDD.2010.56
– ident: 59612
  doi: 10.32604/cmc.2021.015054
– ident: 59601
– ident: 59609
  doi: 10.54254/2755-2721/54/20241498
– ident: 59621
  doi: 10.1109/IALP.2009.21
– ident: 59619
  doi: 10.1016/j.ins.2014.02.050
– ident: 59622
  doi: 10.1016/j.inffus.2024.102769
– ident: 59603
  doi: 10.1145/956755.956759
– ident: 59620
  doi: 10.1007/978-3-540-30211-7_3
– ident: 59618
  doi: 10.1162/089120101753342653
– ident: 59610
– ident: 59606
  doi: 10.7717/peerj-cs.1617
– ident: 59608
  doi: 10.14569/IJACSA.2023.01406142
– ident: 59617
  doi: 10.3115/1073083.1073102
– ident: 59607
  doi: 10.1016/j.specom.2023.102970
– ident: 59613
  doi: 10.1016/j.jbi.2023.104578
– ident: 59614
– ident: 59604
  doi: 10.1109/TKDE.2010.162
– ident: 59611
SSID ssib044764537
Score 1.9129322
Snippet Coreference resolution is an important problem in fields such as natural language understanding, natural language generation, named entity recognition, text...
SourceID crossref
SourceType Index Database
StartPage 161
Title A Decision Tree Based Approach for Pashto Coreference Resolution: The Case of Person Name Aliases
Volume 13
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV1La9wwEBbb9NJLSWlLH2nQoZdinFaWLNu9OUuWpZASiFtyM7IlkUCzKftIyB7yA_qrO9LYXmVpoenFLMIMWs_H6JvRPAh5D2dCq6XUcSqMjQUzKm64LuJPuc1lW9jCNi4OefxVTr-JL2fp2Wj0K8haWi2bg3b9x7qS_9EqrIFeXZXsAzQ7CIUF-A36hSdoGJ7_pOMSzAWOyImquTHRIRxJ2hFLLJNyGYQnanEO9HK8mSfiI_a4qz7lYqwwon_i6TdY3EsTlT8uYHURktfvk_K0wnboWA7hrxpOwZDfuPyxoLXhEI5enfs5RtEEdnypomq1HhCGddnTbny3D0irNSYKzdXtrQrjEUnq86bkxmwBJWCxTLrAosE1d48DzCe_Z3d5gK8kMKIM27N35zHDUS7bph5cVd8W43q5MAfXjF8ksIQdn-531d467YYcRPB-vIzaSai9hNpJeEQeJ1nmr_yP74562yREJkXqW7AOfxBvvb2Qj1vbCFhOQFeqXfK08zNoiaB5RkZm9pyokvaAoQ4w1AOG9oChABiKgKEBYOgGMJ8pwIU6uNArSxEu1MGFdnB5QarJUTWext2MjbjNOY-FAToJHKVxPFExyy04DJrbRhsOvoRQGTNACoXNC9YAOTepKuBQMCITBeda8pdkZ3Y1M68IlbJ13fLghQJ82pSpxCRCK9sU3CZaiNfkQ_9J6p_YSaX-mw7ePODdt-TJBoZ7ZGc5X5l3QBSXzb7X4G9DCWiq
linkProvider ISSN International Centre
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=A+Decision+Tree+Based+Approach+for+Pashto+Coreference+Resolution%3A+The+Case+of+Person+Name+Aliases&rft.jtitle=VFAST+Transactions+on+Software+Engineering&rft.au=Zuhra%2C+Fatima+Tuz&rft.au=Ali%2C+Hina&rft.au=Naz%2C+Surayya&rft.date=2025-06-06&rft.issn=2411-6246&rft.eissn=2309-3978&rft.volume=13&rft.issue=2&rft.spage=161&rft.epage=169&rft_id=info:doi/10.21015%2Fvtse.v13i2.2143&rft.externalDBID=n%2Fa&rft.externalDocID=10_21015_vtse_v13i2_2143
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2411-6246&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2411-6246&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2411-6246&client=summon