A Decision Tree Based Approach for Pashto Coreference Resolution: The Case of Person Name Aliases

Coreference resolution is an important problem in fields such as natural language understanding, natural language generation, named entity recognition, text summarization, and anaphora resolution. Determining whether or not two proper nouns are aliases of each other (i.e. aliases identification) is...

Full description

Saved in:
Bibliographic Details
Published inVFAST Transactions on Software Engineering Vol. 13; no. 2; pp. 161 - 169
Main Authors Zuhra, Fatima Tuz, Ali, Hina, Naz, Surayya
Format Journal Article
LanguageEnglish
Published 06.06.2025
Online AccessGet full text
ISSN2411-6246
2309-3978
DOI10.21015/vtse.v13i2.2143

Cover

More Information
Summary:Coreference resolution is an important problem in fields such as natural language understanding, natural language generation, named entity recognition, text summarization, and anaphora resolution. Determining whether or not two proper nouns are aliases of each other (i.e. aliases identification) is a classification problem. A binary classifier for alias identification is needed which returns “Yes” if the two input nouns are aliases and “No” otherwise. In this research paper, a binary decision tree based classifier is proposed that is augmented with cosine similarity measure for personal name aliases identification in Pashto. This classifier is trained on aliases records containing features’ vectors.  A total of 10000 proper nouns’ pairs examples from the Pashto corpus have been extracted and a collection of crawled Pashto text, and recorded their features in this work. This resulted in 10000 example records, having 12 attributes. The selected dataset contains examples from different genres of the corpus e.g. novels, dramas, news, sports, letters and essays. These examples contain 5000 positive instances (i.e. class “Yes”) and 5000 negative instances (i.e. class “No”). These records are divided into two parts: the training part and the testing part in the ratio of 7:3. The 7000 examples of training part are used to induct the decision tree. This decision tree is created using Rapidminer, which is a data mining tool. Then, first order logic rules are created from the decision tree. These rules are then transformed into an algorithm, which is implemented in programming language Python. These rules are tested on the testing part of examples, which contain 3000 labeled examples. A total of 2794 out of these 3000 examples are classified correctly, which means an accuracy of approximately 93%. The error analysis of the 7% classification errors is performed to improve the system in future.
ISSN:2411-6246
2309-3978
DOI:10.21015/vtse.v13i2.2143