Machine learning and ontology-based novel semantic document indexing for information retrieval

•Document key phrases to ontology concept mapping with limited or no related concepts in the ontology.•Analytic hierarchy process based application specific concept feature weights.•Document’s concept term variations and synonyms mapped on domain ontology.•Average F-measure enhanced by 25% compared...

Full description

Saved in:

Bibliographic Details
Published in	Computers & industrial engineering Vol. 176; p. 108940
Main Authors	Sharma, Anil, Kumar, Suresh
Format	Journal Article
Language	English
Published	Elsevier Ltd 01.02.2023
Subjects	Computer science ontology Concept extraction Document indexing Information retrieval Machine learning Natural language processing Semantic web Semantic web Document indexing Computer science ontology Machine learning Information retrieval Natural language processing Concept extraction
Online Access	Get full text

Cover

Loading…

More Information
Summary:	•Document key phrases to ontology concept mapping with limited or no related concepts in the ontology.•Analytic hierarchy process based application specific concept feature weights.•Document’s concept term variations and synonyms mapped on domain ontology.•Average F-measure enhanced by 25% compared to the state-of-the-art. The goal of information retrieval (IR) systems is to find the contents most closely related to the user's information needs from a pool of information. However, conventional IR methods neglect semantic descriptions of document contents and index documents based on the words that they include. When users and indexing systems use different terms to express the same subject, a vocabulary gap emerges. To overcome this limitation and to enhance the effectiveness of the IR systems, this paper introduced a novel hybrid semantic document indexing employing machine learning and domain ontology. The presented technique uses a skip-gram with negative sampling-based machine learning model and a domain ontology to determine the concepts for annotating unstructured documents. The proposed work also introduced multiple feature based novel concept ranking algorithm where statistical, semantic, and scientific named entity features of the concept were used to assign relevance weight to the annotations. The fuzzy analytical hierarchy process was used to derive the parameters of these feature weights. The final step is to rank the concepts according to their relevance to the document. Five benchmark publicly accessible datasets from the computer science domain were used in a series of experiments to validate the results of presented method. Experiment findings showed that the proposed method performs better than state-of-the-art techniques on these datasets, by improving average accuracy by 29%, while an improvement of 25% was recorded in F-measure. The improvement in average accuracy demonstrates that the performance of the proposed approach is better than the state-of-the-art methods in extracting document concepts accurately even when the same concept is referred to by distinct terms in the document and domain ontologies. The proposed system's ability to find similar concepts when the documents possess no concept from domain ontology is demonstrated by the improvement in F-measure, which is attributed to high recall rates of the proposed indexing scheme while maintaining high accuracy.
ISSN:	0360-8352 1879-0550
DOI:	10.1016/j.cie.2022.108940