A scalable parallel Chinese online encyclopedia knowledge denoising method based on entry tags and Spark cluster

Because of the open-collaborative of online encyclopedias, a large number of knowledge triples are improperly classified in online encyclopedia systems, and it is necessary to denoise and refine the open-domain encyclopedia Knowledge Bases (KBs) to improve the quality and precision. However, the lac...

Full description

Saved in:
Bibliographic Details
Published inApplied intelligence (Dordrecht, Netherlands) Vol. 51; no. 10; pp. 7573 - 7599
Main Authors Wang, Ting, Li, Jie, Guo, Jiale
Format Journal Article
LanguageEnglish
Published New York Springer US 01.10.2021
Springer Nature B.V
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Because of the open-collaborative of online encyclopedias, a large number of knowledge triples are improperly classified in online encyclopedia systems, and it is necessary to denoise and refine the open-domain encyclopedia Knowledge Bases (KBs) to improve the quality and precision. However, the lack and inaccuracy of triple semantic features lead to a poor refinement effect. In addition, considering large-scale encyclopedia KBs, the processing of massive knowledge will lead to too much computing time and poor scalability of the algorithm. To solve the problems of knowledge denoising in the Chinese encyclopedia system, first, based on data field theory, this paper proposes a new Cartesian product mapping-based method (TripleES) to calculate the semantic similarity of entity triples, based on which a method for quantifying the quality of entry tags is proposed. Second, to further improve the denoising effect on KBs, this paper proposes a new method (TriplePV) to compute the potential value of triple based on multi-feature fusion strategy to calculate the semantic distance between the “out-of-vocabulary” entry tags and embeds it into the potential function. Third, to ensure our algorithms have good scalability, the proposed denoising algorithms are implemented and optimized in parallel based on the Spark cluster-computing framework. Specifically, Spark-based TripleES (ES_Spark) and Spark-based TriplePV (PV_Spark) algorithms are proposed to calculate the semantic similarity and potential value of triples respectively. Finally, a comprehensive comparative analysis is performed on the denoising effect and time efficiency with the state-of-the-art distributed Chinese encyclopedia knowledge denoising algorithm. The experimental results on real-world datasets show that the parallel denoising algorithm proposed in this paper can improve the efficiency of knowledge denoising and the accuracy of KBs, which outperforms the state-of-the-art methods.
ISSN:0924-669X
1573-7497
DOI:10.1007/s10489-021-02295-5