A machine learning approach to query generation in plagiarism source retrieval

Plagiarism source retrieval is the core task of plagiarism detection. It has become the standard for plagiarism detection to use the queries extracted from suspicious documents to retrieve the plagiarism sources. Generating queries from a suspicious document is one of the most important steps in pla...

Full description

Saved in:
Bibliographic Details
Published inFrontiers of information technology & electronic engineering Vol. 18; no. 10; pp. 1556 - 1572
Main Authors Kong, Lei-lei, Lu, Zhi-mao, Qi, Hao-liang, Han, Zhong-yuan
Format Journal Article
LanguageEnglish
Published Hangzhou Zhejiang University Press 01.10.2017
Springer Nature B.V
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Plagiarism source retrieval is the core task of plagiarism detection. It has become the standard for plagiarism detection to use the queries extracted from suspicious documents to retrieve the plagiarism sources. Generating queries from a suspicious document is one of the most important steps in plagiarism source retrieval. Heuristic-based query generation methods are widely used in the current research. Each heuristic-based method has its own advantages, and no one statistically outperforms the others on all suspicious document segments when generating queries for source retrieval. Further improvements on heuristic methods for source retrieval rely mainly on the experience of experts. This leads to difficulties in putting forward new heuristic methods that can overcome the shortcomings of the existing ones. This paper paves the way for a new statistical machine learning approach to select the best queries from the candidates. The statistical machine learning approach to query generation for source retrieval is formulated as a ranking framework. Specifically, it aims to achieve the optimal source retrieval performance for each suspicious document segment. The proposed method exploits learning to rank to generate queries from the candidates. To our knowledge, our work is the first research to apply machine learning methods to resolve the problem of query generation for source retrieval. To solve the essential problem of an absence of training data for learning to rank, the building of training samples for source retrieval is also conducted. We rigorously evaluate various aspects of the proposed method on the publicly available PAN source retrieval corpus. With respect to the established baselines, the experimental results show that applying our proposed query generation method based on machine learning yields statistically significant improvements over baselines in source retrieval effectiveness.
Bibliography:33-1389/TP
ISSN:2095-9184
2095-9230
DOI:10.1631/FITEE.1601344