A machine learning approach to query generation in plagiarism source retrieval

Plagiarism source retrieval is the core task of plagiarism detection. It has become the standard for plagiarism detection to use the queries extracted from suspicious documents to retrieve the plagiarism sources. Generating queries from a suspicious document is one of the most important steps in pla...

Full description

Saved in:

Bibliographic Details
Published in	Frontiers of information technology & electronic engineering Vol. 18; no. 10; pp. 1556 - 1572
Main Authors	Kong, Lei-lei, Lu, Zhi-mao, Qi, Hao-liang, Han, Zhong-yuan
Format	Journal Article
Language	English
Published	Hangzhou Zhejiang University Press 01.10.2017 Springer Nature B.V
Subjects	Communications Engineering Computer Hardware Computer Science Computer Systems Organization and Communication Networks Documents Electrical Engineering Electronics and Microelectronics Heuristic Heuristic methods Instrumentation Machine learning Methods Networks Plagiarism Queries Retrieval Segments Learning to rank TP391.3 Source retrieval Query generation Machine learning Plagiarism detection
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Plagiarism source retrieval is the core task of plagiarism detection. It has become the standard for plagiarism detection to use the queries extracted from suspicious documents to retrieve the plagiarism sources. Generating queries from a suspicious document is one of the most important steps in plagiarism source retrieval. Heuristic-based query generation methods are widely used in the current research. Each heuristic-based method has its own advantages, and no one statistically outperforms the others on all suspicious document segments when generating queries for source retrieval. Further improvements on heuristic methods for source retrieval rely mainly on the experience of experts. This leads to difficulties in putting forward new heuristic methods that can overcome the shortcomings of the existing ones. This paper paves the way for a new statistical machine learning approach to select the best queries from the candidates. The statistical machine learning approach to query generation for source retrieval is formulated as a ranking framework. Specifically, it aims to achieve the optimal source retrieval performance for each suspicious document segment. The proposed method exploits learning to rank to generate queries from the candidates. To our knowledge, our work is the first research to apply machine learning methods to resolve the problem of query generation for source retrieval. To solve the essential problem of an absence of training data for learning to rank, the building of training samples for source retrieval is also conducted. We rigorously evaluate various aspects of the proposed method on the publicly available PAN source retrieval corpus. With respect to the established baselines, the experimental results show that applying our proposed query generation method based on machine learning yields statistically significant improvements over baselines in source retrieval effectiveness.
Bibliography:	33-1389/TP
ISSN:	2095-9184 2095-9230
DOI:	10.1631/FITEE.1601344