Two improvements to detect duplicates in Stack Overflow

Stack Overflow is one of the most popular question-and-answer sites for programmers. However, there are a great number of duplicate questions that are expected to be detected automatically in a short time. In this paper, we introduce two approaches to improve the detection accuracy: splitting body i...

Full description

Saved in:

Bibliographic Details
Published in	2017 IEEE 24th International Conference on Software Analysis, Evolution and Reengineering (SANER) pp. 563 - 564
Main Authors	Mizobuchi, Yuji, Takayama, Kuniharu
Format	Conference Proceeding
Language	English
Published	IEEE 01.02.2017
Subjects	Computational modeling Conferences Data mining duplicate question Feature extraction HTML information retrieval machine learning Mathematical model Software Stack Overflow word-embedding
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Stack Overflow is one of the most popular question-and-answer sites for programmers. However, there are a great number of duplicate questions that are expected to be detected automatically in a short time. In this paper, we introduce two approaches to improve the detection accuracy: splitting body into different types of data and using word-embedding to treat word ambiguities that are not contained in the general corpuses. The evaluation shows that these approaches improve the accuracy compared with the traditional method.
DOI:	10.1109/SANER.2017.7884678