A context-aware approach to detection of short irrelevant texts

This paper presents a simple and effective framework that can detect irrelevant short text contents following blogs and news articles, etc. in a context-aware and timely fashion. Nowadays, websites such as Linkedin.com and CNN.com allow their visitors to leave comments after articles, and spammers a...

Full description

Saved in:
Bibliographic Details
Published in2015 IEEE International Conference on Data Science and Advanced Analytics (DSAA) pp. 1 - 10
Main Authors Sihong Xie, Jing Wang, Amin, Mohammad S., Baoshi Yan, Bhasin, Anmol, Yu, Clement, Yu, Philip S.
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.10.2015
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:This paper presents a simple and effective framework that can detect irrelevant short text contents following blogs and news articles, etc. in a context-aware and timely fashion. Nowadays, websites such as Linkedin.com and CNN.com allow their visitors to leave comments after articles, and spammers are exploiting this feature to post irrelevant contents. Visited by millions of readers per day, these websites have extremely high visibility, and irrelevant comments have a detrimental effect on the visiting traffic and revenue of these websites. Therefore, it is critical to eliminate these irrelevant comments as accurately and early as possible. Different from traditional text mining tasks, comments following news and blog articles are characterized by briefness and context-dependent semantics, making it difficult to measure semantic relevance. What's worse, there could be only a handful of comments soon after an article is posted, leading to a severe lack of information for semantics and relevance measurement. We propose to infer "context-aware semantics" to address the above challenges in a unified framework. Specifically, we construct contexts for comments using either blocks of surrounding comments, or comments collected via a principled transfer learning approach. The constructed contexts mitigate the sparseness and sharply define context-dependent semantics of comments, even at the early stage of commenting activities, allowing traditional dimension reduction methods to better capture the semantics of short texts in a context-aware way. We confirm the effectiveness of the proposed method on two real world datasets consisting of news and blog articles and comments, with a maximal improvement of 20% in Area Under Precision-Recall Curve.
ISBN:9781467382724
1467382728
DOI:10.1109/DSAA.2015.7344831