HASKER: An efficient algorithm for string kernels. Application to polarity classification in various languages

String kernels have successfully been used for various NLP tasks, ranging from text categorization by topic to native language identification. In this paper, we present a simple and efficient algorithm for computing various spectrum string kernels. When comparing two strings, we store the p-grams in...

Full description

Saved in:
Bibliographic Details
Published inProcedia computer science Vol. 112; pp. 1755 - 1763
Main Authors Popescu, Marius, Grozea, Cristian, Tudor Ionescu, Radu
Format Journal Article
LanguageEnglish
Published Elsevier B.V 2017
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:String kernels have successfully been used for various NLP tasks, ranging from text categorization by topic to native language identification. In this paper, we present a simple and efficient algorithm for computing various spectrum string kernels. When comparing two strings, we store the p-grams in the first string into a hash table, and then we apply a hash table lookup for the p-grams that occur in the second string. In terms of time, we show that our algorithm can outperform a state-of-the-art tool for computing string similarity. In terms of accuracy, we show that our approach can reach state-of-the-art performance for polarity classification in various languages. Our efficient implementation is provided online for free at http://string-kernels.herokuapp.com.
ISSN:1877-0509
1877-0509
DOI:10.1016/j.procs.2017.08.207