Learning similarity functions for binary strings via genetic programming

Data that encode the presence of some characteristics typically can be represented as binary strings. We need similarity functions for binary strings in order to classify or cluster them. Existing similarity functions, however, do not take advantage of training data, which are often available. We be...

Full description

Saved in:
Bibliographic Details
Published in2016 International Conference on Advanced Computer Science and Information Systems (ICACSIS) pp. 476 - 483
Main Authors Pebriadi, Muhammad Syahid, Dewanto, Vektor, Kusuma, Wisnu Ananta, Afendi, Farit Mochamad, Heryanto, Rudi
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.10.2016
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Data that encode the presence of some characteristics typically can be represented as binary strings. We need similarity functions for binary strings in order to classify or cluster them. Existing similarity functions, however, do not take advantage of training data, which are often available. We believe that similarity functions should be data-specific. To this end, we use genetic programming (GP) to learn similarity functions from training data. We propose a novel fitness function that considers five aspects of good similarity functions, i.e. recall, magnitude, zero-division, identity and symmetry. We also report mostly-used math operators from extensive literature review. Experiment results show that GP-based similarity functions outperform the well-known Tanimoto function in most datasets in terms of classification accuracy using SVMs. In addition, those GP-based similarity functions are simpler: using fewer numbers of operators and operands. This suggests that our proposed fitness function for GP is justifiable for learning similarity functions.
DOI:10.1109/ICACSIS.2016.7872773