Learning similarity functions for binary strings via genetic programming

Data that encode the presence of some characteristics typically can be represented as binary strings. We need similarity functions for binary strings in order to classify or cluster them. Existing similarity functions, however, do not take advantage of training data, which are often available. We be...

Full description

Saved in:

Bibliographic Details
Published in	2016 International Conference on Advanced Computer Science and Information Systems (ICACSIS) pp. 476 - 483
Main Authors	Pebriadi, Muhammad Syahid, Dewanto, Vektor, Kusuma, Wisnu Ananta, Afendi, Farit Mochamad, Heryanto, Rudi
Format	Conference Proceeding
Language	English
Published	IEEE 01.10.2016
Subjects	Bibliographies Face Genetic programming Measurement Sociology Statistics Training data
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Data that encode the presence of some characteristics typically can be represented as binary strings. We need similarity functions for binary strings in order to classify or cluster them. Existing similarity functions, however, do not take advantage of training data, which are often available. We believe that similarity functions should be data-specific. To this end, we use genetic programming (GP) to learn similarity functions from training data. We propose a novel fitness function that considers five aspects of good similarity functions, i.e. recall, magnitude, zero-division, identity and symmetry. We also report mostly-used math operators from extensive literature review. Experiment results show that GP-based similarity functions outperform the well-known Tanimoto function in most datasets in terms of classification accuracy using SVMs. In addition, those GP-based similarity functions are simpler: using fewer numbers of operators and operands. This suggests that our proposed fitness function for GP is justifiable for learning similarity functions.
DOI:	10.1109/ICACSIS.2016.7872773