Learning similarity functions for binary strings via genetic programming
Data that encode the presence of some characteristics typically can be represented as binary strings. We need similarity functions for binary strings in order to classify or cluster them. Existing similarity functions, however, do not take advantage of training data, which are often available. We be...
Saved in:
Published in | 2016 International Conference on Advanced Computer Science and Information Systems (ICACSIS) pp. 476 - 483 |
---|---|
Main Authors | , , , , |
Format | Conference Proceeding |
Language | English |
Published |
IEEE
01.10.2016
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Data that encode the presence of some characteristics typically can be represented as binary strings. We need similarity functions for binary strings in order to classify or cluster them. Existing similarity functions, however, do not take advantage of training data, which are often available. We believe that similarity functions should be data-specific. To this end, we use genetic programming (GP) to learn similarity functions from training data. We propose a novel fitness function that considers five aspects of good similarity functions, i.e. recall, magnitude, zero-division, identity and symmetry. We also report mostly-used math operators from extensive literature review. Experiment results show that GP-based similarity functions outperform the well-known Tanimoto function in most datasets in terms of classification accuracy using SVMs. In addition, those GP-based similarity functions are simpler: using fewer numbers of operators and operands. This suggests that our proposed fitness function for GP is justifiable for learning similarity functions. |
---|---|
DOI: | 10.1109/ICACSIS.2016.7872773 |