Predicting protein-protein interactions in unbalanced data using the primary structure of proteins

Elucidating protein-protein interactions (PPIs) is essential to constructing protein interaction networks and facilitating our understanding of the general principles of biological systems. Previous studies have revealed that interacting protein pairs can be predicted by their primary structure. Mos...

Full description

Saved in:

Bibliographic Details
Published in	BMC bioinformatics Vol. 11; no. 1; p. 167
Main Authors	Yu, Chi-Yuan, Chou, Lih-Ching, Chang, Darby Tien-Hao
Format	Journal Article
Language	English
Published	England BioMed Central Ltd 02.04.2010 BioMed Central BMC
Subjects	Algorithms Amino Acid Sequence Amino acids Binding Sites Bioinformatics Classification Computational biology Databases, Protein Dealing Experiments Handling Mathematical analysis Methods Protein Interaction Mapping - methods Protein-protein interactions Proteins Proteins - chemistry Proteins - metabolism Proteomics - methods Research article Sequence Analysis, Protein Structure Studies Tasks Vectors (mathematics) Taiwan
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Elucidating protein-protein interactions (PPIs) is essential to constructing protein interaction networks and facilitating our understanding of the general principles of biological systems. Previous studies have revealed that interacting protein pairs can be predicted by their primary structure. Most of these approaches have achieved satisfactory performance on datasets comprising equal number of interacting and non-interacting protein pairs. However, this ratio is highly unbalanced in nature, and these techniques have not been comprehensively evaluated with respect to the effect of the large number of non-interacting pairs in realistic datasets. Moreover, since highly unbalanced distributions usually lead to large datasets, more efficient predictors are desired when handling such challenging tasks. This study presents a method for PPI prediction based only on sequence information, which contributes in three aspects. First, we propose a probability-based mechanism for transforming protein sequences into feature vectors. Second, the proposed predictor is designed with an efficient classification algorithm, where the efficiency is essential for handling highly unbalanced datasets. Third, the proposed PPI predictor is assessed with several unbalanced datasets with different positive-to-negative ratios (from 1:1 to 1:15). This analysis provides solid evidence that the degree of dataset imbalance is important to PPI predictors. Dealing with data imbalance is a key issue in PPI prediction since there are far fewer interacting protein pairs than non-interacting ones. This article provides a comprehensive study on this issue and develops a practical tool that achieves both good prediction performance and efficiency using only protein sequence information.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23
ISSN:	1471-2105 1471-2105
DOI:	10.1186/1471-2105-11-167