Protein–protein interaction sites prediction by ensemble random forests with synthetic minority oversampling technique

Abstract Motivation The prediction of protein–protein interaction (PPI) sites is a key to mutation design, catalytic reaction and the reconstruction of PPI networks. It is a challenging task considering the significant abundant sequences and the imbalance issue in samples. Results A new ensemble lea...

Full description

Saved in:
Bibliographic Details
Published inBioinformatics Vol. 35; no. 14; pp. 2395 - 2402
Main Authors Wang, Xiaoying, Yu, Bin, Ma, Anjun, Chen, Cheng, Liu, Bingqiang, Ma, Qin
Format Journal Article
LanguageEnglish
Published England Oxford University Press 15.07.2019
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Abstract Motivation The prediction of protein–protein interaction (PPI) sites is a key to mutation design, catalytic reaction and the reconstruction of PPI networks. It is a challenging task considering the significant abundant sequences and the imbalance issue in samples. Results A new ensemble learning-based method, Ensemble Learning of synthetic minority oversampling technique (SMOTE) for Unbalancing samples and RF algorithm (EL-SMURF), was proposed for PPI sites prediction in this study. The sequence profile feature and the residue evolution rates were combined for feature extraction of neighboring residues using a sliding window, and the SMOTE was applied to oversample interface residues in the feature space for the imbalance problem. The Multi-dimensional Scaling feature selection method was implemented to reduce feature redundancy and subset selection. Finally, the Random Forest classifiers were applied to build the ensemble learning model, and the optimal feature vectors were inserted into EL-SMURF to predict PPI sites. The performance validation of EL-SMURF on two independent validation datasets showed 77.1% and 77.7% accuracy, which were 6.2–15.7% and 6.1–18.9% higher than the other existing tools, respectively. Availability and implementation The source codes and data used in this study are publicly available at http://github.com/QUST-AIBBDRC/EL-SMURF/. Supplementary information Supplementary data are available at Bioinformatics online.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
The authors wish it to be known that, in their opinion, Xiaoying Wang and Bin Yu authors should be regarded as Joint First Authors.
ISSN:1367-4803
1367-4811
1460-2059
1367-4811
DOI:10.1093/bioinformatics/bty995