UsIL-6: An unbalanced learning strategy for identifying IL-6 inducing peptides by undersampling technique

•We propose a bioinformatics tool (UsIL-6) for accurately identifying IL-6 inducing peptides.•The model is based on NearMiss3 undersampling technique, Boruta feature selection method and extreme randomization tree machine learning classification algorithm.•In order to better explain the correlation...

Full description

Saved in:
Bibliographic Details
Published inComputer methods and programs in biomedicine Vol. 250; p. 108176
Main Authors Liao, Yan-hong, Chen, Shou-zhi, Bin, Yan-nan, Zhao, Jian-ping, Feng, Xin-long, Zheng, Chun-hou
Format Journal Article
LanguageEnglish
Published Ireland Elsevier B.V 01.06.2024
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:•We propose a bioinformatics tool (UsIL-6) for accurately identifying IL-6 inducing peptides.•The model is based on NearMiss3 undersampling technique, Boruta feature selection method and extreme randomization tree machine learning classification algorithm.•In order to better explain the correlation between prediction results and each feature and understand the relationship between each feature and positive or negative class prediction, we used the framework called Shapley Additive Explanation (SHAP) to explain the output of the ML classifier.•UsIL-6 achieved 0.870 AUC and 0.808 BACC on independent test dataset, outperforming the state-of-the-art models. Interleukin-6 (IL-6) is the critical factor of early warning, monitoring, and prognosis in the inflammatory storm of COVID-19 cases. IL-6 inducing peptides, which can induce cytokine IL-6 production, are very important for the development of diagnosis and immunotherapy. Although the existing methods have some success in predicting IL-6 inducing peptides, there is still room for improvement in the performance of these models in practical application. In this study, we proposed UsIL-6, a high-performance bioinformatics tool for identifying IL-6 inducing peptides. First, we extracted five groups of physicochemical properties and sequence structural information from IL-6 inducing peptide sequences, and obtained a 636-dimensional feature vector, we also employed NearMiss3 undersampling method and normalization method StandardScaler to process the data. Then, a 40-dimensional optimal feature vector was obtained by Boruta feature selection method. Finally, we combined this feature vector with extreme randomization tree classifier to build the final model UsIL-6. The AUC value of UsIL-6 on the independent test dataset was 0.87, and the BACC value was 0.808, which indicated that UsIL-6 had better performance than the existing methods in IL-6 inducing peptide recognition. The performance comparison on independent test dataset confirmed that UsIL-6 could achieve the highest performance, best robustness, and most excellent generalization ability. We hope that UsIL-6 will become a valuable method to identify, annotate and characterize new IL-6 inducing peptides.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
ISSN:0169-2607
1872-7565
1872-7565
DOI:10.1016/j.cmpb.2024.108176