A novel feature and sample joint transfer learning method with feature selection in semi-supervised scenarios for identifying the sequence of some species with less known genetic data

When identifying the sequence of some species using fewer known gene training data (named target domain), the data of closely related species and unlabeled data of the species (named source domain) could be considered for auxiliary training. However, there are differences in the statistical distribu...

Full description

Saved in:
Bibliographic Details
Published inSoft computing (Berlin, Germany) Vol. 27; no. 9; pp. 5411 - 5423
Main Authors Wen, Jianghui, Huang, Haoran, Pu, Zhenyu, Deng, Bing
Format Journal Article
LanguageEnglish
Published Berlin/Heidelberg Springer Berlin Heidelberg 01.05.2023
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:When identifying the sequence of some species using fewer known gene training data (named target domain), the data of closely related species and unlabeled data of the species (named source domain) could be considered for auxiliary training. However, there are differences in the statistical distribution of the feature space comprising of genetic data of different species. Therefore, this paper proposes a feature and sample jointed transfer (FSJT) method for semi-supervised scenarios, consisting of two modules. In the first module, the distance between the sample probability distribution functions in the feature space is taken as the optimization objective, and a hybrid balanced distribution adaptation method is constructed to transform the feature space of the two domains to increase the similarity between the domains. In the second module, the confidence of the unlabeled data in the target domain is defined and a self-learning sample transfer method is proposed to reduce the impact of samples having large differences in source-domain training data. Simultaneously, to select the suitable source-domain samples and the target domain when the sample size between the fields is very different, the transferred Lasso and the nearest-neighbor (TLR) feature selection method is proposed using FSJT. Then, the whole framework and algorithm flow of the TLR-FSJT model is presented and verified using the transfer learning standard dataset and ribonucleic acid data from GenBank database by comparing it with three machine learning methods and the FSJT model. Results show that the TLR-FSJT model has the highest accuracy in semi-supervised scenarios.
ISSN:1432-7643
1433-7479
DOI:10.1007/s00500-022-07773-7