Genomic prediction in plants: opportunities for ensemble machine learning based approaches [version 2; peer review: 1 approved, 2 approved with reservations]

Background: Many studies have demonstrated the utility of machine learning (ML) methods for genomic prediction (GP) of various plant traits, but a clear rationale for choosing ML over conventionally used, often simpler parametric methods, is still lacking. Predictive performance of GP models might d...

Full description

Saved in:

Bibliographic Details
Published in	F1000 research Vol. 11; p. 802
Main Authors	Farooq, Muhammad, van Dijk, Aalt D.J., Nijveen, Harm, Mansoor, Shahid, de Ridder, Dick
Format	Journal Article
Language	English
Published	London Faculty of 1000 Ltd 2023 F1000 Research Limited F1000 Research Ltd
Subjects	Bayesian analysis Binomial distribution eng Genomes Genomic Prediction Genomic Selection Genomics Genotype & phenotype Heritability Learning algorithms Linear Mixed Models Linkage disequilibrium Machine Learning Mathematical models Neural networks Nucleotides Phenotypes Population genetics Population structure Predictions Quantitative trait loci Regression analysis Sample size Single-nucleotide polymorphism Linear Mixed Models Machine Learning Genomic Prediction Genomic Selection
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Background: Many studies have demonstrated the utility of machine learning (ML) methods for genomic prediction (GP) of various plant traits, but a clear rationale for choosing ML over conventionally used, often simpler parametric methods, is still lacking. Predictive performance of GP models might depend on a plethora of factors including sample size, number of markers, population structure and genetic architecture. Methods: Here, we investigate which problem and dataset characteristics are related to good performance of ML methods for genomic prediction. We compare the predictive performance of two frequently used ensemble ML methods (Random Forest and Extreme Gradient Boosting) with parametric methods including genomic best linear unbiased prediction (GBLUP), reproducing kernel Hilbert space regression (RKHS), BayesA and BayesB. To explore problem characteristics, we use simulated and real plant traits under different genetic complexity levels determined by the number of Quantitative Trait Loci (QTLs), heritability ( h 2 and h 2 e ), population structure and linkage disequilibrium between causal nucleotides and other SNPs. Results: Decision tree based ensemble ML methods are a better choice for nonlinear phenotypes and are comparable to Bayesian methods for linear phenotypes in the case of large effect Quantitative Trait Nucleotides (QTNs). Furthermore, we find that ML methods are susceptible to confounding due to population structure but less sensitive to low linkage disequilibrium than linear parametric methods. Conclusions: Overall, this provides insights into the role of ML in GP as well as guidelines for practitioners.
Bibliography:	new_version ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 No competing interests were disclosed.
ISSN:	2046-1402 2046-1402
DOI:	10.12688/f1000research.122437.2