Automatic specimen identification of Harpacticoids (Crustacea:Copepoda) using Random Forest and MALDI‐TOF mass spectra, including a post hoc test for false positive discovery

Ecological studies require accurate identification of specimens. This is very time consuming when processing plankton, meiobenthos or soil biota samples due to the presence of a high number of minute specimens. A solution to this problem may be MALDI‐TOF MS, an emerging technique for identification...

Full description

Saved in:
Bibliographic Details
Published inMethods in ecology and evolution Vol. 9; no. 6; pp. 1421 - 1434
Main Authors Rossel, Sven, Martínez Arbizu, Pedro, Kembel, Steven
Format Journal Article
LanguageEnglish
Published London John Wiley & Sons, Inc 01.06.2018
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Ecological studies require accurate identification of specimens. This is very time consuming when processing plankton, meiobenthos or soil biota samples due to the presence of a high number of minute specimens. A solution to this problem may be MALDI‐TOF MS, an emerging technique for identification of metazoan species. As an alternative to factory delivered software or clustering approaches, Random Forest (RF) models can be trained to identify species, using MALDI‐TOF data. However, in a real‐world scenario, RF models will fail in detecting species which were not included in the training dataset as well, thus producing false positives (misidentifications). We produced MALDI‐TOF MS spectra for meiofauna species and trained RF models, using MALDI‐TOF bins as predictors and species as multi‐level target class. We used the empirical beta distribution of the probability of class assignment in the model to design a post hoc test for false positive discovery. Two strategies increase the final accuracy of species identification: (1) “class smoothing” consisting of in silico observations of classes, created by bootstrapping the value of each predictor within each class and: (2) adding a “null class” to the training dataset by bootstrapping predictor values and shuffling predictor labels creating a class without multivariate signal. We prove that RF is an excellent method for species identification, using MALDI‐TOF MS data. The models are flexible enough to correctly classify observations created in silico by smoothing the classes. Our post hoc test unmasks false positive classifications successfully. Smoothing the classes and adding a null class to the training model attracts assignment of false positives to this class. In our example, a 100% false positive discovery could be achieved, while maintaining very high overall prediction accuracy. Combining MALDI‐TOF MS and RF models is a step towards a fully automatic species identification workflow that is particularly necessary for species‐rich communities of small organism for ecological studies but also for routine monitoring. The post hoc test for false positive discovery can be applied to any RF multilevel classification model, not only in a biological context.
ISSN:2041-210X
2041-210X
DOI:10.1111/2041-210X.13000