Cocoa origin classifiability through LC-MS data: A statistical approach for large and long-term datasets
[Display omitted] •A large dataset of 297 LC-MS profiles covering 10 countries was employed.•Analysis of LC-MS dataset gathered over prolonged time.•Popular unsupervised (PCA) and supervised (LDA) methods were used.•Result of LDA depends nonlinearly on the number of compounds used.•A statistical app...
Saved in:
Published in | Food research international Vol. 140; p. 109983 |
---|---|
Main Authors | , , , , , , |
Format | Journal Article |
Language | English |
Published |
Canada
Elsevier Ltd
01.02.2021
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | [Display omitted]
•A large dataset of 297 LC-MS profiles covering 10 countries was employed.•Analysis of LC-MS dataset gathered over prolonged time.•Popular unsupervised (PCA) and supervised (LDA) methods were used.•Result of LDA depends nonlinearly on the number of compounds used.•A statistical approach for compounds selection greatly improves the result of LDA.
Classification of food samples based upon their countries of origin is an important task in food industry for quality assurance and development of fine flavor products. Liquid chromatography –mass spectrometry (LC-MS) provides a fast technique for obtaining in-depth information about chemical composition of foods. However, in a large dataset that is gathered over a period of few years, multiple, incoherent and hard to avoid sources of variations e.g., experimental conditions, transportation, batch and instrumental effects, etc. pose technical challenges that make the study of origin classification a difficult problem. Here, we use a large dataset gathered over a period of four years containing 297 LC-MS profiles of cocoa sourced from 10 countries to demonstrate these challenges by using two popular multivariate analysis methods: principal component analysis (PCA) and linear discriminant analysis (LDA). We show that PCA provides a limited separation in bean origin, while LDA suffers from a strong non-linear dependence on the set of compounds. Further, we show for LDA that a compound selection criterion based on Gaussian distribution of intensities across samples dramatically enhances origin clustering of samples thereby suggesting possibilities for studying marker compounds in such a disparate dataset through this approach. In essence, we show and develop a new approach that maximizes, avoiding overfitting, the utility of multivariate analysis in a highly complex dataset. |
---|---|
Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 |
ISSN: | 0963-9969 1873-7145 |
DOI: | 10.1016/j.foodres.2020.109983 |