Combining parallel factor analysis and machine learning for the classification of dissolved organic matter according to source using fluorescence signatures

Parallel factor (PARAFAC) analysis of dissolved organic matter (DOM) fluorescence has facilitated a surge of investigation into its biogeochemical cycling. However, rigorous, PARAFAC-based methods for holistically distinguishing DOM sources are lacking. This study classified 1029 PARAFAC-analyzed ex...

Full description

Saved in:

Bibliographic Details
Published in	Chemosphere (Oxford) Vol. 155; pp. 283 - 291
Main Authors	Cuss, C.W., McConnell, S.M., Guéguen, C.
Format	Journal Article
Language	English
Published	England Elsevier Ltd 01.07.2016
Subjects	Chemical Fractionation Data mining/machine learning Dissolved organic matter (DOM) Ecosystem Environmental Monitoring - methods Excitation-emission matrix (EEM) Factor Analysis, Statistical Fluorescence K-nearest neighbours (kNN) Leaf leachate Machine Learning Models, Theoretical Neural Networks (Computer) Organic Chemicals - analysis Organic Chemicals - chemistry Parallel factor analysis (PARAFAC) Plant Leaves - chemistry Rivers - chemistry Spectrometry, Fluorescence Water Pollutants, Chemical - analysis Water Pollutants, Chemical - chemistry Dissolved organic matter (DOM) Excitation-emission matrix (EEM) Data mining/machine learning Leaf leachate Parallel factor analysis (PARAFAC) K-nearest neighbours (kNN)
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Parallel factor (PARAFAC) analysis of dissolved organic matter (DOM) fluorescence has facilitated a surge of investigation into its biogeochemical cycling. However, rigorous, PARAFAC-based methods for holistically distinguishing DOM sources are lacking. This study classified 1029 PARAFAC-analyzed excitation-emission matrices (EEMs) measured using DOM isolated from 24 different leaf leachates, rivers, and organic matter standards using four machine learning methods (MLM). EEMs were also divided into subsets to assess the impact of experimental treatments (i.e. whole EEMs, size fractionation, mixtures, quenching) and dataset properties (i.e. different numbers of EEMs from each leachate/river) on classification. A split-half validated, 10-component PARAFAC model was extended to 12 components to remove consistent peaks evident in model residuals. The 12-component model performed better than the 10-component model, correctly classifying up to 80 additional EEMs, when the dataset included size-fractionated DOM or several different sources (i.e. many leaf species and rivers); however, the 10-component model performed better for whole-sample EEMs when comparing leaf leachates to rivers. The MLM correctly classified whole EEMs of riverine DOM by source with up to 87.0% accuracy, leachates with up to 92.5% accuracy, and distinguished leachates from rivers with 97.2% accuracy. A difference of up to 17.3% in classification accuracy was observed depending on the MLM method used with the following order: multilayer perceptron = support vector machine > k-nearest neighbours ≫ decision tree; however, performances differed widely depending on the data subset. Classification accuracy for whole and size-fractionated rivers compared to whole and size-fractionated leachates using N-way partial least-squares discriminant analysis (NPLS-DA; 97.7%) was similar to that achieved using MLM. Combining MLM with PARAFAC is an effective method for classifying DOM based on its fluorescence signature because PARAFAC can isolate meaningful fluorescent species and unlike PLSDA, MLM constructs a single model which simultaneously classifies EEMs as belonging to one of several categories. A complete accounting of carbon flows through ecosystems should include the processes and sources that contribute to the disparate fluorescence signatures of riverine and leached DOM. •Machine-learning applied to 1029 PARAFAC-modeled EEMs to classify 24 DOM sources.•Classification accuracy: 97% river vs leachate; 93% leachate by species; 87% by river.•Some machine learning algorithms achieved higher classification accuracies.•Accuracy similar to NPLS-DA, but faster and with simultaneous multiclass comparison.•Extending # components past cross-validated PARAFAC model improved accuracy.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23
ISSN:	0045-6535 1879-1298
DOI:	10.1016/j.chemosphere.2016.04.061