Prediction of Microcystis Occurrences and Analysis Using Machine Learning in High-Dimension, Low-Sample-Size and Imbalanced Water Quality Data

•Prediction and analysis of Microcystis occurrences in Hitotsuse Dam, Japan.•Feature Engineering and Feature Selection Algorithms applied to water quality data.•There are five water quality factors implicated in Microcystis occurrences.•Basic statistics of water quality over a year are related to Mi...

Full description

Saved in:
Bibliographic Details
Published inHarmful algae Vol. 117; p. 102273
Main Authors Mori, Masaya, Gonzalez Flores, Roberto, Suzuki, Yoshihiro, Nukazawa, Kei, Hiraoka, Toru, Nonaka, Hirofumi
Format Journal Article
LanguageEnglish
Published Elsevier B.V 01.08.2022
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:•Prediction and analysis of Microcystis occurrences in Hitotsuse Dam, Japan.•Feature Engineering and Feature Selection Algorithms applied to water quality data.•There are five water quality factors implicated in Microcystis occurrences.•Basic statistics of water quality over a year are related to Microcystis occurrences.•Presents a suitable method for High-Dimension, Low-Sample-Size, and imbalanced data. [Display omitted] Machine learning, Deep learning, and water quality data have been used in recent years to predict the outbreak of harmful algae, especially Microcystis, and analyze outbreak causes. However, for various reasons, water quality data are often High-Dimension, Low-Sample- Size (HDLSS), meaning the sample size is lower than the number of dimensions. Moreover, imbalance problems may arise due to bias in the occurrence frequency of Microcystis. These problems make predicting the occurrence of Microcystis and analyzing its causes with machine learning difficult. In this study, a machine learning model that applies Feature Engineering (FE) and Feature Selection (FS) algorithms are used to predict outbreaks of Microcystis and analyze the outbreak factors from imbalanced HDLSS water quality data. The prediction performance was verified with binary classification to determine whether Microcystis would occur in the future by applying three machine learning models to four data patterns. The cause analysis of Microcystis occurrence was performed by visualizing the results of applying FE and FS. For the test data, the predictive performance of FE and FS methods was significantly better than that of the conventional method, with an accuracy of .108 points and an F-value of .691 points higher than the conventional method. A prediction performance increase was observed with a smaller model capacity. Data-driven analysis suggested that total nitrogen, chemical oxygen demand, chlorophyll-a, dissolved oxygen saturation, and water temperature are associated with Microcystis occurrences. The results also indicated that basic statistics of the water quality distribution (especially mean, standard deviation, and skewness) over a year, not the concentrations of water components, are related to the occurrence of Microcystis. These are new findings not found in previous studies and are expected to contribute significantly to future studies of algae. This study provides a method for analyzing water quality data with high-dimensionality and small samples, imbalance problems, or both.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
ISSN:1568-9883
1878-1470
DOI:10.1016/j.hal.2022.102273