Self-optimization of training dataset improves forecasting of cyanobacterial bloom by machine learning

Data-driven model (DDM) prediction of aquatic ecological responses, such as cyanobacterial harmful algal blooms (CyanoHABs), is critically influenced by the choice of training dataset. However, a systematic method to choose the optimal training dataset considering data history has not yet been devel...

Full description

Saved in:
Bibliographic Details
Published inThe Science of the total environment Vol. 866; p. 161398
Main Authors Kim, Jayun, Jung, Woosik, An, Jusuk, Oh, Hyun Je, Park, Joonhong
Format Journal Article
LanguageEnglish
Published Netherlands Elsevier B.V 25.03.2023
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Data-driven model (DDM) prediction of aquatic ecological responses, such as cyanobacterial harmful algal blooms (CyanoHABs), is critically influenced by the choice of training dataset. However, a systematic method to choose the optimal training dataset considering data history has not yet been developed. Providing a comprehensive procedure with self-based optimal training dataset-selecting algorithm would self-improve the DDM performance. In this study, a novel algorithm was developed to self-generate possible training dataset candidates from the available input and output variable data and self-choose the optimal training dataset that maximizes CyanoHAB forecasting performance. Nine years of meteorological and water quality data (input) and CyanoHAB data (output) from a site on the Nakdong River, South Korea, were acquired and pretreated via an automated process. An artificial neural network (ANN) was chosen from among the DDM candidates by first-cut training and validation using the entire collected dataset. Optimal training datasets for the ANN were self-selected from among the possible self-generated training datasets by systematically simulating the performance in response to 46 periods and 40 sizes (number of data elements) of the generated training datasets. The best-performing models were screened to identify the candidate models. The best performance corresponded to 6–7 years of training data (∼18 % lower error) for forecasting 1–28 d ahead (1–28 d of forecasting lead time (FLT)). After the hyperparameters of the screened model candidates were fine-tuned, the best-performing model (7 years of data with 14 d FLT) was self-determined by comparing the forecasts with unseen CyanoHAB events. The self-determined model could reasonably predict CyanoHABs occurring in Korean waters (cyanobacteria cells/mL ≥ 1000). Thus, our proposed method of self-optimizing the training dataset effectively improved the predictive accuracy and operational efficiency of the DDM prediction of CyanoHAB. [Display omitted] •Novel algorithm was developed to self-determine optimal training dataset.•Duration of data collection was key to optimizing training dataset.•Novel algorithm improved cyanobacterial bloom prediction by machine learning.•Algorithm applicable to cognitive programming for cyanobacterial bloom forecasting.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
ISSN:0048-9697
1879-1026
DOI:10.1016/j.scitotenv.2023.161398