Self-optimization of training dataset improves forecasting of cyanobacterial bloom by machine learning
Data-driven model (DDM) prediction of aquatic ecological responses, such as cyanobacterial harmful algal blooms (CyanoHABs), is critically influenced by the choice of training dataset. However, a systematic method to choose the optimal training dataset considering data history has not yet been devel...
Saved in:
Published in | The Science of the total environment Vol. 866; p. 161398 |
---|---|
Main Authors | , , , , |
Format | Journal Article |
Language | English |
Published |
Netherlands
Elsevier B.V
25.03.2023
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Data-driven model (DDM) prediction of aquatic ecological responses, such as cyanobacterial harmful algal blooms (CyanoHABs), is critically influenced by the choice of training dataset. However, a systematic method to choose the optimal training dataset considering data history has not yet been developed. Providing a comprehensive procedure with self-based optimal training dataset-selecting algorithm would self-improve the DDM performance. In this study, a novel algorithm was developed to self-generate possible training dataset candidates from the available input and output variable data and self-choose the optimal training dataset that maximizes CyanoHAB forecasting performance. Nine years of meteorological and water quality data (input) and CyanoHAB data (output) from a site on the Nakdong River, South Korea, were acquired and pretreated via an automated process. An artificial neural network (ANN) was chosen from among the DDM candidates by first-cut training and validation using the entire collected dataset. Optimal training datasets for the ANN were self-selected from among the possible self-generated training datasets by systematically simulating the performance in response to 46 periods and 40 sizes (number of data elements) of the generated training datasets. The best-performing models were screened to identify the candidate models. The best performance corresponded to 6–7 years of training data (∼18 % lower error) for forecasting 1–28 d ahead (1–28 d of forecasting lead time (FLT)). After the hyperparameters of the screened model candidates were fine-tuned, the best-performing model (7 years of data with 14 d FLT) was self-determined by comparing the forecasts with unseen CyanoHAB events. The self-determined model could reasonably predict CyanoHABs occurring in Korean waters (cyanobacteria cells/mL ≥ 1000). Thus, our proposed method of self-optimizing the training dataset effectively improved the predictive accuracy and operational efficiency of the DDM prediction of CyanoHAB.
[Display omitted]
•Novel algorithm was developed to self-determine optimal training dataset.•Duration of data collection was key to optimizing training dataset.•Novel algorithm improved cyanobacterial bloom prediction by machine learning.•Algorithm applicable to cognitive programming for cyanobacterial bloom forecasting. |
---|---|
Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 |
ISSN: | 0048-9697 1879-1026 |
DOI: | 10.1016/j.scitotenv.2023.161398 |