On the Choice of Training Data for Machine Learning of Geostrophic Mesoscale Turbulence

Data plays a central role in data‐driven methods, but is not often the subject of focus in investigations of machine learning algorithms as applied to Earth System Modeling related problems. Here we consider the problem of eddy‐mean interaction in rotating stratified turbulence in the presence of la...

Full description

Saved in:
Bibliographic Details
Published inJournal of advances in modeling earth systems Vol. 16; no. 2
Main Authors Yan, F. E., Mak, J., Wang, Y.
Format Journal Article
LanguageEnglish
Published Washington John Wiley & Sons, Inc 01.02.2024
American Geophysical Union (AGU)
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Data plays a central role in data‐driven methods, but is not often the subject of focus in investigations of machine learning algorithms as applied to Earth System Modeling related problems. Here we consider the problem of eddy‐mean interaction in rotating stratified turbulence in the presence of lateral boundaries, where it is known that rotational components of the eddy flux plays no direct role in the sub‐grid forcing onto the mean state variables, and its presence is expected to affect the performance of the trained machine learning models. While an often utilized choice in the literature is to train a model from the divergence of the eddy fluxes, here we provide theoretical arguments and numerical evidence that learning from the eddy fluxes with the rotational component appropriately filtered out, achieved in this work by means of an object called the eddy force function, results in models with comparable or better skill, but substantially reduced sensitivity to the presence of small‐scale features. We argue that while the choice of data choice and/or quality may not be critical if we simply want a model to have predictive skill, it is highly desirable and perhaps even necessary if we want to leverage data‐driven methods to aid in discovering unknown or hidden physical processes within the data itself. Plain Language Summary Data‐driven methods are increasingly being utilized in various problems relating to the numerical modeling of the Earth system. While there are many investigations focusing on the machine learning algorithms or the problems themselves, there have been relative few investigations into the impact of data choice or quality, given the central role of data. We consider here the impact of the choice of data for a particular problem relevant to ocean modeling, that of eddy‐mean interaction, where it is known that the training data generically contains a component that plays no role in the eddy‐mean interaction, and its presence in the training phase is expected to degrade the model performance. We provide arguments and evidence that one choice is preferable over a more standard choice utilized in related research. While the choice of data choice and/or quality may not be critical if we simply want a data‐driven model to be skillful, we argue it is highly desirable, possibly even a necessity, if we want to leverage data‐driven methods as a means to aid in discovery of unknown or hidden physical processes within the data itself. Key Points Investigated the dependence of convolution neural networks on the choice of training data for geostrophic turbulence Models are trained on eddy fluxes with rotational component filtered out by means of an eddy force function Resulting models as accurate but less sensitive to small‐scale features than models trained on divergence of eddy fluxes
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:1942-2466
1942-2466
DOI:10.1029/2023MS003915