Iterative Error‐Driven Ensemble Labeling (IEDEL) Algorithm for Enhanced Data Quality Control for the Atmospheric Radiation Measurement (ARM) Program User Facility

For over three decades, the Atmospheric Radiation Measurement (ARM) Program user facility has provided researchers with invaluable benchmark atmospheric data. Ensuring the accuracy and integrity of ARM data is vital, and to achieve this, the ARM Data Quality Office (DQO) has implemented customized q...

Full description

Saved in:
Bibliographic Details
Published inJournal of geophysical research. Machine learning and computation Vol. 1; no. 3
Main Authors Li, Lishan, Kehoe, Kenneth E., Hu, Jiaxi, Peppler, Randy A., Sockol, Alyssa J., Godine, Corey A.
Format Journal Article
LanguageEnglish
Published 01.09.2024
Subjects
Online AccessGet full text

Cover

Loading…
Abstract For over three decades, the Atmospheric Radiation Measurement (ARM) Program user facility has provided researchers with invaluable benchmark atmospheric data. Ensuring the accuracy and integrity of ARM data is vital, and to achieve this, the ARM Data Quality Office (DQO) has implemented customized quality control tests tailored to each variable, with guidance from instrument mentors. These tests are designed to pinpoint common issues, such as data exceeding valid ranges or persisting with little change over extended periods, and ARM offers tools for users to review and exclude contaminated data efficiently. However, certain quality issues, such as spikes in time series or data drift over time, sometimes evade detection by existing tests and require manual identification by data analysts and instrument mentors through visualization tools. To tackle these challenges more efficiently, the DQO has developed and implemented the Iterative Error‐Driven Ensemble Labeling (IEDEL) algorithm with unanimous voting and transfer learning techniques to efficiently generate labeled data at scale. This initiative has empowered the creation of high‐performing machine learning models, enabling real‐time monitoring of data quality issues within the ARM data and thereby enhancing data integrity and accessibility. Plain Language Summary For more than 30 years, the Atmospheric Radiation Measurement (ARM) Program user facility has been providing scientists with important atmospheric data. Ensuring these data are accurate and trustworthy is crucial. To achieve this, the ARM Data Quality Office (DQO) establishes tailored quality control (QC) checks for each data variable, based on thresholds designed by the ARM instrument mentors, who are experts in meteorology. These checks help identify common data issues, such as data falling outside the normal range or not changing as expected over time. However, some problems, like sporadic data spikes or shifts in the average of data over time, might not be detected by these QC checks. These issues require visual identification by data analysts and ARM instrument mentors using ARM's visualization tools. To become more efficient at detecting these problems, the DQO has developed a new method called the Iterative Error‐Driven Ensemble Labeling algorithm to label data issues and used a machine learning algorithm to categorize them. This innovative approach enables the DQO to build intelligent applications that monitor data in real time, around the clock, and allow instrument mentors to resolve data issues promptly. Key Points Unsupervised learning methods can't generalize well to new data due to their reliance on the estimate of training data's anomaly ratio The Iterative Error‐Driven Ensemble Labeling (IEDEL) algorithm effectively guides abnormal data pattern discovery in large data sets using pre‐trained models The IEDEL algorithm reduces review effort by up to 95% without sacrificing accuracy in labeling abnormal data patterns
AbstractList For over three decades, the Atmospheric Radiation Measurement (ARM) Program user facility has provided researchers with invaluable benchmark atmospheric data. Ensuring the accuracy and integrity of ARM data is vital, and to achieve this, the ARM Data Quality Office (DQO) has implemented customized quality control tests tailored to each variable, with guidance from instrument mentors. These tests are designed to pinpoint common issues, such as data exceeding valid ranges or persisting with little change over extended periods, and ARM offers tools for users to review and exclude contaminated data efficiently. However, certain quality issues, such as spikes in time series or data drift over time, sometimes evade detection by existing tests and require manual identification by data analysts and instrument mentors through visualization tools. To tackle these challenges more efficiently, the DQO has developed and implemented the Iterative Error‐Driven Ensemble Labeling (IEDEL) algorithm with unanimous voting and transfer learning techniques to efficiently generate labeled data at scale. This initiative has empowered the creation of high‐performing machine learning models, enabling real‐time monitoring of data quality issues within the ARM data and thereby enhancing data integrity and accessibility. For more than 30 years, the Atmospheric Radiation Measurement (ARM) Program user facility has been providing scientists with important atmospheric data. Ensuring these data are accurate and trustworthy is crucial. To achieve this, the ARM Data Quality Office (DQO) establishes tailored quality control (QC) checks for each data variable, based on thresholds designed by the ARM instrument mentors, who are experts in meteorology. These checks help identify common data issues, such as data falling outside the normal range or not changing as expected over time. However, some problems, like sporadic data spikes or shifts in the average of data over time, might not be detected by these QC checks. These issues require visual identification by data analysts and ARM instrument mentors using ARM's visualization tools. To become more efficient at detecting these problems, the DQO has developed a new method called the Iterative Error‐Driven Ensemble Labeling algorithm to label data issues and used a machine learning algorithm to categorize them. This innovative approach enables the DQO to build intelligent applications that monitor data in real time, around the clock, and allow instrument mentors to resolve data issues promptly. Unsupervised learning methods can't generalize well to new data due to their reliance on the estimate of training data's anomaly ratio The Iterative Error‐Driven Ensemble Labeling (IEDEL) algorithm effectively guides abnormal data pattern discovery in large data sets using pre‐trained models The IEDEL algorithm reduces review effort by up to 95% without sacrificing accuracy in labeling abnormal data patterns
For over three decades, the Atmospheric Radiation Measurement (ARM) Program user facility has provided researchers with invaluable benchmark atmospheric data. Ensuring the accuracy and integrity of ARM data is vital, and to achieve this, the ARM Data Quality Office (DQO) has implemented customized quality control tests tailored to each variable, with guidance from instrument mentors. These tests are designed to pinpoint common issues, such as data exceeding valid ranges or persisting with little change over extended periods, and ARM offers tools for users to review and exclude contaminated data efficiently. However, certain quality issues, such as spikes in time series or data drift over time, sometimes evade detection by existing tests and require manual identification by data analysts and instrument mentors through visualization tools. To tackle these challenges more efficiently, the DQO has developed and implemented the Iterative Error‐Driven Ensemble Labeling (IEDEL) algorithm with unanimous voting and transfer learning techniques to efficiently generate labeled data at scale. This initiative has empowered the creation of high‐performing machine learning models, enabling real‐time monitoring of data quality issues within the ARM data and thereby enhancing data integrity and accessibility. Plain Language Summary For more than 30 years, the Atmospheric Radiation Measurement (ARM) Program user facility has been providing scientists with important atmospheric data. Ensuring these data are accurate and trustworthy is crucial. To achieve this, the ARM Data Quality Office (DQO) establishes tailored quality control (QC) checks for each data variable, based on thresholds designed by the ARM instrument mentors, who are experts in meteorology. These checks help identify common data issues, such as data falling outside the normal range or not changing as expected over time. However, some problems, like sporadic data spikes or shifts in the average of data over time, might not be detected by these QC checks. These issues require visual identification by data analysts and ARM instrument mentors using ARM's visualization tools. To become more efficient at detecting these problems, the DQO has developed a new method called the Iterative Error‐Driven Ensemble Labeling algorithm to label data issues and used a machine learning algorithm to categorize them. This innovative approach enables the DQO to build intelligent applications that monitor data in real time, around the clock, and allow instrument mentors to resolve data issues promptly. Key Points Unsupervised learning methods can't generalize well to new data due to their reliance on the estimate of training data's anomaly ratio The Iterative Error‐Driven Ensemble Labeling (IEDEL) algorithm effectively guides abnormal data pattern discovery in large data sets using pre‐trained models The IEDEL algorithm reduces review effort by up to 95% without sacrificing accuracy in labeling abnormal data patterns
Author Li, Lishan
Sockol, Alyssa J.
Hu, Jiaxi
Peppler, Randy A.
Godine, Corey A.
Kehoe, Kenneth E.
Author_xml – sequence: 1
  givenname: Lishan
  orcidid: 0009-0004-7428-5392
  surname: Li
  fullname: Li, Lishan
  email: miali@ou.edu
  organization: University of Oklahoma
– sequence: 2
  givenname: Kenneth E.
  surname: Kehoe
  fullname: Kehoe, Kenneth E.
  organization: University of Oklahoma
– sequence: 3
  givenname: Jiaxi
  orcidid: 0000-0002-7795-334X
  surname: Hu
  fullname: Hu, Jiaxi
  organization: NOAA/OAR National Severe Storms Laboratory
– sequence: 4
  givenname: Randy A.
  surname: Peppler
  fullname: Peppler, Randy A.
  organization: University of Oklahoma
– sequence: 5
  givenname: Alyssa J.
  surname: Sockol
  fullname: Sockol, Alyssa J.
  organization: University of Oklahoma
– sequence: 6
  givenname: Corey A.
  surname: Godine
  fullname: Godine, Corey A.
  organization: University of Oklahoma
BookMark eNp90EtOwzAQBmALgcRzxwG8LBKFsdOmybJq01LUCqhgHTnOpDVKbDR2Qd1xBA7ByTgJ4bFgxWpmpE__SP8h27XOImOnAi4EyPRSguxdXwGASOUOO5BpGnX7UsDun32fnXj_2JookpDA4IC9zwKSCuYZeUbk6OP1bUztZXlmPTZFjXyuCqyNXfHOLBtn8zM-rFeOTFg3vHLUurWyGks-VkHxu42qTdjykbOBXP0twhr5MDTOP62RjOZLVZr2pbN8gcpvCBu0gXeGy8UZvyW3ItXwB4_EJ0qbr7Rjtlep2uPJ7zxiD5PsfnTVnd9MZ6PhvKuFiKErBqIQZZxUSS_up6hRa8BCxklcRlpC3E-gqqAse1INknb0o6pKVFr0ZBKLQaGiI3b-k6vJeU9Y5U9kGkXbXED-VXL-t-SWww9_MTVu_7X59XQpIog-Aa6egM0
Cites_doi 10.31449/inf.v44i3.2828
10.1016/j.sigpro.2003.07.018
10.1175/AMSMONOGRAPHS‐D‐15‐0039.1
10.1145/1541880.1541882
10.1109/BigDataService58306.2023.00007
10.1162/089976601750264965
10.1175/AMSMONOGRAPHS‐D‐15‐0023.1
10.1109/ASPDAC.2015.7059020
10.1109/ICDM.2008.17
10.1145/335191.335388
10.1175/AMSMONOGRAPHS‐D‐16‐0004.1
10.1145/3068335
10.1613/jair.953
ContentType Journal Article
Copyright 2024 The Author(s). Journal of Geophysical Research: Machine Learning and Computation published by Wiley Periodicals LLC on behalf of American Geophysical Union.
Copyright_xml – notice: 2024 The Author(s). Journal of Geophysical Research: Machine Learning and Computation published by Wiley Periodicals LLC on behalf of American Geophysical Union.
DBID 24P
AAYXX
CITATION
DOI 10.1029/2024JH000192
DatabaseName Wiley Online Library Open Access
CrossRef
DatabaseTitle CrossRef
DatabaseTitleList CrossRef

Database_xml – sequence: 1
  dbid: 24P
  name: Wiley Online Library Open Access
  url: https://authorservices.wiley.com/open-science/open-access/browse-journals.html
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
EISSN 2993-5210
EndPage n/a
ExternalDocumentID 10_1029_2024JH000192
JGR130
Genre researchArticle
GrantInformation_xml – fundername: National Oceanic and Atmospheric Administration
  funderid: NA21OAR4320204
GroupedDBID 0R~
24P
AAMMB
ACCMX
AEFGJ
AGXDD
AIDQK
AIDYY
ALMA_UNASSIGNED_HOLDINGS
GROUPED_DOAJ
M~E
WIN
AAYXX
CITATION
ID FETCH-LOGICAL-c1160-171b1d68f84659ececc0eb2686d3c206580ff0dd42a78dd453ff8a9b428617ba3
IEDL.DBID 24P
ISSN 2993-5210
IngestDate Tue Jul 01 03:43:13 EDT 2025
Wed Aug 20 07:26:06 EDT 2025
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed true
IsScholarly true
Issue 3
Language English
License Attribution
http://creativecommons.org/licenses/by/4.0
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c1160-171b1d68f84659ececc0eb2686d3c206580ff0dd42a78dd453ff8a9b428617ba3
ORCID 0009-0004-7428-5392
0000-0002-7795-334X
OpenAccessLink https://onlinelibrary.wiley.com/doi/abs/10.1029%2F2024JH000192
PageCount 21
ParticipantIDs crossref_primary_10_1029_2024JH000192
wiley_primary_10_1029_2024JH000192_JGR130
ProviderPackageCode CITATION
AAYXX
PublicationCentury 2000
PublicationDate September 2024
2024-09-00
PublicationDateYYYYMMDD 2024-09-01
PublicationDate_xml – month: 09
  year: 2024
  text: September 2024
PublicationDecade 2020
PublicationTitle Journal of geophysical research. Machine learning and computation
PublicationYear 2024
References 2017; 42
2000; 29
2009; 41
2023
2011
2021
2020
1998
2008
2018
2015
2020; 44
2002
2003; 83
2001; 13
2016; 57
e_1_2_8_17_1
Xu Q. (e_1_2_8_21_1) 2020
e_1_2_8_19_1
e_1_2_8_14_1
Giannoni F. (e_1_2_8_7_1) 2018
e_1_2_8_15_1
e_1_2_8_16_1
Turner D. D. (e_1_2_8_18_1) 2016
e_1_2_8_3_1
e_1_2_8_2_1
e_1_2_8_5_1
e_1_2_8_4_1
Oliver A. (e_1_2_8_13_1) 2018
e_1_2_8_6_1
e_1_2_8_9_1
e_1_2_8_8_1
e_1_2_8_20_1
e_1_2_8_10_1
e_1_2_8_11_1
e_1_2_8_22_1
e_1_2_8_12_1
References_xml – year: 2011
– volume: 41
  start-page: 1
  issue: 3
  year: 2009
  end-page: 58
  article-title: Anomaly detection: A survey
  publication-title: ACM Computing Surveys
– volume: 57
  start-page: 8.1
  issue: 1
  year: 2016
  end-page: 8.13
  article-title: The ARM north slope of Alaska (NSA) sites
  publication-title: Meteorological Monographs
– volume: 42
  start-page: 1
  issue: 3
  year: 2017
  end-page: 21
  article-title: DBSCAN Revisited, Revisited: Why and how you should (still) use DBSCAN
  publication-title: ACM Transactions on Database Systems
– volume: 57
  issue: 1
  year: 2016
– volume: 29
  start-page: 93
  issue: 2
  year: 2000
  end-page: 104
  article-title: LOF: Identifying density‐based local outliers
  publication-title: SIGMOD Record
– volume: 57
  start-page: 12.1
  issue: 1
  year: 2016
  end-page: 12.14
  article-title: The ARM data quality program
  publication-title: Meteorological Monographs
– start-page: 286
  year: 2015
  end-page: 293
– year: 2002
– year: 2020
– start-page: 413
  year: 2008
  end-page: 422
– year: 2021
– volume: 44
  start-page: 291
  issue: 3
  year: 2020
  end-page: 302
  article-title: Reminder of the first paper on transfer learning in neural networks, 1976
  publication-title: Informatica
– year: 2023
– year: 2018
  article-title: Anomaly detection models for IoT time series data
  publication-title: arXiv e‐prints
– year: 2018
– volume: 83
  start-page: 2481
  issue: 12
  year: 2003
  end-page: 2497
  article-title: Novelty detection: A review—Part 1: Statistical approaches
  publication-title: Signal Processing
– volume: 57
  start-page: 6.1
  issue: 1
  year: 2016
  end-page: 6.14
  article-title: The ARM Southern Great Plains (SGP) site
  publication-title: Meteorological Monographs
– volume: 13
  start-page: 1443
  issue: 7
  year: 2001
  end-page: 1471
  article-title: Estimating the support of a high‐dimensional distribution
  publication-title: Neural Computation
– year: 1998
– ident: e_1_2_8_2_1
– ident: e_1_2_8_3_1
  doi: 10.31449/inf.v44i3.2828
– ident: e_1_2_8_11_1
  doi: 10.1016/j.sigpro.2003.07.018
– ident: e_1_2_8_14_1
  doi: 10.1175/AMSMONOGRAPHS‐D‐15‐0039.1
– ident: e_1_2_8_5_1
  doi: 10.1145/1541880.1541882
– ident: e_1_2_8_8_1
  doi: 10.1109/BigDataService58306.2023.00007
– volume-title: The Atmospheric Radiation Measurement (ARM) program: The first 20 years. Meteorological Monographs
  year: 2016
  ident: e_1_2_8_18_1
– ident: e_1_2_8_19_1
– year: 2018
  ident: e_1_2_8_7_1
  article-title: Anomaly detection models for IoT time series data
  publication-title: arXiv e‐prints
– volume-title: Neural information processing systems
  year: 2018
  ident: e_1_2_8_13_1
– ident: e_1_2_8_15_1
  doi: 10.1162/089976601750264965
– ident: e_1_2_8_20_1
  doi: 10.1175/AMSMONOGRAPHS‐D‐15‐0023.1
– ident: e_1_2_8_22_1
  doi: 10.1109/ASPDAC.2015.7059020
– ident: e_1_2_8_10_1
  doi: 10.1109/ICDM.2008.17
– ident: e_1_2_8_4_1
  doi: 10.1145/335191.335388
– ident: e_1_2_8_12_1
– ident: e_1_2_8_17_1
  doi: 10.1175/AMSMONOGRAPHS‐D‐16‐0004.1
– ident: e_1_2_8_9_1
– ident: e_1_2_8_16_1
  doi: 10.1145/3068335
– ident: e_1_2_8_6_1
  doi: 10.1613/jair.953
– volume-title: Interspeech
  year: 2020
  ident: e_1_2_8_21_1
SSID ssj0003320807
Score 2.2669308
Snippet For over three decades, the Atmospheric Radiation Measurement (ARM) Program user facility has provided researchers with invaluable benchmark atmospheric data....
SourceID crossref
wiley
SourceType Index Database
Publisher
SubjectTerms anomaly detection
time series analysis
transfer learning
Title Iterative Error‐Driven Ensemble Labeling (IEDEL) Algorithm for Enhanced Data Quality Control for the Atmospheric Radiation Measurement (ARM) Program User Facility
URI https://onlinelibrary.wiley.com/doi/abs/10.1029%2F2024JH000192
Volume 1
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV29TsMwELb4WVgQCBD_ugEkOkQ4duKkY9WmlIqiqqJSt8qJHUDqD0rL0AXxCDwET8aTcLYDlAWJJZGiiweffffd-e4zIWeKSlHNReApGmdeILjvxULnnowjxSVCBG7JdDq3otUP2oNwUCbcTC-M44f4TriZnWHttdngMp2VZAOGIxOj9qDdchhllayb7lrDnc-C7neOhXNGXcc0M2Vq6KloWfuOQ1wuD_DLKy2jVOtmmltks8SHUHMK3SYrerJD3q8t9zEaJkiKYlp8vL41CmOmIJnM9DgdabiRqe0sh4vrpJHcVKA2up9i4P8wBoSlKPdgj_qhIecSHHHGAuquTt1KIBCE2nw8nRmegccMeoa0wGgNOj9pRLio9ToV6LqiLujj-oWmzEx97WKX9JvJXb3lldcreJnvC-r5kZ_6SsQ5QpCwqjNUJsU4W8RC8YwZaELznCoVMBnF-Ap5nseymmLAgrAnlXyPrE2mE71PQGka6SjM0fPFQagV_oAal4xp5mum-AE5_5re4ZNj0Rja029WHS6r4YBU7Nz_KTRsX_XQ8x7-Q_aIbJivrjbsmKzNi2d9gmBinp7aFXNqQ3F8dl6ST9xJxDA
linkProvider Wiley-Blackwell
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV29TsMwELb4GWBBIED8cwNIdIhw7MRJx4qmtKVFqKISW-TEDiDRFoUysPEIPARPxpNwtgOUBYkpy8WDz3f33fnuMyFHikpRL0TgKRrnXiC478VCF56MI8UlQgRuyXT6l6I9DLo34U31zqmZhXH8EN8FN2MZ1l8bAzcF6YptwJBkYtoedNsOpMyTxUCwyFgmC66-iyycM-pGppnpU8NQRavmd1zidHaBX2FpFqbaONNaJSsVQISG0-gamdPjdfLeseTH6JkgKctJ-fH61iyNn4Jk_KRH2YOGnszsaDmcdJJm0qtB4-F2gpn_3QgQl6Lcnb3rh6acSnDMGS9w5hrVrQQiQWhMR5MnQzRwn8PAsBYYtUH_p44IJ41BvwZXrqsLhniAoSVz02D7skGGreT6rO1V7yt4ue8L6vmRn_lKxAVikLCuc9QmxURbxELxnBlsQouCKhUwGcX4CXlRxLKeYcaCuCeTfJMsjCdjvUVAaRrpKCww9MVBqBX-gCqXjGnma6b4Njn-2t700dFopPb6m9XTWTVsk5rd-z-F0u75AEPvzj9kD8lS-7rfS3udy4tdsmwkXKPYHlmYls96H5HFNDuwp-cTRcrFdg
linkToPdf http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3NThsxELYoSKgXBCoICrRzKFJyWOG1d72bY0Q2JCGJoohI3FbetQ2VSIKWcODWR-hD8GQ8Scf2FtILUk97mfXBY8988_eZkB-KStEyIgoUTcsgEjwMUqFNINNEcYkQgTsyndFY9GbR4Ca-qRNudhbG80O8JdzszXD22l7wB2VqsgHLkYlRezToeYzyiWy5ep9ldo4mbzkWzhn1E9PMtqmhp6J17zsucb6-wD9eaR2lOjfT3SU7NT6EtlfoHtnQiy_kpe-4j9EwQVZVy-r11-9OZc0UZItHPS_uNQxl4SbLodHPOtmwCe372yUG_ndzQFiKcneu1A8duZLgiTOe4cL3qTsJBILQXs2Xj5Zn4GcJU0taYLUGo_c0IjTa01ETJr6pC2Z4fqErS9tf-7xPZt3s-qIX1M8rBGUYChqESViESqQGIUjc0iUqk2KcLVKheMksNKHGUKUiJpMUPzE3JpWtAgMWhD2F5Adkc7Fc6EMCStNEJ7FBz5dGsVb4A2pcMqZZqJniR-Ts7_bmD55FI3fVb9bK19VwRJpu7z8UygeXU_S8X_9D9jvZnnS6-bA_vjomn62AbxM7IZur6kmfIq5YFd_c4fkDTiHEqA
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Iterative+Error%E2%80%90Driven+Ensemble+Labeling+%28IEDEL%29+Algorithm+for+Enhanced+Data+Quality+Control+for+the+Atmospheric+Radiation+Measurement+%28ARM%29+Program+User+Facility&rft.jtitle=Journal+of+geophysical+research.+Machine+learning+and+computation&rft.au=Li%2C+Lishan&rft.au=Kehoe%2C+Kenneth+E.&rft.au=Hu%2C+Jiaxi&rft.au=Peppler%2C+Randy+A.&rft.date=2024-09-01&rft.issn=2993-5210&rft.eissn=2993-5210&rft.volume=1&rft.issue=3&rft_id=info:doi/10.1029%2F2024JH000192&rft.externalDBID=n%2Fa&rft.externalDocID=10_1029_2024JH000192
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2993-5210&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2993-5210&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2993-5210&client=summon