Iterative Error‐Driven Ensemble Labeling (IEDEL) Algorithm for Enhanced Data Quality Control for the Atmospheric Radiation Measurement (ARM) Program User Facility
For over three decades, the Atmospheric Radiation Measurement (ARM) Program user facility has provided researchers with invaluable benchmark atmospheric data. Ensuring the accuracy and integrity of ARM data is vital, and to achieve this, the ARM Data Quality Office (DQO) has implemented customized q...
Saved in:
Published in | Journal of geophysical research. Machine learning and computation Vol. 1; no. 3 |
---|---|
Main Authors | , , , , , |
Format | Journal Article |
Language | English |
Published |
01.09.2024
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Abstract | For over three decades, the Atmospheric Radiation Measurement (ARM) Program user facility has provided researchers with invaluable benchmark atmospheric data. Ensuring the accuracy and integrity of ARM data is vital, and to achieve this, the ARM Data Quality Office (DQO) has implemented customized quality control tests tailored to each variable, with guidance from instrument mentors. These tests are designed to pinpoint common issues, such as data exceeding valid ranges or persisting with little change over extended periods, and ARM offers tools for users to review and exclude contaminated data efficiently. However, certain quality issues, such as spikes in time series or data drift over time, sometimes evade detection by existing tests and require manual identification by data analysts and instrument mentors through visualization tools. To tackle these challenges more efficiently, the DQO has developed and implemented the Iterative Error‐Driven Ensemble Labeling (IEDEL) algorithm with unanimous voting and transfer learning techniques to efficiently generate labeled data at scale. This initiative has empowered the creation of high‐performing machine learning models, enabling real‐time monitoring of data quality issues within the ARM data and thereby enhancing data integrity and accessibility.
Plain Language Summary
For more than 30 years, the Atmospheric Radiation Measurement (ARM) Program user facility has been providing scientists with important atmospheric data. Ensuring these data are accurate and trustworthy is crucial. To achieve this, the ARM Data Quality Office (DQO) establishes tailored quality control (QC) checks for each data variable, based on thresholds designed by the ARM instrument mentors, who are experts in meteorology. These checks help identify common data issues, such as data falling outside the normal range or not changing as expected over time. However, some problems, like sporadic data spikes or shifts in the average of data over time, might not be detected by these QC checks. These issues require visual identification by data analysts and ARM instrument mentors using ARM's visualization tools. To become more efficient at detecting these problems, the DQO has developed a new method called the Iterative Error‐Driven Ensemble Labeling algorithm to label data issues and used a machine learning algorithm to categorize them. This innovative approach enables the DQO to build intelligent applications that monitor data in real time, around the clock, and allow instrument mentors to resolve data issues promptly.
Key Points
Unsupervised learning methods can't generalize well to new data due to their reliance on the estimate of training data's anomaly ratio
The Iterative Error‐Driven Ensemble Labeling (IEDEL) algorithm effectively guides abnormal data pattern discovery in large data sets using pre‐trained models
The IEDEL algorithm reduces review effort by up to 95% without sacrificing accuracy in labeling abnormal data patterns |
---|---|
AbstractList | For over three decades, the Atmospheric Radiation Measurement (ARM) Program user facility has provided researchers with invaluable benchmark atmospheric data. Ensuring the accuracy and integrity of ARM data is vital, and to achieve this, the ARM Data Quality Office (DQO) has implemented customized quality control tests tailored to each variable, with guidance from instrument mentors. These tests are designed to pinpoint common issues, such as data exceeding valid ranges or persisting with little change over extended periods, and ARM offers tools for users to review and exclude contaminated data efficiently. However, certain quality issues, such as spikes in time series or data drift over time, sometimes evade detection by existing tests and require manual identification by data analysts and instrument mentors through visualization tools. To tackle these challenges more efficiently, the DQO has developed and implemented the Iterative Error‐Driven Ensemble Labeling (IEDEL) algorithm with unanimous voting and transfer learning techniques to efficiently generate labeled data at scale. This initiative has empowered the creation of high‐performing machine learning models, enabling real‐time monitoring of data quality issues within the ARM data and thereby enhancing data integrity and accessibility.
For more than 30 years, the Atmospheric Radiation Measurement (ARM) Program user facility has been providing scientists with important atmospheric data. Ensuring these data are accurate and trustworthy is crucial. To achieve this, the ARM Data Quality Office (DQO) establishes tailored quality control (QC) checks for each data variable, based on thresholds designed by the ARM instrument mentors, who are experts in meteorology. These checks help identify common data issues, such as data falling outside the normal range or not changing as expected over time. However, some problems, like sporadic data spikes or shifts in the average of data over time, might not be detected by these QC checks. These issues require visual identification by data analysts and ARM instrument mentors using ARM's visualization tools. To become more efficient at detecting these problems, the DQO has developed a new method called the Iterative Error‐Driven Ensemble Labeling algorithm to label data issues and used a machine learning algorithm to categorize them. This innovative approach enables the DQO to build intelligent applications that monitor data in real time, around the clock, and allow instrument mentors to resolve data issues promptly.
Unsupervised learning methods can't generalize well to new data due to their reliance on the estimate of training data's anomaly ratio
The Iterative Error‐Driven Ensemble Labeling (IEDEL) algorithm effectively guides abnormal data pattern discovery in large data sets using pre‐trained models
The IEDEL algorithm reduces review effort by up to 95% without sacrificing accuracy in labeling abnormal data patterns For over three decades, the Atmospheric Radiation Measurement (ARM) Program user facility has provided researchers with invaluable benchmark atmospheric data. Ensuring the accuracy and integrity of ARM data is vital, and to achieve this, the ARM Data Quality Office (DQO) has implemented customized quality control tests tailored to each variable, with guidance from instrument mentors. These tests are designed to pinpoint common issues, such as data exceeding valid ranges or persisting with little change over extended periods, and ARM offers tools for users to review and exclude contaminated data efficiently. However, certain quality issues, such as spikes in time series or data drift over time, sometimes evade detection by existing tests and require manual identification by data analysts and instrument mentors through visualization tools. To tackle these challenges more efficiently, the DQO has developed and implemented the Iterative Error‐Driven Ensemble Labeling (IEDEL) algorithm with unanimous voting and transfer learning techniques to efficiently generate labeled data at scale. This initiative has empowered the creation of high‐performing machine learning models, enabling real‐time monitoring of data quality issues within the ARM data and thereby enhancing data integrity and accessibility. Plain Language Summary For more than 30 years, the Atmospheric Radiation Measurement (ARM) Program user facility has been providing scientists with important atmospheric data. Ensuring these data are accurate and trustworthy is crucial. To achieve this, the ARM Data Quality Office (DQO) establishes tailored quality control (QC) checks for each data variable, based on thresholds designed by the ARM instrument mentors, who are experts in meteorology. These checks help identify common data issues, such as data falling outside the normal range or not changing as expected over time. However, some problems, like sporadic data spikes or shifts in the average of data over time, might not be detected by these QC checks. These issues require visual identification by data analysts and ARM instrument mentors using ARM's visualization tools. To become more efficient at detecting these problems, the DQO has developed a new method called the Iterative Error‐Driven Ensemble Labeling algorithm to label data issues and used a machine learning algorithm to categorize them. This innovative approach enables the DQO to build intelligent applications that monitor data in real time, around the clock, and allow instrument mentors to resolve data issues promptly. Key Points Unsupervised learning methods can't generalize well to new data due to their reliance on the estimate of training data's anomaly ratio The Iterative Error‐Driven Ensemble Labeling (IEDEL) algorithm effectively guides abnormal data pattern discovery in large data sets using pre‐trained models The IEDEL algorithm reduces review effort by up to 95% without sacrificing accuracy in labeling abnormal data patterns |
Author | Li, Lishan Sockol, Alyssa J. Hu, Jiaxi Peppler, Randy A. Godine, Corey A. Kehoe, Kenneth E. |
Author_xml | – sequence: 1 givenname: Lishan orcidid: 0009-0004-7428-5392 surname: Li fullname: Li, Lishan email: miali@ou.edu organization: University of Oklahoma – sequence: 2 givenname: Kenneth E. surname: Kehoe fullname: Kehoe, Kenneth E. organization: University of Oklahoma – sequence: 3 givenname: Jiaxi orcidid: 0000-0002-7795-334X surname: Hu fullname: Hu, Jiaxi organization: NOAA/OAR National Severe Storms Laboratory – sequence: 4 givenname: Randy A. surname: Peppler fullname: Peppler, Randy A. organization: University of Oklahoma – sequence: 5 givenname: Alyssa J. surname: Sockol fullname: Sockol, Alyssa J. organization: University of Oklahoma – sequence: 6 givenname: Corey A. surname: Godine fullname: Godine, Corey A. organization: University of Oklahoma |
BookMark | eNp90EtOwzAQBmALgcRzxwG8LBKFsdOmybJq01LUCqhgHTnOpDVKbDR2Qd1xBA7ByTgJ4bFgxWpmpE__SP8h27XOImOnAi4EyPRSguxdXwGASOUOO5BpGnX7UsDun32fnXj_2JookpDA4IC9zwKSCuYZeUbk6OP1bUztZXlmPTZFjXyuCqyNXfHOLBtn8zM-rFeOTFg3vHLUurWyGks-VkHxu42qTdjykbOBXP0twhr5MDTOP62RjOZLVZr2pbN8gcpvCBu0gXeGy8UZvyW3ItXwB4_EJ0qbr7Rjtlep2uPJ7zxiD5PsfnTVnd9MZ6PhvKuFiKErBqIQZZxUSS_up6hRa8BCxklcRlpC3E-gqqAse1INknb0o6pKVFr0ZBKLQaGiI3b-k6vJeU9Y5U9kGkXbXED-VXL-t-SWww9_MTVu_7X59XQpIog-Aa6egM0 |
Cites_doi | 10.31449/inf.v44i3.2828 10.1016/j.sigpro.2003.07.018 10.1175/AMSMONOGRAPHS‐D‐15‐0039.1 10.1145/1541880.1541882 10.1109/BigDataService58306.2023.00007 10.1162/089976601750264965 10.1175/AMSMONOGRAPHS‐D‐15‐0023.1 10.1109/ASPDAC.2015.7059020 10.1109/ICDM.2008.17 10.1145/335191.335388 10.1175/AMSMONOGRAPHS‐D‐16‐0004.1 10.1145/3068335 10.1613/jair.953 |
ContentType | Journal Article |
Copyright | 2024 The Author(s). Journal of Geophysical Research: Machine Learning and Computation published by Wiley Periodicals LLC on behalf of American Geophysical Union. |
Copyright_xml | – notice: 2024 The Author(s). Journal of Geophysical Research: Machine Learning and Computation published by Wiley Periodicals LLC on behalf of American Geophysical Union. |
DBID | 24P AAYXX CITATION |
DOI | 10.1029/2024JH000192 |
DatabaseName | Wiley Online Library Open Access CrossRef |
DatabaseTitle | CrossRef |
DatabaseTitleList | CrossRef |
Database_xml | – sequence: 1 dbid: 24P name: Wiley Online Library Open Access url: https://authorservices.wiley.com/open-science/open-access/browse-journals.html sourceTypes: Publisher |
DeliveryMethod | fulltext_linktorsrc |
EISSN | 2993-5210 |
EndPage | n/a |
ExternalDocumentID | 10_1029_2024JH000192 JGR130 |
Genre | researchArticle |
GrantInformation_xml | – fundername: National Oceanic and Atmospheric Administration funderid: NA21OAR4320204 |
GroupedDBID | 0R~ 24P AAMMB ACCMX AEFGJ AGXDD AIDQK AIDYY ALMA_UNASSIGNED_HOLDINGS GROUPED_DOAJ M~E WIN AAYXX CITATION |
ID | FETCH-LOGICAL-c1160-171b1d68f84659ececc0eb2686d3c206580ff0dd42a78dd453ff8a9b428617ba3 |
IEDL.DBID | 24P |
ISSN | 2993-5210 |
IngestDate | Tue Jul 01 03:43:13 EDT 2025 Wed Aug 20 07:26:06 EDT 2025 |
IsDoiOpenAccess | true |
IsOpenAccess | true |
IsPeerReviewed | true |
IsScholarly | true |
Issue | 3 |
Language | English |
License | Attribution http://creativecommons.org/licenses/by/4.0 |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-c1160-171b1d68f84659ececc0eb2686d3c206580ff0dd42a78dd453ff8a9b428617ba3 |
ORCID | 0009-0004-7428-5392 0000-0002-7795-334X |
OpenAccessLink | https://onlinelibrary.wiley.com/doi/abs/10.1029%2F2024JH000192 |
PageCount | 21 |
ParticipantIDs | crossref_primary_10_1029_2024JH000192 wiley_primary_10_1029_2024JH000192_JGR130 |
ProviderPackageCode | CITATION AAYXX |
PublicationCentury | 2000 |
PublicationDate | September 2024 2024-09-00 |
PublicationDateYYYYMMDD | 2024-09-01 |
PublicationDate_xml | – month: 09 year: 2024 text: September 2024 |
PublicationDecade | 2020 |
PublicationTitle | Journal of geophysical research. Machine learning and computation |
PublicationYear | 2024 |
References | 2017; 42 2000; 29 2009; 41 2023 2011 2021 2020 1998 2008 2018 2015 2020; 44 2002 2003; 83 2001; 13 2016; 57 e_1_2_8_17_1 Xu Q. (e_1_2_8_21_1) 2020 e_1_2_8_19_1 e_1_2_8_14_1 Giannoni F. (e_1_2_8_7_1) 2018 e_1_2_8_15_1 e_1_2_8_16_1 Turner D. D. (e_1_2_8_18_1) 2016 e_1_2_8_3_1 e_1_2_8_2_1 e_1_2_8_5_1 e_1_2_8_4_1 Oliver A. (e_1_2_8_13_1) 2018 e_1_2_8_6_1 e_1_2_8_9_1 e_1_2_8_8_1 e_1_2_8_20_1 e_1_2_8_10_1 e_1_2_8_11_1 e_1_2_8_22_1 e_1_2_8_12_1 |
References_xml | – year: 2011 – volume: 41 start-page: 1 issue: 3 year: 2009 end-page: 58 article-title: Anomaly detection: A survey publication-title: ACM Computing Surveys – volume: 57 start-page: 8.1 issue: 1 year: 2016 end-page: 8.13 article-title: The ARM north slope of Alaska (NSA) sites publication-title: Meteorological Monographs – volume: 42 start-page: 1 issue: 3 year: 2017 end-page: 21 article-title: DBSCAN Revisited, Revisited: Why and how you should (still) use DBSCAN publication-title: ACM Transactions on Database Systems – volume: 57 issue: 1 year: 2016 – volume: 29 start-page: 93 issue: 2 year: 2000 end-page: 104 article-title: LOF: Identifying density‐based local outliers publication-title: SIGMOD Record – volume: 57 start-page: 12.1 issue: 1 year: 2016 end-page: 12.14 article-title: The ARM data quality program publication-title: Meteorological Monographs – start-page: 286 year: 2015 end-page: 293 – year: 2002 – year: 2020 – start-page: 413 year: 2008 end-page: 422 – year: 2021 – volume: 44 start-page: 291 issue: 3 year: 2020 end-page: 302 article-title: Reminder of the first paper on transfer learning in neural networks, 1976 publication-title: Informatica – year: 2023 – year: 2018 article-title: Anomaly detection models for IoT time series data publication-title: arXiv e‐prints – year: 2018 – volume: 83 start-page: 2481 issue: 12 year: 2003 end-page: 2497 article-title: Novelty detection: A review—Part 1: Statistical approaches publication-title: Signal Processing – volume: 57 start-page: 6.1 issue: 1 year: 2016 end-page: 6.14 article-title: The ARM Southern Great Plains (SGP) site publication-title: Meteorological Monographs – volume: 13 start-page: 1443 issue: 7 year: 2001 end-page: 1471 article-title: Estimating the support of a high‐dimensional distribution publication-title: Neural Computation – year: 1998 – ident: e_1_2_8_2_1 – ident: e_1_2_8_3_1 doi: 10.31449/inf.v44i3.2828 – ident: e_1_2_8_11_1 doi: 10.1016/j.sigpro.2003.07.018 – ident: e_1_2_8_14_1 doi: 10.1175/AMSMONOGRAPHS‐D‐15‐0039.1 – ident: e_1_2_8_5_1 doi: 10.1145/1541880.1541882 – ident: e_1_2_8_8_1 doi: 10.1109/BigDataService58306.2023.00007 – volume-title: The Atmospheric Radiation Measurement (ARM) program: The first 20 years. Meteorological Monographs year: 2016 ident: e_1_2_8_18_1 – ident: e_1_2_8_19_1 – year: 2018 ident: e_1_2_8_7_1 article-title: Anomaly detection models for IoT time series data publication-title: arXiv e‐prints – volume-title: Neural information processing systems year: 2018 ident: e_1_2_8_13_1 – ident: e_1_2_8_15_1 doi: 10.1162/089976601750264965 – ident: e_1_2_8_20_1 doi: 10.1175/AMSMONOGRAPHS‐D‐15‐0023.1 – ident: e_1_2_8_22_1 doi: 10.1109/ASPDAC.2015.7059020 – ident: e_1_2_8_10_1 doi: 10.1109/ICDM.2008.17 – ident: e_1_2_8_4_1 doi: 10.1145/335191.335388 – ident: e_1_2_8_12_1 – ident: e_1_2_8_17_1 doi: 10.1175/AMSMONOGRAPHS‐D‐16‐0004.1 – ident: e_1_2_8_9_1 – ident: e_1_2_8_16_1 doi: 10.1145/3068335 – ident: e_1_2_8_6_1 doi: 10.1613/jair.953 – volume-title: Interspeech year: 2020 ident: e_1_2_8_21_1 |
SSID | ssj0003320807 |
Score | 2.2669308 |
Snippet | For over three decades, the Atmospheric Radiation Measurement (ARM) Program user facility has provided researchers with invaluable benchmark atmospheric data.... |
SourceID | crossref wiley |
SourceType | Index Database Publisher |
SubjectTerms | anomaly detection time series analysis transfer learning |
Title | Iterative Error‐Driven Ensemble Labeling (IEDEL) Algorithm for Enhanced Data Quality Control for the Atmospheric Radiation Measurement (ARM) Program User Facility |
URI | https://onlinelibrary.wiley.com/doi/abs/10.1029%2F2024JH000192 |
Volume | 1 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV29TsMwELb4WVgQCBD_ugEkOkQ4duKkY9WmlIqiqqJSt8qJHUDqD0rL0AXxCDwET8aTcLYDlAWJJZGiiweffffd-e4zIWeKSlHNReApGmdeILjvxULnnowjxSVCBG7JdDq3otUP2oNwUCbcTC-M44f4TriZnWHttdngMp2VZAOGIxOj9qDdchhllayb7lrDnc-C7neOhXNGXcc0M2Vq6KloWfuOQ1wuD_DLKy2jVOtmmltks8SHUHMK3SYrerJD3q8t9zEaJkiKYlp8vL41CmOmIJnM9DgdabiRqe0sh4vrpJHcVKA2up9i4P8wBoSlKPdgj_qhIecSHHHGAuquTt1KIBCE2nw8nRmegccMeoa0wGgNOj9pRLio9ToV6LqiLujj-oWmzEx97WKX9JvJXb3lldcreJnvC-r5kZ_6SsQ5QpCwqjNUJsU4W8RC8YwZaELznCoVMBnF-Ap5nseymmLAgrAnlXyPrE2mE71PQGka6SjM0fPFQagV_oAal4xp5mum-AE5_5re4ZNj0Rja029WHS6r4YBU7Nz_KTRsX_XQ8x7-Q_aIbJivrjbsmKzNi2d9gmBinp7aFXNqQ3F8dl6ST9xJxDA |
linkProvider | Wiley-Blackwell |
linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV29TsMwELb4GWBBIED8cwNIdIhw7MRJx4qmtKVFqKISW-TEDiDRFoUysPEIPARPxpNwtgOUBYkpy8WDz3f33fnuMyFHikpRL0TgKRrnXiC478VCF56MI8UlQgRuyXT6l6I9DLo34U31zqmZhXH8EN8FN2MZ1l8bAzcF6YptwJBkYtoedNsOpMyTxUCwyFgmC66-iyycM-pGppnpU8NQRavmd1zidHaBX2FpFqbaONNaJSsVQISG0-gamdPjdfLeseTH6JkgKctJ-fH61iyNn4Jk_KRH2YOGnszsaDmcdJJm0qtB4-F2gpn_3QgQl6Lcnb3rh6acSnDMGS9w5hrVrQQiQWhMR5MnQzRwn8PAsBYYtUH_p44IJ41BvwZXrqsLhniAoSVz02D7skGGreT6rO1V7yt4ue8L6vmRn_lKxAVikLCuc9QmxURbxELxnBlsQouCKhUwGcX4CXlRxLKeYcaCuCeTfJMsjCdjvUVAaRrpKCww9MVBqBX-gCqXjGnma6b4Njn-2t700dFopPb6m9XTWTVsk5rd-z-F0u75AEPvzj9kD8lS-7rfS3udy4tdsmwkXKPYHlmYls96H5HFNDuwp-cTRcrFdg |
linkToPdf | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3NThsxELYoSKgXBCoICrRzKFJyWOG1d72bY0Q2JCGJoohI3FbetQ2VSIKWcODWR-hD8GQ8Scf2FtILUk97mfXBY8988_eZkB-KStEyIgoUTcsgEjwMUqFNINNEcYkQgTsyndFY9GbR4Ca-qRNudhbG80O8JdzszXD22l7wB2VqsgHLkYlRezToeYzyiWy5ep9ldo4mbzkWzhn1E9PMtqmhp6J17zsucb6-wD9eaR2lOjfT3SU7NT6EtlfoHtnQiy_kpe-4j9EwQVZVy-r11-9OZc0UZItHPS_uNQxl4SbLodHPOtmwCe372yUG_ndzQFiKcneu1A8duZLgiTOe4cL3qTsJBILQXs2Xj5Zn4GcJU0taYLUGo_c0IjTa01ETJr6pC2Z4fqErS9tf-7xPZt3s-qIX1M8rBGUYChqESViESqQGIUjc0iUqk2KcLVKheMksNKHGUKUiJpMUPzE3JpWtAgMWhD2F5Adkc7Fc6EMCStNEJ7FBz5dGsVb4A2pcMqZZqJniR-Ts7_bmD55FI3fVb9bK19VwRJpu7z8UygeXU_S8X_9D9jvZnnS6-bA_vjomn62AbxM7IZur6kmfIq5YFd_c4fkDTiHEqA |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Iterative+Error%E2%80%90Driven+Ensemble+Labeling+%28IEDEL%29+Algorithm+for+Enhanced+Data+Quality+Control+for+the+Atmospheric+Radiation+Measurement+%28ARM%29+Program+User+Facility&rft.jtitle=Journal+of+geophysical+research.+Machine+learning+and+computation&rft.au=Li%2C+Lishan&rft.au=Kehoe%2C+Kenneth+E.&rft.au=Hu%2C+Jiaxi&rft.au=Peppler%2C+Randy+A.&rft.date=2024-09-01&rft.issn=2993-5210&rft.eissn=2993-5210&rft.volume=1&rft.issue=3&rft_id=info:doi/10.1029%2F2024JH000192&rft.externalDBID=n%2Fa&rft.externalDocID=10_1029_2024JH000192 |
thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2993-5210&client=summon |
thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2993-5210&client=summon |
thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2993-5210&client=summon |