Data leakage detection in machine learning code: transfer learning, active learning, or low-shot prompting?
With the increasing reliance on machine learning (ML) across diverse disciplines, ML code has been subject to a number of issues that impact its quality, such as lack of documentation, algorithmic biases, overfitting, lack of reproducibility, inadequate data preprocessing, and potential for data lea...
Saved in:
Published in | PeerJ. Computer science Vol. 11; p. e2730 |
---|---|
Main Authors | , |
Format | Journal Article |
Language | English |
Published |
United States
PeerJ. Ltd
05.03.2025
PeerJ Inc |
Subjects | |
Online Access | Get full text |
ISSN | 2376-5992 2376-5992 |
DOI | 10.7717/peerj-cs.2730 |
Cover
Loading…
Abstract | With the increasing reliance on machine learning (ML) across diverse disciplines, ML code has been subject to a number of issues that impact its quality, such as lack of documentation, algorithmic biases, overfitting, lack of reproducibility, inadequate data preprocessing, and potential for data leakage, all of which can significantly affect the performance and reliability of ML models. Data leakage can affect the quality of ML models where sensitive information from the test set inadvertently influences the training process, leading to inflated performance metrics that do not generalize well to new, unseen data. Data leakage can occur at either the dataset-level ( i.e ., during dataset construction) or at the code-level. Existing studies introduced methods to detect code-level data leakage using manual and code analysis approaches. However, automated tools with advanced ML techniques are increasingly recognized as essential for efficiently identifying quality issues in large and complex codebases, enhancing the overall effectiveness of code review processes. In this article, we aim to explore ML-based approaches for limited annotated datasets to detect code-level data leakage in ML code. We proposed three approaches, namely, transfer learning, active learning, and low-shot prompting. Additionally, we introduced an automated approached to handle the imbalance issues of code data. Our results show that active learning outperformed the other approaches with an F-2 score of 0.72 and reduced the number of needed annotated samples from 1,523 to 698. We conclude that existing ML-based approaches can effectively mitigate the challenges associated with limited data availability. |
---|---|
AbstractList | With the increasing reliance on machine learning (ML) across diverse disciplines, ML code has been subject to a number of issues that impact its quality, such as lack of documentation, algorithmic biases, overfitting, lack of reproducibility, inadequate data preprocessing, and potential for data leakage, all of which can significantly affect the performance and reliability of ML models. Data leakage can affect the quality of ML models where sensitive information from the test set inadvertently influences the training process, leading to inflated performance metrics that do not generalize well to new, unseen data. Data leakage can occur at either the dataset-level (
., during dataset construction) or at the code-level. Existing studies introduced methods to detect code-level data leakage using manual and code analysis approaches. However, automated tools with advanced ML techniques are increasingly recognized as essential for efficiently identifying quality issues in large and complex codebases, enhancing the overall effectiveness of code review processes. In this article, we aim to explore ML-based approaches for limited annotated datasets to detect code-level data leakage in ML code. We proposed three approaches, namely, transfer learning, active learning, and low-shot prompting. Additionally, we introduced an automated approached to handle the imbalance issues of code data. Our results show that active learning outperformed the other approaches with an F-2 score of 0.72 and reduced the number of needed annotated samples from 1,523 to 698. We conclude that existing ML-based approaches can effectively mitigate the challenges associated with limited data availability. With the increasing reliance on machine learning (ML) across diverse disciplines, ML code has been subject to a number of issues that impact its quality, such as lack of documentation, algorithmic biases, overfitting, lack of reproducibility, inadequate data preprocessing, and potential for data leakage, all of which can significantly affect the performance and reliability of ML models. Data leakage can affect the quality of ML models where sensitive information from the test set inadvertently influences the training process, leading to inflated performance metrics that do not generalize well to new, unseen data. Data leakage can occur at either the dataset-level (i.e., during dataset construction) or at the code-level. Existing studies introduced methods to detect code-level data leakage using manual and code analysis approaches. However, automated tools with advanced ML techniques are increasingly recognized as essential for efficiently identifying quality issues in large and complex codebases, enhancing the overall effectiveness of code review processes. In this article, we aim to explore ML-based approaches for limited annotated datasets to detect code-level data leakage in ML code. We proposed three approaches, namely, transfer learning, active learning, and low-shot prompting. Additionally, we introduced an automated approached to handle the imbalance issues of code data. Our results show that active learning outperformed the other approaches with an F-2 score of 0.72 and reduced the number of needed annotated samples from 1,523 to 698. We conclude that existing ML-based approaches can effectively mitigate the challenges associated with limited data availability. With the increasing reliance on machine learning (ML) across diverse disciplines, ML code has been subject to a number of issues that impact its quality, such as lack of documentation, algorithmic biases, overfitting, lack of reproducibility, inadequate data preprocessing, and potential for data leakage, all of which can significantly affect the performance and reliability of ML models. Data leakage can affect the quality of ML models where sensitive information from the test set inadvertently influences the training process, leading to inflated performance metrics that do not generalize well to new, unseen data. Data leakage can occur at either the dataset-level ( i.e ., during dataset construction) or at the code-level. Existing studies introduced methods to detect code-level data leakage using manual and code analysis approaches. However, automated tools with advanced ML techniques are increasingly recognized as essential for efficiently identifying quality issues in large and complex codebases, enhancing the overall effectiveness of code review processes. In this article, we aim to explore ML-based approaches for limited annotated datasets to detect code-level data leakage in ML code. We proposed three approaches, namely, transfer learning, active learning, and low-shot prompting. Additionally, we introduced an automated approached to handle the imbalance issues of code data. Our results show that active learning outperformed the other approaches with an F-2 score of 0.72 and reduced the number of needed annotated samples from 1,523 to 698. We conclude that existing ML-based approaches can effectively mitigate the challenges associated with limited data availability. With the increasing reliance on machine learning (ML) across diverse disciplines, ML code has been subject to a number of issues that impact its quality, such as lack of documentation, algorithmic biases, overfitting, lack of reproducibility, inadequate data preprocessing, and potential for data leakage, all of which can significantly affect the performance and reliability of ML models. Data leakage can affect the quality of ML models where sensitive information from the test set inadvertently influences the training process, leading to inflated performance metrics that do not generalize well to new, unseen data. Data leakage can occur at either the dataset-level (i.e., during dataset construction) or at the code-level. Existing studies introduced methods to detect code-level data leakage using manual and code analysis approaches. However, automated tools with advanced ML techniques are increasingly recognized as essential for efficiently identifying quality issues in large and complex codebases, enhancing the overall effectiveness of code review processes. In this article, we aim to explore ML-based approaches for limited annotated datasets to detect code-level data leakage in ML code. We proposed three approaches, namely, transfer learning, active learning, and low-shot prompting. Additionally, we introduced an automated approached to handle the imbalance issues of code data. Our results show that active learning outperformed the other approaches with an F-2 score of 0.72 and reduced the number of needed annotated samples from 1,523 to 698. We conclude that existing ML-based approaches can effectively mitigate the challenges associated with limited data availability.With the increasing reliance on machine learning (ML) across diverse disciplines, ML code has been subject to a number of issues that impact its quality, such as lack of documentation, algorithmic biases, overfitting, lack of reproducibility, inadequate data preprocessing, and potential for data leakage, all of which can significantly affect the performance and reliability of ML models. Data leakage can affect the quality of ML models where sensitive information from the test set inadvertently influences the training process, leading to inflated performance metrics that do not generalize well to new, unseen data. Data leakage can occur at either the dataset-level (i.e., during dataset construction) or at the code-level. Existing studies introduced methods to detect code-level data leakage using manual and code analysis approaches. However, automated tools with advanced ML techniques are increasingly recognized as essential for efficiently identifying quality issues in large and complex codebases, enhancing the overall effectiveness of code review processes. In this article, we aim to explore ML-based approaches for limited annotated datasets to detect code-level data leakage in ML code. We proposed three approaches, namely, transfer learning, active learning, and low-shot prompting. Additionally, we introduced an automated approached to handle the imbalance issues of code data. Our results show that active learning outperformed the other approaches with an F-2 score of 0.72 and reduced the number of needed annotated samples from 1,523 to 698. We conclude that existing ML-based approaches can effectively mitigate the challenges associated with limited data availability. |
ArticleNumber | e2730 |
Audience | Academic |
Author | Hassine, Jameleddine Alturayeif, Nouf |
Author_xml | – sequence: 1 givenname: Nouf orcidid: 0000-0002-2761-8420 surname: Alturayeif fullname: Alturayeif, Nouf organization: Information and Computer Science Department, King Fahd University of Petroleum and Minerals, Dhahran, Saudi Arabia, Computing Department, Imam Abdulrahman Bin Faisal University, Dammam, Saudi Arabia – sequence: 2 givenname: Jameleddine orcidid: 0000-0001-8170-9860 surname: Hassine fullname: Hassine, Jameleddine organization: Information and Computer Science Department, King Fahd University of Petroleum and Minerals, Dhahran, Saudi Arabia, Interdisciplinary Research Center for Intelligent Secure Systems, King Fahd University of Petroleum and Minerals, Dhahran, Saudi Arabia |
BackLink | https://www.ncbi.nlm.nih.gov/pubmed/40134878$$D View this record in MEDLINE/PubMed |
BookMark | eNptks1rGzEQxZeS0qRpjr2WhV5a6LrSarWj7SWE9MsQKPTjLGTtaC1nV3IlOW3_-8pxamyodJB485uHRrynxYnzDoviOSUzAApv14hhVek4q4GRR8VZzaCteNfVJwf30-IixhUhhHKaV_ekOG0IZY0AcVbcvldJlSOqWzVg2WNCnax3pXXlpPTSOtwWg7NuKLXv8V2ZgnLRYNjrb0qVe-7wQPC56n9VcelTuQ5-WqcsXz4rHhs1Rrx4OM-LHx8_fL_-XN18-TS_vrqpNKcsVQvTtdBDA7Aw22d2jLWkFoxr0y4WLJ-dNl3Patb0jSCGAtFIkWvFocEW2Hkx3_n2Xq3kOthJhT_SKyvvBR8GqUKyekRZ1wIawlugWjQaeWdoT6AG1Foow1T2utx5rTeLCXuNLs8_HpkeV5xdysHfyfzRjAO02eHVg0PwPzcYk5xs1DiOyqHfRMmooAwI4SKjL3fooPLbrDM-W-otLq9E3QGIruGZmv2HyrvHyeocEGOzftTw-qghMwl_p0FtYpTzb1-P2ReH8-4H_ZeYDFQ7QAcfY0CzRyiR21DK-1BKHeU2lOwvaTDT7g |
Cites_doi | 10.1109/TKDE.2009.191 10.1007/s10664-023-10405-9 10.1613/jair.295 10.48550/arXiv.2305.16837 10.48550/arXiv.1907.11692 10.1145/3386252 10.48550/arXiv.2111.15258 10.1145/3447876 10.1613/jair.953 10.1016/j.eng.2019.12.012 10.1145/3472291 10.48550/arXiv.2310.10508 10.1145/3512934 10.48550/arXiv.1809.09287 10.18653/v1/2020.findings-emnlp.139 10.1145/3560815 10.1145/2382577.2382579 10.1145/3641289 10.1109/JPROC.2020.3004555 10.7717/peerj-cs.1230 |
ContentType | Journal Article |
Copyright | 2025 Alturayeif and Hassine. COPYRIGHT 2025 PeerJ. Ltd. 2025 Alturayeif and Hassine 2025 Alturayeif and Hassine |
Copyright_xml | – notice: 2025 Alturayeif and Hassine. – notice: COPYRIGHT 2025 PeerJ. Ltd. – notice: 2025 Alturayeif and Hassine 2025 Alturayeif and Hassine |
DBID | AAYXX CITATION NPM ISR 7X8 5PM DOA |
DOI | 10.7717/peerj-cs.2730 |
DatabaseName | CrossRef PubMed Gale In Context: Science MEDLINE - Academic PubMed Central (Full Participant titles) DOAJ Directory of Open Access Journals |
DatabaseTitle | CrossRef PubMed MEDLINE - Academic |
DatabaseTitleList | PubMed CrossRef MEDLINE - Academic |
Database_xml | – sequence: 1 dbid: DOA name: DOAJ Directory of Open Access Journals url: https://www.doaj.org/ sourceTypes: Open Website – sequence: 2 dbid: NPM name: PubMed url: https://proxy.k.utb.cz/login?url=http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed sourceTypes: Index Database |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Computer Science |
EISSN | 2376-5992 |
ExternalDocumentID | oai_doaj_org_article_2287405671c84ce59f1d0727ecc8af3a PMC11935776 A829778945 40134878 10_7717_peerj_cs_2730 |
Genre | Journal Article |
GrantInformation_xml | – fundername: Interdisciplinary Research Center for Intelligent Secure Systems at KFUPM grantid: INSS2406 |
GroupedDBID | 53G 5VS 8FE 8FG AAFWJ AAYXX ABUWG ADBBV AFKRA AFPKN ALMA_UNASSIGNED_HOLDINGS ARAPS ARCSS AZQEC BCNDV BENPR BGLVJ BPHCQ CCPQU CITATION DWQXO FRP GNUQQ GROUPED_DOAJ HCIFZ IAO ICD IEA ISR ITC K6V K7- M~E OK1 P62 PHGZM PHGZT PIMPY PQQKQ PROAC RPM H13 NPM PQGLB PMFND 7X8 5PM PUEGO |
ID | FETCH-LOGICAL-c513t-bf967d7477bf1348933602835cf6bb335c9cf9d3234d480f170ce1e5ca574e673 |
IEDL.DBID | DOA |
ISSN | 2376-5992 |
IngestDate | Wed Aug 27 01:26:43 EDT 2025 Thu Aug 21 18:39:59 EDT 2025 Fri Jul 11 18:57:42 EDT 2025 Tue Jun 17 21:58:34 EDT 2025 Tue Jun 10 20:53:56 EDT 2025 Fri Jun 27 05:15:51 EDT 2025 Mon Jul 21 05:33:20 EDT 2025 Tue Jul 01 05:27:41 EDT 2025 |
IsDoiOpenAccess | true |
IsOpenAccess | true |
IsPeerReviewed | true |
IsScholarly | true |
Keywords | Data leakage Code quality Low-shot prompting Active learning Transfer learning |
Language | English |
License | https://creativecommons.org/licenses/by/4.0 2025 Alturayeif and Hassine. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited. |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-c513t-bf967d7477bf1348933602835cf6bb335c9cf9d3234d480f170ce1e5ca574e673 |
Notes | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 |
ORCID | 0000-0001-8170-9860 0000-0002-2761-8420 |
OpenAccessLink | https://doaj.org/article/2287405671c84ce59f1d0727ecc8af3a |
PMID | 40134878 |
PQID | 3181370058 |
PQPubID | 23479 |
PageCount | e2730 |
ParticipantIDs | doaj_primary_oai_doaj_org_article_2287405671c84ce59f1d0727ecc8af3a pubmedcentral_primary_oai_pubmedcentral_nih_gov_11935776 proquest_miscellaneous_3181370058 gale_infotracmisc_A829778945 gale_infotracacademiconefile_A829778945 gale_incontextgauss_ISR_A829778945 pubmed_primary_40134878 crossref_primary_10_7717_peerj_cs_2730 |
ProviderPackageCode | CITATION AAYXX |
PublicationCentury | 2000 |
PublicationDate | 2025-03-05 |
PublicationDateYYYYMMDD | 2025-03-05 |
PublicationDate_xml | – month: 03 year: 2025 text: 2025-03-05 day: 05 |
PublicationDecade | 2020 |
PublicationPlace | United States |
PublicationPlace_xml | – name: United States – name: San Diego, USA |
PublicationTitle | PeerJ. Computer science |
PublicationTitleAlternate | PeerJ Comput Sci |
PublicationYear | 2025 |
Publisher | PeerJ. Ltd PeerJ Inc |
Publisher_xml | – name: PeerJ. Ltd – name: PeerJ Inc |
References | Huang (10.7717/peerj-cs.2730/ref-17) 2021 Yang (10.7717/peerj-cs.2730/ref-48) 2021 Tsantalis (10.7717/peerj-cs.2730/ref-41) 2018 Seung (10.7717/peerj-cs.2730/ref-36) 1992 Liu (10.7717/peerj-cs.2730/ref-21) 2019 Drozdova (10.7717/peerj-cs.2730/ref-12) 2023; 9 Lyu (10.7717/peerj-cs.2730/ref-23) 2021; 30 Chawla (10.7717/peerj-cs.2730/ref-6) 2002; 16 Chattopadhyay (10.7717/peerj-cs.2730/ref-5) 2020 Namaki (10.7717/peerj-cs.2730/ref-25) 2020 Feng (10.7717/peerj-cs.2730/ref-13) 2020 Chorev (10.7717/peerj-cs.2730/ref-7) 2022; 23 Subotić (10.7717/peerj-cs.2730/ref-40) 2022 Fowler (10.7717/peerj-cs.2730/ref-14) 1997 Nahar (10.7717/peerj-cs.2730/ref-24) 2022 Cohn (10.7717/peerj-cs.2730/ref-8) 1996; 4 Olsson (10.7717/peerj-cs.2730/ref-26) 2009 Biswas (10.7717/peerj-cs.2730/ref-1) 2022 Wang (10.7717/peerj-cs.2730/ref-45) 2020b; 53 Sridhara (10.7717/peerj-cs.2730/ref-39) 2023 He (10.7717/peerj-cs.2730/ref-16) 2016 OpenAI (10.7717/peerj-cs.2730/ref-27) 2023 Brown (10.7717/peerj-cs.2730/ref-2) 2020; 33 Goodfellow (10.7717/peerj-cs.2730/ref-15) 2015 Quaranta (10.7717/peerj-cs.2730/ref-31) 2022; 6 Kohavi (10.7717/peerj-cs.2730/ref-20) 2003 Vaswani (10.7717/peerj-cs.2730/ref-43) 2017; 30 Kaufman (10.7717/peerj-cs.2730/ref-18) 2012; 6 Wang (10.7717/peerj-cs.2730/ref-44) 2020a Cousot (10.7717/peerj-cs.2730/ref-9) 1977 Liu (10.7717/peerj-cs.2730/ref-22) 2023; 55 Ren (10.7717/peerj-cs.2730/ref-33) 2020; 6 Devlin (10.7717/peerj-cs.2730/ref-10) 2019 Pujar (10.7717/peerj-cs.2730/ref-30) 2024; 29 Yang (10.7717/peerj-cs.2730/ref-47) 2022 Ren (10.7717/peerj-cs.2730/ref-32) 2021; 54 Settles (10.7717/peerj-cs.2730/ref-35) 2009 Xie (10.7717/peerj-cs.2730/ref-46) 2019 Pimentel (10.7717/peerj-cs.2730/ref-29) 2019 Koenzen (10.7717/peerj-cs.2730/ref-19) 2020 Tukey (10.7717/peerj-cs.2730/ref-42) 1977; 688 Pan (10.7717/peerj-cs.2730/ref-28) 2009; 22 Shin (10.7717/peerj-cs.2730/ref-37) 2023 Zhuang (10.7717/peerj-cs.2730/ref-49) 2020; 109 Drobnjakovic (10.7717/peerj-cs.2730/ref-11) 2024 Smailagic (10.7717/peerj-cs.2730/ref-38) 2018 Burkov (10.7717/peerj-cs.2730/ref-3) 2020; 1 Schneider (10.7717/peerj-cs.2730/ref-34) 2019 Chang (10.7717/peerj-cs.2730/ref-4) 2024; 15 |
References_xml | – start-page: 1 year: 2020 ident: 10.7717/peerj-cs.2730/ref-19 article-title: Code duplication and reuse in jupyter notebooks – year: 2009 ident: 10.7717/peerj-cs.2730/ref-26 article-title: A literature survey of active machine learning in the context of natural language processing – volume: 22 start-page: 1345 issue: 10 year: 2009 ident: 10.7717/peerj-cs.2730/ref-28 article-title: A survey on transfer learning publication-title: IEEE Transactions on Knowledge and Data Engineering doi: 10.1109/TKDE.2009.191 – volume: 29 start-page: 48 issue: 2 year: 2024 ident: 10.7717/peerj-cs.2730/ref-30 article-title: Analyzing source code vulnerabilities in the D2A dataset with ML ensembles and C-BERT publication-title: Empirical Software Engineering doi: 10.1007/s10664-023-10405-9 – volume: 4 start-page: 129 year: 1996 ident: 10.7717/peerj-cs.2730/ref-8 article-title: Active learning with statistical models publication-title: Journal of Artificial Intelligence Research doi: 10.1613/jair.295 – start-page: 1 year: 2022 ident: 10.7717/peerj-cs.2730/ref-47 article-title: Data leakage in notebooks: static detection and better processes – year: 2023 ident: 10.7717/peerj-cs.2730/ref-39 article-title: ChatGPT: a study on its utility for ubiquitous software engineering tasks doi: 10.48550/arXiv.2305.16837 – year: 2015 ident: 10.7717/peerj-cs.2730/ref-15 article-title: Explaining and harnessing adversarial examples – volume: 1 volume-title: Machine learning engineering year: 2020 ident: 10.7717/peerj-cs.2730/ref-3 – start-page: 1542 year: 2020 ident: 10.7717/peerj-cs.2730/ref-25 article-title: Vamsa: automated provenance tracking in data science scripts – year: 2019 ident: 10.7717/peerj-cs.2730/ref-21 article-title: RoBERTa: a robustly optimized BERT pretraining approach doi: 10.48550/arXiv.1907.11692 – volume: 53 start-page: 1 issue: 3 year: 2020b ident: 10.7717/peerj-cs.2730/ref-45 article-title: Generalizing from a few examples: a survey on few-shot learning publication-title: ACM Computing Surveys (CSUR) doi: 10.1145/3386252 – year: 2021 ident: 10.7717/peerj-cs.2730/ref-17 article-title: DeepAL: deep active learning in python doi: 10.48550/arXiv.2111.15258 – volume: 30 start-page: 1 issue: 4 year: 2021 ident: 10.7717/peerj-cs.2730/ref-23 article-title: An empirical study of the impact of data splitting decisions on the performance of aiops solutions publication-title: ACM Transactions on Software Engineering and Methodology (TOSEM) doi: 10.1145/3447876 – year: 2003 ident: 10.7717/peerj-cs.2730/ref-20 article-title: Ten supplementary analyses to improve e-commerce web sites – volume: 23 start-page: 12990 issue: 1 year: 2022 ident: 10.7717/peerj-cs.2730/ref-7 article-title: Deepchecks: a library for testing and validating machine learning models and data publication-title: The Journal of Machine Learning Research – volume: 16 start-page: 321 year: 2002 ident: 10.7717/peerj-cs.2730/ref-6 article-title: Smote: synthetic minority over-sampling technique publication-title: Journal of Artificial Intelligence Research doi: 10.1613/jair.953 – start-page: 3465 year: 2019 ident: 10.7717/peerj-cs.2730/ref-34 article-title: Unsupervised pre-training for speech recognition – start-page: 109 year: 2024 ident: 10.7717/peerj-cs.2730/ref-11 article-title: An abstract interpretation-based data leakage static analysis – start-page: 138 year: 2020a ident: 10.7717/peerj-cs.2730/ref-44 article-title: Assessing and restoring reproducibility of jupyter notebooks – volume: 6 start-page: 346 issue: 3 year: 2020 ident: 10.7717/peerj-cs.2730/ref-33 article-title: Adversarial attacks and defenses in deep learning publication-title: Engineering doi: 10.1016/j.eng.2019.12.012 – volume: 688 start-page: 581 year: 1977 ident: 10.7717/peerj-cs.2730/ref-42 article-title: Exploratory data analysis addision-wesley publication-title: Reading, Ma – volume: 54 start-page: 1 issue: 9 year: 2021 ident: 10.7717/peerj-cs.2730/ref-32 article-title: A survey of deep active learning publication-title: ACM Computing Surveys (CSUR) doi: 10.1145/3472291 – start-page: 238 year: 1977 ident: 10.7717/peerj-cs.2730/ref-9 article-title: Abstract interpretation: a unified lattice model for static analysis of programs by construction or approximation of fixpoints – year: 1997 ident: 10.7717/peerj-cs.2730/ref-14 article-title: Refactoring: improving the design of existing code – start-page: 501 year: 2019 ident: 10.7717/peerj-cs.2730/ref-46 article-title: Feature denoising for improving adversarial robustness – year: 2023 ident: 10.7717/peerj-cs.2730/ref-37 article-title: Prompt engineering or fine tuning: An empirical assessment of large language models in automated software engineering tasks doi: 10.48550/arXiv.2310.10508 – year: 2009 ident: 10.7717/peerj-cs.2730/ref-35 article-title: Active learning literature survey – volume: 6 start-page: 1 issue: CSCW1 year: 2022 ident: 10.7717/peerj-cs.2730/ref-31 article-title: Eliciting best practices for collaboration with computational notebooks publication-title: Proceedings of the ACM on Human-Computer Interaction doi: 10.1145/3512934 – year: 2018 ident: 10.7717/peerj-cs.2730/ref-38 article-title: Medal: Deep active learning sampling method for medical image analysis doi: 10.48550/arXiv.1809.09287 – start-page: 1536 volume-title: Findings of the Association for Computational Linguistics: EMNLP 2020 year: 2020 ident: 10.7717/peerj-cs.2730/ref-13 article-title: CodeBERT: A pre-trained model for programming and natural languages doi: 10.18653/v1/2020.findings-emnlp.139 – volume: 55 start-page: 1 issue: 9 year: 2023 ident: 10.7717/peerj-cs.2730/ref-22 article-title: Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing publication-title: ACM Computing Surveys doi: 10.1145/3560815 – volume: 6 start-page: 1 issue: 4 year: 2012 ident: 10.7717/peerj-cs.2730/ref-18 article-title: Leakage in data mining: formulation, detection, and avoidance publication-title: ACM Transactions on Knowledge Discovery from Data (TKDD) doi: 10.1145/2382577.2382579 – start-page: 287 year: 1992 ident: 10.7717/peerj-cs.2730/ref-36 article-title: Query by committee – start-page: 507 year: 2019 ident: 10.7717/peerj-cs.2730/ref-29 article-title: A large-scale study about quality and reproducibility of jupyter notebooks – start-page: 13 year: 2022 ident: 10.7717/peerj-cs.2730/ref-40 article-title: A static analysis framework for data science notebooks – start-page: 1 year: 2020 ident: 10.7717/peerj-cs.2730/ref-5 article-title: What’s wrong with computational notebooks? pain points, needs, and design opportunities – start-page: 770 year: 2016 ident: 10.7717/peerj-cs.2730/ref-16 article-title: Deep residual learning for image recognition – volume: 30 start-page: 5998 year: 2017 ident: 10.7717/peerj-cs.2730/ref-43 article-title: Attention is all you need publication-title: Advances in Neural Information Processing Systems – year: 2023 ident: 10.7717/peerj-cs.2730/ref-27 article-title: GPT-4 technical report – start-page: 413 year: 2022 ident: 10.7717/peerj-cs.2730/ref-24 article-title: Collaboration challenges in building ML-enabled systems: communication, documentation, engineering, and process – volume: 15 start-page: 1 issue: 3 year: 2024 ident: 10.7717/peerj-cs.2730/ref-4 article-title: A survey on evaluation of large language models publication-title: ACM Transactions on Intelligent Systems and Technology doi: 10.1145/3641289 – start-page: 483 year: 2018 ident: 10.7717/peerj-cs.2730/ref-41 article-title: Accurate and efficient refactoring detection in commit history – volume: 109 start-page: 43 issue: 1 year: 2020 ident: 10.7717/peerj-cs.2730/ref-49 article-title: A comprehensive survey on transfer learning publication-title: Proceedings of the IEEE doi: 10.1109/JPROC.2020.3004555 – start-page: 2091 year: 2022 ident: 10.7717/peerj-cs.2730/ref-1 article-title: The art and practice of data science pipelines: a comprehensive study of data science pipelines in theory, in-the-small, and in-the-large – volume: 33 start-page: 1877 volume-title: Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6–12, 2020, Virtual year: 2020 ident: 10.7717/peerj-cs.2730/ref-2 article-title: Language models are few-shot learners – start-page: 4171 year: 2019 ident: 10.7717/peerj-cs.2730/ref-10 article-title: BERT: pre-training of deep bidirectional transformers for language understanding – volume: 9 start-page: e1230 issue: 4 year: 2023 ident: 10.7717/peerj-cs.2730/ref-12 article-title: Code4ML: a large-scale dataset of annotated machine learning code publication-title: PeerJ Computer Science doi: 10.7717/peerj-cs.1230 – start-page: 304 year: 2021 ident: 10.7717/peerj-cs.2730/ref-48 article-title: Subtle bugs everywhere: generating documentation for data wrangling code |
SSID | ssj0001511119 |
Score | 2.284719 |
Snippet | With the increasing reliance on machine learning (ML) across diverse disciplines, ML code has been subject to a number of issues that impact its quality, such... |
SourceID | doaj pubmedcentral proquest gale pubmed crossref |
SourceType | Open Website Open Access Repository Aggregation Database Index Database |
StartPage | e2730 |
SubjectTerms | Active learning Artificial Intelligence Code quality Computational linguistics Data leakage Data Mining and Machine Learning Evaluation Language processing Low-shot prompting Machine learning Natural language interfaces Neural Networks Software Engineering Transfer learning |
Title | Data leakage detection in machine learning code: transfer learning, active learning, or low-shot prompting? |
URI | https://www.ncbi.nlm.nih.gov/pubmed/40134878 https://www.proquest.com/docview/3181370058 https://pubmed.ncbi.nlm.nih.gov/PMC11935776 https://doaj.org/article/2287405671c84ce59f1d0727ecc8af3a |
Volume | 11 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV1Lb9QwELagXLjwfqQtK4MQXAhNYjt2uKAWuhQkKlSo1JtlO3ZbSpNVkxV_nxknW23EgQuXRPJYSewZzyOa-YaQl0KZ0gfQfjznBVxUlhpwNFLjWR1UYZT1-L_j62F5cMy_nIiTtVZfmBM2wAMPG7dTICA7WGmZO8WxZijkdQZGF16tTGDRNQKbtxZMDfXBqAqqAVRTQsiys_D-6mfqurdgr7OJEYpY_X9r5DWTNE2XXLM_83vkzug40t3hg--TG755QO6umjLQ8Yw-JBcfTW_oL28uQFPQ2vcx16qh5w29jImTno6dIk4p1rO_o330XeEZq_E31EQtuDbQArX9nXZnbU_h0y8XmC39_hE5nu__-HCQjh0VUidy1qc2VKWsIYKQNuQMcWdYiQ6GcKG0lsG9cqGqWcF4zVUGPMycz71wRkjuS8kek42mbfxTQksuvSxqZhkX3FimasuEdUwKJ0PFq4S8Wm2xXgzAGRoCDuSFjrzQrtPIi4TsIQOuJyHedRwAKdCjFOh_SUFCXiD7NCJaNJgyc2qWXac_fz_Su1g8LFXFRUJej5NCC1vrzFiBAAtCEKzJzO3JTDhybkJ-vpISjSTMU2t8u-w0aMicSezVmJAng9RcLwwjWQgPgaIm8jRZ-ZTSnJ9FxG8QZSakLDf_x15tkdsFNjHGRDqxTTb6q6V_Bp5Vb2fkppp_mpFbe_uH345m8Uj9AZwcJQ4 |
linkProvider | Directory of Open Access Journals |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Data+leakage+detection+in+machine+learning+code%3A+transfer+learning%2C+active+learning%2C+or+low-shot+prompting%3F&rft.jtitle=PeerJ.+Computer+science&rft.au=Alturayeif%2C+Nouf&rft.au=Hassine%2C+Jameleddine&rft.date=2025-03-05&rft.issn=2376-5992&rft.eissn=2376-5992&rft.volume=11&rft.spage=e2730&rft_id=info:doi/10.7717%2Fpeerj-cs.2730&rft.externalDBID=n%2Fa&rft.externalDocID=10_7717_peerj_cs_2730 |
thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2376-5992&client=summon |
thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2376-5992&client=summon |
thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2376-5992&client=summon |