Data leakage detection in machine learning code: transfer learning, active learning, or low-shot prompting?

With the increasing reliance on machine learning (ML) across diverse disciplines, ML code has been subject to a number of issues that impact its quality, such as lack of documentation, algorithmic biases, overfitting, lack of reproducibility, inadequate data preprocessing, and potential for data lea...

Full description

Saved in:

Bibliographic Details
Published in	PeerJ. Computer science Vol. 11; p. e2730
Main Authors	Alturayeif, Nouf, Hassine, Jameleddine
Format	Journal Article
Language	English
Published	United States PeerJ. Ltd 05.03.2025 PeerJ Inc
Subjects	Active learning Artificial Intelligence Code quality Computational linguistics Data leakage Data Mining and Machine Learning Evaluation Language processing Low-shot prompting Machine learning Natural language interfaces Neural Networks Software Engineering Transfer learning Data leakage Code quality Low-shot prompting Active learning Transfer learning
Online Access	Get full text
ISSN	2376-5992 2376-5992
DOI	10.7717/peerj-cs.2730

Cover

Loading…

Abstract	With the increasing reliance on machine learning (ML) across diverse disciplines, ML code has been subject to a number of issues that impact its quality, such as lack of documentation, algorithmic biases, overfitting, lack of reproducibility, inadequate data preprocessing, and potential for data leakage, all of which can significantly affect the performance and reliability of ML models. Data leakage can affect the quality of ML models where sensitive information from the test set inadvertently influences the training process, leading to inflated performance metrics that do not generalize well to new, unseen data. Data leakage can occur at either the dataset-level ( i.e ., during dataset construction) or at the code-level. Existing studies introduced methods to detect code-level data leakage using manual and code analysis approaches. However, automated tools with advanced ML techniques are increasingly recognized as essential for efficiently identifying quality issues in large and complex codebases, enhancing the overall effectiveness of code review processes. In this article, we aim to explore ML-based approaches for limited annotated datasets to detect code-level data leakage in ML code. We proposed three approaches, namely, transfer learning, active learning, and low-shot prompting. Additionally, we introduced an automated approached to handle the imbalance issues of code data. Our results show that active learning outperformed the other approaches with an F-2 score of 0.72 and reduced the number of needed annotated samples from 1,523 to 698. We conclude that existing ML-based approaches can effectively mitigate the challenges associated with limited data availability.
AbstractList	With the increasing reliance on machine learning (ML) across diverse disciplines, ML code has been subject to a number of issues that impact its quality, such as lack of documentation, algorithmic biases, overfitting, lack of reproducibility, inadequate data preprocessing, and potential for data leakage, all of which can significantly affect the performance and reliability of ML models. Data leakage can affect the quality of ML models where sensitive information from the test set inadvertently influences the training process, leading to inflated performance metrics that do not generalize well to new, unseen data. Data leakage can occur at either the dataset-level ( ., during dataset construction) or at the code-level. Existing studies introduced methods to detect code-level data leakage using manual and code analysis approaches. However, automated tools with advanced ML techniques are increasingly recognized as essential for efficiently identifying quality issues in large and complex codebases, enhancing the overall effectiveness of code review processes. In this article, we aim to explore ML-based approaches for limited annotated datasets to detect code-level data leakage in ML code. We proposed three approaches, namely, transfer learning, active learning, and low-shot prompting. Additionally, we introduced an automated approached to handle the imbalance issues of code data. Our results show that active learning outperformed the other approaches with an F-2 score of 0.72 and reduced the number of needed annotated samples from 1,523 to 698. We conclude that existing ML-based approaches can effectively mitigate the challenges associated with limited data availability. With the increasing reliance on machine learning (ML) across diverse disciplines, ML code has been subject to a number of issues that impact its quality, such as lack of documentation, algorithmic biases, overfitting, lack of reproducibility, inadequate data preprocessing, and potential for data leakage, all of which can significantly affect the performance and reliability of ML models. Data leakage can affect the quality of ML models where sensitive information from the test set inadvertently influences the training process, leading to inflated performance metrics that do not generalize well to new, unseen data. Data leakage can occur at either the dataset-level (i.e., during dataset construction) or at the code-level. Existing studies introduced methods to detect code-level data leakage using manual and code analysis approaches. However, automated tools with advanced ML techniques are increasingly recognized as essential for efficiently identifying quality issues in large and complex codebases, enhancing the overall effectiveness of code review processes. In this article, we aim to explore ML-based approaches for limited annotated datasets to detect code-level data leakage in ML code. We proposed three approaches, namely, transfer learning, active learning, and low-shot prompting. Additionally, we introduced an automated approached to handle the imbalance issues of code data. Our results show that active learning outperformed the other approaches with an F-2 score of 0.72 and reduced the number of needed annotated samples from 1,523 to 698. We conclude that existing ML-based approaches can effectively mitigate the challenges associated with limited data availability. With the increasing reliance on machine learning (ML) across diverse disciplines, ML code has been subject to a number of issues that impact its quality, such as lack of documentation, algorithmic biases, overfitting, lack of reproducibility, inadequate data preprocessing, and potential for data leakage, all of which can significantly affect the performance and reliability of ML models. Data leakage can affect the quality of ML models where sensitive information from the test set inadvertently influences the training process, leading to inflated performance metrics that do not generalize well to new, unseen data. Data leakage can occur at either the dataset-level ( i.e ., during dataset construction) or at the code-level. Existing studies introduced methods to detect code-level data leakage using manual and code analysis approaches. However, automated tools with advanced ML techniques are increasingly recognized as essential for efficiently identifying quality issues in large and complex codebases, enhancing the overall effectiveness of code review processes. In this article, we aim to explore ML-based approaches for limited annotated datasets to detect code-level data leakage in ML code. We proposed three approaches, namely, transfer learning, active learning, and low-shot prompting. Additionally, we introduced an automated approached to handle the imbalance issues of code data. Our results show that active learning outperformed the other approaches with an F-2 score of 0.72 and reduced the number of needed annotated samples from 1,523 to 698. We conclude that existing ML-based approaches can effectively mitigate the challenges associated with limited data availability. With the increasing reliance on machine learning (ML) across diverse disciplines, ML code has been subject to a number of issues that impact its quality, such as lack of documentation, algorithmic biases, overfitting, lack of reproducibility, inadequate data preprocessing, and potential for data leakage, all of which can significantly affect the performance and reliability of ML models. Data leakage can affect the quality of ML models where sensitive information from the test set inadvertently influences the training process, leading to inflated performance metrics that do not generalize well to new, unseen data. Data leakage can occur at either the dataset-level (i.e., during dataset construction) or at the code-level. Existing studies introduced methods to detect code-level data leakage using manual and code analysis approaches. However, automated tools with advanced ML techniques are increasingly recognized as essential for efficiently identifying quality issues in large and complex codebases, enhancing the overall effectiveness of code review processes. In this article, we aim to explore ML-based approaches for limited annotated datasets to detect code-level data leakage in ML code. We proposed three approaches, namely, transfer learning, active learning, and low-shot prompting. Additionally, we introduced an automated approached to handle the imbalance issues of code data. Our results show that active learning outperformed the other approaches with an F-2 score of 0.72 and reduced the number of needed annotated samples from 1,523 to 698. We conclude that existing ML-based approaches can effectively mitigate the challenges associated with limited data availability.With the increasing reliance on machine learning (ML) across diverse disciplines, ML code has been subject to a number of issues that impact its quality, such as lack of documentation, algorithmic biases, overfitting, lack of reproducibility, inadequate data preprocessing, and potential for data leakage, all of which can significantly affect the performance and reliability of ML models. Data leakage can affect the quality of ML models where sensitive information from the test set inadvertently influences the training process, leading to inflated performance metrics that do not generalize well to new, unseen data. Data leakage can occur at either the dataset-level (i.e., during dataset construction) or at the code-level. Existing studies introduced methods to detect code-level data leakage using manual and code analysis approaches. However, automated tools with advanced ML techniques are increasingly recognized as essential for efficiently identifying quality issues in large and complex codebases, enhancing the overall effectiveness of code review processes. In this article, we aim to explore ML-based approaches for limited annotated datasets to detect code-level data leakage in ML code. We proposed three approaches, namely, transfer learning, active learning, and low-shot prompting. Additionally, we introduced an automated approached to handle the imbalance issues of code data. Our results show that active learning outperformed the other approaches with an F-2 score of 0.72 and reduced the number of needed annotated samples from 1,523 to 698. We conclude that existing ML-based approaches can effectively mitigate the challenges associated with limited data availability.
ArticleNumber	e2730
Audience	Academic
Author	Hassine, Jameleddine Alturayeif, Nouf
Author_xml	– sequence: 1 givenname: Nouf orcidid: 0000-0002-2761-8420 surname: Alturayeif fullname: Alturayeif, Nouf organization: Information and Computer Science Department, King Fahd University of Petroleum and Minerals, Dhahran, Saudi Arabia, Computing Department, Imam Abdulrahman Bin Faisal University, Dammam, Saudi Arabia – sequence: 2 givenname: Jameleddine orcidid: 0000-0001-8170-9860 surname: Hassine fullname: Hassine, Jameleddine organization: Information and Computer Science Department, King Fahd University of Petroleum and Minerals, Dhahran, Saudi Arabia, Interdisciplinary Research Center for Intelligent Secure Systems, King Fahd University of Petroleum and Minerals, Dhahran, Saudi Arabia
BackLink	https://www.ncbi.nlm.nih.gov/pubmed/40134878$$D View this record in MEDLINE/PubMed
BookMark	eNptks1rGzEQxZeS0qRpjr2WhV5a6LrSarWj7SWE9MsQKPTjLGTtaC1nV3IlOW3_-8pxamyodJB485uHRrynxYnzDoviOSUzAApv14hhVek4q4GRR8VZzaCteNfVJwf30-IixhUhhHKaV_ekOG0IZY0AcVbcvldJlSOqWzVg2WNCnax3pXXlpPTSOtwWg7NuKLXv8V2ZgnLRYNjrb0qVe-7wQPC56n9VcelTuQ5-WqcsXz4rHhs1Rrx4OM-LHx8_fL_-XN18-TS_vrqpNKcsVQvTtdBDA7Aw22d2jLWkFoxr0y4WLJ-dNl3Patb0jSCGAtFIkWvFocEW2Hkx3_n2Xq3kOthJhT_SKyvvBR8GqUKyekRZ1wIawlugWjQaeWdoT6AG1Foow1T2utx5rTeLCXuNLs8_HpkeV5xdysHfyfzRjAO02eHVg0PwPzcYk5xs1DiOyqHfRMmooAwI4SKjL3fooPLbrDM-W-otLq9E3QGIruGZmv2HyrvHyeocEGOzftTw-qghMwl_p0FtYpTzb1-P2ReH8-4H_ZeYDFQ7QAcfY0CzRyiR21DK-1BKHeU2lOwvaTDT7g
Cites_doi	10.1109/TKDE.2009.191 10.1007/s10664-023-10405-9 10.1613/jair.295 10.48550/arXiv.2305.16837 10.48550/arXiv.1907.11692 10.1145/3386252 10.48550/arXiv.2111.15258 10.1145/3447876 10.1613/jair.953 10.1016/j.eng.2019.12.012 10.1145/3472291 10.48550/arXiv.2310.10508 10.1145/3512934 10.48550/arXiv.1809.09287 10.18653/v1/2020.findings-emnlp.139 10.1145/3560815 10.1145/2382577.2382579 10.1145/3641289 10.1109/JPROC.2020.3004555 10.7717/peerj-cs.1230
ContentType	Journal Article
Copyright	2025 Alturayeif and Hassine. COPYRIGHT 2025 PeerJ. Ltd. 2025 Alturayeif and Hassine 2025 Alturayeif and Hassine
Copyright_xml	– notice: 2025 Alturayeif and Hassine. – notice: COPYRIGHT 2025 PeerJ. Ltd. – notice: 2025 Alturayeif and Hassine 2025 Alturayeif and Hassine
DBID	AAYXX CITATION NPM ISR 7X8 5PM DOA
DOI	10.7717/peerj-cs.2730
DatabaseName	CrossRef PubMed Gale In Context: Science MEDLINE - Academic PubMed Central (Full Participant titles) DOAJ Directory of Open Access Journals
DatabaseTitle	CrossRef PubMed MEDLINE - Academic
DatabaseTitleList	PubMed CrossRef MEDLINE - Academic
Database_xml	– sequence: 1 dbid: DOA name: DOAJ Directory of Open Access Journals url: https://www.doaj.org/ sourceTypes: Open Website – sequence: 2 dbid: NPM name: PubMed url: https://proxy.k.utb.cz/login?url=http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed sourceTypes: Index Database
DeliveryMethod	fulltext_linktorsrc
Discipline	Computer Science
EISSN	2376-5992
ExternalDocumentID	oai_doaj_org_article_2287405671c84ce59f1d0727ecc8af3a PMC11935776 A829778945 40134878 10_7717_peerj_cs_2730
Genre	Journal Article
GrantInformation_xml	– fundername: Interdisciplinary Research Center for Intelligent Secure Systems at KFUPM grantid: INSS2406
GroupedDBID	53G 5VS 8FE 8FG AAFWJ AAYXX ABUWG ADBBV AFKRA AFPKN ALMA_UNASSIGNED_HOLDINGS ARAPS ARCSS AZQEC BCNDV BENPR BGLVJ BPHCQ CCPQU CITATION DWQXO FRP GNUQQ GROUPED_DOAJ HCIFZ IAO ICD IEA ISR ITC K6V K7- M~E OK1 P62 PHGZM PHGZT PIMPY PQQKQ PROAC RPM H13 NPM PQGLB PMFND 7X8 5PM PUEGO
ID	FETCH-LOGICAL-c513t-bf967d7477bf1348933602835cf6bb335c9cf9d3234d480f170ce1e5ca574e673
IEDL.DBID	DOA
ISSN	2376-5992
IngestDate	Wed Aug 27 01:26:43 EDT 2025 Thu Aug 21 18:39:59 EDT 2025 Fri Jul 11 18:57:42 EDT 2025 Tue Jun 17 21:58:34 EDT 2025 Tue Jun 10 20:53:56 EDT 2025 Fri Jun 27 05:15:51 EDT 2025 Mon Jul 21 05:33:20 EDT 2025 Tue Jul 01 05:27:41 EDT 2025
IsDoiOpenAccess	true
IsOpenAccess	true
IsPeerReviewed	true
IsScholarly	true
Keywords	Data leakage Code quality Low-shot prompting Active learning Transfer learning
Language	English
License	https://creativecommons.org/licenses/by/4.0 2025 Alturayeif and Hassine. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited.
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-c513t-bf967d7477bf1348933602835cf6bb335c9cf9d3234d480f170ce1e5ca574e673
Notes	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23
ORCID	0000-0001-8170-9860 0000-0002-2761-8420
OpenAccessLink	https://doaj.org/article/2287405671c84ce59f1d0727ecc8af3a
PMID	40134878
PQID	3181370058
PQPubID	23479
PageCount	e2730
ParticipantIDs	doaj_primary_oai_doaj_org_article_2287405671c84ce59f1d0727ecc8af3a pubmedcentral_primary_oai_pubmedcentral_nih_gov_11935776 proquest_miscellaneous_3181370058 gale_infotracmisc_A829778945 gale_infotracacademiconefile_A829778945 gale_incontextgauss_ISR_A829778945 pubmed_primary_40134878 crossref_primary_10_7717_peerj_cs_2730
ProviderPackageCode	CITATION AAYXX
PublicationCentury	2000
PublicationDate	2025-03-05
PublicationDateYYYYMMDD	2025-03-05
PublicationDate_xml	– month: 03 year: 2025 text: 2025-03-05 day: 05
PublicationDecade	2020
PublicationPlace	United States
PublicationPlace_xml	– name: United States – name: San Diego, USA
PublicationTitle	PeerJ. Computer science
PublicationTitleAlternate	PeerJ Comput Sci
PublicationYear	2025
Publisher	PeerJ. Ltd PeerJ Inc
Publisher_xml	– name: PeerJ. Ltd – name: PeerJ Inc
References	Huang (10.7717/peerj-cs.2730/ref-17) 2021 Yang (10.7717/peerj-cs.2730/ref-48) 2021 Tsantalis (10.7717/peerj-cs.2730/ref-41) 2018 Seung (10.7717/peerj-cs.2730/ref-36) 1992 Liu (10.7717/peerj-cs.2730/ref-21) 2019 Drozdova (10.7717/peerj-cs.2730/ref-12) 2023; 9 Lyu (10.7717/peerj-cs.2730/ref-23) 2021; 30 Chawla (10.7717/peerj-cs.2730/ref-6) 2002; 16 Chattopadhyay (10.7717/peerj-cs.2730/ref-5) 2020 Namaki (10.7717/peerj-cs.2730/ref-25) 2020 Feng (10.7717/peerj-cs.2730/ref-13) 2020 Chorev (10.7717/peerj-cs.2730/ref-7) 2022; 23 Subotić (10.7717/peerj-cs.2730/ref-40) 2022 Fowler (10.7717/peerj-cs.2730/ref-14) 1997 Nahar (10.7717/peerj-cs.2730/ref-24) 2022 Cohn (10.7717/peerj-cs.2730/ref-8) 1996; 4 Olsson (10.7717/peerj-cs.2730/ref-26) 2009 Biswas (10.7717/peerj-cs.2730/ref-1) 2022 Wang (10.7717/peerj-cs.2730/ref-45) 2020b; 53 Sridhara (10.7717/peerj-cs.2730/ref-39) 2023 He (10.7717/peerj-cs.2730/ref-16) 2016 OpenAI (10.7717/peerj-cs.2730/ref-27) 2023 Brown (10.7717/peerj-cs.2730/ref-2) 2020; 33 Goodfellow (10.7717/peerj-cs.2730/ref-15) 2015 Quaranta (10.7717/peerj-cs.2730/ref-31) 2022; 6 Kohavi (10.7717/peerj-cs.2730/ref-20) 2003 Vaswani (10.7717/peerj-cs.2730/ref-43) 2017; 30 Kaufman (10.7717/peerj-cs.2730/ref-18) 2012; 6 Wang (10.7717/peerj-cs.2730/ref-44) 2020a Cousot (10.7717/peerj-cs.2730/ref-9) 1977 Liu (10.7717/peerj-cs.2730/ref-22) 2023; 55 Ren (10.7717/peerj-cs.2730/ref-33) 2020; 6 Devlin (10.7717/peerj-cs.2730/ref-10) 2019 Pujar (10.7717/peerj-cs.2730/ref-30) 2024; 29 Yang (10.7717/peerj-cs.2730/ref-47) 2022 Ren (10.7717/peerj-cs.2730/ref-32) 2021; 54 Settles (10.7717/peerj-cs.2730/ref-35) 2009 Xie (10.7717/peerj-cs.2730/ref-46) 2019 Pimentel (10.7717/peerj-cs.2730/ref-29) 2019 Koenzen (10.7717/peerj-cs.2730/ref-19) 2020 Tukey (10.7717/peerj-cs.2730/ref-42) 1977; 688 Pan (10.7717/peerj-cs.2730/ref-28) 2009; 22 Shin (10.7717/peerj-cs.2730/ref-37) 2023 Zhuang (10.7717/peerj-cs.2730/ref-49) 2020; 109 Drobnjakovic (10.7717/peerj-cs.2730/ref-11) 2024 Smailagic (10.7717/peerj-cs.2730/ref-38) 2018 Burkov (10.7717/peerj-cs.2730/ref-3) 2020; 1 Schneider (10.7717/peerj-cs.2730/ref-34) 2019 Chang (10.7717/peerj-cs.2730/ref-4) 2024; 15
References_xml	– start-page: 1 year: 2020 ident: 10.7717/peerj-cs.2730/ref-19 article-title: Code duplication and reuse in jupyter notebooks – year: 2009 ident: 10.7717/peerj-cs.2730/ref-26 article-title: A literature survey of active machine learning in the context of natural language processing – volume: 22 start-page: 1345 issue: 10 year: 2009 ident: 10.7717/peerj-cs.2730/ref-28 article-title: A survey on transfer learning publication-title: IEEE Transactions on Knowledge and Data Engineering doi: 10.1109/TKDE.2009.191 – volume: 29 start-page: 48 issue: 2 year: 2024 ident: 10.7717/peerj-cs.2730/ref-30 article-title: Analyzing source code vulnerabilities in the D2A dataset with ML ensembles and C-BERT publication-title: Empirical Software Engineering doi: 10.1007/s10664-023-10405-9 – volume: 4 start-page: 129 year: 1996 ident: 10.7717/peerj-cs.2730/ref-8 article-title: Active learning with statistical models publication-title: Journal of Artificial Intelligence Research doi: 10.1613/jair.295 – start-page: 1 year: 2022 ident: 10.7717/peerj-cs.2730/ref-47 article-title: Data leakage in notebooks: static detection and better processes – year: 2023 ident: 10.7717/peerj-cs.2730/ref-39 article-title: ChatGPT: a study on its utility for ubiquitous software engineering tasks doi: 10.48550/arXiv.2305.16837 – year: 2015 ident: 10.7717/peerj-cs.2730/ref-15 article-title: Explaining and harnessing adversarial examples – volume: 1 volume-title: Machine learning engineering year: 2020 ident: 10.7717/peerj-cs.2730/ref-3 – start-page: 1542 year: 2020 ident: 10.7717/peerj-cs.2730/ref-25 article-title: Vamsa: automated provenance tracking in data science scripts – year: 2019 ident: 10.7717/peerj-cs.2730/ref-21 article-title: RoBERTa: a robustly optimized BERT pretraining approach doi: 10.48550/arXiv.1907.11692 – volume: 53 start-page: 1 issue: 3 year: 2020b ident: 10.7717/peerj-cs.2730/ref-45 article-title: Generalizing from a few examples: a survey on few-shot learning publication-title: ACM Computing Surveys (CSUR) doi: 10.1145/3386252 – year: 2021 ident: 10.7717/peerj-cs.2730/ref-17 article-title: DeepAL: deep active learning in python doi: 10.48550/arXiv.2111.15258 – volume: 30 start-page: 1 issue: 4 year: 2021 ident: 10.7717/peerj-cs.2730/ref-23 article-title: An empirical study of the impact of data splitting decisions on the performance of aiops solutions publication-title: ACM Transactions on Software Engineering and Methodology (TOSEM) doi: 10.1145/3447876 – year: 2003 ident: 10.7717/peerj-cs.2730/ref-20 article-title: Ten supplementary analyses to improve e-commerce web sites – volume: 23 start-page: 12990 issue: 1 year: 2022 ident: 10.7717/peerj-cs.2730/ref-7 article-title: Deepchecks: a library for testing and validating machine learning models and data publication-title: The Journal of Machine Learning Research – volume: 16 start-page: 321 year: 2002 ident: 10.7717/peerj-cs.2730/ref-6 article-title: Smote: synthetic minority over-sampling technique publication-title: Journal of Artificial Intelligence Research doi: 10.1613/jair.953 – start-page: 3465 year: 2019 ident: 10.7717/peerj-cs.2730/ref-34 article-title: Unsupervised pre-training for speech recognition – start-page: 109 year: 2024 ident: 10.7717/peerj-cs.2730/ref-11 article-title: An abstract interpretation-based data leakage static analysis – start-page: 138 year: 2020a ident: 10.7717/peerj-cs.2730/ref-44 article-title: Assessing and restoring reproducibility of jupyter notebooks – volume: 6 start-page: 346 issue: 3 year: 2020 ident: 10.7717/peerj-cs.2730/ref-33 article-title: Adversarial attacks and defenses in deep learning publication-title: Engineering doi: 10.1016/j.eng.2019.12.012 – volume: 688 start-page: 581 year: 1977 ident: 10.7717/peerj-cs.2730/ref-42 article-title: Exploratory data analysis addision-wesley publication-title: Reading, Ma – volume: 54 start-page: 1 issue: 9 year: 2021 ident: 10.7717/peerj-cs.2730/ref-32 article-title: A survey of deep active learning publication-title: ACM Computing Surveys (CSUR) doi: 10.1145/3472291 – start-page: 238 year: 1977 ident: 10.7717/peerj-cs.2730/ref-9 article-title: Abstract interpretation: a unified lattice model for static analysis of programs by construction or approximation of fixpoints – year: 1997 ident: 10.7717/peerj-cs.2730/ref-14 article-title: Refactoring: improving the design of existing code – start-page: 501 year: 2019 ident: 10.7717/peerj-cs.2730/ref-46 article-title: Feature denoising for improving adversarial robustness – year: 2023 ident: 10.7717/peerj-cs.2730/ref-37 article-title: Prompt engineering or fine tuning: An empirical assessment of large language models in automated software engineering tasks doi: 10.48550/arXiv.2310.10508 – year: 2009 ident: 10.7717/peerj-cs.2730/ref-35 article-title: Active learning literature survey – volume: 6 start-page: 1 issue: CSCW1 year: 2022 ident: 10.7717/peerj-cs.2730/ref-31 article-title: Eliciting best practices for collaboration with computational notebooks publication-title: Proceedings of the ACM on Human-Computer Interaction doi: 10.1145/3512934 – year: 2018 ident: 10.7717/peerj-cs.2730/ref-38 article-title: Medal: Deep active learning sampling method for medical image analysis doi: 10.48550/arXiv.1809.09287 – start-page: 1536 volume-title: Findings of the Association for Computational Linguistics: EMNLP 2020 year: 2020 ident: 10.7717/peerj-cs.2730/ref-13 article-title: CodeBERT: A pre-trained model for programming and natural languages doi: 10.18653/v1/2020.findings-emnlp.139 – volume: 55 start-page: 1 issue: 9 year: 2023 ident: 10.7717/peerj-cs.2730/ref-22 article-title: Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing publication-title: ACM Computing Surveys doi: 10.1145/3560815 – volume: 6 start-page: 1 issue: 4 year: 2012 ident: 10.7717/peerj-cs.2730/ref-18 article-title: Leakage in data mining: formulation, detection, and avoidance publication-title: ACM Transactions on Knowledge Discovery from Data (TKDD) doi: 10.1145/2382577.2382579 – start-page: 287 year: 1992 ident: 10.7717/peerj-cs.2730/ref-36 article-title: Query by committee – start-page: 507 year: 2019 ident: 10.7717/peerj-cs.2730/ref-29 article-title: A large-scale study about quality and reproducibility of jupyter notebooks – start-page: 13 year: 2022 ident: 10.7717/peerj-cs.2730/ref-40 article-title: A static analysis framework for data science notebooks – start-page: 1 year: 2020 ident: 10.7717/peerj-cs.2730/ref-5 article-title: What’s wrong with computational notebooks? pain points, needs, and design opportunities – start-page: 770 year: 2016 ident: 10.7717/peerj-cs.2730/ref-16 article-title: Deep residual learning for image recognition – volume: 30 start-page: 5998 year: 2017 ident: 10.7717/peerj-cs.2730/ref-43 article-title: Attention is all you need publication-title: Advances in Neural Information Processing Systems – year: 2023 ident: 10.7717/peerj-cs.2730/ref-27 article-title: GPT-4 technical report – start-page: 413 year: 2022 ident: 10.7717/peerj-cs.2730/ref-24 article-title: Collaboration challenges in building ML-enabled systems: communication, documentation, engineering, and process – volume: 15 start-page: 1 issue: 3 year: 2024 ident: 10.7717/peerj-cs.2730/ref-4 article-title: A survey on evaluation of large language models publication-title: ACM Transactions on Intelligent Systems and Technology doi: 10.1145/3641289 – start-page: 483 year: 2018 ident: 10.7717/peerj-cs.2730/ref-41 article-title: Accurate and efficient refactoring detection in commit history – volume: 109 start-page: 43 issue: 1 year: 2020 ident: 10.7717/peerj-cs.2730/ref-49 article-title: A comprehensive survey on transfer learning publication-title: Proceedings of the IEEE doi: 10.1109/JPROC.2020.3004555 – start-page: 2091 year: 2022 ident: 10.7717/peerj-cs.2730/ref-1 article-title: The art and practice of data science pipelines: a comprehensive study of data science pipelines in theory, in-the-small, and in-the-large – volume: 33 start-page: 1877 volume-title: Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6–12, 2020, Virtual year: 2020 ident: 10.7717/peerj-cs.2730/ref-2 article-title: Language models are few-shot learners – start-page: 4171 year: 2019 ident: 10.7717/peerj-cs.2730/ref-10 article-title: BERT: pre-training of deep bidirectional transformers for language understanding – volume: 9 start-page: e1230 issue: 4 year: 2023 ident: 10.7717/peerj-cs.2730/ref-12 article-title: Code4ML: a large-scale dataset of annotated machine learning code publication-title: PeerJ Computer Science doi: 10.7717/peerj-cs.1230 – start-page: 304 year: 2021 ident: 10.7717/peerj-cs.2730/ref-48 article-title: Subtle bugs everywhere: generating documentation for data wrangling code
SSID	ssj0001511119
Score	2.284719
Snippet	With the increasing reliance on machine learning (ML) across diverse disciplines, ML code has been subject to a number of issues that impact its quality, such...
SourceID	doaj pubmedcentral proquest gale pubmed crossref
SourceType	Open Website Open Access Repository Aggregation Database Index Database
StartPage	e2730
SubjectTerms	Active learning Artificial Intelligence Code quality Computational linguistics Data leakage Data Mining and Machine Learning Evaluation Language processing Low-shot prompting Machine learning Natural language interfaces Neural Networks Software Engineering Transfer learning
Title	Data leakage detection in machine learning code: transfer learning, active learning, or low-shot prompting?
URI	https://www.ncbi.nlm.nih.gov/pubmed/40134878 https://www.proquest.com/docview/3181370058 https://pubmed.ncbi.nlm.nih.gov/PMC11935776 https://doaj.org/article/2287405671c84ce59f1d0727ecc8af3a
Volume	11
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV1Lb9QwELagXLjwfqQtK4MQXAhNYjt2uKAWuhQkKlSo1JtlO3ZbSpNVkxV_nxknW23EgQuXRPJYSewZzyOa-YaQl0KZ0gfQfjznBVxUlhpwNFLjWR1UYZT1-L_j62F5cMy_nIiTtVZfmBM2wAMPG7dTICA7WGmZO8WxZijkdQZGF16tTGDRNQKbtxZMDfXBqAqqAVRTQsiys_D-6mfqurdgr7OJEYpY_X9r5DWTNE2XXLM_83vkzug40t3hg--TG755QO6umjLQ8Yw-JBcfTW_oL28uQFPQ2vcx16qh5w29jImTno6dIk4p1rO_o330XeEZq_E31EQtuDbQArX9nXZnbU_h0y8XmC39_hE5nu__-HCQjh0VUidy1qc2VKWsIYKQNuQMcWdYiQ6GcKG0lsG9cqGqWcF4zVUGPMycz71wRkjuS8kek42mbfxTQksuvSxqZhkX3FimasuEdUwKJ0PFq4S8Wm2xXgzAGRoCDuSFjrzQrtPIi4TsIQOuJyHedRwAKdCjFOh_SUFCXiD7NCJaNJgyc2qWXac_fz_Su1g8LFXFRUJej5NCC1vrzFiBAAtCEKzJzO3JTDhybkJ-vpISjSTMU2t8u-w0aMicSezVmJAng9RcLwwjWQgPgaIm8jRZ-ZTSnJ9FxG8QZSakLDf_x15tkdsFNjHGRDqxTTb6q6V_Bp5Vb2fkppp_mpFbe_uH345m8Uj9AZwcJQ4
linkProvider	Directory of Open Access Journals
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Data+leakage+detection+in+machine+learning+code%3A+transfer+learning%2C+active+learning%2C+or+low-shot+prompting%3F&rft.jtitle=PeerJ.+Computer+science&rft.au=Alturayeif%2C+Nouf&rft.au=Hassine%2C+Jameleddine&rft.date=2025-03-05&rft.issn=2376-5992&rft.eissn=2376-5992&rft.volume=11&rft.spage=e2730&rft_id=info:doi/10.7717%2Fpeerj-cs.2730&rft.externalDBID=n%2Fa&rft.externalDocID=10_7717_peerj_cs_2730
thumbnail_l	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2376-5992&client=summon
thumbnail_m	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2376-5992&client=summon
thumbnail_s	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2376-5992&client=summon