Data leakage detection in machine learning code: transfer learning, active learning, or low-shot prompting?

With the increasing reliance on machine learning (ML) across diverse disciplines, ML code has been subject to a number of issues that impact its quality, such as lack of documentation, algorithmic biases, overfitting, lack of reproducibility, inadequate data preprocessing, and potential for data lea...

Full description

Saved in:
Bibliographic Details
Published inPeerJ. Computer science Vol. 11; p. e2730
Main Authors Alturayeif, Nouf, Hassine, Jameleddine
Format Journal Article
LanguageEnglish
Published United States PeerJ. Ltd 05.03.2025
PeerJ Inc
Subjects
Online AccessGet full text
ISSN2376-5992
2376-5992
DOI10.7717/peerj-cs.2730

Cover

Loading…
Abstract With the increasing reliance on machine learning (ML) across diverse disciplines, ML code has been subject to a number of issues that impact its quality, such as lack of documentation, algorithmic biases, overfitting, lack of reproducibility, inadequate data preprocessing, and potential for data leakage, all of which can significantly affect the performance and reliability of ML models. Data leakage can affect the quality of ML models where sensitive information from the test set inadvertently influences the training process, leading to inflated performance metrics that do not generalize well to new, unseen data. Data leakage can occur at either the dataset-level ( i.e ., during dataset construction) or at the code-level. Existing studies introduced methods to detect code-level data leakage using manual and code analysis approaches. However, automated tools with advanced ML techniques are increasingly recognized as essential for efficiently identifying quality issues in large and complex codebases, enhancing the overall effectiveness of code review processes. In this article, we aim to explore ML-based approaches for limited annotated datasets to detect code-level data leakage in ML code. We proposed three approaches, namely, transfer learning, active learning, and low-shot prompting. Additionally, we introduced an automated approached to handle the imbalance issues of code data. Our results show that active learning outperformed the other approaches with an F-2 score of 0.72 and reduced the number of needed annotated samples from 1,523 to 698. We conclude that existing ML-based approaches can effectively mitigate the challenges associated with limited data availability.
AbstractList With the increasing reliance on machine learning (ML) across diverse disciplines, ML code has been subject to a number of issues that impact its quality, such as lack of documentation, algorithmic biases, overfitting, lack of reproducibility, inadequate data preprocessing, and potential for data leakage, all of which can significantly affect the performance and reliability of ML models. Data leakage can affect the quality of ML models where sensitive information from the test set inadvertently influences the training process, leading to inflated performance metrics that do not generalize well to new, unseen data. Data leakage can occur at either the dataset-level ( ., during dataset construction) or at the code-level. Existing studies introduced methods to detect code-level data leakage using manual and code analysis approaches. However, automated tools with advanced ML techniques are increasingly recognized as essential for efficiently identifying quality issues in large and complex codebases, enhancing the overall effectiveness of code review processes. In this article, we aim to explore ML-based approaches for limited annotated datasets to detect code-level data leakage in ML code. We proposed three approaches, namely, transfer learning, active learning, and low-shot prompting. Additionally, we introduced an automated approached to handle the imbalance issues of code data. Our results show that active learning outperformed the other approaches with an F-2 score of 0.72 and reduced the number of needed annotated samples from 1,523 to 698. We conclude that existing ML-based approaches can effectively mitigate the challenges associated with limited data availability.
With the increasing reliance on machine learning (ML) across diverse disciplines, ML code has been subject to a number of issues that impact its quality, such as lack of documentation, algorithmic biases, overfitting, lack of reproducibility, inadequate data preprocessing, and potential for data leakage, all of which can significantly affect the performance and reliability of ML models. Data leakage can affect the quality of ML models where sensitive information from the test set inadvertently influences the training process, leading to inflated performance metrics that do not generalize well to new, unseen data. Data leakage can occur at either the dataset-level (i.e., during dataset construction) or at the code-level. Existing studies introduced methods to detect code-level data leakage using manual and code analysis approaches. However, automated tools with advanced ML techniques are increasingly recognized as essential for efficiently identifying quality issues in large and complex codebases, enhancing the overall effectiveness of code review processes. In this article, we aim to explore ML-based approaches for limited annotated datasets to detect code-level data leakage in ML code. We proposed three approaches, namely, transfer learning, active learning, and low-shot prompting. Additionally, we introduced an automated approached to handle the imbalance issues of code data. Our results show that active learning outperformed the other approaches with an F-2 score of 0.72 and reduced the number of needed annotated samples from 1,523 to 698. We conclude that existing ML-based approaches can effectively mitigate the challenges associated with limited data availability.
With the increasing reliance on machine learning (ML) across diverse disciplines, ML code has been subject to a number of issues that impact its quality, such as lack of documentation, algorithmic biases, overfitting, lack of reproducibility, inadequate data preprocessing, and potential for data leakage, all of which can significantly affect the performance and reliability of ML models. Data leakage can affect the quality of ML models where sensitive information from the test set inadvertently influences the training process, leading to inflated performance metrics that do not generalize well to new, unseen data. Data leakage can occur at either the dataset-level ( i.e ., during dataset construction) or at the code-level. Existing studies introduced methods to detect code-level data leakage using manual and code analysis approaches. However, automated tools with advanced ML techniques are increasingly recognized as essential for efficiently identifying quality issues in large and complex codebases, enhancing the overall effectiveness of code review processes. In this article, we aim to explore ML-based approaches for limited annotated datasets to detect code-level data leakage in ML code. We proposed three approaches, namely, transfer learning, active learning, and low-shot prompting. Additionally, we introduced an automated approached to handle the imbalance issues of code data. Our results show that active learning outperformed the other approaches with an F-2 score of 0.72 and reduced the number of needed annotated samples from 1,523 to 698. We conclude that existing ML-based approaches can effectively mitigate the challenges associated with limited data availability.
With the increasing reliance on machine learning (ML) across diverse disciplines, ML code has been subject to a number of issues that impact its quality, such as lack of documentation, algorithmic biases, overfitting, lack of reproducibility, inadequate data preprocessing, and potential for data leakage, all of which can significantly affect the performance and reliability of ML models. Data leakage can affect the quality of ML models where sensitive information from the test set inadvertently influences the training process, leading to inflated performance metrics that do not generalize well to new, unseen data. Data leakage can occur at either the dataset-level (i.e., during dataset construction) or at the code-level. Existing studies introduced methods to detect code-level data leakage using manual and code analysis approaches. However, automated tools with advanced ML techniques are increasingly recognized as essential for efficiently identifying quality issues in large and complex codebases, enhancing the overall effectiveness of code review processes. In this article, we aim to explore ML-based approaches for limited annotated datasets to detect code-level data leakage in ML code. We proposed three approaches, namely, transfer learning, active learning, and low-shot prompting. Additionally, we introduced an automated approached to handle the imbalance issues of code data. Our results show that active learning outperformed the other approaches with an F-2 score of 0.72 and reduced the number of needed annotated samples from 1,523 to 698. We conclude that existing ML-based approaches can effectively mitigate the challenges associated with limited data availability.With the increasing reliance on machine learning (ML) across diverse disciplines, ML code has been subject to a number of issues that impact its quality, such as lack of documentation, algorithmic biases, overfitting, lack of reproducibility, inadequate data preprocessing, and potential for data leakage, all of which can significantly affect the performance and reliability of ML models. Data leakage can affect the quality of ML models where sensitive information from the test set inadvertently influences the training process, leading to inflated performance metrics that do not generalize well to new, unseen data. Data leakage can occur at either the dataset-level (i.e., during dataset construction) or at the code-level. Existing studies introduced methods to detect code-level data leakage using manual and code analysis approaches. However, automated tools with advanced ML techniques are increasingly recognized as essential for efficiently identifying quality issues in large and complex codebases, enhancing the overall effectiveness of code review processes. In this article, we aim to explore ML-based approaches for limited annotated datasets to detect code-level data leakage in ML code. We proposed three approaches, namely, transfer learning, active learning, and low-shot prompting. Additionally, we introduced an automated approached to handle the imbalance issues of code data. Our results show that active learning outperformed the other approaches with an F-2 score of 0.72 and reduced the number of needed annotated samples from 1,523 to 698. We conclude that existing ML-based approaches can effectively mitigate the challenges associated with limited data availability.
ArticleNumber e2730
Audience Academic
Author Hassine, Jameleddine
Alturayeif, Nouf
Author_xml – sequence: 1
  givenname: Nouf
  orcidid: 0000-0002-2761-8420
  surname: Alturayeif
  fullname: Alturayeif, Nouf
  organization: Information and Computer Science Department, King Fahd University of Petroleum and Minerals, Dhahran, Saudi Arabia, Computing Department, Imam Abdulrahman Bin Faisal University, Dammam, Saudi Arabia
– sequence: 2
  givenname: Jameleddine
  orcidid: 0000-0001-8170-9860
  surname: Hassine
  fullname: Hassine, Jameleddine
  organization: Information and Computer Science Department, King Fahd University of Petroleum and Minerals, Dhahran, Saudi Arabia, Interdisciplinary Research Center for Intelligent Secure Systems, King Fahd University of Petroleum and Minerals, Dhahran, Saudi Arabia
BackLink https://www.ncbi.nlm.nih.gov/pubmed/40134878$$D View this record in MEDLINE/PubMed
BookMark eNptks1rGzEQxZeS0qRpjr2WhV5a6LrSarWj7SWE9MsQKPTjLGTtaC1nV3IlOW3_-8pxamyodJB485uHRrynxYnzDoviOSUzAApv14hhVek4q4GRR8VZzaCteNfVJwf30-IixhUhhHKaV_ekOG0IZY0AcVbcvldJlSOqWzVg2WNCnax3pXXlpPTSOtwWg7NuKLXv8V2ZgnLRYNjrb0qVe-7wQPC56n9VcelTuQ5-WqcsXz4rHhs1Rrx4OM-LHx8_fL_-XN18-TS_vrqpNKcsVQvTtdBDA7Aw22d2jLWkFoxr0y4WLJ-dNl3Patb0jSCGAtFIkWvFocEW2Hkx3_n2Xq3kOthJhT_SKyvvBR8GqUKyekRZ1wIawlugWjQaeWdoT6AG1Foow1T2utx5rTeLCXuNLs8_HpkeV5xdysHfyfzRjAO02eHVg0PwPzcYk5xs1DiOyqHfRMmooAwI4SKjL3fooPLbrDM-W-otLq9E3QGIruGZmv2HyrvHyeocEGOzftTw-qghMwl_p0FtYpTzb1-P2ReH8-4H_ZeYDFQ7QAcfY0CzRyiR21DK-1BKHeU2lOwvaTDT7g
Cites_doi 10.1109/TKDE.2009.191
10.1007/s10664-023-10405-9
10.1613/jair.295
10.48550/arXiv.2305.16837
10.48550/arXiv.1907.11692
10.1145/3386252
10.48550/arXiv.2111.15258
10.1145/3447876
10.1613/jair.953
10.1016/j.eng.2019.12.012
10.1145/3472291
10.48550/arXiv.2310.10508
10.1145/3512934
10.48550/arXiv.1809.09287
10.18653/v1/2020.findings-emnlp.139
10.1145/3560815
10.1145/2382577.2382579
10.1145/3641289
10.1109/JPROC.2020.3004555
10.7717/peerj-cs.1230
ContentType Journal Article
Copyright 2025 Alturayeif and Hassine.
COPYRIGHT 2025 PeerJ. Ltd.
2025 Alturayeif and Hassine 2025 Alturayeif and Hassine
Copyright_xml – notice: 2025 Alturayeif and Hassine.
– notice: COPYRIGHT 2025 PeerJ. Ltd.
– notice: 2025 Alturayeif and Hassine 2025 Alturayeif and Hassine
DBID AAYXX
CITATION
NPM
ISR
7X8
5PM
DOA
DOI 10.7717/peerj-cs.2730
DatabaseName CrossRef
PubMed
Gale In Context: Science
MEDLINE - Academic
PubMed Central (Full Participant titles)
DOAJ Directory of Open Access Journals
DatabaseTitle CrossRef
PubMed
MEDLINE - Academic
DatabaseTitleList PubMed


CrossRef

MEDLINE - Academic
Database_xml – sequence: 1
  dbid: DOA
  name: DOAJ Directory of Open Access Journals
  url: https://www.doaj.org/
  sourceTypes: Open Website
– sequence: 2
  dbid: NPM
  name: PubMed
  url: https://proxy.k.utb.cz/login?url=http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed
  sourceTypes: Index Database
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
EISSN 2376-5992
ExternalDocumentID oai_doaj_org_article_2287405671c84ce59f1d0727ecc8af3a
PMC11935776
A829778945
40134878
10_7717_peerj_cs_2730
Genre Journal Article
GrantInformation_xml – fundername: Interdisciplinary Research Center for Intelligent Secure Systems at KFUPM
  grantid: INSS2406
GroupedDBID 53G
5VS
8FE
8FG
AAFWJ
AAYXX
ABUWG
ADBBV
AFKRA
AFPKN
ALMA_UNASSIGNED_HOLDINGS
ARAPS
ARCSS
AZQEC
BCNDV
BENPR
BGLVJ
BPHCQ
CCPQU
CITATION
DWQXO
FRP
GNUQQ
GROUPED_DOAJ
HCIFZ
IAO
ICD
IEA
ISR
ITC
K6V
K7-
M~E
OK1
P62
PHGZM
PHGZT
PIMPY
PQQKQ
PROAC
RPM
H13
NPM
PQGLB
PMFND
7X8
5PM
PUEGO
ID FETCH-LOGICAL-c513t-bf967d7477bf1348933602835cf6bb335c9cf9d3234d480f170ce1e5ca574e673
IEDL.DBID DOA
ISSN 2376-5992
IngestDate Wed Aug 27 01:26:43 EDT 2025
Thu Aug 21 18:39:59 EDT 2025
Fri Jul 11 18:57:42 EDT 2025
Tue Jun 17 21:58:34 EDT 2025
Tue Jun 10 20:53:56 EDT 2025
Fri Jun 27 05:15:51 EDT 2025
Mon Jul 21 05:33:20 EDT 2025
Tue Jul 01 05:27:41 EDT 2025
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed true
IsScholarly true
Keywords Data leakage
Code quality
Low-shot prompting
Active learning
Transfer learning
Language English
License https://creativecommons.org/licenses/by/4.0
2025 Alturayeif and Hassine.
This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited.
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c513t-bf967d7477bf1348933602835cf6bb335c9cf9d3234d480f170ce1e5ca574e673
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
ORCID 0000-0001-8170-9860
0000-0002-2761-8420
OpenAccessLink https://doaj.org/article/2287405671c84ce59f1d0727ecc8af3a
PMID 40134878
PQID 3181370058
PQPubID 23479
PageCount e2730
ParticipantIDs doaj_primary_oai_doaj_org_article_2287405671c84ce59f1d0727ecc8af3a
pubmedcentral_primary_oai_pubmedcentral_nih_gov_11935776
proquest_miscellaneous_3181370058
gale_infotracmisc_A829778945
gale_infotracacademiconefile_A829778945
gale_incontextgauss_ISR_A829778945
pubmed_primary_40134878
crossref_primary_10_7717_peerj_cs_2730
ProviderPackageCode CITATION
AAYXX
PublicationCentury 2000
PublicationDate 2025-03-05
PublicationDateYYYYMMDD 2025-03-05
PublicationDate_xml – month: 03
  year: 2025
  text: 2025-03-05
  day: 05
PublicationDecade 2020
PublicationPlace United States
PublicationPlace_xml – name: United States
– name: San Diego, USA
PublicationTitle PeerJ. Computer science
PublicationTitleAlternate PeerJ Comput Sci
PublicationYear 2025
Publisher PeerJ. Ltd
PeerJ Inc
Publisher_xml – name: PeerJ. Ltd
– name: PeerJ Inc
References Huang (10.7717/peerj-cs.2730/ref-17) 2021
Yang (10.7717/peerj-cs.2730/ref-48) 2021
Tsantalis (10.7717/peerj-cs.2730/ref-41) 2018
Seung (10.7717/peerj-cs.2730/ref-36) 1992
Liu (10.7717/peerj-cs.2730/ref-21) 2019
Drozdova (10.7717/peerj-cs.2730/ref-12) 2023; 9
Lyu (10.7717/peerj-cs.2730/ref-23) 2021; 30
Chawla (10.7717/peerj-cs.2730/ref-6) 2002; 16
Chattopadhyay (10.7717/peerj-cs.2730/ref-5) 2020
Namaki (10.7717/peerj-cs.2730/ref-25) 2020
Feng (10.7717/peerj-cs.2730/ref-13) 2020
Chorev (10.7717/peerj-cs.2730/ref-7) 2022; 23
Subotić (10.7717/peerj-cs.2730/ref-40) 2022
Fowler (10.7717/peerj-cs.2730/ref-14) 1997
Nahar (10.7717/peerj-cs.2730/ref-24) 2022
Cohn (10.7717/peerj-cs.2730/ref-8) 1996; 4
Olsson (10.7717/peerj-cs.2730/ref-26) 2009
Biswas (10.7717/peerj-cs.2730/ref-1) 2022
Wang (10.7717/peerj-cs.2730/ref-45) 2020b; 53
Sridhara (10.7717/peerj-cs.2730/ref-39) 2023
He (10.7717/peerj-cs.2730/ref-16) 2016
OpenAI (10.7717/peerj-cs.2730/ref-27) 2023
Brown (10.7717/peerj-cs.2730/ref-2) 2020; 33
Goodfellow (10.7717/peerj-cs.2730/ref-15) 2015
Quaranta (10.7717/peerj-cs.2730/ref-31) 2022; 6
Kohavi (10.7717/peerj-cs.2730/ref-20) 2003
Vaswani (10.7717/peerj-cs.2730/ref-43) 2017; 30
Kaufman (10.7717/peerj-cs.2730/ref-18) 2012; 6
Wang (10.7717/peerj-cs.2730/ref-44) 2020a
Cousot (10.7717/peerj-cs.2730/ref-9) 1977
Liu (10.7717/peerj-cs.2730/ref-22) 2023; 55
Ren (10.7717/peerj-cs.2730/ref-33) 2020; 6
Devlin (10.7717/peerj-cs.2730/ref-10) 2019
Pujar (10.7717/peerj-cs.2730/ref-30) 2024; 29
Yang (10.7717/peerj-cs.2730/ref-47) 2022
Ren (10.7717/peerj-cs.2730/ref-32) 2021; 54
Settles (10.7717/peerj-cs.2730/ref-35) 2009
Xie (10.7717/peerj-cs.2730/ref-46) 2019
Pimentel (10.7717/peerj-cs.2730/ref-29) 2019
Koenzen (10.7717/peerj-cs.2730/ref-19) 2020
Tukey (10.7717/peerj-cs.2730/ref-42) 1977; 688
Pan (10.7717/peerj-cs.2730/ref-28) 2009; 22
Shin (10.7717/peerj-cs.2730/ref-37) 2023
Zhuang (10.7717/peerj-cs.2730/ref-49) 2020; 109
Drobnjakovic (10.7717/peerj-cs.2730/ref-11) 2024
Smailagic (10.7717/peerj-cs.2730/ref-38) 2018
Burkov (10.7717/peerj-cs.2730/ref-3) 2020; 1
Schneider (10.7717/peerj-cs.2730/ref-34) 2019
Chang (10.7717/peerj-cs.2730/ref-4) 2024; 15
References_xml – start-page: 1
  year: 2020
  ident: 10.7717/peerj-cs.2730/ref-19
  article-title: Code duplication and reuse in jupyter notebooks
– year: 2009
  ident: 10.7717/peerj-cs.2730/ref-26
  article-title: A literature survey of active machine learning in the context of natural language processing
– volume: 22
  start-page: 1345
  issue: 10
  year: 2009
  ident: 10.7717/peerj-cs.2730/ref-28
  article-title: A survey on transfer learning
  publication-title: IEEE Transactions on Knowledge and Data Engineering
  doi: 10.1109/TKDE.2009.191
– volume: 29
  start-page: 48
  issue: 2
  year: 2024
  ident: 10.7717/peerj-cs.2730/ref-30
  article-title: Analyzing source code vulnerabilities in the D2A dataset with ML ensembles and C-BERT
  publication-title: Empirical Software Engineering
  doi: 10.1007/s10664-023-10405-9
– volume: 4
  start-page: 129
  year: 1996
  ident: 10.7717/peerj-cs.2730/ref-8
  article-title: Active learning with statistical models
  publication-title: Journal of Artificial Intelligence Research
  doi: 10.1613/jair.295
– start-page: 1
  year: 2022
  ident: 10.7717/peerj-cs.2730/ref-47
  article-title: Data leakage in notebooks: static detection and better processes
– year: 2023
  ident: 10.7717/peerj-cs.2730/ref-39
  article-title: ChatGPT: a study on its utility for ubiquitous software engineering tasks
  doi: 10.48550/arXiv.2305.16837
– year: 2015
  ident: 10.7717/peerj-cs.2730/ref-15
  article-title: Explaining and harnessing adversarial examples
– volume: 1
  volume-title: Machine learning engineering
  year: 2020
  ident: 10.7717/peerj-cs.2730/ref-3
– start-page: 1542
  year: 2020
  ident: 10.7717/peerj-cs.2730/ref-25
  article-title: Vamsa: automated provenance tracking in data science scripts
– year: 2019
  ident: 10.7717/peerj-cs.2730/ref-21
  article-title: RoBERTa: a robustly optimized BERT pretraining approach
  doi: 10.48550/arXiv.1907.11692
– volume: 53
  start-page: 1
  issue: 3
  year: 2020b
  ident: 10.7717/peerj-cs.2730/ref-45
  article-title: Generalizing from a few examples: a survey on few-shot learning
  publication-title: ACM Computing Surveys (CSUR)
  doi: 10.1145/3386252
– year: 2021
  ident: 10.7717/peerj-cs.2730/ref-17
  article-title: DeepAL: deep active learning in python
  doi: 10.48550/arXiv.2111.15258
– volume: 30
  start-page: 1
  issue: 4
  year: 2021
  ident: 10.7717/peerj-cs.2730/ref-23
  article-title: An empirical study of the impact of data splitting decisions on the performance of aiops solutions
  publication-title: ACM Transactions on Software Engineering and Methodology (TOSEM)
  doi: 10.1145/3447876
– year: 2003
  ident: 10.7717/peerj-cs.2730/ref-20
  article-title: Ten supplementary analyses to improve e-commerce web sites
– volume: 23
  start-page: 12990
  issue: 1
  year: 2022
  ident: 10.7717/peerj-cs.2730/ref-7
  article-title: Deepchecks: a library for testing and validating machine learning models and data
  publication-title: The Journal of Machine Learning Research
– volume: 16
  start-page: 321
  year: 2002
  ident: 10.7717/peerj-cs.2730/ref-6
  article-title: Smote: synthetic minority over-sampling technique
  publication-title: Journal of Artificial Intelligence Research
  doi: 10.1613/jair.953
– start-page: 3465
  year: 2019
  ident: 10.7717/peerj-cs.2730/ref-34
  article-title: Unsupervised pre-training for speech recognition
– start-page: 109
  year: 2024
  ident: 10.7717/peerj-cs.2730/ref-11
  article-title: An abstract interpretation-based data leakage static analysis
– start-page: 138
  year: 2020a
  ident: 10.7717/peerj-cs.2730/ref-44
  article-title: Assessing and restoring reproducibility of jupyter notebooks
– volume: 6
  start-page: 346
  issue: 3
  year: 2020
  ident: 10.7717/peerj-cs.2730/ref-33
  article-title: Adversarial attacks and defenses in deep learning
  publication-title: Engineering
  doi: 10.1016/j.eng.2019.12.012
– volume: 688
  start-page: 581
  year: 1977
  ident: 10.7717/peerj-cs.2730/ref-42
  article-title: Exploratory data analysis addision-wesley
  publication-title: Reading, Ma
– volume: 54
  start-page: 1
  issue: 9
  year: 2021
  ident: 10.7717/peerj-cs.2730/ref-32
  article-title: A survey of deep active learning
  publication-title: ACM Computing Surveys (CSUR)
  doi: 10.1145/3472291
– start-page: 238
  year: 1977
  ident: 10.7717/peerj-cs.2730/ref-9
  article-title: Abstract interpretation: a unified lattice model for static analysis of programs by construction or approximation of fixpoints
– year: 1997
  ident: 10.7717/peerj-cs.2730/ref-14
  article-title: Refactoring: improving the design of existing code
– start-page: 501
  year: 2019
  ident: 10.7717/peerj-cs.2730/ref-46
  article-title: Feature denoising for improving adversarial robustness
– year: 2023
  ident: 10.7717/peerj-cs.2730/ref-37
  article-title: Prompt engineering or fine tuning: An empirical assessment of large language models in automated software engineering tasks
  doi: 10.48550/arXiv.2310.10508
– year: 2009
  ident: 10.7717/peerj-cs.2730/ref-35
  article-title: Active learning literature survey
– volume: 6
  start-page: 1
  issue: CSCW1
  year: 2022
  ident: 10.7717/peerj-cs.2730/ref-31
  article-title: Eliciting best practices for collaboration with computational notebooks
  publication-title: Proceedings of the ACM on Human-Computer Interaction
  doi: 10.1145/3512934
– year: 2018
  ident: 10.7717/peerj-cs.2730/ref-38
  article-title: Medal: Deep active learning sampling method for medical image analysis
  doi: 10.48550/arXiv.1809.09287
– start-page: 1536
  volume-title: Findings of the Association for Computational Linguistics: EMNLP 2020
  year: 2020
  ident: 10.7717/peerj-cs.2730/ref-13
  article-title: CodeBERT: A pre-trained model for programming and natural languages
  doi: 10.18653/v1/2020.findings-emnlp.139
– volume: 55
  start-page: 1
  issue: 9
  year: 2023
  ident: 10.7717/peerj-cs.2730/ref-22
  article-title: Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing
  publication-title: ACM Computing Surveys
  doi: 10.1145/3560815
– volume: 6
  start-page: 1
  issue: 4
  year: 2012
  ident: 10.7717/peerj-cs.2730/ref-18
  article-title: Leakage in data mining: formulation, detection, and avoidance
  publication-title: ACM Transactions on Knowledge Discovery from Data (TKDD)
  doi: 10.1145/2382577.2382579
– start-page: 287
  year: 1992
  ident: 10.7717/peerj-cs.2730/ref-36
  article-title: Query by committee
– start-page: 507
  year: 2019
  ident: 10.7717/peerj-cs.2730/ref-29
  article-title: A large-scale study about quality and reproducibility of jupyter notebooks
– start-page: 13
  year: 2022
  ident: 10.7717/peerj-cs.2730/ref-40
  article-title: A static analysis framework for data science notebooks
– start-page: 1
  year: 2020
  ident: 10.7717/peerj-cs.2730/ref-5
  article-title: What’s wrong with computational notebooks? pain points, needs, and design opportunities
– start-page: 770
  year: 2016
  ident: 10.7717/peerj-cs.2730/ref-16
  article-title: Deep residual learning for image recognition
– volume: 30
  start-page: 5998
  year: 2017
  ident: 10.7717/peerj-cs.2730/ref-43
  article-title: Attention is all you need
  publication-title: Advances in Neural Information Processing Systems
– year: 2023
  ident: 10.7717/peerj-cs.2730/ref-27
  article-title: GPT-4 technical report
– start-page: 413
  year: 2022
  ident: 10.7717/peerj-cs.2730/ref-24
  article-title: Collaboration challenges in building ML-enabled systems: communication, documentation, engineering, and process
– volume: 15
  start-page: 1
  issue: 3
  year: 2024
  ident: 10.7717/peerj-cs.2730/ref-4
  article-title: A survey on evaluation of large language models
  publication-title: ACM Transactions on Intelligent Systems and Technology
  doi: 10.1145/3641289
– start-page: 483
  year: 2018
  ident: 10.7717/peerj-cs.2730/ref-41
  article-title: Accurate and efficient refactoring detection in commit history
– volume: 109
  start-page: 43
  issue: 1
  year: 2020
  ident: 10.7717/peerj-cs.2730/ref-49
  article-title: A comprehensive survey on transfer learning
  publication-title: Proceedings of the IEEE
  doi: 10.1109/JPROC.2020.3004555
– start-page: 2091
  year: 2022
  ident: 10.7717/peerj-cs.2730/ref-1
  article-title: The art and practice of data science pipelines: a comprehensive study of data science pipelines in theory, in-the-small, and in-the-large
– volume: 33
  start-page: 1877
  volume-title: Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6–12, 2020, Virtual
  year: 2020
  ident: 10.7717/peerj-cs.2730/ref-2
  article-title: Language models are few-shot learners
– start-page: 4171
  year: 2019
  ident: 10.7717/peerj-cs.2730/ref-10
  article-title: BERT: pre-training of deep bidirectional transformers for language understanding
– volume: 9
  start-page: e1230
  issue: 4
  year: 2023
  ident: 10.7717/peerj-cs.2730/ref-12
  article-title: Code4ML: a large-scale dataset of annotated machine learning code
  publication-title: PeerJ Computer Science
  doi: 10.7717/peerj-cs.1230
– start-page: 304
  year: 2021
  ident: 10.7717/peerj-cs.2730/ref-48
  article-title: Subtle bugs everywhere: generating documentation for data wrangling code
SSID ssj0001511119
Score 2.284719
Snippet With the increasing reliance on machine learning (ML) across diverse disciplines, ML code has been subject to a number of issues that impact its quality, such...
SourceID doaj
pubmedcentral
proquest
gale
pubmed
crossref
SourceType Open Website
Open Access Repository
Aggregation Database
Index Database
StartPage e2730
SubjectTerms Active learning
Artificial Intelligence
Code quality
Computational linguistics
Data leakage
Data Mining and Machine Learning
Evaluation
Language processing
Low-shot prompting
Machine learning
Natural language interfaces
Neural Networks
Software Engineering
Transfer learning
Title Data leakage detection in machine learning code: transfer learning, active learning, or low-shot prompting?
URI https://www.ncbi.nlm.nih.gov/pubmed/40134878
https://www.proquest.com/docview/3181370058
https://pubmed.ncbi.nlm.nih.gov/PMC11935776
https://doaj.org/article/2287405671c84ce59f1d0727ecc8af3a
Volume 11
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV1Lb9QwELagXLjwfqQtK4MQXAhNYjt2uKAWuhQkKlSo1JtlO3ZbSpNVkxV_nxknW23EgQuXRPJYSewZzyOa-YaQl0KZ0gfQfjznBVxUlhpwNFLjWR1UYZT1-L_j62F5cMy_nIiTtVZfmBM2wAMPG7dTICA7WGmZO8WxZijkdQZGF16tTGDRNQKbtxZMDfXBqAqqAVRTQsiys_D-6mfqurdgr7OJEYpY_X9r5DWTNE2XXLM_83vkzug40t3hg--TG755QO6umjLQ8Yw-JBcfTW_oL28uQFPQ2vcx16qh5w29jImTno6dIk4p1rO_o330XeEZq_E31EQtuDbQArX9nXZnbU_h0y8XmC39_hE5nu__-HCQjh0VUidy1qc2VKWsIYKQNuQMcWdYiQ6GcKG0lsG9cqGqWcF4zVUGPMycz71wRkjuS8kek42mbfxTQksuvSxqZhkX3FimasuEdUwKJ0PFq4S8Wm2xXgzAGRoCDuSFjrzQrtPIi4TsIQOuJyHedRwAKdCjFOh_SUFCXiD7NCJaNJgyc2qWXac_fz_Su1g8LFXFRUJej5NCC1vrzFiBAAtCEKzJzO3JTDhybkJ-vpISjSTMU2t8u-w0aMicSezVmJAng9RcLwwjWQgPgaIm8jRZ-ZTSnJ9FxG8QZSakLDf_x15tkdsFNjHGRDqxTTb6q6V_Bp5Vb2fkppp_mpFbe_uH345m8Uj9AZwcJQ4
linkProvider Directory of Open Access Journals
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Data+leakage+detection+in+machine+learning+code%3A+transfer+learning%2C+active+learning%2C+or+low-shot+prompting%3F&rft.jtitle=PeerJ.+Computer+science&rft.au=Alturayeif%2C+Nouf&rft.au=Hassine%2C+Jameleddine&rft.date=2025-03-05&rft.issn=2376-5992&rft.eissn=2376-5992&rft.volume=11&rft.spage=e2730&rft_id=info:doi/10.7717%2Fpeerj-cs.2730&rft.externalDBID=n%2Fa&rft.externalDocID=10_7717_peerj_cs_2730
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2376-5992&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2376-5992&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2376-5992&client=summon