Contextualized Data-Wrangling Code Generation in Computational Notebooks
Data wrangling, the process of preparing raw data for further analysis in computational notebooks, is a crucial yet time-consuming step in data science. Code generation has the potential to automate the data wrangling process to reduce analysts' overhead by translating user intents into executa...
Saved in:
Published in | IEEE/ACM International Conference on Automated Software Engineering : [proceedings] pp. 1282 - 1294 |
---|---|
Main Authors | , , , , , , , , , |
Format | Conference Proceeding |
Language | English |
Published |
ACM
27.10.2024
|
Subjects | |
Online Access | Get full text |
ISSN | 2643-1572 |
DOI | 10.1145/3691620.3695503 |
Cover
Loading…
Abstract | Data wrangling, the process of preparing raw data for further analysis in computational notebooks, is a crucial yet time-consuming step in data science. Code generation has the potential to automate the data wrangling process to reduce analysts' overhead by translating user intents into executable code. Precisely generating data wrangling code necessitates a comprehensive consideration of the rich context present in notebooks, including textual context, code context and data context. However, notebooks often interleave multiple non-linear analysis tasks into linear sequence of code blocks, where the contextual dependencies are not clearly reflected. Directly training models with source code blocks fails to fully exploit the contexts for accurate wrangling code generation.To bridge the gap, we aim to construct a high quality datasets with clear and rich contexts to help training models for data wrangling code generation tasks. In this work, we first propose an automated approach, CoCoMine to mine data-wrangling code generation examples with clear multi-modal contextual dependency. It first adopts data flow analysis to identify the code blocks containing data wrangling codes. Then, CoCoMine extracts the contextualized data-wrangling code examples through tracing and replaying notebooks. With CoCoMine, we construct CoCoNote, a dataset containing 58,221 examples for Contextualized Data-wrangling Code generation in Notebooks. To demonstrate the effectiveness of our dataset, we finetune a range of pretrained code models and prompt various large language models on our task. Furthermore, we also propose DataCoder, which encodes data context and code&textual contexts separately to enhance code generation. Experiment results demonstrate the significance of incorporating data context in data-wrangling code generation and the effectiveness of our model. We release code and data at https://github.com/Jun-jie-Huang/CoCoNote.CCS CONCEPTS* Software and its engineering → Automatic programming. |
---|---|
AbstractList | Data wrangling, the process of preparing raw data for further analysis in computational notebooks, is a crucial yet time-consuming step in data science. Code generation has the potential to automate the data wrangling process to reduce analysts' overhead by translating user intents into executable code. Precisely generating data wrangling code necessitates a comprehensive consideration of the rich context present in notebooks, including textual context, code context and data context. However, notebooks often interleave multiple non-linear analysis tasks into linear sequence of code blocks, where the contextual dependencies are not clearly reflected. Directly training models with source code blocks fails to fully exploit the contexts for accurate wrangling code generation.To bridge the gap, we aim to construct a high quality datasets with clear and rich contexts to help training models for data wrangling code generation tasks. In this work, we first propose an automated approach, CoCoMine to mine data-wrangling code generation examples with clear multi-modal contextual dependency. It first adopts data flow analysis to identify the code blocks containing data wrangling codes. Then, CoCoMine extracts the contextualized data-wrangling code examples through tracing and replaying notebooks. With CoCoMine, we construct CoCoNote, a dataset containing 58,221 examples for Contextualized Data-wrangling Code generation in Notebooks. To demonstrate the effectiveness of our dataset, we finetune a range of pretrained code models and prompt various large language models on our task. Furthermore, we also propose DataCoder, which encodes data context and code&textual contexts separately to enhance code generation. Experiment results demonstrate the significance of incorporating data context in data-wrangling code generation and the effectiveness of our model. We release code and data at https://github.com/Jun-jie-Huang/CoCoNote.CCS CONCEPTS* Software and its engineering → Automatic programming. |
Author | Lu, Shuai Lyu, Michael R. Wang, Chenglong Gu, Jiazhen Guo, Daya Huang, Junjie Gao, Jianfeng Inala, Jeevana Priya Yan, Cong Duan, Nan |
Author_xml | – sequence: 1 givenname: Junjie surname: Huang fullname: Huang, Junjie email: jjhuang23@cse.cuhk.edu.hk organization: The Chinese University of Hong Kong,China – sequence: 2 givenname: Daya surname: Guo fullname: Guo, Daya email: guody5@mail2.sysu.edu.cn organization: Sun-yat Sen University – sequence: 3 givenname: Chenglong surname: Wang fullname: Wang, Chenglong email: chenglong.wang@microsoft.com organization: Microsoft Research – sequence: 4 givenname: Jiazhen surname: Gu fullname: Gu, Jiazhen email: jiazhengu@cse.cuhk.edu.hk organization: The Chinese University of Hong Kong,China – sequence: 5 givenname: Shuai surname: Lu fullname: Lu, Shuai email: shuailu@microsoft.com organization: Microsoft Research Asia – sequence: 6 givenname: Jeevana Priya surname: Inala fullname: Inala, Jeevana Priya organization: Microsoft Research – sequence: 7 givenname: Cong surname: Yan fullname: Yan, Cong organization: Microsoft Research – sequence: 8 givenname: Jianfeng surname: Gao fullname: Gao, Jianfeng email: jfgao@microsoft.com organization: Microsoft Research – sequence: 9 givenname: Nan surname: Duan fullname: Duan, Nan email: jinala@microsoft.com organization: Microsoft Research Asia – sequence: 10 givenname: Michael R. surname: Lyu fullname: Lyu, Michael R. email: lyu@cse.cuhk.edu.hk organization: The Chinese University of Hong Kong,China |
BookMark | eNotjj1PwzAURQ0CiVIyszDkD6T4-dsjCtAiVbCAGKuX5rmySO0qSSXg1xMB07n3DFf3kp2lnIixa-ALAKVvpfFgBF9M1JrLE1Z4653i3IJQzp6ymTBKVqCtuGDFMMSGT1EbADNjqzqnkT7HI3bxm9ryHkes3ntMuy6mXVnnlsolJepxjDmVMU1qfziOvxW78jmP1OT8MVyx84DdQMU_5-zt8eG1XlXrl-VTfbeucDozVjLoIASCaYgbE6zzqJ0XyJVAa7lrWnAePAWDW-G1C6RB6rYVAVtSWyPn7OZvNxLR5tDHPfZfG-DWKA9O_gAbY07r |
CODEN | IEEPAD |
ContentType | Conference Proceeding |
DBID | 6IE 6IL CBEJK RIE RIL |
DOI | 10.1145/3691620.3695503 |
DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Xplore POP ALL IEEE Xplore All Conference Proceedings IEEE Xplore IEEE Proceedings Order Plans (POP All) 1998-Present |
DatabaseTitleList | |
Database_xml | – sequence: 1 dbid: RIE name: IEEE Xplore url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Computer Science |
EISBN | 9798400712487 |
EISSN | 2643-1572 |
EndPage | 1294 |
ExternalDocumentID | 10764918 |
Genre | orig-research |
GroupedDBID | 6IE 6IF 6IH 6IK 6IL 6IM 6IN 6J9 AAJGR AAWTH ABLEC ACREN ADYOE ADZIZ AFYQB ALMA_UNASSIGNED_HOLDINGS AMTXH BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO IEGSK IPLJI M43 OCL RIE RIL |
ID | FETCH-LOGICAL-a248t-3f5f22a16be066f789a5892a042a7708bd18919ef6ac2958fe5135dd2fade4c63 |
IEDL.DBID | RIE |
IngestDate | Wed Jan 15 06:20:43 EST 2025 |
IsDoiOpenAccess | false |
IsOpenAccess | true |
IsPeerReviewed | false |
IsScholarly | true |
Language | English |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-a248t-3f5f22a16be066f789a5892a042a7708bd18919ef6ac2958fe5135dd2fade4c63 |
PageCount | 13 |
ParticipantIDs | ieee_primary_10764918 |
PublicationCentury | 2000 |
PublicationDate | 2024-Oct.-27 |
PublicationDateYYYYMMDD | 2024-10-27 |
PublicationDate_xml | – month: 10 year: 2024 text: 2024-Oct.-27 day: 27 |
PublicationDecade | 2020 |
PublicationTitle | IEEE/ACM International Conference on Automated Software Engineering : [proceedings] |
PublicationTitleAbbrev | ASE |
PublicationYear | 2024 |
Publisher | ACM |
Publisher_xml | – name: ACM |
SSID | ssib057256116 ssj0051577 |
Score | 2.2863002 |
Snippet | Data wrangling, the process of preparing raw data for further analysis in computational notebooks, is a crucial yet time-consuming step in data science. Code... |
SourceID | ieee |
SourceType | Publisher |
StartPage | 1282 |
SubjectTerms | code generation Codes Computational modeling computational notebooks Context modeling Data models Data science data wrangling Large language models Software Software engineering Source coding Training |
Title | Contextualized Data-Wrangling Code Generation in Computational Notebooks |
URI | https://ieeexplore.ieee.org/document/10764918 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1NSwMxEA3ak6f6UfGbHLxubbL5PFdL8VA8WOytZDcTKcJWdAvSX-8ku6soCN6WPSwhyex7k5n3Qsj1yHtfgtCxqmsyAYZlzuXR7i4ID9ZK51KD7ExN5-J-IRetWD1pYQAgNZ_BMD6mWr5fl5t4VIYRrpWwzOySXdxnjVir2zxSI3izyHWa3zDitNatlw8T8iZXSIQ45qjKIinPf1ymkrBk0iezbhRNC8nLcFMXw3L7y6Dx38PcJ4Nv2R59-AKkA7ID1SHpd_c20DaMj8g0WVJ9ROXIague3rraZU8IWs9Rm07Haw-0saOOq0ZXFW2-0Z4b0tm6TjXW9wGZT-4ex9OsvVEhc1yYOsuDDJw7pgpAqhG0sU4ayx1GrtN6ZArPjGUWgnIlt9IEkCyX3vPgPIhS5cekV60rOCFUyBCLyci2HGZ4XhgLwjChjcSEzit7SgZxZpavjWnGspuUsz_en5M9jnwhwgLXF6RXv23gEvG-Lq7SOn8CBwOowQ |
linkProvider | IEEE |
linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3JTsMwELWgHOBUliJ2fOCaUjtezwVUoEQcWtFb5cRjVCElCFIJ9euxs4BAQuIW5RBZdsbvjWfeM0IXA2ttBkyGqq6KGCgSGRMHuzvHLGjNjakaZBMxmrK7GZ81YvVKCwMAVfMZ9MNjVcu3RbYMR2U-wqVgmqh1tOGBn_FartX-Plx6-CaB7dQbsUdqKRs3H8L4ZSw8FaI-SxXa0_L4x3UqFZrcdFHSjqNuInnpL8u0n61-WTT-e6DbqPct3MOPX5C0g9Yg30Xd9uYG3ATyHhpVplQfQTuyWIHFV6Y00ZOHreegTsfDwgKuDanDuuFFjutvNCeHOCnKqsr63kPTm-vJcBQ1dypEhjJVRrHjjlJDRAqebDiptOFKU-Nj10g5UKklShMNTpiMaq4ccBJza6kzFlgm4n3UyYscDhBm3IVysudbxud4likNTBEmFfcpnRX6EPXCzMxfa9uMeTspR3-8P0ebo8nDeD6-Te6P0Rb17CGABJUnqFO-LeHUo3-ZnlVr_gkPn6wO |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=IEEE%2FACM+International+Conference+on+Automated+Software+Engineering+%3A+%5Bproceedings%5D&rft.atitle=Contextualized+Data-Wrangling+Code+Generation+in+Computational+Notebooks&rft.au=Huang%2C+Junjie&rft.au=Guo%2C+Daya&rft.au=Wang%2C+Chenglong&rft.au=Gu%2C+Jiazhen&rft.date=2024-10-27&rft.pub=ACM&rft.eissn=2643-1572&rft.spage=1282&rft.epage=1294&rft_id=info:doi/10.1145%2F3691620.3695503&rft.externalDocID=10764918 |