Contextualized Data-Wrangling Code Generation in Computational Notebooks

Data wrangling, the process of preparing raw data for further analysis in computational notebooks, is a crucial yet time-consuming step in data science. Code generation has the potential to automate the data wrangling process to reduce analysts' overhead by translating user intents into executa...

Full description

Saved in:
Bibliographic Details
Published inIEEE/ACM International Conference on Automated Software Engineering : [proceedings] pp. 1282 - 1294
Main Authors Huang, Junjie, Guo, Daya, Wang, Chenglong, Gu, Jiazhen, Lu, Shuai, Inala, Jeevana Priya, Yan, Cong, Gao, Jianfeng, Duan, Nan, Lyu, Michael R.
Format Conference Proceeding
LanguageEnglish
Published ACM 27.10.2024
Subjects
Online AccessGet full text
ISSN2643-1572
DOI10.1145/3691620.3695503

Cover

Loading…
Abstract Data wrangling, the process of preparing raw data for further analysis in computational notebooks, is a crucial yet time-consuming step in data science. Code generation has the potential to automate the data wrangling process to reduce analysts' overhead by translating user intents into executable code. Precisely generating data wrangling code necessitates a comprehensive consideration of the rich context present in notebooks, including textual context, code context and data context. However, notebooks often interleave multiple non-linear analysis tasks into linear sequence of code blocks, where the contextual dependencies are not clearly reflected. Directly training models with source code blocks fails to fully exploit the contexts for accurate wrangling code generation.To bridge the gap, we aim to construct a high quality datasets with clear and rich contexts to help training models for data wrangling code generation tasks. In this work, we first propose an automated approach, CoCoMine to mine data-wrangling code generation examples with clear multi-modal contextual dependency. It first adopts data flow analysis to identify the code blocks containing data wrangling codes. Then, CoCoMine extracts the contextualized data-wrangling code examples through tracing and replaying notebooks. With CoCoMine, we construct CoCoNote, a dataset containing 58,221 examples for Contextualized Data-wrangling Code generation in Notebooks. To demonstrate the effectiveness of our dataset, we finetune a range of pretrained code models and prompt various large language models on our task. Furthermore, we also propose DataCoder, which encodes data context and code&textual contexts separately to enhance code generation. Experiment results demonstrate the significance of incorporating data context in data-wrangling code generation and the effectiveness of our model. We release code and data at https://github.com/Jun-jie-Huang/CoCoNote.CCS CONCEPTS* Software and its engineering → Automatic programming.
AbstractList Data wrangling, the process of preparing raw data for further analysis in computational notebooks, is a crucial yet time-consuming step in data science. Code generation has the potential to automate the data wrangling process to reduce analysts' overhead by translating user intents into executable code. Precisely generating data wrangling code necessitates a comprehensive consideration of the rich context present in notebooks, including textual context, code context and data context. However, notebooks often interleave multiple non-linear analysis tasks into linear sequence of code blocks, where the contextual dependencies are not clearly reflected. Directly training models with source code blocks fails to fully exploit the contexts for accurate wrangling code generation.To bridge the gap, we aim to construct a high quality datasets with clear and rich contexts to help training models for data wrangling code generation tasks. In this work, we first propose an automated approach, CoCoMine to mine data-wrangling code generation examples with clear multi-modal contextual dependency. It first adopts data flow analysis to identify the code blocks containing data wrangling codes. Then, CoCoMine extracts the contextualized data-wrangling code examples through tracing and replaying notebooks. With CoCoMine, we construct CoCoNote, a dataset containing 58,221 examples for Contextualized Data-wrangling Code generation in Notebooks. To demonstrate the effectiveness of our dataset, we finetune a range of pretrained code models and prompt various large language models on our task. Furthermore, we also propose DataCoder, which encodes data context and code&textual contexts separately to enhance code generation. Experiment results demonstrate the significance of incorporating data context in data-wrangling code generation and the effectiveness of our model. We release code and data at https://github.com/Jun-jie-Huang/CoCoNote.CCS CONCEPTS* Software and its engineering → Automatic programming.
Author Lu, Shuai
Lyu, Michael R.
Wang, Chenglong
Gu, Jiazhen
Guo, Daya
Huang, Junjie
Gao, Jianfeng
Inala, Jeevana Priya
Yan, Cong
Duan, Nan
Author_xml – sequence: 1
  givenname: Junjie
  surname: Huang
  fullname: Huang, Junjie
  email: jjhuang23@cse.cuhk.edu.hk
  organization: The Chinese University of Hong Kong,China
– sequence: 2
  givenname: Daya
  surname: Guo
  fullname: Guo, Daya
  email: guody5@mail2.sysu.edu.cn
  organization: Sun-yat Sen University
– sequence: 3
  givenname: Chenglong
  surname: Wang
  fullname: Wang, Chenglong
  email: chenglong.wang@microsoft.com
  organization: Microsoft Research
– sequence: 4
  givenname: Jiazhen
  surname: Gu
  fullname: Gu, Jiazhen
  email: jiazhengu@cse.cuhk.edu.hk
  organization: The Chinese University of Hong Kong,China
– sequence: 5
  givenname: Shuai
  surname: Lu
  fullname: Lu, Shuai
  email: shuailu@microsoft.com
  organization: Microsoft Research Asia
– sequence: 6
  givenname: Jeevana Priya
  surname: Inala
  fullname: Inala, Jeevana Priya
  organization: Microsoft Research
– sequence: 7
  givenname: Cong
  surname: Yan
  fullname: Yan, Cong
  organization: Microsoft Research
– sequence: 8
  givenname: Jianfeng
  surname: Gao
  fullname: Gao, Jianfeng
  email: jfgao@microsoft.com
  organization: Microsoft Research
– sequence: 9
  givenname: Nan
  surname: Duan
  fullname: Duan, Nan
  email: jinala@microsoft.com
  organization: Microsoft Research Asia
– sequence: 10
  givenname: Michael R.
  surname: Lyu
  fullname: Lyu, Michael R.
  email: lyu@cse.cuhk.edu.hk
  organization: The Chinese University of Hong Kong,China
BookMark eNotjj1PwzAURQ0CiVIyszDkD6T4-dsjCtAiVbCAGKuX5rmySO0qSSXg1xMB07n3DFf3kp2lnIixa-ALAKVvpfFgBF9M1JrLE1Z4653i3IJQzp6ymTBKVqCtuGDFMMSGT1EbADNjqzqnkT7HI3bxm9ryHkes3ntMuy6mXVnnlsolJepxjDmVMU1qfziOvxW78jmP1OT8MVyx84DdQMU_5-zt8eG1XlXrl-VTfbeucDozVjLoIASCaYgbE6zzqJ0XyJVAa7lrWnAePAWDW-G1C6RB6rYVAVtSWyPn7OZvNxLR5tDHPfZfG-DWKA9O_gAbY07r
CODEN IEEPAD
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1145/3691620.3695503
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Xplore POP ALL
IEEE Xplore All Conference Proceedings
IEEE Xplore
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Xplore
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
EISBN 9798400712487
EISSN 2643-1572
EndPage 1294
ExternalDocumentID 10764918
Genre orig-research
GroupedDBID 6IE
6IF
6IH
6IK
6IL
6IM
6IN
6J9
AAJGR
AAWTH
ABLEC
ACREN
ADYOE
ADZIZ
AFYQB
ALMA_UNASSIGNED_HOLDINGS
AMTXH
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
CHZPO
IEGSK
IPLJI
M43
OCL
RIE
RIL
ID FETCH-LOGICAL-a248t-3f5f22a16be066f789a5892a042a7708bd18919ef6ac2958fe5135dd2fade4c63
IEDL.DBID RIE
IngestDate Wed Jan 15 06:20:43 EST 2025
IsDoiOpenAccess false
IsOpenAccess true
IsPeerReviewed false
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-a248t-3f5f22a16be066f789a5892a042a7708bd18919ef6ac2958fe5135dd2fade4c63
PageCount 13
ParticipantIDs ieee_primary_10764918
PublicationCentury 2000
PublicationDate 2024-Oct.-27
PublicationDateYYYYMMDD 2024-10-27
PublicationDate_xml – month: 10
  year: 2024
  text: 2024-Oct.-27
  day: 27
PublicationDecade 2020
PublicationTitle IEEE/ACM International Conference on Automated Software Engineering : [proceedings]
PublicationTitleAbbrev ASE
PublicationYear 2024
Publisher ACM
Publisher_xml – name: ACM
SSID ssib057256116
ssj0051577
Score 2.2863002
Snippet Data wrangling, the process of preparing raw data for further analysis in computational notebooks, is a crucial yet time-consuming step in data science. Code...
SourceID ieee
SourceType Publisher
StartPage 1282
SubjectTerms code generation
Codes
Computational modeling
computational notebooks
Context modeling
Data models
Data science
data wrangling
Large language models
Software
Software engineering
Source coding
Training
Title Contextualized Data-Wrangling Code Generation in Computational Notebooks
URI https://ieeexplore.ieee.org/document/10764918
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1NSwMxEA3ak6f6UfGbHLxubbL5PFdL8VA8WOytZDcTKcJWdAvSX-8ku6soCN6WPSwhyex7k5n3Qsj1yHtfgtCxqmsyAYZlzuXR7i4ID9ZK51KD7ExN5-J-IRetWD1pYQAgNZ_BMD6mWr5fl5t4VIYRrpWwzOySXdxnjVir2zxSI3izyHWa3zDitNatlw8T8iZXSIQ45qjKIinPf1ymkrBk0iezbhRNC8nLcFMXw3L7y6Dx38PcJ4Nv2R59-AKkA7ID1SHpd_c20DaMj8g0WVJ9ROXIague3rraZU8IWs9Rm07Haw-0saOOq0ZXFW2-0Z4b0tm6TjXW9wGZT-4ex9OsvVEhc1yYOsuDDJw7pgpAqhG0sU4ayx1GrtN6ZArPjGUWgnIlt9IEkCyX3vPgPIhS5cekV60rOCFUyBCLyci2HGZ4XhgLwjChjcSEzit7SgZxZpavjWnGspuUsz_en5M9jnwhwgLXF6RXv23gEvG-Lq7SOn8CBwOowQ
linkProvider IEEE
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3JTsMwELWgHOBUliJ2fOCaUjtezwVUoEQcWtFb5cRjVCElCFIJ9euxs4BAQuIW5RBZdsbvjWfeM0IXA2ttBkyGqq6KGCgSGRMHuzvHLGjNjakaZBMxmrK7GZ81YvVKCwMAVfMZ9MNjVcu3RbYMR2U-wqVgmqh1tOGBn_FartX-Plx6-CaB7dQbsUdqKRs3H8L4ZSw8FaI-SxXa0_L4x3UqFZrcdFHSjqNuInnpL8u0n61-WTT-e6DbqPct3MOPX5C0g9Yg30Xd9uYG3ATyHhpVplQfQTuyWIHFV6Y00ZOHreegTsfDwgKuDanDuuFFjutvNCeHOCnKqsr63kPTm-vJcBQ1dypEhjJVRrHjjlJDRAqebDiptOFKU-Nj10g5UKklShMNTpiMaq4ccBJza6kzFlgm4n3UyYscDhBm3IVysudbxud4likNTBEmFfcpnRX6EPXCzMxfa9uMeTspR3-8P0ebo8nDeD6-Te6P0Rb17CGABJUnqFO-LeHUo3-ZnlVr_gkPn6wO
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=IEEE%2FACM+International+Conference+on+Automated+Software+Engineering+%3A+%5Bproceedings%5D&rft.atitle=Contextualized+Data-Wrangling+Code+Generation+in+Computational+Notebooks&rft.au=Huang%2C+Junjie&rft.au=Guo%2C+Daya&rft.au=Wang%2C+Chenglong&rft.au=Gu%2C+Jiazhen&rft.date=2024-10-27&rft.pub=ACM&rft.eissn=2643-1572&rft.spage=1282&rft.epage=1294&rft_id=info:doi/10.1145%2F3691620.3695503&rft.externalDocID=10764918