Semantically Aligned Question and Code Generation for Automated Insight Generation

Automated insight generation is a common tactic for helping knowledge workers, such as data scientists, to quickly understand the potential value of new and unfamiliar data. Unfortunately, automated insights produced by large-language models can generate code that does not correctly correspond (or a...

Full description

Saved in:
Bibliographic Details
Published in2024 IEEE/ACM International Workshop on Large Language Models for Code (LLM4Code) pp. 127 - 134
Main Authors Singha, Ananya, Chopra, Bhavya, Khatry, Anirudh, Gulwani, Sumit, Henley, Austin Z., Le, Vu, Parnin, Chris, Singh, Mukul, Verbruggen, Gust
Format Conference Proceeding
LanguageEnglish
Published ACM 20.04.2024
Subjects
Online AccessGet full text
DOI10.1145/3643795.3648381

Cover

Abstract Automated insight generation is a common tactic for helping knowledge workers, such as data scientists, to quickly understand the potential value of new and unfamiliar data. Unfortunately, automated insights produced by large-language models can generate code that does not correctly correspond (or align) to the insight. In this paper, we leverage the semantic knowledge of large language models to generate targeted and insightful questions about data and the corresponding code to answer those questions. Then through an empirical study on data from Open-WikiTable, we show that embeddings can be effectively used for filtering out semantically unaligned pairs of question and code. Additionally, we found that generating questions and code together yields more diverse questions.
AbstractList Automated insight generation is a common tactic for helping knowledge workers, such as data scientists, to quickly understand the potential value of new and unfamiliar data. Unfortunately, automated insights produced by large-language models can generate code that does not correctly correspond (or align) to the insight. In this paper, we leverage the semantic knowledge of large language models to generate targeted and insightful questions about data and the corresponding code to answer those questions. Then through an empirical study on data from Open-WikiTable, we show that embeddings can be effectively used for filtering out semantically unaligned pairs of question and code. Additionally, we found that generating questions and code together yields more diverse questions.
Author Henley, Austin Z.
Singh, Mukul
Le, Vu
Parnin, Chris
Singha, Ananya
Chopra, Bhavya
Khatry, Anirudh
Verbruggen, Gust
Gulwani, Sumit
Author_xml – sequence: 1
  givenname: Ananya
  surname: Singha
  fullname: Singha, Ananya
  email: t-asingha@microsoft.com
  organization: Microsoft,India
– sequence: 2
  givenname: Bhavya
  surname: Chopra
  fullname: Chopra, Bhavya
  email: t-bhchopra@microsoft.com
  organization: Microsoft,India
– sequence: 3
  givenname: Anirudh
  surname: Khatry
  fullname: Khatry, Anirudh
  email: t-anikhatry@microsoft.com
  organization: Microsoft,India
– sequence: 4
  givenname: Sumit
  surname: Gulwani
  fullname: Gulwani, Sumit
  email: sumitg@microsoft.com
  organization: Microsoft,USA
– sequence: 5
  givenname: Austin Z.
  surname: Henley
  fullname: Henley, Austin Z.
  email: azh321@gmail.com
  organization: Microsoft,USA
– sequence: 6
  givenname: Vu
  surname: Le
  fullname: Le, Vu
  email: levu@microsoft.com
  organization: Microsoft,USA
– sequence: 7
  givenname: Chris
  surname: Parnin
  fullname: Parnin, Chris
  email: chrisparnin@microsoft.com
  organization: Microsoft,USA
– sequence: 8
  givenname: Mukul
  surname: Singh
  fullname: Singh, Mukul
  email: singhmukul@microsoft.com
  organization: Microsoft,India
– sequence: 9
  givenname: Gust
  surname: Verbruggen
  fullname: Verbruggen, Gust
  email: gverbruggen@microsoft.com
  organization: Microsoft,Belgium
BookMark eNpNjM1Kw0AURkdQUGvWblzMC6TO5M7vMgSthYKo3Zeb5KYOJBNJpou-vUFduDqHj8N3yy7jGImxeynWUir9CEaB9Xq90IGTFyzz1jslhBXaerhm2TyHevFCGwHFDXv_oAFjCg32_ZmXfThGavnbieYUxsgxtrwaW-IbijThz9aNEy9PaRwwLek2zuH4mf4Fd-yqw36m7I8rtn9-2lcv-e51s63KXY6FcikvjCXTWQtNhwKh8GSMw9pih651jQdhjZKoi0brtkFjrJQShBPgfF3XsGIPv7eBiA5fUxhwOh-ksKCMNPANFn1PwA
CODEN IEEPAD
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1145/3643795.3648381
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
EISBN 9798400705793
EndPage 134
ExternalDocumentID 10734616
Genre orig-research
GroupedDBID 6IE
6IL
ACM
ALMA_UNASSIGNED_HOLDINGS
APO
CBEJK
LHSKQ
RIE
RIL
ID FETCH-LOGICAL-a248t-267e6f773cfa0a329e668ab7afa8d8c9307641a52c55dca6671113080389bbb3
IEDL.DBID RIE
IngestDate Wed Aug 27 03:01:19 EDT 2025
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-a248t-267e6f773cfa0a329e668ab7afa8d8c9307641a52c55dca6671113080389bbb3
PageCount 8
ParticipantIDs ieee_primary_10734616
PublicationCentury 2000
PublicationDate 2024-April-20
PublicationDateYYYYMMDD 2024-04-20
PublicationDate_xml – month: 04
  year: 2024
  text: 2024-April-20
  day: 20
PublicationDecade 2020
PublicationTitle 2024 IEEE/ACM International Workshop on Large Language Models for Code (LLM4Code)
PublicationTitleAbbrev LLM4CODE
PublicationYear 2024
Publisher ACM
Publisher_xml – name: ACM
SSID ssib057256032
Score 1.8693117
Snippet Automated insight generation is a common tactic for helping knowledge workers, such as data scientists, to quickly understand the potential value of new and...
SourceID ieee
SourceType Publisher
StartPage 127
SubjectTerms alignment
Code-generation
Codes
Conferences
Costs
Filtering
Large language models
LLM
Semantics
Title Semantically Aligned Question and Code Generation for Automated Insight Generation
URI https://ieeexplore.ieee.org/document/10734616
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1LSwMxEA7akycVK77JwWvW3Tw3x1IsVbCIVuit5Cli3RXdPeivN8m2WgTBU0IIJEwe82Um8w0A54QXlgrDkPaFRNR7iaTIGbK80F5JZfJkcLuZ8PEDvZ6x2TJYPcXCOOfS5zOXxWry5dvatNFUFk64IJQXfBNshn3WBWutNg8TUXkTvKTvKSi7INEpJVkWypJE4uq1_ClJfYy2wWQ1cPdr5DlrG52Zz1-cjP-e2Q7o_0TqwdtvHbQLNly1B-7u3UsQWBT_4gMOFk-P4S6FybQZVgGqysJhbR3sKKdTW4CucNA2dcCvoetV9R7f7Gsd-mA6upwOx2iZOwEpTMsGYS4c90IQ41WuCJaO81JpobwqbWlkONqcFophw5g1inMRc84H-BgAjNaa7INeVVfuAMCiFNhjaR2TjCpqJHYMW6pdABqeM30I-lEe89eOHWO-EsXRH-3HYAsHYBA9Mjg_Ab3mrXWnQbE3-iwt6BdC6KMi
linkProvider IEEE
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1LS8NAEF60HvSkYsW3e_CamOwzeyzF0mpbRCv0VvYpYk1Ek4P-eneTVosgeEpYFhJmdvN9mdn5BoALzFJDuKaRcqmIiHMiEjyhkWGpclJIndQBt9GY9R_I9ZROF8XqdS2MtbY-fGbjcFvn8k2hqxAq8zucY8JStg42PPAT2pRrLZcP5QG-MVoI-KSEXuKQlhI09tcMB-nqlQ4qNYD0tsF4-ejm3MhzXJUq1p-_VBn__W47oP1Tqwdvv1FoF6zZfA_c3dsXb7LggPkH7MyfHv3XFNbBTe8HKHMDu4WxsBGdrsc8eYWdqiw8g_VTB_l7-GtfmdAGk97VpNuPFt0TIolIVkaIccsc51g7mUiMhGUsk4pLJzOTaeE3NyOppEhTarRkjIeu855AegqjlML7oJUXuT0AMM04ckgYSwUlkmiBLEWGKOuphmNUHYJ2sMfstdHHmC1NcfTH-DnY7E9Gw9lwML45BlvI04SQn0HJCWiVb5U99TBfqrPauV_-5KZv
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2024+IEEE%2FACM+International+Workshop+on+Large+Language+Models+for+Code+%28LLM4Code%29&rft.atitle=Semantically+Aligned+Question+and+Code+Generation+for+Automated+Insight+Generation&rft.au=Singha%2C+Ananya&rft.au=Chopra%2C+Bhavya&rft.au=Khatry%2C+Anirudh&rft.au=Gulwani%2C+Sumit&rft.date=2024-04-20&rft.pub=ACM&rft.spage=127&rft.epage=134&rft_id=info:doi/10.1145%2F3643795.3648381&rft.externalDocID=10734616