Semantically Aligned Question and Code Generation for Automated Insight Generation

Automated insight generation is a common tactic for helping knowledge workers, such as data scientists, to quickly understand the potential value of new and unfamiliar data. Unfortunately, automated insights produced by large-language models can generate code that does not correctly correspond (or a...

Full description

Saved in:

Bibliographic Details
Published in	2024 IEEE/ACM International Workshop on Large Language Models for Code (LLM4Code) pp. 127 - 134
Main Authors	Singha, Ananya, Chopra, Bhavya, Khatry, Anirudh, Gulwani, Sumit, Henley, Austin Z., Le, Vu, Parnin, Chris, Singh, Mukul, Verbruggen, Gust
Format	Conference Proceeding
Language	English
Published	ACM 20.04.2024
Subjects	alignment Code-generation Codes Conferences Costs Filtering Large language models LLM Semantics
Online Access	Get full text
DOI	10.1145/3643795.3648381

Cover

Abstract	Automated insight generation is a common tactic for helping knowledge workers, such as data scientists, to quickly understand the potential value of new and unfamiliar data. Unfortunately, automated insights produced by large-language models can generate code that does not correctly correspond (or align) to the insight. In this paper, we leverage the semantic knowledge of large language models to generate targeted and insightful questions about data and the corresponding code to answer those questions. Then through an empirical study on data from Open-WikiTable, we show that embeddings can be effectively used for filtering out semantically unaligned pairs of question and code. Additionally, we found that generating questions and code together yields more diverse questions.
AbstractList	Automated insight generation is a common tactic for helping knowledge workers, such as data scientists, to quickly understand the potential value of new and unfamiliar data. Unfortunately, automated insights produced by large-language models can generate code that does not correctly correspond (or align) to the insight. In this paper, we leverage the semantic knowledge of large language models to generate targeted and insightful questions about data and the corresponding code to answer those questions. Then through an empirical study on data from Open-WikiTable, we show that embeddings can be effectively used for filtering out semantically unaligned pairs of question and code. Additionally, we found that generating questions and code together yields more diverse questions.
Author	Henley, Austin Z. Singh, Mukul Le, Vu Parnin, Chris Singha, Ananya Chopra, Bhavya Khatry, Anirudh Verbruggen, Gust Gulwani, Sumit
Author_xml	– sequence: 1 givenname: Ananya surname: Singha fullname: Singha, Ananya email: t-asingha@microsoft.com organization: Microsoft,India – sequence: 2 givenname: Bhavya surname: Chopra fullname: Chopra, Bhavya email: t-bhchopra@microsoft.com organization: Microsoft,India – sequence: 3 givenname: Anirudh surname: Khatry fullname: Khatry, Anirudh email: t-anikhatry@microsoft.com organization: Microsoft,India – sequence: 4 givenname: Sumit surname: Gulwani fullname: Gulwani, Sumit email: sumitg@microsoft.com organization: Microsoft,USA – sequence: 5 givenname: Austin Z. surname: Henley fullname: Henley, Austin Z. email: azh321@gmail.com organization: Microsoft,USA – sequence: 6 givenname: Vu surname: Le fullname: Le, Vu email: levu@microsoft.com organization: Microsoft,USA – sequence: 7 givenname: Chris surname: Parnin fullname: Parnin, Chris email: chrisparnin@microsoft.com organization: Microsoft,USA – sequence: 8 givenname: Mukul surname: Singh fullname: Singh, Mukul email: singhmukul@microsoft.com organization: Microsoft,India – sequence: 9 givenname: Gust surname: Verbruggen fullname: Verbruggen, Gust email: gverbruggen@microsoft.com organization: Microsoft,Belgium
BookMark	eNpNjM1Kw0AURkdQUGvWblzMC6TO5M7vMgSthYKo3Zeb5KYOJBNJpou-vUFduDqHj8N3yy7jGImxeynWUir9CEaB9Xq90IGTFyzz1jslhBXaerhm2TyHevFCGwHFDXv_oAFjCg32_ZmXfThGavnbieYUxsgxtrwaW-IbijThz9aNEy9PaRwwLek2zuH4mf4Fd-yqw36m7I8rtn9-2lcv-e51s63KXY6FcikvjCXTWQtNhwKh8GSMw9pih651jQdhjZKoi0brtkFjrJQShBPgfF3XsGIPv7eBiA5fUxhwOh-ksKCMNPANFn1PwA
CODEN	IEEPAD
ContentType	Conference Proceeding
DBID	6IE 6IL CBEJK RIE RIL
DOI	10.1145/3643795.3648381
DatabaseName	IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml	– sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher
DeliveryMethod	fulltext_linktorsrc
EISBN	9798400705793
EndPage	134
ExternalDocumentID	10734616
Genre	orig-research
GroupedDBID	6IE 6IL ACM ALMA_UNASSIGNED_HOLDINGS APO CBEJK LHSKQ RIE RIL
ID	FETCH-LOGICAL-a248t-267e6f773cfa0a329e668ab7afa8d8c9307641a52c55dca6671113080389bbb3
IEDL.DBID	RIE
IngestDate	Wed Aug 27 03:01:19 EDT 2025
IsPeerReviewed	false
IsScholarly	false
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-a248t-267e6f773cfa0a329e668ab7afa8d8c9307641a52c55dca6671113080389bbb3
PageCount	8
ParticipantIDs	ieee_primary_10734616
PublicationCentury	2000
PublicationDate	2024-April-20
PublicationDateYYYYMMDD	2024-04-20
PublicationDate_xml	– month: 04 year: 2024 text: 2024-April-20 day: 20
PublicationDecade	2020
PublicationTitle	2024 IEEE/ACM International Workshop on Large Language Models for Code (LLM4Code)
PublicationTitleAbbrev	LLM4CODE
PublicationYear	2024
Publisher	ACM
Publisher_xml	– name: ACM
SSID	ssib057256032
Score	1.8693117
Snippet	Automated insight generation is a common tactic for helping knowledge workers, such as data scientists, to quickly understand the potential value of new and...
SourceID	ieee
SourceType	Publisher
StartPage	127
SubjectTerms	alignment Code-generation Codes Conferences Costs Filtering Large language models LLM Semantics
Title	Semantically Aligned Question and Code Generation for Automated Insight Generation
URI	https://ieeexplore.ieee.org/document/10734616
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1LSwMxEA7akycVK77JwWvW3Tw3x1IsVbCIVuit5Cli3RXdPeivN8m2WgTBU0IIJEwe82Um8w0A54QXlgrDkPaFRNR7iaTIGbK80F5JZfJkcLuZ8PEDvZ6x2TJYPcXCOOfS5zOXxWry5dvatNFUFk64IJQXfBNshn3WBWutNg8TUXkTvKTvKSi7INEpJVkWypJE4uq1_ClJfYy2wWQ1cPdr5DlrG52Zz1-cjP-e2Q7o_0TqwdtvHbQLNly1B-7u3UsQWBT_4gMOFk-P4S6FybQZVgGqysJhbR3sKKdTW4CucNA2dcCvoetV9R7f7Gsd-mA6upwOx2iZOwEpTMsGYS4c90IQ41WuCJaO81JpobwqbWlkONqcFophw5g1inMRc84H-BgAjNaa7INeVVfuAMCiFNhjaR2TjCpqJHYMW6pdABqeM30I-lEe89eOHWO-EsXRH-3HYAsHYBA9Mjg_Ab3mrXWnQbE3-iwt6BdC6KMi
linkProvider	IEEE
linkToHtml	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1LS8NAEF60HvSkYsW3e_CamOwzeyzF0mpbRCv0VvYpYk1Ek4P-eneTVosgeEpYFhJmdvN9mdn5BoALzFJDuKaRcqmIiHMiEjyhkWGpclJIndQBt9GY9R_I9ZROF8XqdS2MtbY-fGbjcFvn8k2hqxAq8zucY8JStg42PPAT2pRrLZcP5QG-MVoI-KSEXuKQlhI09tcMB-nqlQ4qNYD0tsF4-ejm3MhzXJUq1p-_VBn__W47oP1Tqwdvv1FoF6zZfA_c3dsXb7LggPkH7MyfHv3XFNbBTe8HKHMDu4WxsBGdrsc8eYWdqiw8g_VTB_l7-GtfmdAGk97VpNuPFt0TIolIVkaIccsc51g7mUiMhGUsk4pLJzOTaeE3NyOppEhTarRkjIeu855AegqjlML7oJUXuT0AMM04ckgYSwUlkmiBLEWGKOuphmNUHYJ2sMfstdHHmC1NcfTH-DnY7E9Gw9lwML45BlvI04SQn0HJCWiVb5U99TBfqrPauV_-5KZv
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2024+IEEE%2FACM+International+Workshop+on+Large+Language+Models+for+Code+%28LLM4Code%29&rft.atitle=Semantically+Aligned+Question+and+Code+Generation+for+Automated+Insight+Generation&rft.au=Singha%2C+Ananya&rft.au=Chopra%2C+Bhavya&rft.au=Khatry%2C+Anirudh&rft.au=Gulwani%2C+Sumit&rft.date=2024-04-20&rft.pub=ACM&rft.spage=127&rft.epage=134&rft_id=info:doi/10.1145%2F3643795.3648381&rft.externalDocID=10734616